From ba564e2f791642443a7c28ff8fa5a0b194866819 Mon Sep 17 00:00:00 2001 From: Neil Smith Date: Sun, 1 Jun 2014 19:51:35 +0100 Subject: [PATCH] Caching word segmentation --- slides/word-segmentation.html | 34 +++++++++++++++++++++++++++++++--- 1 file changed, 31 insertions(+), 3 deletions(-) diff --git a/slides/word-segmentation.html b/slides/word-segmentation.html index 6eb88e3..d9d1ec6 100644 --- a/slides/word-segmentation.html +++ b/slides/word-segmentation.html @@ -47,12 +47,40 @@ # Word segmentation -a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z ---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|-- -k | e | y | w | o | r | d | a | b | c | f | g | h | i | j | l | m | n | p | q | s | t | u | v | x | z +`makingsenseofthis` +`making sense of this` ---- +# The problem + +Ciphertext is re-split into groups to hide word bounaries. + +How can we rediscover the word boundaries? + +--- + +# Simple approach + +1. Try all possible word boundaries +2. Return the one that looks most like English + +What's the complexity of this process? + +* (We'll fix that in a bit...) + +--- + +# What do we mean by "looks like English"? + +Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here? + +Probability of a bag-of-words (ignoring inter-word dependencies). + +Finding the counts of words in text is harder than letters. + +* More tokens, so need more data to cover sufficient words. +