From: Neil Smith <neil.git@njae.me.uk>
Date: Sun, 1 Jun 2014 18:51:35 +0000 (+0100)
Subject: Caching word segmentation
X-Git-Url: https://git.njae.me.uk/?a=commitdiff_plain;ds=sidebyside;h=ba564e2f791642443a7c28ff8fa5a0b194866819;p=cipher-training.git

Caching word segmentation
---

diff --git a/slides/word-segmentation.html b/slides/word-segmentation.html
index 6eb88e3..d9d1ec6 100644
--- a/slides/word-segmentation.html
+++ b/slides/word-segmentation.html
@@ -47,12 +47,40 @@
 
 # Word segmentation
 
-a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|--
-k | e | y | w | o | r | d | a | b | c | f | g | h | i | j | l | m | n | p | q | s | t | u | v | x | z
+`makingsenseofthis`
+`making sense of this`
 
 ----
 
+# The problem
+
+Ciphertext is re-split into groups to hide word bounaries.
+
+How can we rediscover the word boundaries?
+
+---
+
+# Simple approach
+
+1. Try all possible word boundaries
+2. Return the one that looks most like English
+
+What's the complexity of this process?
+
+* (We'll fix that in a bit...)
+
+---
+
+# What do we mean by "looks like English"?
+
+Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
+
+Probability of a bag-of-words (ignoring inter-word dependencies).
+
+Finding the counts of words in text is harder than letters.
+
+* More tokens, so need more data to cover sufficient words.
+
 
     </textarea>
     <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">