Tweaked some slides.
[cipher-training.git] / slides / word-segmentation.html
1 <!DOCTYPE html>
2 <html>
3 <head>
4 <title>Word segmentation</title>
5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
6 <style type="text/css">
7 /* Slideshow styles */
8 body {
9 font-size: 20px;
10 }
11 h1, h2, h3 {
12 font-weight: 400;
13 margin-bottom: 0;
14 }
15 h1 { font-size: 3em; }
16 h2 { font-size: 2em; }
17 h3 { font-size: 1.6em; }
18 a, a > code {
19 text-decoration: none;
20 }
21 code {
22 -moz-border-radius: 5px;
23 -web-border-radius: 5px;
24 background: #e7e8e2;
25 border-radius: 5px;
26 font-size: 16px;
27 }
28 .plaintext {
29 background: #272822;
30 color: #80ff80;
31 text-shadow: 0 0 20px #333;
32 padding: 2px 5px;
33 }
34 .ciphertext {
35 background: #272822;
36 color: #ff6666;
37 text-shadow: 0 0 20px #333;
38 padding: 2px 5px;
39 }
40 .indexlink {
41 position: absolute;
42 bottom: 1em;
43 left: 1em;
44 }
45 .float-right {
46 float: right;
47 }
48 </style>
49 </head>
50
51 <body>
52 <textarea id="source">
53
54 # Word segmentation
55
56 `makingsenseofthis`
57
58 `making sense of this`
59
60 ---
61
62 layout: true
63
64 .indexlink[[Index](index.html)]
65
66 ---
67
68 # The problem
69
70 Ciphertext is re-split into groups to hide word bounaries.
71
72 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
73
74 How can we rediscover the word boundaries?
75
76 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
77
78 ---
79
80 # Simple approach
81
82 1. Try all possible word boundaries
83 2. Return the one that looks most like English
84
85 What's the complexity of this process?
86
87 * (We'll fix that in a bit...)
88
89 ---
90
91 # What do we mean by "looks like English"?
92
93 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
94
95 Probability of a bag-of-words (ignoring inter-word dependencies).
96
97 Finding the counts of words in text is harder than letters.
98
99 * More tokens, so need more data to cover sufficient words.
100
101 ---
102 # Data sparsity and smoothing
103
104 `counts_1w.txt` is the 333,333 most common words types, with number of tokens for each, collected by Google.
105
106 Doesn't cover a lot of words we want, such as proper nouns.
107
108 We'll have to guess the probability of unknown word.
109
110 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
111
112 ...but we'll ignore them all.
113
114 Assume unknown words have a count of 1.
115
116 ---
117
118 # Storing word probabilities
119
120 We want something like a `defaultdict` but with our own default value
121
122 Subclass a dict!
123
124 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
125
126 `__missing__` handles the case when the key is missing
127
128
129 ```python
130 class Pdist(dict):
131 def __init__(self, data=[]):
132 for key, count in data:
133 ...
134 self.total = ...
135 def __missing__(self, key):
136 return ...
137
138 Pw = Pdist(data...)
139
140 def Pwords(words):
141 return ...
142 ```
143
144 ---
145
146 # Testing the bag of words model
147
148
149 ```python
150 >>> 'hello' in Pw.keys() >>> Pwords(['hello'])
151 True -4.25147684171819
152 >>> 'inigo' in Pw >>> Pwords(['hello', 'my'])
153 True -6.995724679281423
154 >>> 'blj' in Pw >>> Pwords(['hello', 'my', 'name'])
155 False -10.098177451501074
156 >>> Pw['hello'] >>> Pwords(['hello', 'my', 'name', 'is'])
157 -4.25147684171819 -12.195018236240843
158 >>> Pw['my'] >>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
159 -2.7442478375632335 -18.927603013570945
160 >>> Pw['name'] >>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
161 -3.102452772219651 -23.964487301167402
162 >>> Pw['is']
163 -2.096840784739768
164 >>> Pw['blj']
165 -11.76946906492656
166 ```
167
168 ---
169
170 # Splitting the input
171
172 ```
173 To segment a string:
174 find all possible splits into a first portion and remainder
175 for each split:
176 segment the remainder
177 return the split with highest score
178 ```
179
180 Indexing pulls out letters. `'sometext'[0]` = 's' ; `'sometext'[3]` = 'e' ; `'sometext'[-1]` = 't'
181
182 Slices pulls out substrings. `'sometext'[1:4]` = 'ome' ; `'sometext'[:3]` = 'som' ; `'sometext'[5:]` = 'ext'
183
184 `range()` will sweep across the string
185
186 ## Test case
187
188 ```python
189 >>> splits('sometext')
190 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
191 ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
192 ```
193
194 The last one is important
195
196 * What if this is the last word of the text?
197
198 ---
199
200 # Effeciency and memoisation
201
202 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
203
204 At any stage, can consider the sentence as prefix, word, suffix
205
206 * `littlewarmthin | the | kindness i receive`
207 * `littlewarmthi | nthe | kindness i receive`
208 * `littlewarmth | inthe | kindness i receive`
209 * `littlewarmt | hinthe | kindness i receive`
210
211 P(sentence) = P(prefix) × P(word) × P(suffix)
212
213 * We're assuming independence of sections.
214 * For a given word/suffix split, there is only one best segmentation of the suffix.
215 * Best segmentation of sentence (with split here) must have the best segmentation of the suffix.
216 * Once we've found it, no need to recalculate it.
217
218 ## What's the complexity now?
219
220 ---
221
222 # Memoisation
223
224 * Maintain a table of previously-found results
225 * Every time we're asked to calculate a segmentation, look in the table.
226 * If it's in the table, just return that.
227 * If not, calculate it and store the result in the table.
228
229 Wrap the segment function in something that maintains that table.
230
231 In the standard library: `lru_cache` as a function decorator.
232
233 ```python
234 from functools import lru_cache
235
236 @lru_cache()
237 def segment(text):
238 ...
239 ```
240 * (Plenty of tutorials online on function decorators.)
241
242 ---
243
244 # Implmentation detail
245
246 You'll hit Python's recursion level limit.
247
248 Easy to reset:
249
250 ```python
251 import sys
252 sys.setrecursionlimit(1000000)
253 ```
254
255 ---
256
257 # Testing segmentation
258
259 ```python
260 >>> segment('hello')
261 ['hello']
262 >>> segment('hellomy')
263 ['hello', 'my']
264 >>> segment('hellomyname')
265 ['hello', 'my', 'name']
266 >>> segment('hellomynameis')
267 ['hellomynameis']
268 ```
269
270 Oh.
271
272 Why?
273
274 ---
275
276 # A broken language model
277
278 ```python
279 >>> Pwords(['hello'])
280 -4.25147684171819
281 >>> Pwords(['hello', 'my'])
282 -6.995724679281423
283 >>> Pwords(['hello', 'my', 'name'])
284 -10.098177451501074
285 >>> Pwords(['hello', 'my', 'name', 'is'])
286 -12.195018236240843
287
288 >>> Pw['is']
289 -2.096840784739768
290 >>> Pw['blj']
291 -11.76946906492656
292 ```
293
294 Need a better estimate for probability of unknown words.
295
296 Needs to take account of length of word.
297
298 * Longer words are less probable.
299
300 ## To IPython for investigation!
301
302 ---
303
304 # Making Pdist more flexible
305
306 Want to give a sensible default for unknown elements
307
308 * But this will vary by referent
309 * Different languages, *n*-grams, etc.
310
311 Make it a parameter!
312
313 ---
314
315 # Hint
316
317 ```python
318 class Pdist(dict):
319 def __init__(self, data=[], estimate_of_missing=None):
320 for key, count in data2:
321 ...
322 self.total = ...
323 def __missing__(self, key):
324 if estimate_of_missing:
325 return estimate_of_missing(key, self.total)
326 else:
327 return ...
328
329 def log_probability_of_unknown_word(key, N):
330 return -log10(N * 10**((len(key) - 2) * 1.4))
331
332 Pw = Pdist(datafile('count_1w.txt'), log_probability_of_unknown_word)
333 ```
334
335 ---
336
337 # Testing segmentation again
338
339 ```python
340 >>> segment('hello')
341 ['hello']
342 >>> segment('hellomy')
343 ['hello', 'my']
344 >>> segment('hellomyname')
345 ['hello', 'my', 'name']
346 >>> segment('hellomynameis')
347 ['hello', 'my', 'name', 'is']
348 >>> ' '.join(segment(sanitise('HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW '
349 'AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE ')))
350 'helmut s cousins are i suppose kind in their own way but there is
351 little warmth in the kindness i receive'
352 ```
353
354 Try it out on the full decrypt of `2013/2b.ciphertext` (it's a Caesar cipher)
355
356
357 </textarea>
358 <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
359 </script>
360
361 <script type="text/javascript"
362 src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
363
364 <script type="text/javascript">
365 var slideshow = remark.create({ ratio: "16:9" });
366
367 // Setup MathJax
368 MathJax.Hub.Config({
369 tex2jax: {
370 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
371 }
372 });
373 MathJax.Hub.Queue(function() {
374 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
375 return(elem.SourceElement());
376 }).parent().addClass('has-jax');
377 });
378 MathJax.Hub.Configured();
379 </script>
380 </body>
381 </html>