Finished word segmentation slides
[cipher-training.git] / slides / word-segmentation.html
1 <!DOCTYPE html>
2 <html>
3 <head>
4 <title>Word segmentation</title>
5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
6 <style type="text/css">
7 /* Slideshow styles */
8 body {
9 font-size: 20px;
10 }
11 h1, h2, h3 {
12 font-weight: 400;
13 margin-bottom: 0;
14 }
15 h1 { font-size: 3em; }
16 h2 { font-size: 2em; }
17 h3 { font-size: 1.6em; }
18 a, a > code {
19 text-decoration: none;
20 }
21 code {
22 -moz-border-radius: 5px;
23 -web-border-radius: 5px;
24 background: #e7e8e2;
25 border-radius: 5px;
26 font-size: 16px;
27 }
28 .plaintext {
29 background: #272822;
30 color: #80ff80;
31 text-shadow: 0 0 20px #333;
32 padding: 2px 5px;
33 }
34 .ciphertext {
35 background: #272822;
36 color: #ff6666;
37 text-shadow: 0 0 20px #333;
38 padding: 2px 5px;
39 }
40 .float-right {
41 float: right;
42 }
43 </style>
44 </head>
45
46 <body>
47 <textarea id="source">
48
49 # Word segmentation
50
51 `makingsenseofthis`
52
53 `making sense of this`
54
55 ---
56
57 # The problem
58
59 Ciphertext is re-split into groups to hide word bounaries.
60
61 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
62
63 How can we rediscover the word boundaries?
64
65 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
66
67 ---
68
69 # Simple approach
70
71 1. Try all possible word boundaries
72 2. Return the one that looks most like English
73
74 What's the complexity of this process?
75
76 * (We'll fix that in a bit...)
77
78 ---
79
80 # What do we mean by "looks like English"?
81
82 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
83
84 Probability of a bag-of-words (ignoring inter-word dependencies).
85
86 Finding the counts of words in text is harder than letters.
87
88 * More tokens, so need more data to cover sufficient words.
89
90 ---
91 # Data sparsity and smoothing
92
93 `counts_1w.txt` is the 333,333 most common words types, with number of tokens for each, collected by Google.
94
95 Doesn't cover a lot of words we want, such as proper nouns.
96
97 We'll have to guess the probability of unknown word.
98
99 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
100
101 ...but we'll ignore them all.
102
103 Assume unknown words have a count of 1.
104
105 ---
106
107 # Storing word probabilities
108
109 We want something like a `defaultdict` but with our own default value
110
111 Subclass a dict!
112
113 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
114
115 `__missing__` handles the case when the key is missing
116
117
118 ```python
119 class Pdist(dict):
120 def __init__(self, data=[]):
121 for key, count in data2:
122 ...
123 self.total = ...
124 def __missing__(self, key):
125 return ...
126
127 Pw = Pdist(data...)
128
129 def Pwords(words):
130 return ...
131 ```
132
133 ---
134
135 # Testing the bag of words model
136
137
138 ```python
139 >>> 'hello' in Pw.keys() >>> Pwords(['hello'])
140 True -4.25147684171819
141 >>> 'inigo' in Pw.keys() >>> Pwords(['hello', 'my'])
142 True -6.995724679281423
143 >>> 'blj' in Pw.keys() >>> Pwords(['hello', 'my', 'name'])
144 False -10.098177451501074
145 >>> Pw['hello'] >>> Pwords(['hello', 'my', 'name', 'is'])
146 -4.25147684171819 -12.195018236240843
147 >>> Pw['my'] >>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
148 -2.7442478375632335 -18.927603013570945
149 >>> Pw['name'] >>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
150 -3.102452772219651 -23.964487301167402
151 >>> Pw['is']
152 -2.096840784739768
153 >>> Pw['blj']
154 -11.76946906492656
155 ```
156
157 ---
158
159 # Splitting the input
160
161 ```
162 To segment a string:
163 find all possible splits into a first portion and remainder
164 for each split:
165 segment the remainder
166 return the split with highest score
167 ```
168
169 Indexing pulls out letters. `'sometext'[0]` = 's' ; `'keyword'[3]` = 'e' ; `'keyword'[-1]` = 't'
170
171 Slices pulls out substrings. `'keyword'[1:4]` = 'ome' ; `'keyword'[:3]` = 'som' ; `'keyword'[5:]` = 'ext'
172
173 `range()` will sweep across the string
174
175 ## Test case
176
177 ```python
178 >>> splits('sometext')
179 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
180 ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
181 ```
182
183 The last one is important
184
185 * What if this is the last word of the text?
186
187 ---
188
189 # Effeciency and memoisation
190
191 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
192
193 At any stage, can consider the sentence as prefix, word, suffix
194
195 * `littlewarmthin | the | kindness i receive`
196 * `littlewarmthi | nthe | kindness i receive`
197 * `littlewarmth | inthe | kindness i receive`
198 * `littlewarmt | hinthe | kindness i receive`
199
200 P(sentence) = P(prefix) × P(word) × P(suffix)
201
202 * We're assuming independence of sections.
203 * For a given word/suffix split, there is only one best segmentation of the suffix.
204 * Best segmentation of sentence (with split here) must have the best segmentation of the suffix.
205 * Once we've found it, no need to recalculate it.
206
207 ## What's the complexity now?
208
209 ---
210
211 # Memoisation
212
213 * Maintain a table of previously-found results
214 * Every time we're asked to calculate a segmentation, look in the table.
215 * If it's in the table, just return that.
216 * If not, calculate it and store the result in the table.
217
218 Wrap the segment function in something that maintains that table.
219
220 In the standard library: `lru_cache` as a function decorator.
221
222 ```python
223 from functools import lru_cache
224
225 @lru_cache()
226 def segment(text):
227 ...
228 ```
229 * (Plenty of tutorials online on function decorators.)
230
231 ---
232
233 # Implmentation detail
234
235 You'll hit Python's recursion level limit.
236
237 Easy to reset:
238
239 ```python
240 import sys
241 sys.setrecursionlimit(1000000)
242 ```
243
244 ---
245
246 # Testing segmentation
247
248 ```python
249 >>> segment('hello')
250 ['hello']
251 >>> segment('hellomy')
252 ['hello', 'my']
253 >>> segment('hellomyname')
254 ['hello', 'my', 'name']
255 >>> segment('hellomynameis')
256 ['hellomynameis']
257 ```
258
259 Oh.
260
261 Why?
262
263 ---
264
265 # A broken language model
266
267 ```python
268 >>> Pwords(['hello'])
269 -4.25147684171819
270 >>> Pwords(['hello', 'my'])
271 -6.995724679281423
272 >>> Pwords(['hello', 'my', 'name'])
273 -10.098177451501074
274 >>> Pwords(['hello', 'my', 'name', 'is'])
275 -12.195018236240843
276
277 >>> Pw['is']
278 -2.096840784739768
279 >>> Pw['blj']
280 -11.76946906492656
281 ```
282
283 Need a better estimate for probability of unknown words.
284
285 Needs to take account of length of word.
286
287 * Longer words are less probable.
288
289 ## To IPython for investigation!
290
291 ---
292
293 # Making Pdist more flexible
294
295 Want to give a sensible default for unknown elements
296
297 * But this will vary by referent
298 * Different languages, *n*-grams, etc.
299
300 Make it a parameter!
301
302 ---
303
304 # Hint
305
306 ```python
307 class Pdist(dict):
308 def __init__(self, data=[], estimate_of_missing=None):
309 for key, count in data2:
310 ...
311 self.total = ...
312 def __missing__(self, key):
313 if estimate_of_missing:
314 return estimate_of_missing(key, self.total)
315 else:
316 return ...
317
318 def log_probability_of_unknown_word(key, N):
319 return -log10(N * 10**((len(key) - 2) * 1.4))
320
321 Pw = Pdist(datafile('count_1w.txt'), log_probability_of_unknown_word)
322 ```
323
324 ---
325
326 # Testing segmentation again
327
328 ```python
329 >>> segment('hello')
330 ['hello']
331 >>> segment('hellomy')
332 ['hello', 'my']
333 >>> segment('hellomyname')
334 ['hello', 'my', 'name']
335 >>> segment('hellomynameis')
336 ['hello', 'my', 'name', 'is']
337 >>> ' '.join(segment(sanitise('HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW '
338 'AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE ')))
339 'helmut s cousins are i suppose kind in their own way but there is
340 little warmth in the kindness i receive'
341 ```
342
343 Try it out on the full decrypt of `2013/2b.ciphertext` (it's a Caesar cipher)
344
345
346 </textarea>
347 <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
348 </script>
349
350 <script type="text/javascript"
351 src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
352
353 <script type="text/javascript">
354 var slideshow = remark.create({ ratio: "16:9" });
355
356 // Setup MathJax
357 MathJax.Hub.Config({
358 tex2jax: {
359 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
360 }
361 });
362 MathJax.Hub.Queue(function() {
363 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
364 return(elem.SourceElement());
365 }).parent().addClass('has-jax');
366 });
367 MathJax.Hub.Configured();
368 </script>
369 </body>
370 </html>