4 <title>Word segmentation
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
47 <textarea id=
"source">
53 `making sense of this`
59 Ciphertext is re-split into groups to hide word bounaries.
61 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
63 How can we rediscover the word boundaries?
65 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
71 1. Try all possible word boundaries
72 2. Return the one that looks most like English
74 What's the complexity of this process?
76 * (We'll fix that in a bit...)
80 # What do we mean by
"looks like English"?
82 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
84 Probability of a bag-of-words (ignoring inter-word dependencies).
86 Finding the counts of words in text is harder than letters.
88 * More tokens, so need more data to cover sufficient words.
91 # Data sparsity and smoothing
93 `counts_1w.txt` is the
333,
333 most common words types, with number of tokens for each, collected by Google.
95 Doesn't cover a lot of words we want, such as proper nouns.
97 We'll have to guess the probability of unknown word.
99 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
101 ...but we'll ignore them all.
103 Assume unknown words have a count of
1.
107 # Storing word probabilities
109 We want something like a `defaultdict` but with our own default value
113 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
115 `__missing__` handles the case when the key is missing
120 def __init__(self, data=[]):
121 for key, count in data2:
124 def __missing__(self, key):
135 # Testing the bag of words model
139 >>> 'hello' in Pw.keys()
>>> Pwords(['hello'])
140 True -
4.25147684171819
141 >>> 'inigo' in Pw.keys()
>>> Pwords(['hello', 'my'])
142 True -
6.995724679281423
143 >>> 'blj' in Pw.keys()
>>> Pwords(['hello', 'my', 'name'])
144 False -
10.098177451501074
145 >>> Pw['hello']
>>> Pwords(['hello', 'my', 'name', 'is'])
146 -
4.25147684171819 -
12.195018236240843
147 >>> Pw['my']
>>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
148 -
2.7442478375632335 -
18.927603013570945
149 >>> Pw['name']
>>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
150 -
3.102452772219651 -
23.964487301167402
159 # Splitting the input
163 find all possible splits into a first portion and remainder
165 segment the remainder
166 return the split with highest score
169 Indexing pulls out letters. `'sometext'[
0]` = 's' ; `'keyword'[
3]` = 'e' ; `'keyword'[-
1]` = 't'
171 Slices pulls out substrings. `'keyword'[
1:
4]` = 'ome' ; `'keyword'[:
3]` = 'som' ; `'keyword'[
5:]` = 'ext'
173 `range()` will sweep across the string
178 >>> splits('sometext')
179 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
180 ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
183 The last one is important
185 * What if this is the last word of the text?
189 # Effeciency and memoisation
191 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
193 At any stage, can consider the sentence as prefix, word, suffix
195 * `littlewarmthin | the | kindness i receive`
196 * `littlewarmthi | nthe | kindness i receive`
197 * `littlewarmth | inthe | kindness i receive`
198 * `littlewarmt | hinthe | kindness i receive`
200 P(sentence) = P(prefix) × P(word) × P(suffix)
202 * We're assuming independence of sections.
203 * For a given word/suffix split, there is only one best segmentation of the suffix.
204 * Best segmentation of sentence (with split here) must have the best segmentation of the suffix.
205 * Once we've found it, no need to recalculate it.
207 ## What's the complexity now?
213 * Maintain a table of previously-found results
214 * Every time we're asked to calculate a segmentation, look in the table.
215 * If it's in the table, just return that.
216 * If not, calculate it and store the result in the table.
218 Wrap the segment function in something that maintains that table.
220 In the standard library: `lru_cache` as a function decorator.
223 from functools import lru_cache
229 * (Plenty of tutorials online on function decorators.)
233 # Implmentation detail
235 You'll hit Python's recursion level limit.
241 sys.setrecursionlimit(
1000000)
246 # Testing segmentation
251 >>> segment('hellomy')
253 >>> segment('hellomyname')
254 ['hello', 'my', 'name']
255 >>> segment('hellomynameis')
265 # A broken language model
268 >>> Pwords(['hello'])
270 >>> Pwords(['hello', 'my'])
272 >>> Pwords(['hello', 'my', 'name'])
274 >>> Pwords(['hello', 'my', 'name', 'is'])
283 Need a better estimate for probability of unknown words.
285 Needs to take account of length of word.
287 * Longer words are less probable.
289 ## To IPython for investigation!
293 # Making Pdist more flexible
295 Want to give a sensible default for unknown elements
297 * But this will vary by referent
298 * Different languages, *n*-grams, etc.
308 def __init__(self, data=[], estimate_of_missing=None):
309 for key, count in data2:
312 def __missing__(self, key):
313 if estimate_of_missing:
314 return estimate_of_missing(key, self.total)
318 def log_probability_of_unknown_word(key, N):
319 return -log10(N *
10**((len(key) -
2) *
1.4))
321 Pw = Pdist(datafile('count_1w.txt'), log_probability_of_unknown_word)
326 # Testing segmentation again
331 >>> segment('hellomy')
333 >>> segment('hellomyname')
334 ['hello', 'my', 'name']
335 >>> segment('hellomynameis')
336 ['hello', 'my', 'name', 'is']
337 >>> ' '.join(segment(sanitise('HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW '
338 'AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE ')))
339 'helmut s cousins are i suppose kind in their own way but there is
340 little warmth in the kindness i receive'
343 Try it out on the full decrypt of `
2013/
2b.ciphertext` (it's a Caesar cipher)
347 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
350 <script type=
"text/javascript"
351 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
353 <script type=
"text/javascript">
354 var slideshow = remark.create({ ratio:
"16:9" });
359 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
362 MathJax.Hub.Queue(function() {
363 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
364 return(elem.SourceElement());
365 }).parent().addClass('has-jax');
367 MathJax.Hub.Configured();