4 <title>Word segmentation
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
52 <textarea id=
"source">
58 `making sense of this`
64 .indexlink[[Index](index.html)]
70 Ciphertext is re-split into groups to hide word bounaries.
72 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
74 How can we rediscover the word boundaries?
76 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
82 1. Try all possible word boundaries
83 2. Return the one that looks most like English
85 What's the complexity of this process?
87 * (We'll fix that in a bit...)
91 # What do we mean by
"looks like English"?
93 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
95 Probability of a bag-of-words (ignoring inter-word dependencies).
97 Finding the counts of words in text is harder than letters.
99 * More tokens, so need more data to cover sufficient words.
102 # Data sparsity and smoothing
104 `counts_1w.txt` is the
333,
333 most common words types, with number of tokens for each, collected by Google.
106 Doesn't cover a lot of words we want, such as proper nouns.
108 We'll have to guess the probability of unknown word.
110 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
112 ...but we'll ignore them all.
114 Assume unknown words have a count of
1.
118 # Storing word probabilities
120 We want something like a `defaultdict` but with our own default value
124 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
126 `__missing__` handles the case when the key is missing
131 def __init__(self, data=[]):
132 for key, count in data:
135 def __missing__(self, key):
146 # Testing the bag of words model
150 >>> 'hello' in Pw.keys()
>>> Pwords(['hello'])
151 True -
4.25147684171819
152 >>> 'inigo' in Pw
>>> Pwords(['hello', 'my'])
153 True -
6.995724679281423
154 >>> 'blj' in Pw
>>> Pwords(['hello', 'my', 'name'])
155 False -
10.098177451501074
156 >>> Pw['hello']
>>> Pwords(['hello', 'my', 'name', 'is'])
157 -
4.25147684171819 -
12.195018236240843
158 >>> Pw['my']
>>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
159 -
2.7442478375632335 -
18.927603013570945
160 >>> Pw['name']
>>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
161 -
3.102452772219651 -
23.964487301167402
170 # Splitting the input
174 find all possible splits into a first portion and remainder
176 segment the remainder
177 return the split with highest score
180 Indexing pulls out letters. `'sometext'[
0]` = 's' ; `'sometext'[
3]` = 'e' ; `'sometext'[-
1]` = 't'
182 Slices pulls out substrings. `'sometext'[
1:
4]` = 'ome' ; `'sometext'[:
3]` = 'som' ; `'sometext'[
5:]` = 'ext'
184 `range()` will sweep across the string
189 >>> splits('sometext')
190 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
191 ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
194 The last one is important
196 * What if this is the last word of the text?
200 # Effeciency and memoisation
202 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
204 At any stage, can consider the sentence as prefix, word, suffix
206 * `littlewarmthin | the | kindness i receive`
207 * `littlewarmthi | nthe | kindness i receive`
208 * `littlewarmth | inthe | kindness i receive`
209 * `littlewarmt | hinthe | kindness i receive`
211 P(sentence) = P(prefix) × P(word) × P(suffix)
213 * We're assuming independence of sections.
214 * For a given word/suffix split, there is only one best segmentation of the suffix.
215 * Best segmentation of sentence (with split here) must have the best segmentation of the suffix.
216 * Once we've found it, no need to recalculate it.
218 ## What's the complexity now?
224 * Maintain a table of previously-found results
225 * Every time we're asked to calculate a segmentation, look in the table.
226 * If it's in the table, just return that.
227 * If not, calculate it and store the result in the table.
229 Wrap the segment function in something that maintains that table.
231 In the standard library: `lru_cache` as a function decorator.
234 from functools import lru_cache
240 * (Plenty of tutorials online on function decorators.)
244 # Implmentation detail
246 You'll hit Python's recursion level limit.
252 sys.setrecursionlimit(
1000000)
257 # Testing segmentation
262 >>> segment('hellomy')
264 >>> segment('hellomyname')
265 ['hello', 'my', 'name']
266 >>> segment('hellomynameis')
276 # A broken language model
279 >>> Pwords(['hello'])
281 >>> Pwords(['hello', 'my'])
283 >>> Pwords(['hello', 'my', 'name'])
285 >>> Pwords(['hello', 'my', 'name', 'is'])
294 Need a better estimate for probability of unknown words.
296 Needs to take account of length of word.
298 * Longer words are less probable.
300 ## To IPython for investigation!
304 # Making Pdist more flexible
306 Want to give a sensible default for unknown elements
308 * But this will vary by referent
309 * Different languages, *n*-grams, etc.
319 def __init__(self, data=[], estimate_of_missing=None):
320 for key, count in data2:
323 def __missing__(self, key):
324 if estimate_of_missing:
325 return estimate_of_missing(key, self.total)
329 def log_probability_of_unknown_word(key, N):
330 return -log10(N *
10**((len(key) -
2) *
1.4))
332 Pw = Pdist(datafile('count_1w.txt'), log_probability_of_unknown_word)
337 # Testing segmentation again
342 >>> segment('hellomy')
344 >>> segment('hellomyname')
345 ['hello', 'my', 'name']
346 >>> segment('hellomynameis')
347 ['hello', 'my', 'name', 'is']
348 >>> ' '.join(segment(sanitise('HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW '
349 'AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE ')))
350 'helmut s cousins are i suppose kind in their own way but there is
351 little warmth in the kindness i receive'
354 Try it out on the full decrypt of `
2013/
2b.ciphertext` (it's a Caesar cipher)
358 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
361 <script type=
"text/javascript"
362 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
364 <script type=
"text/javascript">
365 var slideshow = remark.create({ ratio:
"16:9" });
370 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
373 MathJax.Hub.Queue(function() {
374 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
375 return(elem.SourceElement());
376 }).parent().addClass('has-jax');
378 MathJax.Hub.Configured();