slides/word-segmentation.html

   1 <!DOCTYPE html>
   2 <html>
   3   <head>
   4     <title>Word segmentation</title>
   5     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
   6     <style type="text/css">
   7       /* Slideshow styles */
   8       body {
   9         font-size: 20px;
  10       }
  11       h1, h2, h3 {
  12         font-weight: 400;
  13         margin-bottom: 0;
  14       }
  15       h1 { font-size: 3em; }
  16       h2 { font-size: 2em; }
  17       h3 { font-size: 1.6em; }
  18       a, a > code {
  19         text-decoration: none;
  20       }
  21       code {
  22         -moz-border-radius: 5px;
  23         -web-border-radius: 5px;
  24         background: #e7e8e2;
  25         border-radius: 5px;
  26         font-size: 16px;
  27       }
  28       .plaintext {
  29         background: #272822;
  30         color: #80ff80;
  31         text-shadow: 0 0 20px #333;
  32         padding: 2px 5px;
  33       }
  34       .ciphertext {
  35         background: #272822;
  36         color: #ff6666;
  37         text-shadow: 0 0 20px #333;
  38         padding: 2px 5px;
  39       }
  40       .indexlink {
  41         position: absolute;
  42         bottom: 1em;
  43         left: 1em;
  44       }
  45        .float-right {
  46         float: right;
  47       }
  48     </style>
  49   </head>
  50
  51   <body>
  52     <textarea id="source">
  53
  54 # Word segmentation
  55
  56 `makingsenseofthis`
  57
  58 `making sense of this`
  59
  60 ---
  61
  62 layout: true
  63
  64 .indexlink[[Index](index.html)]
  65
  66 ---
  67
  68 # The problem
  69
  70 Ciphertext is re-split into groups to hide word bounaries.
  71
  72 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
  73
  74 How can we rediscover the word boundaries?
  75
  76 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
  77
  78 ---
  79
  80 # Simple approach
  81
  82 1. Try all possible word boundaries
  83 2. Return the one that looks most like English
  84
  85 What's the complexity of this process?
  86
  87 * (We'll fix that in a bit...)
  88
  89 ---
  90
  91 # What do we mean by "looks like English"?
  92
  93 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
  94
  95 Probability of a bag-of-words (ignoring inter-word dependencies).
  96
  97 Finding the counts of words in text is harder than letters.
  98
  99 * More tokens, so need more data to cover sufficient words.
 100
 101 ---
 102 # Data sparsity and smoothing
 103
 104 `counts_1w.txt` is the 333,333 most common words types, with number of tokens for each, collected by Google.
 105
 106 Doesn't cover a lot of words we want, such as proper nouns.
 107
 108 We'll have to guess the probability of unknown word.
 109
 110 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
 111
 112 ...but we'll ignore them all.
 113
 114 Assume unknown words have a count of 1.
 115
 116 ---
 117
 118 # Storing word probabilities
 119
 120 We want something like a `defaultdict` but with our own default value
 121
 122 Subclass a dict!
 123
 124 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
 125
 126 `__missing__` handles the case when the key is missing
 127
 128
 129 ```python
 130 class Pdist(dict):
 131     def __init__(self, data=[]):
 132         for key, count in data:
 133             ...
 134         self.total = ...
 135     def __missing__(self, key):
 136         return ...
 137
 138 Pw = Pdist(data...)
 139
 140 def Pwords(words):
 141     return ...
 142 ```
 143
 144 ---
 145
 146 # Testing the bag of words model
 147
 148
 149 ```python
 150 >>> 'hello' in Pw.keys()       >>> Pwords(['hello'])
 151 True                           -4.25147684171819
 152 >>> 'inigo' in Pw              >>> Pwords(['hello', 'my'])
 153 True                           -6.995724679281423
 154 >>> 'blj' in Pw                >>> Pwords(['hello', 'my', 'name'])
 155 False                          -10.098177451501074
 156 >>> Pw['hello']                >>> Pwords(['hello', 'my', 'name', 'is'])
 157 -4.25147684171819              -12.195018236240843
 158 >>> Pw['my']                   >>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
 159 -2.7442478375632335            -18.927603013570945
 160 >>> Pw['name']                 >>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
 161 -3.102452772219651             -23.964487301167402
 162 >>> Pw['is']
 163 -2.096840784739768
 164 >>> Pw['blj']
 165 -11.76946906492656
 166 ```
 167
 168 ---
 169
 170 # Splitting the input
 171
 172 ```
 173 To segment a string:
 174     find all possible splits into a first portion and remainder
 175     for each split:
 176         segment the remainder
 177     return the split with highest score
 178 ```
 179
 180 Indexing pulls out letters. `'sometext'[0]` = 's' ; `'sometext'[3]` = 'e' ; `'sometext'[-1]` = 't'
 181
 182 Slices pulls out substrings. `'sometext'[1:4]` = 'ome' ; `'sometext'[:3]` = 'som' ; `'sometext'[5:]` = 'ext'
 183
 184 `range()` will sweep across the string
 185
 186 ## Test case
 187
 188 ```python
 189 >>> splits('sometext')
 190 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
 191  ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
 192 ```
 193
 194 The last one is important
 195
 196 * What if this is the last word of the text?
 197
 198 ---
 199
 200 # Effeciency and memoisation
 201
 202 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
 203
 204 At any stage, can consider the sentence as prefix, word, suffix
 205
 206 * `littlewarmthin | the | kindness i receive`
 207 * `littlewarmthi | nthe | kindness i receive`
 208 * `littlewarmth | inthe | kindness i receive`
 209 * `littlewarmt | hinthe | kindness i receive`
 210
 211 P(sentence) = P(prefix) × P(word) × P(suffix)
 212
 213 * We're assuming independence of sections.
 214 * For a given word/suffix split, there is only one best segmentation of the suffix.
 215 * Best segmentation of sentence (with split here) must have the best segmentation of the suffix.
 216 * Once we've found it, no need to recalculate it.
 217
 218 ## What's the complexity now?
 219
 220 ---
 221
 222 # Memoisation
 223
 224 * Maintain a table of previously-found results
 225 * Every time we're asked to calculate a segmentation, look in the table.
 226 * If it's in the table, just return that.
 227 * If not, calculate it and store the result in the table.
 228
 229 Wrap the segment function in something that maintains that table.
 230
 231 In the standard library: `lru_cache` as a function decorator.
 232
 233 ```python
 234 from functools import lru_cache
 235
 236 @lru_cache()
 237 def segment(text):
 238     ...
 239 ```
 240 * (Plenty of tutorials online on function decorators.)
 241
 242 ---
 243
 244 # Implmentation detail
 245
 246 You'll hit Python's recursion level limit.
 247
 248 Easy to reset:
 249
 250 ```python
 251 import sys
 252 sys.setrecursionlimit(1000000)
 253 ```
 254
 255 ---
 256
 257 # Testing segmentation
 258
 259 ```python
 260 >>> segment('hello')
 261 ['hello']
 262 >>> segment('hellomy')
 263 ['hello', 'my']
 264 >>> segment('hellomyname')
 265 ['hello', 'my', 'name']
 266 >>> segment('hellomynameis')
 267 ['hellomynameis']
 268 ```
 269
 270 Oh.
 271
 272 Why?
 273
 274 ---
 275
 276 # A broken language model
 277
 278 ```python
 279 >>> Pwords(['hello'])
 280 -4.25147684171819
 281 >>> Pwords(['hello', 'my'])
 282 -6.995724679281423
 283 >>> Pwords(['hello', 'my', 'name'])
 284 -10.098177451501074
 285 >>> Pwords(['hello', 'my', 'name', 'is'])
 286 -12.195018236240843
 287
 288 >>> Pw['is']
 289 -2.096840784739768
 290 >>> Pw['blj']
 291 -11.76946906492656
 292 ```
 293
 294 Need a better estimate for probability of unknown words.
 295
 296 Needs to take account of length of word.
 297
 298 * Longer words are less probable.
 299
 300 ## To IPython for investigation!
 301
 302 ---
 303
 304 # Making Pdist more flexible
 305
 306 Want to give a sensible default for unknown elements
 307
 308 * But this will vary by referent
 309 * Different languages, *n*-grams, etc.
 310
 311 Make it a parameter!
 312
 313 ---
 314
 315 # Hint
 316
 317 ```python
 318 class Pdist(dict):
 319     def __init__(self, data=[], estimate_of_missing=None):
 320         for key, count in data2:
 321             ...
 322         self.total = ...
 323     def __missing__(self, key):
 324         if estimate_of_missing:
 325             return estimate_of_missing(key, self.total)
 326         else:
 327             return ...
 328
 329 def log_probability_of_unknown_word(key, N):
 330     return -log10(N * 10**((len(key) - 2) * 1.4))
 331
 332 Pw = Pdist(datafile('count_1w.txt'), log_probability_of_unknown_word)
 333 ```
 334
 335 ---
 336
 337 # Testing segmentation again
 338
 339 ```python
 340 >>> segment('hello')
 341 ['hello']
 342 >>> segment('hellomy')
 343 ['hello', 'my']
 344 >>> segment('hellomyname')
 345 ['hello', 'my', 'name']
 346 >>> segment('hellomynameis')
 347 ['hello', 'my', 'name', 'is']
 348 >>> ' '.join(segment(sanitise('HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW '
 349                               'AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE ')))
 350 'helmut s cousins are i suppose kind in their own way but there is
 351  little warmth in the kindness i receive'
 352 ```
 353
 354 Try it out on the full decrypt of `2013/2b.ciphertext` (it's a Caesar cipher)
 355
 356
 357     </textarea>
 358     <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
 359     </script>
 360
 361     <script type="text/javascript"
 362       src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
 363
 364     <script type="text/javascript">
 365       var slideshow = remark.create({ ratio: "16:9" });
 366
 367       // Setup MathJax
 368       MathJax.Hub.Config({
 369         tex2jax: {
 370         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
 371         }
 372       });
 373       MathJax.Hub.Queue(function() {
 374         $(MathJax.Hub.getAllJax()).map(function(index, elem) {
 375             return(elem.SourceElement());
 376         }).parent().addClass('has-jax');
 377       });
 378       MathJax.Hub.Configured();
 379     </script>
 380   </body>
 381 </html>