slides/word-segmentation.html

   1 <!DOCTYPE html>
   2 <html>
   3   <head>
   4     <title>Word segmentation</title>
   5     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
   6     <style type="text/css">
   7       /* Slideshow styles */
   8       body {
   9         font-size: 20px;
  10       }
  11       h1, h2, h3 {
  12         font-weight: 400;
  13         margin-bottom: 0;
  14       }
  15       h1 { font-size: 3em; }
  16       h2 { font-size: 2em; }
  17       h3 { font-size: 1.6em; }
  18       a, a > code {
  19         text-decoration: none;
  20       }
  21       code {
  22         -moz-border-radius: 5px;
  23         -web-border-radius: 5px;
  24         background: #e7e8e2;
  25         border-radius: 5px;
  26         font-size: 16px;
  27       }
  28       .plaintext {
  29         background: #272822;
  30         color: #80ff80;
  31         text-shadow: 0 0 20px #333;
  32         padding: 2px 5px;
  33       }
  34       .ciphertext {
  35         background: #272822;
  36         color: #ff6666;
  37         text-shadow: 0 0 20px #333;
  38         padding: 2px 5px;
  39       }
  40        .float-right {
  41         float: right;
  42       }
  43     </style>
  44   </head>
  45
  46   <body>
  47     <textarea id="source">
  48
  49 # Word segmentation
  50
  51 `makingsenseofthis`
  52
  53 `making sense of this`
  54
  55 ---
  56
  57 # The problem
  58
  59 Ciphertext is re-split into groups to hide word bounaries.
  60
  61 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
  62
  63 How can we rediscover the word boundaries?
  64
  65 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
  66
  67 ---
  68
  69 # Simple approach
  70
  71 1. Try all possible word boundaries
  72 2. Return the one that looks most like English
  73
  74 What's the complexity of this process?
  75
  76 * (We'll fix that in a bit...)
  77
  78 ---
  79
  80 # What do we mean by "looks like English"?
  81
  82 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
  83
  84 Probability of a bag-of-words (ignoring inter-word dependencies).
  85
  86 Finding the counts of words in text is harder than letters.
  87
  88 * More tokens, so need more data to cover sufficient words.
  89
  90 ---
  91 # Data sparsity and smoothing
  92
  93 `counts_1w.txt` is the 333,333 most common words types, with number of tokens for each, collected by Google.
  94
  95 Doesn't cover a lot of words we want, such as proper nouns.
  96
  97 We'll have to guess the probability of unknown word.
  98
  99 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
 100
 101 ...but we'll ignore them all.
 102
 103 Assume unknown words have a count of 1.
 104
 105 ---
 106
 107 # Storing word probabilities
 108
 109 We want something like a `defaultdict` but with our own default value
 110
 111 Subclass a dict!
 112
 113 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
 114
 115 `__missing__` handles the case when the key is missing
 116
 117
 118 ```python
 119 class Pdist(dict):
 120     def __init__(self, data=[]):
 121         for key, count in data2:
 122             ...
 123         self.total = ...
 124     def __missing__(self, key):
 125         return ...
 126
 127 Pw = Pdist(data...)
 128
 129 def Pwords(words):
 130     return ...
 131 ```
 132
 133 ---
 134
 135 # Testing the bag of words model
 136
 137
 138 ```python
 139 >>> 'hello' in Pw.keys()       >>> Pwords(['hello'])
 140 True                           -4.25147684171819
 141 >>> 'inigo' in Pw.keys()       >>> Pwords(['hello', 'my'])
 142 True                           -6.995724679281423
 143 >>> 'blj' in Pw.keys()         >>> Pwords(['hello', 'my', 'name'])
 144 False                          -10.098177451501074
 145 >>> Pw['hello']                >>> Pwords(['hello', 'my', 'name', 'is'])
 146 -4.25147684171819              -12.195018236240843
 147 >>> Pw['my']                   >>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
 148 -2.7442478375632335            -18.927603013570945
 149 >>> Pw['name']                 >>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
 150 -3.102452772219651             -23.964487301167402
 151 >>> Pw['is']
 152 -2.096840784739768
 153 >>> Pw['blj']
 154 -11.76946906492656
 155 ```
 156
 157 ---
 158
 159 # Splitting the input
 160
 161 ```
 162 To segment a string:
 163     find all possible splits into a first portion and remainder
 164     for each split:
 165         segment the remainder
 166     return the split with highest score
 167 ```
 168
 169 Indexing pulls out letters. `'sometext'[0]` = 's' ; `'keyword'[3]` = 'e' ; `'keyword'[-1]` = 't'
 170
 171 Slices pulls out substrings. `'keyword'[1:4]` = 'ome' ; `'keyword'[:3]` = 'som' ; `'keyword'[5:]` = 'ext'
 172
 173 `range()` will sweep across the string
 174
 175 ## Test case
 176
 177 ```python
 178 >>> splits('sometext')
 179 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
 180  ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
 181 ```
 182
 183 The last one is important
 184
 185 * What if this is the last word of the text?
 186
 187 ---
 188
 189 # Effeciency and memoisation
 190
 191 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
 192
 193 At any stage, can consider the sentence as prefix, word, suffix
 194
 195 * `littlewarmthin | the | kindness i receive`
 196 * `littlewarmthi | nthe | kindness i receive`
 197 * `littlewarmth | inthe | kindness i receive`
 198 * `littlewarmt | hinthe | kindness i receive`
 199
 200 P(sentence) = P(prefix) × P(word) × P(suffix)
 201
 202 * We're assuming independence of sections.
 203 * For a given word/suffix split, there is only one best segmentation of the suffix.
 204 * Best segmentation of sentence (with split here) must have the best segmentation of the suffix.
 205 * Once we've found it, no need to recalculate it.
 206
 207 ## What's the complexity now?
 208
 209 ---
 210
 211 # Memoisation
 212
 213 * Maintain a table of previously-found results
 214 * Every time we're asked to calculate a segmentation, look in the table.
 215 * If it's in the table, just return that.
 216 * If not, calculate it and store the result in the table.
 217
 218 Wrap the segment function in something that maintains that table.
 219
 220 In the standard library: `lru_cache` as a function decorator.
 221
 222 ```python
 223 from functools import lru_cache
 224
 225 @lru_cache()
 226 def segment(text):
 227     ...
 228 ```
 229 * (Plenty of tutorials online on function decorators.)
 230
 231 ---
 232
 233 # Implmentation detail
 234
 235 You'll hit Python's recursion level limit.
 236
 237 Easy to reset:
 238
 239 ```python
 240 import sys
 241 sys.setrecursionlimit(1000000)
 242 ```
 243
 244 ---
 245
 246 # Testing segmentation
 247
 248 ```python
 249 >>> segment('hello')
 250 ['hello']
 251 >>> segment('hellomy')
 252 ['hello', 'my']
 253 >>> segment('hellomyname')
 254 ['hello', 'my', 'name']
 255 >>> segment('hellomynameis')
 256 ['hellomynameis']
 257 ```
 258
 259 Oh.
 260
 261 Why?
 262
 263 ---
 264
 265 # A broken language model
 266
 267 ```python
 268 >>> Pwords(['hello'])
 269 -4.25147684171819
 270 >>> Pwords(['hello', 'my'])
 271 -6.995724679281423
 272 >>> Pwords(['hello', 'my', 'name'])
 273 -10.098177451501074
 274 >>> Pwords(['hello', 'my', 'name', 'is'])
 275 -12.195018236240843
 276
 277 >>> Pw['is']
 278 -2.096840784739768
 279 >>> Pw['blj']
 280 -11.76946906492656
 281 ```
 282
 283 Need a better estimate for probability of unknown words.
 284
 285 Needs to take account of length of word.
 286
 287 * Longer words are less probable.
 288
 289 ## To IPython for investigation!
 290
 291 ---
 292
 293 # Making Pdist more flexible
 294
 295 Want to give a sensible default for unknown elements
 296
 297 * But this will vary by referent
 298 * Different languages, *n*-grams, etc.
 299
 300 Make it a parameter!
 301
 302 ---
 303
 304 # Hint
 305
 306 ```python
 307 class Pdist(dict):
 308     def __init__(self, data=[], estimate_of_missing=None):
 309         for key, count in data2:
 310             ...
 311         self.total = ...
 312     def __missing__(self, key):
 313         if estimate_of_missing:
 314             return estimate_of_missing(key, self.total)
 315         else:
 316             return ...
 317
 318 def log_probability_of_unknown_word(key, N):
 319     return -log10(N * 10**((len(key) - 2) * 1.4))
 320
 321 Pw = Pdist(datafile('count_1w.txt'), log_probability_of_unknown_word)
 322 ```
 323
 324 ---
 325
 326 # Testing segmentation again
 327
 328 ```python
 329 >>> segment('hello')
 330 ['hello']
 331 >>> segment('hellomy')
 332 ['hello', 'my']
 333 >>> segment('hellomyname')
 334 ['hello', 'my', 'name']
 335 >>> segment('hellomynameis')
 336 ['hello', 'my', 'name', 'is']
 337 >>> ' '.join(segment(sanitise('HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW '
 338                               'AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE ')))
 339 'helmut s cousins are i suppose kind in their own way but there is
 340  little warmth in the kindness i receive'
 341 ```
 342
 343 Try it out on the full decrypt of `2013/2b.ciphertext` (it's a Caesar cipher)
 344
 345
 346     </textarea>
 347     <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
 348     </script>
 349
 350     <script type="text/javascript"
 351       src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
 352
 353     <script type="text/javascript">
 354       var slideshow = remark.create({ ratio: "16:9" });
 355
 356       // Setup MathJax
 357       MathJax.Hub.Config({
 358         tex2jax: {
 359         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
 360         }
 361       });
 362       MathJax.Hub.Queue(function() {
 363         $(MathJax.Hub.getAllJax()).map(function(index, elem) {
 364             return(elem.SourceElement());
 365         }).parent().addClass('has-jax');
 366       });
 367       MathJax.Hub.Configured();
 368     </script>
 369   </body>
 370 </html>