slides/word-segmentation.html

   1 <!DOCTYPE html>
   2 <html>
   3   <head>
   4     <title>Affine ciphers</title>
   5     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
   6     <style type="text/css">
   7       /* Slideshow styles */
   8       body {
   9         font-size: 20px;
  10       }
  11       h1, h2, h3 {
  12         font-weight: 400;
  13         margin-bottom: 0;
  14       }
  15       h1 { font-size: 3em; }
  16       h2 { font-size: 2em; }
  17       h3 { font-size: 1.6em; }
  18       a, a > code {
  19         text-decoration: none;
  20       }
  21       code {
  22         -moz-border-radius: 5px;
  23         -web-border-radius: 5px;
  24         background: #e7e8e2;
  25         border-radius: 5px;
  26         font-size: 16px;
  27       }
  28       .plaintext {
  29         background: #272822;
  30         color: #80ff80;
  31         text-shadow: 0 0 20px #333;
  32         padding: 2px 5px;
  33       }
  34       .ciphertext {
  35         background: #272822;
  36         color: #ff6666;
  37         text-shadow: 0 0 20px #333;
  38         padding: 2px 5px;
  39       }
  40        .float-right {
  41         float: right;
  42       }
  43     </style>
  44   </head>
  45
  46   <body>
  47     <textarea id="source">
  48
  49 # Word segmentation
  50
  51 `makingsenseofthis`
  52
  53 `making sense of this`
  54
  55 ---
  56
  57 # The problem
  58
  59 Ciphertext is re-split into groups to hide word bounaries.
  60
  61 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
  62
  63 How can we rediscover the word boundaries?
  64
  65 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
  66
  67 ---
  68
  69 # Simple approach
  70
  71 1. Try all possible word boundaries
  72 2. Return the one that looks most like English
  73
  74 What's the complexity of this process?
  75
  76 * (We'll fix that in a bit...)
  77
  78 ---
  79
  80 # What do we mean by "looks like English"?
  81
  82 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
  83
  84 Probability of a bag-of-words (ignoring inter-word dependencies).
  85
  86 Finding the counts of words in text is harder than letters.
  87
  88 * More tokens, so need more data to cover sufficient words.
  89
  90 ---
  91 # Data sparsity and smoothing
  92
  93 `counts_1w.txt` is the 333,333 most common words types, with number of tokens for each, collected by Google.
  94
  95 Doesn't cover a lot of words we want, such as proper nouns.
  96
  97 We'll have to guess the probability of unknown word.
  98
  99 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
 100
 101 ...but we'll ignore them all.
 102
 103 Assume unknown words have a count of 1.
 104
 105 ---
 106
 107 # Storing word probabilities
 108
 109 We want something like a `defaultdict` but with our own default value
 110
 111 Subclass a dict!
 112
 113 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
 114
 115 `__missing__` handles the case when the key is missing
 116
 117
 118 ```python
 119 class Pdist(dict):
 120     def __init__(self, data=[]):
 121         for key, count in data2:
 122             ...
 123         self.total = ...
 124     def __missing__(self, key):
 125         return ...
 126
 127 Pw = Pdist(data...)
 128
 129 def Pwords(words):
 130     return ...
 131 ```
 132
 133 ---
 134
 135 # Testing the bag of words model
 136
 137 ```python
 138 >>> 'hello' in Pw.keys()
 139 True
 140 >>> 'inigo' in Pw.keys()
 141 True
 142 >>> 'blj' in Pw.keys()
 143 False
 144 >>> Pw['hello']
 145 -4.25147684171819
 146 >>> Pw['my']
 147 -2.7442478375632335
 148 >>> Pw['name']
 149 -3.102452772219651
 150 >>> Pw['is']
 151 -2.096840784739768
 152 >>> Pw['blj']
 153 -11.76946906492656
 154 >>> Pwords(['hello'])
 155 -4.25147684171819
 156 >>> Pwords(['hello', 'my'])
 157 -6.995724679281423
 158 >>> Pwords(['hello', 'my', 'name'])
 159 -10.098177451501074
 160 >>> Pwords(['hello', 'my', 'name', 'is'])
 161 -12.195018236240843
 162 >>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
 163 -18.927603013570945
 164 >>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
 165 -23.964487301167402
 166 ```
 167
 168 ---
 169
 170 # Splitting the input
 171
 172 ```
 173 To segment a string:
 174     find all possible splits into a first portion and remainder
 175     for each split:
 176         segment the remainder
 177     return the split with highest score
 178 ```
 179
 180 Indexing pulls out letters. `'sometext'[0]` = 's' ; `'keyword'[3]` = 'e' ; `'keyword'[-1]` = 't'
 181
 182 Slices pulls out substrings. `'keyword'[1:4]` = 'ome' ; `'keyword'[:3]` = 'som' ; `'keyword'[5:]` = 'ext'
 183
 184 `range()` will sweep across the string
 185
 186 ## Test case
 187
 188 ```python
 189 >>> splits('sometext')
 190 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
 191  ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
 192 ```
 193
 194 The last one is important
 195
 196 * What if this is the last word of the text?
 197
 198
 199     </textarea>
 200     <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
 201     </script>
 202
 203     <script type="text/javascript"
 204       src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
 205
 206     <script type="text/javascript">
 207       var slideshow = remark.create({ ratio: "16:9" });
 208
 209       // Setup MathJax
 210       MathJax.Hub.Config({
 211         tex2jax: {
 212         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
 213         }
 214       });
 215       MathJax.Hub.Queue(function() {
 216         $(MathJax.Hub.getAllJax()).map(function(index, elem) {
 217             return(elem.SourceElement());
 218         }).parent().addClass('has-jax');
 219       });
 220       MathJax.Hub.Configured();
 221     </script>
 222   </body>
 223 </html>