d8c2824b4be5bfad4be0c7f8707bdc64c785feb6
[cipher-training.git] / slides / word-segmentation.html
1 <!DOCTYPE html>
2 <html>
3 <head>
4 <title>Affine ciphers</title>
5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
6 <style type="text/css">
7 /* Slideshow styles */
8 body {
9 font-size: 20px;
10 }
11 h1, h2, h3 {
12 font-weight: 400;
13 margin-bottom: 0;
14 }
15 h1 { font-size: 3em; }
16 h2 { font-size: 2em; }
17 h3 { font-size: 1.6em; }
18 a, a > code {
19 text-decoration: none;
20 }
21 code {
22 -moz-border-radius: 5px;
23 -web-border-radius: 5px;
24 background: #e7e8e2;
25 border-radius: 5px;
26 font-size: 16px;
27 }
28 .plaintext {
29 background: #272822;
30 color: #80ff80;
31 text-shadow: 0 0 20px #333;
32 padding: 2px 5px;
33 }
34 .ciphertext {
35 background: #272822;
36 color: #ff6666;
37 text-shadow: 0 0 20px #333;
38 padding: 2px 5px;
39 }
40 .float-right {
41 float: right;
42 }
43 </style>
44 </head>
45
46 <body>
47 <textarea id="source">
48
49 # Word segmentation
50
51 `makingsenseofthis`
52
53 `making sense of this`
54
55 ---
56
57 # The problem
58
59 Ciphertext is re-split into groups to hide word bounaries.
60
61 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
62
63 How can we rediscover the word boundaries?
64
65 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
66
67 ---
68
69 # Simple approach
70
71 1. Try all possible word boundaries
72 2. Return the one that looks most like English
73
74 What's the complexity of this process?
75
76 * (We'll fix that in a bit...)
77
78 ---
79
80 # What do we mean by "looks like English"?
81
82 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
83
84 Probability of a bag-of-words (ignoring inter-word dependencies).
85
86 Finding the counts of words in text is harder than letters.
87
88 * More tokens, so need more data to cover sufficient words.
89
90 ---
91 # Data sparsity and smoothing
92
93 `counts_1w.txt` is the 333,333 most common words types, with number of tokens for each, collected by Google.
94
95 Doesn't cover a lot of words we want, such as proper nouns.
96
97 We'll have to guess the probability of unknown word.
98
99 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
100
101 ...but we'll ignore them all.
102
103 Assume unknown words have a count of 1.
104
105 ---
106
107 # Storing word probabilities
108
109 We want something like a `defaultdict` but with our own default value
110
111 Subclass a dict!
112
113 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
114
115 `__missing__` handles the case when the key is missing
116
117
118 ```python
119 class Pdist(dict):
120 def __init__(self, data=[]):
121 for key, count in data2:
122 ...
123 self.total = ...
124 def __missing__(self, key):
125 return ...
126
127 Pw = Pdist(data...)
128
129 def Pwords(words):
130 return ...
131 ```
132
133 ---
134
135 # Testing the bag of words model
136
137 ```python
138 >>> 'hello' in Pw.keys()
139 True
140 >>> 'inigo' in Pw.keys()
141 True
142 >>> 'blj' in Pw.keys()
143 False
144 >>> Pw['hello']
145 -4.25147684171819
146 >>> Pw['my']
147 -2.7442478375632335
148 >>> Pw['name']
149 -3.102452772219651
150 >>> Pw['is']
151 -2.096840784739768
152 >>> Pw['blj']
153 -11.76946906492656
154 >>> Pwords(['hello'])
155 -4.25147684171819
156 >>> Pwords(['hello', 'my'])
157 -6.995724679281423
158 >>> Pwords(['hello', 'my', 'name'])
159 -10.098177451501074
160 >>> Pwords(['hello', 'my', 'name', 'is'])
161 -12.195018236240843
162 >>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
163 -18.927603013570945
164 >>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
165 -23.964487301167402
166 ```
167
168 ---
169
170 # Splitting the input
171
172 ```
173 To segment a string:
174 find all possible splits into a first portion and remainder
175 for each split:
176 segment the remainder
177 return the split with highest score
178 ```
179
180 Indexing pulls out letters. `'sometext'[0]` = 's' ; `'keyword'[3]` = 'e' ; `'keyword'[-1]` = 't'
181
182 Slices pulls out substrings. `'keyword'[1:4]` = 'ome' ; `'keyword'[:3]` = 'som' ; `'keyword'[5:]` = 'ext'
183
184 `range()` will sweep across the string
185
186 ## Test case
187
188 ```python
189 >>> splits('sometext')
190 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
191 ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
192 ```
193
194 The last one is important
195
196 * What if this is the last word of the text?
197
198
199 </textarea>
200 <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
201 </script>
202
203 <script type="text/javascript"
204 src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
205
206 <script type="text/javascript">
207 var slideshow = remark.create({ ratio: "16:9" });
208
209 // Setup MathJax
210 MathJax.Hub.Config({
211 tex2jax: {
212 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
213 }
214 });
215 MathJax.Hub.Queue(function() {
216 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
217 return(elem.SourceElement());
218 }).parent().addClass('has-jax');
219 });
220 MathJax.Hub.Configured();
221 </script>
222 </body>
223 </html>