4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
46 <textarea id=
"source">
48 # Breaking caesar ciphers
50 ![center-aligned Caesar wheel](caesarwheel1.gif)
56 Slow but clever vs Dumb but fast
60 Ciphertext | Plaintext
62 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
74 * How many keys to try?
80 decipher with this key
81 how close is it to English?
85 What steps do we know how to do?
88 # How close is it to English?
90 What does English look like?
92 * We need a model of English.
94 How do we define
"closeness"?
98 # What does English look like?
100 ## Abstraction: frequency of letter counts
114 One way of thinking about this is a
26-dimensional vector.
116 Create a vector of our text, and one of idealised English.
118 The distance between the vectors is how far from English the text is.
122 # Frequencies of English
124 But before then how do we count the letters?
126 * Read a file into a string
140 Counting letters in _War and Peace_ gives all manner of junk.
142 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
145 [l.lower() for l in text if ...]
153 >>> caesar_encipher_letter('é',
1)
155 What does it produce?
157 What should it produce?
159 ## Unicode, combining codepoints, and normal forms
161 Text encodings will bite you when you least expect it.
163 * urlencoding is the other pain point.
167 # Five minutes on StackOverflow later...
171 """Remove all accents from letters.
172 It does this by converting the unicode string to decomposed compatibility
173 form, dropping all the combining accents, then re-encoding the bytes.
175 >>> unaccent('hello')
177 >>> unaccent('HELLO')
179 >>> unaccent('héllo')
181 >>> unaccent('héllö')
183 >>> unaccent('HÉLLÖ')
186 return unicodedata.normalize('NFKD', text).\
187 encode('ascii', 'ignore').\
195 .float-right[![right-aligned Vector subtraction](vector-subtraction.svg)]
197 Several different distance measures (__metrics__, also called __norms__):
199 * L
<sub>2</sub> norm (Euclidean distance):
200 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^
2} \)`
202 * L
<sub>1</sub> norm (Manhattan distance, taxicab distance):
203 `\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)`
205 * L
<sub>3</sub> norm:
206 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[
3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^
3} \)`
208 The higher the power used, the more weight is given to the largest differences in components.
212 * L
<sub>0</sub> norm (Hamming distance):
213 `$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{
214 \begin{matrix}
1 &\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\
215 0 &\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$`
217 * L
<sub>∞</sub> norm:
218 `\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)`
220 neither of which will be that useful.)
223 # Normalisation of vectors
225 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
227 * Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^
2 + \mathbf{x}_2^
2 + \mathbf{x}_3^
2 + \dots } }$$`
229 * Normalisation (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
233 # Angle, not distance
235 Rather than looking at the distance between the vectors, look at the angle between them.
237 .float-right[![right-aligned Vector dot product](vector-dot-product.svg)]
239 Vector dot product shows how much of one vector lies in the direction of another:
240 `\( \mathbf{A} \bullet \mathbf{B} =
241 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
244 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
245 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^
2 \)`
247 A bit of rearranging give the cosine simiarity:
248 `$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
249 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^
2 \times \sum_i \mathbf{B}_i^
2} $$`
251 This is independent of vector lengths!
253 Cosine similarity is
1 if in parallel,
0 if perpendicular, -
1 if antiparallel.
257 # An infinite number of monkeys
259 What is the probability that this string of letters is a sample of English?
261 Given 'th', 'e' is about six times more likely than 'a' or 'i'.
263 ## Naive Bayes, or the bag of letters
265 Ignore letter order, just treat each letter individually.
267 Probability of a text is `\( \prod_i p_i \)`
269 (Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`)
275 | Euclidean | Normalised
276 ---|-----------|------------
282 And the probability measure!
284 * Nine different ways of measuring fitness.
286 ## Computing is an empircal science
291 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
294 <script type=
"text/javascript"
295 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
297 <script type=
"text/javascript">
298 var slideshow = remark.create({ ratio:
"16:9" });
303 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
306 MathJax.Hub.Queue(function() {
307 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
308 return(elem.SourceElement());
309 }).parent().addClass('has-jax');
311 MathJax.Hub.Configured();