4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
46 <textarea id=
"source">
48 # Breaking caesar ciphers
50 ![center-aligned Caesar wheel](caesarwheel1.gif)
56 Slow but clever vs Dumb but fast
60 Ciphertext | Plaintext
62 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
74 * How many keys to try?
80 decipher with this key
81 how close is it to English?
85 What steps do we know how to do?
88 # How close is it to English?
90 What does English look like?
92 * We need a model of English.
94 How do we define
"closeness"?
98 # What does English look like?
100 ## Abstraction: frequency of letter counts
114 One way of thinking about this is a
26-dimensional vector.
116 Create a vector of our text, and one of idealised English.
118 The distance between the vectors is how far from English the text is.
122 # Frequencies of English
124 But before then how do we count the letters?
126 * Read a file into a string
136 Create the `language_models.py` file for this.
142 Counting letters in _War and Peace_ gives all manner of junk.
144 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
147 [l.lower() for l in text if ...]
155 >>> 'é' in string.ascii_letters
156 >>> 'e' in string.ascii_letters
159 ## Unicode, combining codepoints, and normal forms
161 Text encodings will bite you when you least expect it.
163 - **é** : LATIN SMALL LETTER E WITH ACUTE (U+
00E9)
165 - **e** + **
́** : LATIN SMALL LETTER E (U+
0065) + COMBINING ACUTE ACCENT (U+
0301)
167 * urlencoding is the other pain point.
171 # Five minutes on StackOverflow later...
175 """Remove all accents from letters.
176 It does this by converting the unicode string to decomposed compatibility
177 form, dropping all the combining accents, then re-encoding the bytes.
179 >>> unaccent('hello')
181 >>> unaccent('HELLO')
183 >>> unaccent('héllo')
185 >>> unaccent('héllö')
187 >>> unaccent('HÉLLÖ')
190 return unicodedata.normalize('NFKD', text).\
191 encode('ascii', 'ignore').\
197 # Find the frequencies of letters in English
199 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`.
200 2. Find the frequencies
201 3. Sort by count (`sorted(, key=)` ; `.items()`, `.keys()`, `.values()`, `.get()`)
202 4. Write counts to `count_1l.txt`
208 .float-right[![right-aligned Vector subtraction](vector-subtraction.svg)]
210 Several different distance measures (__metrics__, also called __norms__):
212 * L
<sub>2</sub> norm (Euclidean distance):
213 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^
2} \)`
215 * L
<sub>1</sub> norm (Manhattan distance, taxicab distance):
216 `\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)`
218 * L
<sub>3</sub> norm:
219 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[
3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^
3} \)`
221 The higher the power used, the more weight is given to the largest differences in components.
225 * L
<sub>0</sub> norm (Hamming distance):
226 `$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{
227 \begin{matrix}
1 &\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\
228 0 &\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$`
230 * L
<sub>∞</sub> norm:
231 `\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)`
233 neither of which will be that useful here, but they keep cropping up.)
236 # Normalisation of vectors
238 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
240 * Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^
2 + \mathbf{x}_2^
2 + \mathbf{x}_3^
2 + \dots } }$$`
242 * Normalisation (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
246 # Angle, not distance
248 Rather than looking at the distance between the vectors, look at the angle between them.
250 .float-right[![right-aligned Vector dot product](vector-dot-product.svg)]
252 Vector dot product shows how much of one vector lies in the direction of another:
253 `\( \mathbf{A} \bullet \mathbf{B} =
254 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
257 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
258 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^
2 \)`
260 A bit of rearranging give the cosine simiarity:
261 `$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
262 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^
2 \times \sum_i \mathbf{B}_i^
2} $$`
264 This is independent of vector lengths!
266 Cosine similarity is
1 if in parallel,
0 if perpendicular, -
1 if antiparallel.
270 # An infinite number of monkeys
272 What is the probability that this string of letters is a sample of English?
274 Given 'th', 'e' is about six times more likely than 'a' or 'i'.
276 ## Naive Bayes, or the bag of letters
278 Ignore letter order, just treat each letter individually.
280 Probability of a text is `\( \prod_i p_i \)`
282 (Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`)
288 | Euclidean | Normalised
289 ---|-----------|------------
295 And the probability measure!
297 * Nine different ways of measuring fitness.
299 ## Computing is an empircal science
301 Let's do some experiments to find the best solution!
305 ## Step
1: get **some** codebreaking working
307 Let's start with the letter probability norm, because it's easy.
309 ## Step
2: build some other scoring functions
311 We also need a way of passing the different functions to the keyfinding function.
313 ## Step
3: find the best scoring function
315 Try them all on random ciphertexts, see which one works best.
319 # Reading letter probabilities
321 1. Load the file `count_1l.txt` into a dict, with letters as keys.
323 2. Normalise the counts (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
325 * Remember the doctest!
327 3. Create a dict `Pl` that gives the log probability of a letter
329 4. Create a function `Pletters` that gives the probability of an iterable of letters
330 * What preconditions should this function have?
331 * Remember the doctest!
335 # Breaking caesar ciphers (at last!)
337 ## Remember the basic idea
341 decipher with this key
342 how close is it to English?
343 remember the best key
346 Try it on the text in `
2013/
1a.ciphertext`. Does it work?
352 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
355 <script type=
"text/javascript"
356 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
358 <script type=
"text/javascript">
359 var slideshow = remark.create({ ratio:
"16:9" });
364 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
367 MathJax.Hub.Queue(function() {
368 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
369 return(elem.SourceElement());
370 }).parent().addClass('has-jax');
372 MathJax.Hub.Configured();