4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
46 <textarea id=
"source">
48 # Breaking caesar ciphers
50 ![center-aligned Caesar wheel](caesarwheel1.gif)
56 Slow but clever vs Dumb but fast
60 Ciphertext | Plaintext
62 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
74 * How many keys to try?
80 decipher with this key
81 how close is it to English?
85 What steps do we know how to do?
88 # How close is it to English?
90 What does English look like?
92 * We need a model of English.
94 How do we define
"closeness"?
98 # What does English look like?
100 ## Abstraction: frequency of letter counts
114 Use this to predict the probability of each letter, and hence the probability of a sequence of letters.
118 # An infinite number of monkeys
120 What is the probability that this string of letters is a sample of English?
122 ## Naive Bayes, or the bag of letters
124 Ignore letter order, just treat each letter individually.
126 Probability of a text is `\( \prod_i p_i \)`
128 Letter | h | e | l | l | o | hello
129 ------------|---------|---------|---------|---------|---------|-------
130 Probability |
0.06645 |
0.12099 |
0.04134 |
0.04134 |
0.08052 |
1.10648239 ×
10<sup>-
6</sup>
132 Letter | i | f | m | m | p | ifmmp
133 ------------|---------|---------|---------|---------|---------|-------
134 Probability |
0.06723 |
0.02159 |
0.02748 |
0.02748 |
0.01607 |
1.76244520 ×
10<sup>-
8</sup>
136 (Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`)
138 Letter | h | e | l | l | o | hello
139 ------------|---------|---------|---------|---------|---------|-------
140 Probability | -
1.1774 | -
0.9172 | -
1.3836 | -
1.3836 | -
1.0940 | -
5.956055
145 # Frequencies of English
147 But before then how do we count the letters?
149 * Read a file into a string
159 Create the `language_models.py` file for this.
165 Counting letters in _War and Peace_ gives all manner of junk.
167 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
170 [l.lower() for l in text if ...]
177 >>> 'é' in string.ascii_letters
178 >>> 'e' in string.ascii_letters
181 ## Unicode, combining codepoints, and normal forms
183 Text encodings will bite you when you least expect it.
185 - **é** : LATIN SMALL LETTER E WITH ACUTE (U+
00E9)
187 - **e** + **
́** : LATIN SMALL LETTER E (U+
0065) + COMBINING ACUTE ACCENT (U+
0301)
189 * urlencoding is the other pain point.
193 # Five minutes on StackOverflow later...
197 """Remove all accents from letters.
198 It does this by converting the unicode string to decomposed compatibility
199 form, dropping all the combining accents, then re-encoding the bytes.
201 >>> unaccent('hello')
203 >>> unaccent('HELLO')
205 >>> unaccent('héllo')
207 >>> unaccent('héllö')
209 >>> unaccent('HÉLLÖ')
212 return unicodedata.normalize('NFKD', text).\
213 encode('ascii', 'ignore').\
219 # Find the frequencies of letters in English
221 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`.
222 2. Find the frequencies (`.update()`)
224 4. Write counts to `count_1l.txt` (`'text{}\n'.format()`)
228 # Reading letter probabilities
230 1. Load the file `count_1l.txt` into a dict, with letters as keys.
232 2. Normalise the counts (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
234 * Remember the doctest!
236 3. Create a dict `Pl` that gives the log probability of a letter
238 4. Create a function `Pletters` that gives the probability of an iterable of letters
239 * What preconditions should this function have?
240 * Remember the doctest!
244 # Breaking caesar ciphers
246 ## Remember the basic idea
250 decipher with this key
251 how close is it to English?
252 remember the best key
255 Try it on the text in `
2013/
1a.ciphertext`. Does it work?
261 Better than scattering `print()`statements through your code
266 logger = logging.getLogger(__name__)
267 logger.addHandler(logging.FileHandler('cipher.log'))
268 logger.setLevel(logging.WARNING)
270 logger.debug('Caesar break attempt using key {
0} gives fit of {
1} '
271 'and decrypt starting: {
2}'.format(shift, fit, plaintext[:
50]))
276 Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG
280 # Back to frequency of letter counts
294 Another way of thinking about this is a
26-dimensional vector.
296 Create a vector of our text, and one of idealised English.
298 The distance between the vectors is how far from English the text is.
304 .float-right[![right-aligned Vector subtraction](vector-subtraction.svg)]
306 Several different distance measures (__metrics__, also called __norms__):
308 * L
<sub>2</sub> norm (Euclidean distance):
309 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^
2} \)`
311 * L
<sub>1</sub> norm (Manhattan distance, taxicab distance):
312 `\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)`
314 * L
<sub>3</sub> norm:
315 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[
3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^
3} \)`
317 The higher the power used, the more weight is given to the largest differences in components.
321 * L
<sub>0</sub> norm (Hamming distance):
322 `$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{
323 \begin{matrix}
1 &\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\
324 0 &\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$`
326 * L
<sub>∞</sub> norm:
327 `\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)`
329 neither of which will be that useful here, but they keep cropping up.)
332 # Normalisation of vectors
334 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
336 * Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^
2 + \mathbf{x}_2^
2 + \mathbf{x}_3^
2 + \dots } }$$`
338 * Normalisation (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
342 # Angle, not distance
344 Rather than looking at the distance between the vectors, look at the angle between them.
346 .float-right[![right-aligned Vector dot product](vector-dot-product.svg)]
348 Vector dot product shows how much of one vector lies in the direction of another:
349 `\( \mathbf{A} \bullet \mathbf{B} =
350 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
353 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
354 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^
2 \)`
356 A bit of rearranging give the cosine simiarity:
357 `$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
358 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^
2 \times \sum_i \mathbf{B}_i^
2} $$`
360 This is independent of vector lengths!
362 Cosine similarity is
1 if in parallel,
0 if perpendicular, -
1 if antiparallel.
368 | Euclidean | Normalised
369 ---|-----------|------------
375 And the probability measure!
377 * Nine different ways of measuring fitness.
379 ## Computing is an empircal science
381 Let's do some experiments to find the best solution!
385 # Experimental harness
387 ## Step
1: build some other scoring functions
389 We need a way of passing the different functions to the keyfinding function.
391 ## Step
2: find the best scoring function
393 Try them all on random ciphertexts, see which one works best.
397 # Functions are values!
401 <function Pletters at
0x7f60e6d9c4d0>
405 def caesar_break(message, fitness=Pletters):
406 """Breaks a Caesar cipher using frequency analysis
408 for shift in range(26):
409 plaintext = caesar_decipher(message, shift)
410 fit = fitness(plaintext)
415 # Changing the comparison function
417 * Must be a function that takes a text and returns a score
418 * Better fit must give higher score, opposite of the vector distance norms
421 def make_frequency_compare_function(target_frequency, frequency_scaling, metric, invert):
422 def frequency_compare(text):
425 return frequency_compare
430 # Data-driven processing
433 metrics = [{'func': norms.l1, 'invert': True, 'name': 'l1'},
434 {'func': norms.l2, 'invert': True, 'name': 'l2'},
435 {'func': norms.l3, 'invert': True, 'name': 'l3'},
436 {'func': norms.cosine_similarity, 'invert': False, 'name': 'cosine_similarity'}]
437 scalings = [{'corpus_frequency': normalised_english_counts,
438 'scaling': norms.normalise,
439 'name': 'normalised'},
440 {'corpus_frequency': euclidean_scaled_english_counts,
441 'scaling': norms.euclidean_scale,
442 'name': 'euclidean_scaled'}]
445 Use this to make all nine scoring functions.
449 <script src="http://gnab.github.io/remark/downloads/remark-
0.6.0.min.js
" type="text/javascript
">
452 <script type="text/javascript
"
453 src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured
"></script>
455 <script type="text/javascript
">
456 var slideshow = remark.create({ ratio: "16:
9" });
461 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
464 MathJax.Hub.Queue(function() {
465 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
466 return(elem.SourceElement());
467 }).parent().addClass('has-jax');
469 MathJax.Hub.Configured();