4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
46 <textarea id=
"source">
48 # Breaking caesar ciphers
50 ![center-aligned Caesar wheel](caesarwheel1.gif)
56 Slow but clever vs Dumb but fast
60 Ciphertext | Plaintext
62 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
74 * How many keys to try?
80 decipher with this key
81 how close is it to English?
85 What steps do we know how to do?
88 # How close is it to English?
90 What does English look like?
92 * We need a model of English.
94 How do we define
"closeness"?
98 # What does English look like?
100 ## Abstraction: frequency of letter counts
114 Use this to predict the probability of each letter, and hence the probability of a sequence of letters.
118 # An infinite number of monkeys
120 What is the probability that this string of letters is a sample of English?
122 ## Naive Bayes, or the bag of letters
124 Ignore letter order, just treat each letter individually.
126 Probability of a text is `\( \prod_i p_i \)`
128 Letter | h | e | l | l | o | hello
129 ------------|---------|---------|---------|---------|---------|-------
130 Probability |
0.06645 |
0.12099 |
0.04134 |
0.04134 |
0.08052 |
1.10648239 ×
10<sup>-
6</sup>
132 Letter | i | f | m | m | p | ifmmp
133 ------------|---------|---------|---------|---------|---------|-------
134 Probability |
0.06723 |
0.02159 |
0.02748 |
0.02748 |
0.01607 |
1.76244520 ×
10<sup>-
8</sup>
136 (Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`)
138 Letter | h | e | l | l | o | hello
139 ------------|---------|---------|---------|---------|---------|-------
140 Probability | -
1.1774 | -
0.9172 | -
1.3836 | -
1.3836 | -
1.0940 | -
5.956055
145 # Frequencies of English
147 But before then how do we count the letters?
149 * Read a file into a string
159 Create the `language_models.py` file for this.
165 Counting letters in _War and Peace_ gives all manner of junk.
167 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
170 [l.lower() for l in text if ...]
177 >>> 'é' in string.ascii_letters
178 >>> 'e' in string.ascii_letters
181 ## Unicode, combining codepoints, and normal forms
183 Text encodings will bite you when you least expect it.
185 - **é** : LATIN SMALL LETTER E WITH ACUTE (U+
00E9)
187 - **e** + **
́** : LATIN SMALL LETTER E (U+
0065) + COMBINING ACUTE ACCENT (U+
0301)
189 * urlencoding is the other pain point.
193 # Five minutes on StackOverflow later...
197 """Remove all accents from letters.
198 It does this by converting the unicode string to decomposed compatibility
199 form, dropping all the combining accents, then re-encoding the bytes.
201 >>> unaccent('hello')
203 >>> unaccent('HELLO')
205 >>> unaccent('héllo')
207 >>> unaccent('héllö')
209 >>> unaccent('HÉLLÖ')
212 return unicodedata.normalize('NFKD', text).\
213 encode('ascii', 'ignore').\
219 # Find the frequencies of letters in English
221 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`.
222 2. Find the frequencies (`.update()`)
224 4. Write counts to `count_1l.txt` (`'text{}\n'.format()`)
228 # Reading letter probabilities
230 1. Load the file `count_1l.txt` into a dict, with letters as keys.
232 2. Normalise the counts (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
234 * Remember the doctest!
236 3. Create a dict `Pl` that gives the log probability of a letter
238 4. Create a function `Pletters` that gives the probability of an iterable of letters
239 * What preconditions should this function have?
240 * Remember the doctest!
244 # Breaking caesar ciphers
246 ## Remember the basic idea
250 decipher with this key
251 how close is it to English?
252 remember the best key
255 Try it on the text in `
2013/
1a.ciphertext`. Does it work?
261 Better than scattering `print()`statements through your code
266 logger = logging.getLogger(__name__)
267 logger.addHandler(logging.FileHandler('cipher.log'))
268 logger.setLevel(logging.WARNING)
270 logger.debug('Caesar break attempt using key {
0} gives fit of {
1} '
271 'and decrypt starting: {
2}'.format(shift, fit, plaintext[:
50]))
276 Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG
278 Use `logger.debug()`, `logger.info()`, etc. to log a message.
282 # How much ciphertext do we need?
284 ## Let's do an experiment to find out
286 1. Load the whole corpus into a string (sanitised)
287 2. Select a random chunk of plaintext and a random key
289 4. Score
1 point if `caesar_cipher_break()` recovers the correct key
290 5. Repeat many times and with many plaintext lengths
294 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
297 <script type=
"text/javascript"
298 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
300 <script type=
"text/javascript">
301 var slideshow = remark.create({ ratio:
"16:9" });
306 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
309 MathJax.Hub.Queue(function() {
310 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
311 return(elem.SourceElement());
312 }).parent().addClass('has-jax');
314 MathJax.Hub.Configured();