4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
51 <textarea id=
"source">
53 # Breaking caesar ciphers
55 ![center-aligned Caesar wheel](caesarwheel1.gif)
61 .indexlink[[Index](index.html)]
67 Slow but clever vs Dumb but fast
71 Ciphertext | Plaintext
73 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
85 * How many keys to try?
91 decipher with this key
92 how close is it to English?
96 What steps do we know how to do?
99 # How close is it to English?
101 What does English look like?
103 * We need a model of English.
105 How do we define
"closeness"?
107 ## Here begineth the yak shaving
111 # What does English look like?
113 ## Abstraction: frequency of letter counts
127 Use this to predict the probability of each letter, and hence the probability of a sequence of letters.
131 .float-right[![right-aligned Typing monkey](typingmonkeylarge.jpg)]
133 # Naive Bayes, or the bag of letters
135 What is the probability that this string of letters is a sample of English?
137 Ignore letter order, just treat each letter individually.
139 Probability of a text is `\( \prod_i p_i \)`
141 Letter | h | e | l | l | o | hello
142 ------------|---------|---------|---------|---------|---------|-------
143 Probability |
0.06645 |
0.12099 |
0.04134 |
0.04134 |
0.08052 |
1.10648239 ×
10<sup>-
6</sup>
145 Letter | i | f | m | m | p | ifmmp
146 ------------|---------|---------|---------|---------|---------|-------
147 Probability |
0.06723 |
0.02159 |
0.02748 |
0.02748 |
0.01607 |
1.76244520 ×
10<sup>-
8</sup>
149 (Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`)
151 Letter | h | e | l | l | o | hello
152 ------------|---------|---------|---------|---------|---------|-------
153 Probability | -
1.1774 | -
0.9172 | -
1.3836 | -
1.3836 | -
1.0940 | -
5.956055
158 # Frequencies of English
160 But before then how do we count the letters?
162 * Read a file into a string
170 collections.Counter()
173 Create the `language_models.py` file for this.
179 Counting letters in _War and Peace_ gives all manner of junk.
181 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
184 [l.lower() for l in text if ...]
191 >>> 'é' in string.ascii_letters
192 >>> 'e' in string.ascii_letters
195 ## Unicode, combining codepoints, and normal forms
197 Text encodings will bite you when you least expect it.
199 - **é** : LATIN SMALL LETTER E WITH ACUTE (U+
00E9)
201 - **e** + **
́** : LATIN SMALL LETTER E (U+
0065) + COMBINING ACUTE ACCENT (U+
0301)
203 * urlencoding is the other pain point.
207 # Five minutes on StackOverflow later...
211 """Remove all accents from letters.
212 It does this by converting the unicode string to decomposed compatibility
213 form, dropping all the combining accents, then re-encoding the bytes.
215 >>> unaccent('hello')
217 >>> unaccent('HELLO')
219 >>> unaccent('héllo')
221 >>> unaccent('héllö')
223 >>> unaccent('HÉLLÖ')
226 return unicodedata.normalize('NFKD', text).\
227 encode('ascii', 'ignore').\
233 # Find the frequencies of letters in English
235 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`.
236 2. Find the frequencies (`.update()`)
237 3. Sort by count (read the docs...)
238 4. Write counts to `count_1l.txt`
240 with open('count_1l.txt', 'w') as f:
242 f.write('text\t{}\n'.format(count))
247 # Reading letter probabilities
249 1. Load the file `count_1l.txt` into a dict, with letters as keys.
251 2. Normalise the counts (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
253 * Remember the doctest!
255 3. Create a dict `Pl` that gives the log probability of a letter
257 4. Create a function `Pletters` that gives the probability of an iterable of letters
258 * What preconditions should this function have?
259 * Remember the doctest!
263 # Breaking caesar ciphers
265 New file: `cipherbreak.py`
267 ## Remember the basic idea
271 decipher with this key
272 how close is it to English?
273 remember the best key
276 Try it on the text in `
2013/
1a.ciphertext`. Does it work?
282 Better than scattering `print()`statements through your code
287 logger = logging.getLogger(__name__)
288 logger.addHandler(logging.FileHandler('cipher.log'))
289 logger.setLevel(logging.WARNING)
291 logger.debug('Caesar break attempt using key {
0} gives fit of {
1} '
292 'and decrypt starting: {
2}'.format(shift, fit, plaintext[:
50]))
297 Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG
299 Use `logger.debug()`, `logger.info()`, etc. to log a message.
303 # Homework: how much ciphertext do we need?
305 ## Let's do an experiment to find out
307 1. Load the whole corpus into a string (sanitised)
308 2. Select a random chunk of plaintext and a random key
310 4. Score
1 point if `caesar_cipher_break()` recovers the correct key
311 5. Repeat many times and with many plaintext lengths
317 with open('caesar_break_parameter_trials.csv', 'w') as f:
318 writer = csv.DictWriter(f, ['name'] + message_lengths,
319 quoting=csv.QUOTE_NONNUMERIC)
321 for scoring in sorted(scores.keys()):
322 scores[scoring]['name'] = scoring
323 writer.writerow(scores[scoring])
327 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
330 <script type=
"text/javascript"
331 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
333 <script type=
"text/javascript">
334 var slideshow = remark.create({ ratio:
"16:9" });
339 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
342 MathJax.Hub.Queue(function() {
343 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
344 return(elem.SourceElement());
345 }).parent().addClass('has-jax');
347 MathJax.Hub.Configured();