4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
51 <textarea id=
"source">
53 # Breaking caesar ciphers
55 ![center-aligned Caesar wheel](caesarwheel1.gif)
61 .indexlink[[Index](index.html)]
67 Slow but clever vs Dumb but fast
71 Ciphertext | Plaintext
73 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
85 * How many keys to try?
91 decipher with this key
92 how close is it to English?
96 What steps do we know how to do?
99 # How close is it to English?
101 What does English look like?
103 * We need a model of English.
105 How do we define
"closeness"?
107 ## Here begineth the yak shaving
111 # What does English look like?
113 ## Abstraction: frequency of letter counts
127 Use this to predict the probability of each letter, and hence the probability of a sequence of letters.
131 .float-right[![right-aligned Typing monkey](typingmonkeylarge.jpg)]
133 # Naive Bayes, or the bag of letters
135 What is the probability that this string of letters is a sample of English?
137 Ignore letter order, just treat each letter individually.
139 Probability of a text is `\( \prod_i p_i \)`
141 Letter | h | e | l | l | o | hello
142 ------------|---------|---------|---------|---------|---------|-------
143 Probability |
0.06645 |
0.12099 |
0.04134 |
0.04134 |
0.08052 |
1.10648239 ×
10<sup>-
6</sup>
145 Letter | i | f | m | m | p | ifmmp
146 ------------|---------|---------|---------|---------|---------|-------
147 Probability |
0.06723 |
0.02159 |
0.02748 |
0.02748 |
0.01607 |
1.76244520 ×
10<sup>-
8</sup>
149 (Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`)
151 Letter | h | e | l | l | o | hello
152 ------------|---------|---------|---------|---------|---------|-------
153 Probability | -
1.1774 | -
0.9172 | -
1.3836 | -
1.3836 | -
1.0940 | -
5.956055
158 # Frequencies of English
160 But before then how do we count the letters?
162 * Read a file into a string
170 collections.Counter()
173 Create the `language_models.py` file for this.
179 Counting letters in _War and Peace_ gives all manner of junk.
181 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
184 [l.lower() for l in text if ...]
191 >>> 'é' in string.ascii_letters
192 >>> 'e' in string.ascii_letters
195 ## Unicode, combining codepoints, and normal forms
197 Text encodings will bite you when you least expect it.
199 - **é** : LATIN SMALL LETTER E WITH ACUTE (U+
00E9)
201 - **e** + **
́** : LATIN SMALL LETTER E (U+
0065) + COMBINING ACUTE ACCENT (U+
0301)
203 * urlencoding is the other pain point.
207 # Five minutes on StackOverflow later...
211 """Remove all accents from letters.
212 It does this by converting the unicode string to decomposed compatibility
213 form, dropping all the combining accents, then re-encoding the bytes.
215 >>> unaccent('hello')
217 >>> unaccent('HELLO')
219 >>> unaccent('héllo')
221 >>> unaccent('héllö')
223 >>> unaccent('HÉLLÖ')
226 return unicodedata.normalize('NFKD', text).\
227 encode('ascii', 'ignore').\
233 # Find the frequencies of letters in English
235 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`.
236 2. Find the frequencies (`.update()`)
237 3. Sort by count (read the docs...)
238 4. Write counts to `count_1l.txt`
240 with open('count_1l.txt', 'w') as f:
242 f.write('text\t{}\n'.format(count))
247 # Reading letter probabilities
249 New file: `language_models.py`
251 1. Load the file `count_1l.txt` into a dict, with letters as keys.
253 2. Normalise the counts (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
255 * Remember the doctest!
257 3. Create a dict `Pl` that gives the log probability of a letter
259 4. Create a function `Pletters` that gives the probability of an iterable of letters
260 * What preconditions should this function have?
261 * Remember the doctest!
265 # Breaking caesar ciphers
267 New file: `cipherbreak.py`
269 ## Remember the basic idea
273 decipher with this key
274 how close is it to English?
275 remember the best key
278 Try it on the text in `
2013/
1a.ciphertext`. Does it work?
284 Better than scattering `print()`statements through your code
289 logger = logging.getLogger(__name__)
290 logger.addHandler(logging.FileHandler('cipher.log'))
291 logger.setLevel(logging.WARNING)
293 logger.debug('Caesar break attempt using key {
0} gives fit of {
1} '
294 'and decrypt starting: {
2}'.format(shift, fit, plaintext[:
50]))
299 Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG
301 Use `logger.debug()`, `logger.info()`, etc. to log a message.
305 # Homework: how much ciphertext do we need?
307 ## Let's do an experiment to find out
309 1. Load the whole corpus into a string (sanitised)
310 2. Select a random chunk of plaintext and a random key
312 4. Score
1 point if `caesar_cipher_break()` recovers the correct key
313 5. Repeat many times and with many plaintext lengths
319 with open('caesar_break_parameter_trials.csv', 'w') as f:
320 writer = csv.DictWriter(f, ['name'] + message_lengths,
321 quoting=csv.QUOTE_NONNUMERIC)
323 for scoring in sorted(scores.keys()):
324 scores[scoring]['name'] = scoring
325 writer.writerow(scores[scoring])
329 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
332 <script type=
"text/javascript"
333 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
335 <script type=
"text/javascript">
336 var slideshow = remark.create({ ratio:
"16:9" });
341 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
344 MathJax.Hub.Queue(function() {
345 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
346 return(elem.SourceElement());
347 }).parent().addClass('has-jax');
349 MathJax.Hub.Configured();