4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
51 <textarea id=
"source">
53 # Breaking caesar ciphers
55 ![center-aligned Caesar wheel](caesarwheel1.gif)
61 .indexlink[[Index](index.html)]
67 Slow but clever vs Dumb but fast
71 Ciphertext | Plaintext
73 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
85 * How many keys to try?
91 decipher with this key
92 how close is it to English?
96 What steps do we know how to do?
99 # How close is it to English?
101 What does English look like?
103 * We need a model of English.
105 How do we define
"closeness"?
107 ## Here begineth the yak shaving
111 # What does English look like?
113 ## Abstraction: frequency of letter counts
115 .float-right[![right-aligned Letter frequencies](letter-frequency-treemap.png)]
129 Use this to predict the probability of each letter, and hence the probability of a sequence of letters.
133 .float-right[![right-aligned Typing monkey](typingmonkeylarge.jpg)]
135 # Naive Bayes, or the bag of letters
137 What is the probability that this string of letters is a sample of English?
139 Ignore letter order, just treat each letter individually.
141 Probability of a text is `\( \prod_i p_i \)`
143 Letter | h | e | l | l | o | hello
144 ------------|---------|---------|---------|---------|---------|-------
145 Probability |
0.06645 |
0.12099 |
0.04134 |
0.04134 |
0.08052 |
1.10648239 ×
10<sup>-
6</sup>
147 Letter | i | f | m | m | p | ifmmp
148 ------------|---------|---------|---------|---------|---------|-------
149 Probability |
0.06723 |
0.02159 |
0.02748 |
0.02748 |
0.01607 |
1.76244520 ×
10<sup>-
8</sup>
151 (Implmentation issue: this can often underflow, so we rephrase it as `\( \sum_i \log p_i \)`)
153 Letter | h | e | l | l | o | hello
154 ------------|---------|---------|---------|---------|---------|-------
155 Probability | -
1.1774 | -
0.9172 | -
1.3836 | -
1.3836 | -
1.0940 | -
5.956055
160 # Frequencies of English
162 But before then how do we count the letters?
164 * Read a file into a string
172 collections.Counter()
175 Create the `language_models.py` file for this.
181 Counting letters in _War and Peace_ gives all manner of junk.
183 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
186 [l.lower() for l in text if ...]
193 >>> 'é' in string.ascii_letters
194 >>> 'e' in string.ascii_letters
197 ## Unicode, combining codepoints, and normal forms
199 Text encodings will bite you when you least expect it.
201 - **é** : LATIN SMALL LETTER E WITH ACUTE (U+
00E9)
203 - **e** + **
́** : LATIN SMALL LETTER E (U+
0065) + COMBINING ACUTE ACCENT (U+
0301)
205 * urlencoding is the other pain point.
209 # Five minutes on StackOverflow later...
215 """Remove all accents from letters.
216 It does this by converting the unicode string to decomposed compatibility
217 form, dropping all the combining accents, then re-encoding the bytes.
219 >>> unaccent('hello')
221 >>> unaccent('HELLO')
223 >>> unaccent('héllo')
225 >>> unaccent('héllö')
227 >>> unaccent('HÉLLÖ')
230 return unicodedata.normalize('NFKD', text).\
231 encode('ascii', 'ignore').\
237 # Find the frequencies of letters in English
239 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`.
240 2. Find the frequencies (`.update()`)
241 3. Sort by count (read the docs...)
242 4. Write counts to `count_1l.txt`
244 with open('count_1l.txt', 'w') as f:
246 f.write('text\t{}\n'.format(count))
251 # Reading letter probabilities
253 1. Load the file `count_1l.txt` into a dict, with letters as keys.
255 2. Normalise the counts (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
257 * Remember the doctest!
259 3. Create a dict `Pl` that gives the log probability of a letter
261 4. Create a function `Pletters` that gives the probability of an iterable of letters
262 * What preconditions should this function have?
263 * Remember the doctest!
267 # Breaking caesar ciphers
269 New file: `cipherbreak.py`
271 ## Remember the basic idea
275 decipher with this key
276 how close is it to English?
277 remember the best key
280 Try it on the text in `
2013/
1a.ciphertext`. Does it work?
286 Better than scattering `print()`statements through your code
291 logger = logging.getLogger(__name__)
292 logger.addHandler(logging.FileHandler('cipher.log'))
293 logger.setLevel(logging.WARNING)
295 logger.debug('Caesar break attempt using key {
0} gives fit of {
1} '
296 'and decrypt starting: {
2}'.format(shift, fit, plaintext[:
50]))
301 Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG
303 Use `logger.debug()`, `logger.info()`, etc. to log a message.
307 # Homework: how much ciphertext do we need?
309 ## Let's do an experiment to find out
311 1. Load the whole corpus into a string (sanitised)
312 2. Select a random chunk of plaintext and a random key
314 4. Score
1 point if `caesar_cipher_break()` recovers the correct key
315 5. Repeat many times and with many plaintext lengths
321 with open('caesar_break_parameter_trials.csv', 'w') as f:
322 writer = csv.DictWriter(f, ['name'] + message_lengths,
323 quoting=csv.QUOTE_NONNUMERIC)
325 for scoring in sorted(scores.keys()):
326 scores[scoring]['name'] = scoring
327 writer.writerow(scores[scoring])
331 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
334 <script type=
"text/javascript"
335 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
337 <script type=
"text/javascript">
338 var slideshow = remark.create({ ratio:
"16:9" });
343 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
346 MathJax.Hub.Queue(function() {
347 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
348 return(elem.SourceElement());
349 }).parent().addClass('has-jax');
351 MathJax.Hub.Configured();