4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
51 <textarea id=
"source">
53 # Breaking caesar ciphers
55 ![center-aligned Caesar wheel](caesarwheel1.gif)
61 .indexlink[[Index](index.html)]
67 Slow but clever vs Dumb but fast
71 Ciphertext | Plaintext
73 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
85 * How many keys to try?
91 decipher with this key
92 how close is it to English?
96 What steps do we know how to do?
99 # How close is it to English?
101 What does English look like?
103 * We need a model of English.
105 How do we define
"closeness"?
107 ## Here begineth the yak shaving
111 # What does English look like?
113 ## Abstraction: frequency of letter counts
127 Use this to predict the probability of each letter, and hence the probability of a sequence of letters.
131 # An infinite number of monkeys
133 What is the probability that this string of letters is a sample of English?
135 ## Naive Bayes, or the bag of letters
137 Ignore letter order, just treat each letter individually.
139 Probability of a text is `\( \prod_i p_i \)`
141 Letter | h | e | l | l | o | hello
142 ------------|---------|---------|---------|---------|---------|-------
143 Probability |
0.06645 |
0.12099 |
0.04134 |
0.04134 |
0.08052 |
1.10648239 ×
10<sup>-
6</sup>
145 Letter | i | f | m | m | p | ifmmp
146 ------------|---------|---------|---------|---------|---------|-------
147 Probability |
0.06723 |
0.02159 |
0.02748 |
0.02748 |
0.01607 |
1.76244520 ×
10<sup>-
8</sup>
149 (Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`)
151 Letter | h | e | l | l | o | hello
152 ------------|---------|---------|---------|---------|---------|-------
153 Probability | -
1.1774 | -
0.9172 | -
1.3836 | -
1.3836 | -
1.0940 | -
5.956055
158 # Frequencies of English
160 But before then how do we count the letters?
162 * Read a file into a string
170 collections.Counter()
173 Create the `language_models.py` file for this.
179 Counting letters in _War and Peace_ gives all manner of junk.
181 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
184 [l.lower() for l in text if ...]
191 >>> 'é' in string.ascii_letters
192 >>> 'e' in string.ascii_letters
195 ## Unicode, combining codepoints, and normal forms
197 Text encodings will bite you when you least expect it.
199 - **é** : LATIN SMALL LETTER E WITH ACUTE (U+
00E9)
201 - **e** + **
́** : LATIN SMALL LETTER E (U+
0065) + COMBINING ACUTE ACCENT (U+
0301)
203 * urlencoding is the other pain point.
207 # Five minutes on StackOverflow later...
211 """Remove all accents from letters.
212 It does this by converting the unicode string to decomposed compatibility
213 form, dropping all the combining accents, then re-encoding the bytes.
215 >>> unaccent('hello')
217 >>> unaccent('HELLO')
219 >>> unaccent('héllo')
221 >>> unaccent('héllö')
223 >>> unaccent('HÉLLÖ')
226 return unicodedata.normalize('NFKD', text).\
227 encode('ascii', 'ignore').\
233 # Find the frequencies of letters in English
235 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`.
236 2. Find the frequencies (`.update()`)
238 4. Write counts to `count_1l.txt` (`'text{}\n'.format()`)
242 # Reading letter probabilities
244 1. Load the file `count_1l.txt` into a dict, with letters as keys.
246 2. Normalise the counts (components of vector sum to
1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
248 * Remember the doctest!
250 3. Create a dict `Pl` that gives the log probability of a letter
252 4. Create a function `Pletters` that gives the probability of an iterable of letters
253 * What preconditions should this function have?
254 * Remember the doctest!
258 # Breaking caesar ciphers
260 ## Remember the basic idea
264 decipher with this key
265 how close is it to English?
266 remember the best key
269 Try it on the text in `
2013/
1a.ciphertext`. Does it work?
275 Better than scattering `print()`statements through your code
280 logger = logging.getLogger(__name__)
281 logger.addHandler(logging.FileHandler('cipher.log'))
282 logger.setLevel(logging.WARNING)
284 logger.debug('Caesar break attempt using key {
0} gives fit of {
1} '
285 'and decrypt starting: {
2}'.format(shift, fit, plaintext[:
50]))
290 Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG
292 Use `logger.debug()`, `logger.info()`, etc. to log a message.
296 # Homework: how much ciphertext do we need?
298 ## Let's do an experiment to find out
300 1. Load the whole corpus into a string (sanitised)
301 2. Select a random chunk of plaintext and a random key
303 4. Score
1 point if `caesar_cipher_break()` recovers the correct key
304 5. Repeat many times and with many plaintext lengths
310 with open('caesar_break_parameter_trials.csv', 'w') as f:
311 writer = csv.DictWriter(f, ['name'] + message_lengths,
312 quoting=csv.QUOTE_NONNUMERIC)
314 for scoring in sorted(scores.keys()):
315 scores[scoring]['name'] = scoring
316 writer.writerow(scores[scoring])
320 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
323 <script type=
"text/javascript"
324 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
326 <script type=
"text/javascript">
327 var slideshow = remark.create({ ratio:
"16:9" });
332 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
335 MathJax.Hub.Queue(function() {
336 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
337 return(elem.SourceElement());
338 }).parent().addClass('has-jax');
340 MathJax.Hub.Configured();