4 <title>Affine ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
47 <textarea id=
"source">
53 `making sense of this`
59 Ciphertext is re-split into groups to hide word bounaries.
61 * HELMU TSCOU SINSA REISU PPOSE KINDI NTHEI ROWNW AYBUT THERE ISLIT TLEWA RMTHI NTHEK INDNE SSIRE CEIVE
63 How can we rediscover the word boundaries?
65 * helmut s cousins are i suppose kind in their own way but there is little warmth in the kindness i receive
71 1. Try all possible word boundaries
72 2. Return the one that looks most like English
74 What's the complexity of this process?
76 * (We'll fix that in a bit...)
80 # What do we mean by
"looks like English"?
82 Naïve Bayes bag-of-words worked well for cipher breaking. Can we apply the same intuition here?
84 Probability of a bag-of-words (ignoring inter-word dependencies).
86 Finding the counts of words in text is harder than letters.
88 * More tokens, so need more data to cover sufficient words.
91 # Data sparsity and smoothing
93 `counts_1w.txt` is the
333,
333 most common words types, with number of tokens for each, collected by Google.
95 Doesn't cover a lot of words we want, such as proper nouns.
97 We'll have to guess the probability of unknown word.
99 Lots of ways to do this properly (Laplace smoothing, Good-Turing smoothing)...
101 ...but we'll ignore them all.
103 Assume unknown words have a count of
1.
107 # Storing word probabilities
109 We want something like a `defaultdict` but with our own default value
113 Constructor (`__init__`) takes a data file, does all the adding up and taking logs
115 `__missing__` handles the case when the key is missing
120 def __init__(self, data=[]):
121 for key, count in data2:
124 def __missing__(self, key):
135 # Testing the bag of words model
138 >>> 'hello' in Pw.keys()
140 >>> 'inigo' in Pw.keys()
142 >>> 'blj' in Pw.keys()
154 >>> Pwords(['hello'])
156 >>> Pwords(['hello', 'my'])
158 >>> Pwords(['hello', 'my', 'name'])
160 >>> Pwords(['hello', 'my', 'name', 'is'])
162 >>> Pwords(['hello', 'my', 'name', 'is', 'inigo'])
164 >>> Pwords(['hello', 'my', 'name', 'is', 'blj'])
170 # Splitting the input
174 find all possible splits into a first portion and remainder
176 segment the remainder
177 return the split with highest score
180 Indexing pulls out letters. `'sometext'[
0]` = 's' ; `'keyword'[
3]` = 'e' ; `'keyword'[-
1]` = 't'
182 Slices pulls out substrings. `'keyword'[
1:
4]` = 'ome' ; `'keyword'[:
3]` = 'som' ; `'keyword'[
5:]` = 'ext'
184 `range()` will sweep across the string
189 >>> splits('sometext')
190 [('s', 'ometext'), ('so', 'metext'), ('som', 'etext'), ('some', 'text'),
191 ('somet', 'ext'), ('somete', 'xt'), ('sometex', 't'), ('sometext', '')]
194 The last one is important
196 * What if this is the last word of the text?
200 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
203 <script type=
"text/javascript"
204 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
206 <script type=
"text/javascript">
207 var slideshow = remark.create({ ratio:
"16:9" });
212 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
215 MathJax.Hub.Queue(function() {
216 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
217 return(elem.SourceElement());
218 }).parent().addClass('has-jax');
220 MathJax.Hub.Configured();