4 <title>Breaking caesar ciphers
</title>
5 <meta http-equiv=
"Content-Type" content=
"text/html; charset=UTF-8"/>
6 <style type=
"text/css">
15 h1 { font-size:
3em; }
16 h2 { font-size:
2em; }
17 h3 { font-size:
1.6em; }
19 text-decoration: none;
22 -moz-border-radius:
5px;
23 -web-border-radius:
5px;
31 text-shadow:
0 0 20px #
333;
37 text-shadow:
0 0 20px #
333;
43 <textarea id=
"source">
45 # Breaking caesar ciphers
47 ![centre-aligned Caesar wheel](caesarwheel1.gif)
59 decipher with this key
60 how close is it to English?
64 What steps do we know how to do?
67 # How close is it to English?
69 What does English look like?
70 * We need a model of English.
72 How do we define
"closeness"?
76 # What does English look like?
78 ## Abstraction: frequency of letter counts
92 One way of thinking about this is a
26-dimensional vector.
94 Create a vector of our text, and one of idealised English.
96 The distance between the vectors is how far from English the text is.
103 ## FIXME! Diagram of vector subtraction here
105 Several different distance measures (__metrics__, also called __norms__):
107 * L
<sub>2</sub> norm (Euclidean distance): `\(|\mathbf{x} - \mathbf{y}| = \sqrt{\sum_i (\mathbf{x}_i - \mathbf{y}_i)^
2} \)`
109 * L
<sub>1</sub> norm (Manhattan distance, taxicab distance): `\(|\mathbf{x} - \mathbf{y}| = \sum_i |\mathbf{x}_i - \mathbf{y}_i| \)`
111 * L
<sub>3</sub> norm: `\(|\mathbf{x} - \mathbf{y}| = \sqrt[
3]{\sum_i |\mathbf{x}_i - \mathbf{y}_i|^
3} \)`
113 The higher the power used, the more weight is given to the largest differences in components.
117 * L
<sub>0</sub> norm (Hamming distance): `\(|\mathbf{x} - \mathbf{y}| = \sum_i \left\{\begin{matrix}
1 &\mbox{if}\ \mathbf{x}_i \neq \mathbf{y}_i , \\
0 &\mbox{if}\ \mathbf{x}_i = \mathbf{y}_i \end{matrix} \right| \)`
119 * L
<sub>∞</sub> norm: `\(|\mathbf{x} - \mathbf{y}| = \max_i{(\mathbf{x}_i - \mathbf{y}_i)} \)`
121 neither of which will be that useful.)
124 # Normalisation of vectors
126 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
128 * Eucliean scaling (vector with unit length): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^
2 + \mathbf{x}_2^
2 + \mathbf{x}_3^
2 + \dots } }\)`
130 * Normalisation (components of vector sum to
1): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }\)`
134 # Angle, not distance
136 Rather than looking at the distance between the vectors, look at the angle between them.
138 Vector dot product shows how much of one vector lies in the direction of another:
139 `\( \mathbf{A} \bullet \mathbf{B} =
140 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
143 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
144 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^
2 \)`
146 A bit of rearranging give the cosine simiarity:
147 `\( \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
148 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^
2 \times \sum_i \mathbf{B}_i^
2} \)`
150 This is independent of vector lengths!
152 Cosine similarity is
1 if in same direction,
0 if at
90⁰, -
1 if antiparallel.
154 ## FIXME! Cosine distance bug: frequencies2 length not squared.
157 ## FIXME! Diagram of vector dot product
160 <script src=
"http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type=
"text/javascript">
163 <script type=
"text/javascript"
164 src=
"http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
166 <script type=
"text/javascript">
167 var slideshow = remark.create({ ratio:
"16:9" });
172 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
175 MathJax.Hub.Queue(function() {
176 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
177 return(elem.SourceElement());
178 }).parent().addClass('has-jax');
180 MathJax.Hub.Configured();