d552209419fe6f6e7074499ea7998f4c55a7465c
[cipher-training.git] / slides / caesar-break.html
1 <!DOCTYPE html>
2 <html>
3 <head>
4 <title>Breaking caesar ciphers</title>
5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
6 <style type="text/css">
7 /* Slideshow styles */
8 body {
9 font-size: 20px;
10 }
11 h1, h2, h3 {
12 font-weight: 400;
13 margin-bottom: 0;
14 }
15 h1 { font-size: 3em; }
16 h2 { font-size: 2em; }
17 h3 { font-size: 1.6em; }
18 a, a > code {
19 text-decoration: none;
20 }
21 code {
22 -moz-border-radius: 5px;
23 -web-border-radius: 5px;
24 background: #e7e8e2;
25 border-radius: 5px;
26 font-size: 16px;
27 }
28 .plaintext {
29 background: #272822;
30 color: #80ff80;
31 text-shadow: 0 0 20px #333;
32 padding: 2px 5px;
33 }
34 .ciphertext {
35 background: #272822;
36 color: #ff6666;
37 text-shadow: 0 0 20px #333;
38 padding: 2px 5px;
39 }
40 </style>
41 </head>
42 <body>
43 <textarea id="source">
44
45 # Breaking caesar ciphers
46
47 ![centre-aligned Caesar wheel](caesarwheel1.gif)
48
49 ---
50
51 # Brute force
52
53 How many keys to try?
54
55 ## Basic idea
56
57 ```
58 for each key:
59 decipher with this key
60 how close is it to English?
61 remember the best key
62 ```
63
64 What steps do we know how to do?
65
66 ---
67 # How close is it to English?
68
69 What does English look like?
70 * We need a model of English.
71
72 How do we define "closeness"?
73
74 ---
75
76 # What does English look like?
77
78 ## Abstraction: frequency of letter counts
79
80 Letter | Count
81 -------|------
82 a | 489107
83 b | 92647
84 c | 140497
85 d | 267381
86 e | 756288
87 . | .
88 . | .
89 . | .
90 z | 3575
91
92 One way of thinking about this is a 26-dimensional vector.
93
94 Create a vector of our text, and one of idealised English.
95
96 The distance between the vectors is how far from English the text is.
97
98 ---
99
100 # Vector distances
101
102
103 ## FIXME! Diagram of vector subtraction here
104
105 Several different distance measures (__metrics__, also called __norms__):
106
107 * L<sub>2</sub> norm (Euclidean distance): `\(|\mathbf{x} - \mathbf{y}| = \sqrt{\sum_i (\mathbf{x}_i - \mathbf{y}_i)^2} \)`
108
109 * L<sub>1</sub> norm (Manhattan distance, taxicab distance): `\(|\mathbf{x} - \mathbf{y}| = \sum_i |\mathbf{x}_i - \mathbf{y}_i| \)`
110
111 * L<sub>3</sub> norm: `\(|\mathbf{x} - \mathbf{y}| = \sqrt[3]{\sum_i |\mathbf{x}_i - \mathbf{y}_i|^3} \)`
112
113 The higher the power used, the more weight is given to the largest differences in components.
114
115 (Extends out to:
116
117 * L<sub>0</sub> norm (Hamming distance): `\(|\mathbf{x} - \mathbf{y}| = \sum_i \left\{\begin{matrix} 1 &amp;\mbox{if}\ \mathbf{x}_i \neq \mathbf{y}_i , \\ 0 &amp;\mbox{if}\ \mathbf{x}_i = \mathbf{y}_i \end{matrix} \right| \)`
118
119 * L<sub>&infin;</sub> norm: `\(|\mathbf{x} - \mathbf{y}| = \max_i{(\mathbf{x}_i - \mathbf{y}_i)} \)`
120
121 neither of which will be that useful.)
122 ---
123
124 # Normalisation of vectors
125
126 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
127
128 * Eucliean scaling (vector with unit length): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }\)`
129
130 * Normalisation (components of vector sum to 1): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }\)`
131
132 ---
133
134 # Angle, not distance
135
136 Rather than looking at the distance between the vectors, look at the angle between them.
137
138 Vector dot product shows how much of one vector lies in the direction of another:
139 `\( \mathbf{A} \bullet \mathbf{B} =
140 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
141
142 But,
143 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
144 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)`
145
146 A bit of rearranging give the cosine simiarity:
147 `\( \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
148 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} \)`
149
150 This is independent of vector lengths!
151
152 Cosine similarity is 1 if in same direction, 0 if at 90⁰, -1 if antiparallel.
153
154 ## FIXME! Cosine distance bug: frequencies2 length not squared.
155
156
157 ## FIXME! Diagram of vector dot product
158
159 </textarea>
160 <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
161 </script>
162
163 <script type="text/javascript"
164 src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
165
166 <script type="text/javascript">
167 var slideshow = remark.create({ ratio: "16:9" });
168
169 // Setup MathJax
170 MathJax.Hub.Config({
171 tex2jax: {
172 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
173 }
174 });
175 MathJax.Hub.Queue(function() {
176 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
177 return(elem.SourceElement());
178 }).parent().addClass('has-jax');
179 });
180 MathJax.Hub.Configured();
181 </script>
182 </body>
183 </html>