Included frequency histogram diagrams
[cipher-training.git] / slides / caesar-break.html
1 <!DOCTYPE html>
2 <html>
3 <head>
4 <title>Breaking caesar ciphers</title>
5 <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
6 <style type="text/css">
7 /* Slideshow styles */
8 body {
9 font-size: 20px;
10 }
11 h1, h2, h3 {
12 font-weight: 400;
13 margin-bottom: 0;
14 }
15 h1 { font-size: 3em; }
16 h2 { font-size: 2em; }
17 h3 { font-size: 1.6em; }
18 a, a > code {
19 text-decoration: none;
20 }
21 code {
22 -moz-border-radius: 5px;
23 -web-border-radius: 5px;
24 background: #e7e8e2;
25 border-radius: 5px;
26 font-size: 16px;
27 }
28 .plaintext {
29 background: #272822;
30 color: #80ff80;
31 text-shadow: 0 0 20px #333;
32 padding: 2px 5px;
33 }
34 .ciphertext {
35 background: #272822;
36 color: #ff6666;
37 text-shadow: 0 0 20px #333;
38 padding: 2px 5px;
39 }
40 </style>
41 </head>
42 <body>
43 <textarea id="source">
44
45 # Breaking caesar ciphers
46
47 ![centre-aligned Caesar wheel](caesarwheel1.gif)
48
49 ---
50
51 # Human vs Machine
52
53 Slow but clever vs Dumb but fast
54
55 ## Human approach
56
57 Ciphertext | Plaintext
58 ---|---
59 ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png)
60
61 ---
62
63 # Human vs machine
64
65 ## Machine approach
66
67 Brute force.
68
69 Try all keys.
70
71 * How many keys to try?
72
73 ## Basic idea
74
75 ```
76 for each key:
77 decipher with this key
78 how close is it to English?
79 remember the best key
80 ```
81
82 What steps do we know how to do?
83
84 ---
85 # How close is it to English?
86
87 What does English look like?
88
89 * We need a model of English.
90
91 How do we define "closeness"?
92
93 ---
94
95 # What does English look like?
96
97 ## Abstraction: frequency of letter counts
98
99 Letter | Count
100 -------|------
101 a | 489107
102 b | 92647
103 c | 140497
104 d | 267381
105 e | 756288
106 . | .
107 . | .
108 . | .
109 z | 3575
110
111 One way of thinking about this is a 26-dimensional vector.
112
113 Create a vector of our text, and one of idealised English.
114
115 The distance between the vectors is how far from English the text is.
116
117 ---
118
119 # Vector distances
120
121
122 ## FIXME! Diagram of vector subtraction here
123
124 Several different distance measures (__metrics__, also called __norms__):
125
126 * L<sub>2</sub> norm (Euclidean distance): `\(\|\mathbf{x} - \mathbf{y}\| = \sqrt{\sum_i (\mathbf{x}_i - \mathbf{y}_i)^2} \)`
127
128 * L<sub>1</sub> norm (Manhattan distance, taxicab distance): `\(\|\mathbf{x} - \mathbf{y}\| = \sum_i |\mathbf{x}_i - \mathbf{y}_i| \)`
129
130 * L<sub>3</sub> norm: `\(\|\mathbf{x} - \mathbf{y}\| = \sqrt[3]{\sum_i |\mathbf{x}_i - \mathbf{y}_i|^3} \)`
131
132 The higher the power used, the more weight is given to the largest differences in components.
133
134 (Extends out to:
135
136 * L<sub>0</sub> norm (Hamming distance): `\(\|\mathbf{x} - \mathbf{y}\| = \sum_i \left\{\begin{matrix} 1 &amp;\mbox{if}\ \mathbf{x}_i \neq \mathbf{y}_i , \\ 0 &amp;\mbox{if}\ \mathbf{x}_i = \mathbf{y}_i \end{matrix} \right| \)`
137
138 * L<sub>&infin;</sub> norm: `\(\|\mathbf{x} - \mathbf{y}\| = \max_i{(\mathbf{x}_i - \mathbf{y}_i)} \)`
139
140 neither of which will be that useful.)
141 ---
142
143 # Normalisation of vectors
144
145 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
146
147 * Eucliean scaling (vector with unit length): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }\)`
148
149 * Normalisation (components of vector sum to 1): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }\)`
150
151 ---
152
153 # Angle, not distance
154
155 Rather than looking at the distance between the vectors, look at the angle between them.
156
157 Vector dot product shows how much of one vector lies in the direction of another:
158 `\( \mathbf{A} \bullet \mathbf{B} =
159 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
160
161 But,
162 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
163 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)`
164
165 A bit of rearranging give the cosine simiarity:
166 `\( \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
167 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} \)`
168
169 This is independent of vector lengths!
170
171 Cosine similarity is 1 if in same direction, 0 if at 90⁰, -1 if antiparallel.
172
173 ## FIXME! Cosine distance bug: frequencies2 length not squared.
174
175
176 ## FIXME! Diagram of vector dot product
177
178 </textarea>
179 <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
180 </script>
181
182 <script type="text/javascript"
183 src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
184
185 <script type="text/javascript">
186 var slideshow = remark.create({ ratio: "16:9" });
187
188 // Setup MathJax
189 MathJax.Hub.Config({
190 tex2jax: {
191 skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
192 }
193 });
194 MathJax.Hub.Queue(function() {
195 $(MathJax.Hub.getAllJax()).map(function(index, elem) {
196 return(elem.SourceElement());
197 }).parent().addClass('has-jax');
198 });
199 MathJax.Hub.Configured();
200 </script>
201 </body>
202 </html>