slides/caesar-break.html

   1 <!DOCTYPE html>
   2 <html>
   3   <head>
   4     <title>Breaking caesar ciphers</title>
   5     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
   6     <style type="text/css">
   7       /* Slideshow styles */
   8       body {
   9         font-size: 20px;
  10       }
  11       h1, h2, h3 {
  12         font-weight: 400;
  13         margin-bottom: 0;
  14       }
  15       h1 { font-size: 3em; }
  16       h2 { font-size: 2em; }
  17       h3 { font-size: 1.6em; }
  18       a, a > code {
  19         text-decoration: none;
  20       }
  21       code {
  22         -moz-border-radius: 5px;
  23         -web-border-radius: 5px;
  24         background: #e7e8e2;
  25         border-radius: 5px;
  26         font-size: 16px;
  27       }
  28       .plaintext {
  29         background: #272822;
  30         color: #80ff80;
  31         text-shadow: 0 0 20px #333;
  32         padding: 2px 5px;
  33       }
  34       .ciphertext {
  35         background: #272822;
  36         color: #ff6666;
  37         text-shadow: 0 0 20px #333;
  38         padding: 2px 5px;
  39       }
  40     </style>
  41   </head>
  42   <body>
  43     <textarea id="source">
  44
  45 # Breaking caesar ciphers
  46
  47 ![centre-aligned Caesar wheel](caesarwheel1.gif)
  48
  49 ---
  50
  51 # Brute force
  52
  53 How many keys to try?
  54
  55 ## Basic idea
  56
  57 ```
  58 for each key:
  59     decipher with this key
  60     how close is it to English?
  61     remember the best key
  62 ```
  63
  64 What steps do we know how to do?
  65
  66 ---
  67 # How close is it to English?
  68
  69 What does English look like?
  70 * We need a model of English.
  71
  72 How do we define "closeness"?
  73
  74 ---
  75
  76 # What does English look like?
  77
  78 ## Abstraction: frequency of letter counts
  79
  80 Letter | Count
  81 -------|------
  82 a | 489107
  83 b | 92647
  84 c | 140497
  85 d | 267381
  86 e | 756288
  87 . | .
  88 . | .
  89 . | .
  90 z | 3575
  91
  92 One way of thinking about this is a 26-dimensional vector.
  93
  94 Create a vector of our text, and one of idealised English.
  95
  96 The distance between the vectors is how far from English the text is.
  97
  98 ---
  99
 100 # Vector distances
 101
 102
 103 ## FIXME! Diagram of vector subtraction here
 104
 105 Several different distance measures (__metrics__, also called __norms__):
 106
 107 * L<sub>2</sub> norm (Euclidean distance):  `\(|\mathbf{x} - \mathbf{y}| = \sqrt{\sum_i (\mathbf{x}_i - \mathbf{y}_i)^2} \)`
 108
 109 * L<sub>1</sub> norm (Manhattan distance, taxicab distance):  `\(|\mathbf{x} - \mathbf{y}| = \sum_i |\mathbf{x}_i - \mathbf{y}_i| \)`
 110
 111 * L<sub>3</sub> norm:  `\(|\mathbf{x} - \mathbf{y}| = \sqrt[3]{\sum_i |\mathbf{x}_i - \mathbf{y}_i|^3} \)`
 112
 113 The higher the power used, the more weight is given to the largest differences in components.
 114
 115 (Extends out to:
 116
 117 * L<sub>0</sub> norm (Hamming distance):  `\(|\mathbf{x} - \mathbf{y}| = \sum_i \left\{\begin{matrix} 1 &amp;\mbox{if}\ \mathbf{x}_i \neq \mathbf{y}_i , \\ 0 &amp;\mbox{if}\ \mathbf{x}_i = \mathbf{y}_i \end{matrix} \right| \)`
 118
 119 * L<sub>&infin;</sub> norm:  `\(|\mathbf{x} - \mathbf{y}| = \max_i{(\mathbf{x}_i - \mathbf{y}_i)} \)`
 120
 121 neither of which will be that useful.)
 122 ---
 123
 124 # Normalisation of vectors
 125
 126 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
 127
 128 * Eucliean scaling (vector with unit length): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }\)`
 129
 130 * Normalisation (components of vector sum to 1): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }\)`
 131
 132 ---
 133
 134 # Angle, not distance
 135
 136 Rather than looking at the distance between the vectors, look at the angle between them.
 137
 138 Vector dot product shows how much of one vector lies in the direction of another:
 139 `\( \mathbf{A} \bullet \mathbf{B} =
 140 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
 141
 142 But,
 143 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
 144 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)`
 145
 146 A bit of rearranging give the cosine simiarity:
 147 `\( \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
 148 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} \)`
 149
 150 This is independent of vector lengths!
 151
 152 Cosine similarity is 1 if in same direction, 0 if at 90⁰, -1 if antiparallel.
 153
 154 ## FIXME! Cosine distance bug: frequencies2 length not squared.
 155
 156
 157 ## FIXME! Diagram of vector dot product
 158
 159     </textarea>
 160     <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
 161     </script>
 162
 163     <script type="text/javascript"
 164       src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
 165
 166     <script type="text/javascript">
 167       var slideshow = remark.create({ ratio: "16:9" });
 168
 169       // Setup MathJax
 170       MathJax.Hub.Config({
 171         tex2jax: {
 172         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
 173         }
 174       });
 175       MathJax.Hub.Queue(function() {
 176         $(MathJax.Hub.getAllJax()).map(function(index, elem) {
 177             return(elem.SourceElement());
 178         }).parent().addClass('has-jax');
 179       });
 180       MathJax.Hub.Configured();
 181     </script>
 182   </body>
 183 </html>