slides/alternative-plaintext-scoring.html

   1 <!DOCTYPE html>
   2 <html>
   3   <head>
   4     <title>Alternative plaintext scoring</title>
   5     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
   6     <style type="text/css">
   7       /* Slideshow styles */
   8       body {
   9         font-size: 20px;
  10       }
  11       h1, h2, h3 {
  12         font-weight: 400;
  13         margin-bottom: 0;
  14       }
  15       h1 { font-size: 3em; }
  16       h2 { font-size: 2em; }
  17       h3 { font-size: 1.6em; }
  18       a, a > code {
  19         text-decoration: none;
  20       }
  21       code {
  22         -moz-border-radius: 5px;
  23         -web-border-radius: 5px;
  24         background: #e7e8e2;
  25         border-radius: 5px;
  26         font-size: 16px;
  27       }
  28       .plaintext {
  29         background: #272822;
  30         color: #80ff80;
  31         text-shadow: 0 0 20px #333;
  32         padding: 2px 5px;
  33       }
  34       .ciphertext {
  35         background: #272822;
  36         color: #ff6666;
  37         text-shadow: 0 0 20px #333;
  38         padding: 2px 5px;
  39       }
  40       .indexlink {
  41         position: absolute;
  42         bottom: 1em;
  43         left: 1em;
  44       }
  45        .float-right {
  46         float: right;
  47       }
  48     </style>
  49   </head>
  50   <body>
  51     <textarea id="source">
  52
  53 # Alternative plaintext scoring methods
  54
  55 ---
  56
  57 layout: true
  58
  59 .indexlink[[Index](index.html)]
  60
  61 ---
  62
  63 # Back to frequency of letter counts
  64
  65 Letter | Count
  66 -------|------
  67 a | 489107
  68 b | 92647
  69 c | 140497
  70 d | 267381
  71 e | 756288
  72 . | .
  73 . | .
  74 . | .
  75 z | 3575
  76
  77 Another way of thinking about this is a 26-dimensional vector.
  78
  79 Create a vector of our text, and one of idealised English.
  80
  81 The distance between the vectors is how far from English the text is.
  82
  83 ---
  84
  85 # Vector distances
  86
  87 .float-right[![right-aligned Vector subtraction](vector-subtraction.svg)]
  88
  89 Several different distance measures (__metrics__, also called __norms__):
  90
  91 * L<sub>2</sub> norm (Euclidean distance):
  92 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^2} \)`
  93
  94 * L<sub>1</sub> norm (Manhattan distance, taxicab distance):
  95 `\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)`
  96
  97 * L<sub>3</sub> norm:
  98 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^3} \)`
  99
 100 The higher the power used, the more weight is given to the largest differences in components.
 101
 102 (Extends out to:
 103
 104 * L<sub>0</sub> norm (Hamming distance):
 105 `$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{
 106 \begin{matrix} 1 &amp;\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\
 107  0 &amp;\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$`
 108
 109 * L<sub>&infin;</sub> norm:
 110 `\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)`
 111
 112 neither of which will be that useful here, but they keep cropping up.)
 113 ---
 114
 115 # Normalisation of vectors
 116
 117 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
 118
 119 * Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }$$`
 120
 121 * Normalisation (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
 122
 123 ---
 124
 125 # Angle, not distance
 126
 127 Rather than looking at the distance between the vectors, look at the angle between them.
 128
 129 .float-right[![right-aligned Vector dot product](vector-dot-product.svg)]
 130
 131 Vector dot product shows how much of one vector lies in the direction of another:
 132 `\( \mathbf{A} \bullet \mathbf{B} =
 133 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
 134
 135 But,
 136 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
 137 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)`
 138
 139 A bit of rearranging give the cosine simiarity:
 140 `$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
 141 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} $$`
 142
 143 This is independent of vector lengths!
 144
 145 Cosine similarity is 1 if in parallel, 0 if perpendicular, -1 if antiparallel.
 146
 147 ---
 148
 149 # Which is best?
 150
 151    | Euclidean | Normalised
 152 ---|-----------|------------
 153 L1 |     x     |      x
 154 L2 |     x     |      x
 155 L3 |     x     |      x
 156 Cosine |     x     |      x
 157
 158 And the probability measure!
 159
 160 * Nine different ways of measuring fitness.
 161
 162 ## Computing is an empircal science
 163
 164 Let's do some experiments to find the best solution!
 165
 166 ---
 167
 168 # Experimental harness
 169
 170 ## Step 1: build some other scoring functions
 171
 172 We need a way of passing the different functions to the keyfinding function.
 173
 174 ## Step 2: find the best scoring function
 175
 176 Try them all on random ciphertexts, see which one works best.
 177
 178 ---
 179
 180 # Functions are values!
 181
 182 ```python
 183 >>> Pletters
 184 <function Pletters at 0x7f60e6d9c4d0>
 185 ```
 186
 187 ```python
 188 def caesar_break(message, fitness=Pletters):
 189     """Breaks a Caesar cipher using frequency analysis
 190 ...
 191     for shift in range(26):
 192         plaintext = caesar_decipher(message, shift)
 193         fit = fitness(plaintext)
 194 ```
 195
 196 ---
 197
 198 # Changing the comparison function
 199
 200 * Must be a function that takes a text and returns a score
 201     * Better fit must give higher score, opposite of the vector distance norms
 202
 203 ```python
 204 def make_frequency_compare_function(target_frequency, frequency_scaling, metric, invert):
 205     def frequency_compare(text):
 206         ...
 207         return score
 208     return frequency_compare
 209 ```
 210
 211 ---
 212
 213 # Data-driven processing
 214
 215 ```python
 216 metrics = [{'func': norms.l1, 'invert': True, 'name': 'l1'},
 217     {'func': norms.l2, 'invert': True, 'name': 'l2'},
 218     {'func': norms.l3, 'invert': True, 'name': 'l3'},
 219     {'func': norms.cosine_similarity, 'invert': False, 'name': 'cosine_similarity'}]
 220 scalings = [{'corpus_frequency': normalised_english_counts,
 221          'scaling': norms.normalise,
 222          'name': 'normalised'},
 223         {'corpus_frequency': euclidean_scaled_english_counts,
 224          'scaling': norms.euclidean_scale,
 225          'name': 'euclidean_scaled'}]
 226 ```
 227
 228 Use this to make all nine scoring functions.
 229
 230
 231     </textarea>
 232     <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
 233     </script>
 234
 235     <script type="text/javascript"
 236       src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
 237
 238     <script type="text/javascript">
 239       var slideshow = remark.create({ ratio: "16:9" });
 240
 241       // Setup MathJax
 242       MathJax.Hub.Config({
 243         tex2jax: {
 244         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
 245         }
 246       });
 247       MathJax.Hub.Queue(function() {
 248         $(MathJax.Hub.getAllJax()).map(function(index, elem) {
 249             return(elem.SourceElement());
 250         }).parent().addClass('has-jax');
 251       });
 252       MathJax.Hub.Configured();
 253     </script>
 254   </body>
 255 </html>