slides/alternative-plaintext-scoring.html

   1 <!DOCTYPE html>
   2 <html>
   3   <head>
   4     <title>Alternative plaintext scoring</title>
   5     <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/>
   6     <style type="text/css">
   7       /* Slideshow styles */
   8       body {
   9         font-size: 20px;
  10       }
  11       h1, h2, h3 {
  12         font-weight: 400;
  13         margin-bottom: 0;
  14       }
  15       h1 { font-size: 3em; }
  16       h2 { font-size: 2em; }
  17       h3 { font-size: 1.6em; }
  18       a, a > code {
  19         text-decoration: none;
  20       }
  21       code {
  22         -moz-border-radius: 5px;
  23         -web-border-radius: 5px;
  24         background: #e7e8e2;
  25         border-radius: 5px;
  26         font-size: 16px;
  27       }
  28       .plaintext {
  29         background: #272822;
  30         color: #80ff80;
  31         text-shadow: 0 0 20px #333;
  32         padding: 2px 5px;
  33       }
  34       .ciphertext {
  35         background: #272822;
  36         color: #ff6666;
  37         text-shadow: 0 0 20px #333;
  38         padding: 2px 5px;
  39       }
  40        .float-right {
  41         float: right;
  42       }
  43     </style>
  44   </head>
  45   <body>
  46     <textarea id="source">
  47
  48 # Alternative plaintext scoring methods
  49
  50 ---
  51
  52 # Back to frequency of letter counts
  53
  54 Letter | Count
  55 -------|------
  56 a | 489107
  57 b | 92647
  58 c | 140497
  59 d | 267381
  60 e | 756288
  61 . | .
  62 . | .
  63 . | .
  64 z | 3575
  65
  66 Another way of thinking about this is a 26-dimensional vector.
  67
  68 Create a vector of our text, and one of idealised English.
  69
  70 The distance between the vectors is how far from English the text is.
  71
  72 ---
  73
  74 # Vector distances
  75
  76 .float-right[![right-aligned Vector subtraction](vector-subtraction.svg)]
  77
  78 Several different distance measures (__metrics__, also called __norms__):
  79
  80 * L<sub>2</sub> norm (Euclidean distance):
  81 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^2} \)`
  82
  83 * L<sub>1</sub> norm (Manhattan distance, taxicab distance):
  84 `\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)`
  85
  86 * L<sub>3</sub> norm:
  87 `\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^3} \)`
  88
  89 The higher the power used, the more weight is given to the largest differences in components.
  90
  91 (Extends out to:
  92
  93 * L<sub>0</sub> norm (Hamming distance):
  94 `$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{
  95 \begin{matrix} 1 &amp;\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\
  96  0 &amp;\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$`
  97
  98 * L<sub>&infin;</sub> norm:
  99 `\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)`
 100
 101 neither of which will be that useful here, but they keep cropping up.)
 102 ---
 103
 104 # Normalisation of vectors
 105
 106 Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them.
 107
 108 * Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }$$`
 109
 110 * Normalisation (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
 111
 112 ---
 113
 114 # Angle, not distance
 115
 116 Rather than looking at the distance between the vectors, look at the angle between them.
 117
 118 .float-right[![right-aligned Vector dot product](vector-dot-product.svg)]
 119
 120 Vector dot product shows how much of one vector lies in the direction of another:
 121 `\( \mathbf{A} \bullet \mathbf{B} =
 122 \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)`
 123
 124 But,
 125 `\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)`
 126 and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)`
 127
 128 A bit of rearranging give the cosine simiarity:
 129 `$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } =
 130 \frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} $$`
 131
 132 This is independent of vector lengths!
 133
 134 Cosine similarity is 1 if in parallel, 0 if perpendicular, -1 if antiparallel.
 135
 136 ---
 137
 138 # Which is best?
 139
 140    | Euclidean | Normalised
 141 ---|-----------|------------
 142 L1 |     x     |      x
 143 L2 |     x     |      x
 144 L3 |     x     |      x
 145 Cosine |     x     |      x
 146
 147 And the probability measure!
 148
 149 * Nine different ways of measuring fitness.
 150
 151 ## Computing is an empircal science
 152
 153 Let's do some experiments to find the best solution!
 154
 155 ---
 156
 157 # Experimental harness
 158
 159 ## Step 1: build some other scoring functions
 160
 161 We need a way of passing the different functions to the keyfinding function.
 162
 163 ## Step 2: find the best scoring function
 164
 165 Try them all on random ciphertexts, see which one works best.
 166
 167 ---
 168
 169 # Functions are values!
 170
 171 ```python
 172 >>> Pletters
 173 <function Pletters at 0x7f60e6d9c4d0>
 174 ```
 175
 176 ```python
 177 def caesar_break(message, fitness=Pletters):
 178     """Breaks a Caesar cipher using frequency analysis
 179 ...
 180     for shift in range(26):
 181         plaintext = caesar_decipher(message, shift)
 182         fit = fitness(plaintext)
 183 ```
 184
 185 ---
 186
 187 # Changing the comparison function
 188
 189 * Must be a function that takes a text and returns a score
 190     * Better fit must give higher score, opposite of the vector distance norms
 191
 192 ```python
 193 def make_frequency_compare_function(target_frequency, frequency_scaling, metric, invert):
 194     def frequency_compare(text):
 195         ...
 196         return score
 197     return frequency_compare
 198 ```
 199
 200 ---
 201
 202 # Data-driven processing
 203
 204 ```python
 205 metrics = [{'func': norms.l1, 'invert': True, 'name': 'l1'},
 206     {'func': norms.l2, 'invert': True, 'name': 'l2'},
 207     {'func': norms.l3, 'invert': True, 'name': 'l3'},
 208     {'func': norms.cosine_similarity, 'invert': False, 'name': 'cosine_similarity'}]
 209 scalings = [{'corpus_frequency': normalised_english_counts,
 210          'scaling': norms.normalise,
 211          'name': 'normalised'},
 212         {'corpus_frequency': euclidean_scaled_english_counts,
 213          'scaling': norms.euclidean_scale,
 214          'name': 'euclidean_scaled'}]
 215 ```
 216
 217 Use this to make all nine scoring functions.
 218
 219
 220     </textarea>
 221     <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
 222     </script>
 223
 224     <script type="text/javascript"
 225       src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script>
 226
 227     <script type="text/javascript">
 228       var slideshow = remark.create({ ratio: "16:9" });
 229
 230       // Setup MathJax
 231       MathJax.Hub.Config({
 232         tex2jax: {
 233         skipTags: ['script', 'noscript', 'style', 'textarea', 'pre']
 234         }
 235       });
 236       MathJax.Hub.Queue(function() {
 237         $(MathJax.Hub.getAllJax()).map(function(index, elem) {
 238             return(elem.SourceElement());
 239         }).parent().addClass('has-jax');
 240       });
 241       MathJax.Hub.Configured();
 242     </script>
 243   </body>
 244 </html>