From: Neil Smith Date: Wed, 12 Mar 2014 12:37:20 +0000 (+0000) Subject: Fixed conflicting equation typesetting X-Git-Url: https://git.njae.me.uk/?a=commitdiff_plain;h=5b51a469cc152b1035a5cf69f2c38d51f9d16eb8;hp=-c;p=cipher-training.git Fixed conflicting equation typesetting --- 5b51a469cc152b1035a5cf69f2c38d51f9d16eb8 diff --combined slides/caesar-break.html index c6d4023,6df8365..a47e236 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@@ -37,9 -37,6 +37,9 @@@ text-shadow: 0 0 20px #333; padding: 2px 5px; } + .float-right { + float: right; + } @@@ -47,13 -44,31 +47,31 @@@ # Breaking caesar ciphers -![centre-aligned Caesar wheel](caesarwheel1.gif) +![center-aligned Caesar wheel](caesarwheel1.gif) --- - # Brute force + # Human vs Machine - How many keys to try? + Slow but clever vs Dumb but fast + + ## Human approach + + Ciphertext | Plaintext + ---|--- + ![left-aligned Ciphertext frequencies](c1a_frequency_histogram.png) | ![left-aligned English frequencies](english_frequency_histogram.png) + + --- + + # Human vs machine + + ## Machine approach + + Brute force. + + Try all keys. + + * How many keys to try? ## Basic idea @@@ -101,56 -116,26 +119,56 @@@ The distance between the vectors is ho --- -# Vector distances +# Frequencies of English + +But before then how do we count the letters? + +* Read a file into a string +```python +open() +read() +``` +* Count them +```python +import collections +``` + +--- +# Canonical forms -## FIXME! Diagram of vector subtraction here +Counting letters in _War and Peace_ gives all manner of junk. + +* Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting + +--- + +# Vector distances + +.float-right[![right-aligned Vector subtraction](vector-subtraction.svg)] Several different distance measures (__metrics__, also called __norms__): -* L2 norm (Euclidean distance): `\(\|\mathbf{x} - \mathbf{y}\| = \sqrt{\sum_i (\mathbf{x}_i - \mathbf{y}_i)^2} \)` +* L2 norm (Euclidean distance): - `\(|\mathbf{a} - \mathbf{b}| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^2} \)` ++`\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^2} \)` -* L1 norm (Manhattan distance, taxicab distance): `\(\|\mathbf{x} - \mathbf{y}\| = \sum_i |\mathbf{x}_i - \mathbf{y}_i| \)` +* L1 norm (Manhattan distance, taxicab distance): - `\(|\mathbf{a} - \mathbf{b}| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)` ++`\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)` -* L3 norm: `\(\|\mathbf{x} - \mathbf{y}\| = \sqrt[3]{\sum_i |\mathbf{x}_i - \mathbf{y}_i|^3} \)` +* L3 norm: - `\(|\mathbf{a} - \mathbf{b}| = \sqrt[3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^3} \)` ++`\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^3} \)` The higher the power used, the more weight is given to the largest differences in components. (Extends out to: -* L0 norm (Hamming distance): `\(\|\mathbf{x} - \mathbf{y}\| = \sum_i \left\{\begin{matrix} 1 &\mbox{if}\ \mathbf{x}_i \neq \mathbf{y}_i , \\ 0 &\mbox{if}\ \mathbf{x}_i = \mathbf{y}_i \end{matrix} \right| \)` +* L0 norm (Hamming distance): - `$$|\mathbf{a} - \mathbf{b}| = \sum_i \left\{ ++`$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{ +\begin{matrix} 1 &\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\ + 0 &\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$` -* L norm: `\(\|\mathbf{x} - \mathbf{y}\| = \max_i{(\mathbf{x}_i - \mathbf{y}_i)} \)` +* L norm: - `\(|\mathbf{a} - \mathbf{b}| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)` ++`\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)` neither of which will be that useful.) --- @@@ -159,9 -144,9 +177,9 @@@ Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them. -* Eucliean scaling (vector with unit length): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }\)` +* Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }$$` -* Normalisation (components of vector sum to 1): `\( \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }\)` +* Normalisation (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$` --- @@@ -169,8 -154,6 +187,8 @@@ Rather than looking at the distance between the vectors, look at the angle between them. +.float-right[![right-aligned Vector dot product](vector-dot-product.svg)] + Vector dot product shows how much of one vector lies in the direction of another: `\( \mathbf{A} \bullet \mathbf{B} = \| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)` @@@ -180,47 -163,17 +198,47 @@@ But and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)` A bit of rearranging give the cosine simiarity: -`\( \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } = -\frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} \)` +`$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } = +\frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} $$` This is independent of vector lengths! -Cosine similarity is 1 if in same direction, 0 if at 90⁰, -1 if antiparallel. +Cosine similarity is 1 if in parallel, 0 if perpendicular, -1 if antiparallel. + +--- + +# An infinite number of monkeys + +What is the probability that this string of letters is a sample of English? + +Given 'th', 'e' is about six times more likely than 'a' or 'i'. + +## Naive Bayes, or the bag of letters + +Ignore letter order, just treat each letter individually. + +Probability of a text is `\( \prod_i p_i \)` + +(Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`) + +--- + +# Which is best? + + | Euclidean | Normalised +---|-----------|------------ +L1 | x | x +L2 | x | x +L3 | x | x +Cosine | x | x + +And the probability measure! + +* Nine different ways of measuring fitness. -## FIXME! Cosine distance bug: frequencies2 length not squared. +## Computing is an empircal science -## FIXME! Diagram of vector dot product