From: Neil Smith Date: Wed, 9 Apr 2014 19:23:02 +0000 (+0100) Subject: Split out vector-based frequency analysis, started on affine ciphers X-Git-Url: https://git.njae.me.uk/?a=commitdiff_plain;h=5442ad81b503a960fdcdacc1eb20707672c75fbe;p=cipher-training.git Split out vector-based frequency analysis, started on affine ciphers --- diff --git a/slides/affine-encipher.html b/slides/affine-encipher.html index 9c54d8a..9e5c20e 100644 --- a/slides/affine-encipher.html +++ b/slides/affine-encipher.html @@ -47,6 +47,37 @@ # Affine ciphers +a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z +--|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|-- +b | e | h | k | n | q | t | w | z | c | f | i | l | o | r | u | x | a | d | g | j | m | p | s | v | y + +An extension of Caesar ciphers + +* Count the gaps in the letters. + +--- +# How affine ciphers work + +_ciphertext_letter_ =_plaintext_letter_ × a + b + +* Convert letters to numbers +* Take the total modulus 26 + +# Enciphering is easy + +* Build the `affine_encipher()` function + +--- + +# Deciphering affine ciphers is harder + +`$$p = \frac{c - b}{a}$$` + +But modular division is hard! + + +--- + ## Explanation of extended Euclid's algorithm from [Programming with finite fields](http://jeremykun.com/2014/03/13/programming-with-finite-fields/) **Definition:** An element _d_ is called a greatest common divisor (gcd) of _a, b_ if it divides both _a_ and _b_, and for every other _z_ dividing both _a_ and _b_, _z_ divides _d_. diff --git a/slides/alternative-plaintext-scoring.html b/slides/alternative-plaintext-scoring.html new file mode 100644 index 0000000..d6f4aa1 --- /dev/null +++ b/slides/alternative-plaintext-scoring.html @@ -0,0 +1,244 @@ + + + + Alternative plaintext scoring + + + + + + + + + + + + diff --git a/slides/caesar-break.html b/slides/caesar-break.html index f6a031f..5ea77b9 100644 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@ -271,178 +271,23 @@ logger.setLevel(logging.WARNING) 'and decrypt starting: {2}'.format(shift, fit, plaintext[:50])) ``` - * Yes, it's ugly. +* Yes, it's ugly. - Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG +Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG ---- - -# Back to frequency of letter counts - -Letter | Count --------|------ -a | 489107 -b | 92647 -c | 140497 -d | 267381 -e | 756288 -. | . -. | . -. | . -z | 3575 - -Another way of thinking about this is a 26-dimensional vector. - -Create a vector of our text, and one of idealised English. - -The distance between the vectors is how far from English the text is. - ---- - -# Vector distances - -.float-right[![right-aligned Vector subtraction](vector-subtraction.svg)] - -Several different distance measures (__metrics__, also called __norms__): - -* L2 norm (Euclidean distance): -`\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^2} \)` - -* L1 norm (Manhattan distance, taxicab distance): -`\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)` - -* L3 norm: -`\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^3} \)` - -The higher the power used, the more weight is given to the largest differences in components. - -(Extends out to: - -* L0 norm (Hamming distance): -`$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{ -\begin{matrix} 1 &\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\ - 0 &\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$` - -* L norm: -`\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)` - -neither of which will be that useful here, but they keep cropping up.) ---- - -# Normalisation of vectors - -Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them. - -* Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }$$` - -* Normalisation (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$` - ---- - -# Angle, not distance - -Rather than looking at the distance between the vectors, look at the angle between them. - -.float-right[![right-aligned Vector dot product](vector-dot-product.svg)] - -Vector dot product shows how much of one vector lies in the direction of another: -`\( \mathbf{A} \bullet \mathbf{B} = -\| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)` - -But, -`\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)` -and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)` - -A bit of rearranging give the cosine simiarity: -`$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } = -\frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} $$` - -This is independent of vector lengths! - -Cosine similarity is 1 if in parallel, 0 if perpendicular, -1 if antiparallel. - ---- - -# Which is best? - - | Euclidean | Normalised ----|-----------|------------ -L1 | x | x -L2 | x | x -L3 | x | x -Cosine | x | x - -And the probability measure! - -* Nine different ways of measuring fitness. - -## Computing is an empircal science - -Let's do some experiments to find the best solution! +Use `logger.debug()`, `logger.info()`, etc. to log a message. --- -# Experimental harness - -## Step 1: build some other scoring functions - -We need a way of passing the different functions to the keyfinding function. - -## Step 2: find the best scoring function - -Try them all on random ciphertexts, see which one works best. - ---- +# How much ciphertext do we need? -# Functions are values! - -```python ->>> Pletters - -``` - -```python -def caesar_break(message, fitness=Pletters): - """Breaks a Caesar cipher using frequency analysis -... - for shift in range(26): - plaintext = caesar_decipher(message, shift) - fit = fitness(plaintext) -``` - ---- - -# Changing the comparison function - -* Must be a function that takes a text and returns a score - * Better fit must give higher score, opposite of the vector distance norms - -```python -def make_frequency_compare_function(target_frequency, frequency_scaling, metric, invert): - def frequency_compare(text): - ... - return score - return frequency_compare -``` - ---- - -# Data-driven processing - -```python -metrics = [{'func': norms.l1, 'invert': True, 'name': 'l1'}, - {'func': norms.l2, 'invert': True, 'name': 'l2'}, - {'func': norms.l3, 'invert': True, 'name': 'l3'}, - {'func': norms.cosine_similarity, 'invert': False, 'name': 'cosine_similarity'}] -scalings = [{'corpus_frequency': normalised_english_counts, - 'scaling': norms.normalise, - 'name': 'normalised'}, - {'corpus_frequency': euclidean_scaled_english_counts, - 'scaling': norms.euclidean_scale, - 'name': 'euclidean_scaled'}] -``` +## Let's do an experiment to find out -Use this to make all nine scoring functions. +1. Load the whole corpus into a string (sanitised) +2. Select a random chunk of plaintext and a random key +3. Encipher the text +4. Score 1 point if `caesar_cipher_break()` recovers the correct key +5. Repeat many times and with many plaintext lengths