From: Neil Smith <neil.git@njae.me.uk> Date: Wed, 9 Apr 2014 19:23:02 +0000 (+0100) Subject: Split out vector-based frequency analysis, started on affine ciphers X-Git-Url: https://git.njae.me.uk/?a=commitdiff_plain;h=5442ad81b503a960fdcdacc1eb20707672c75fbe;p=cipher-training.git Split out vector-based frequency analysis, started on affine ciphers --- diff --git a/slides/affine-encipher.html b/slides/affine-encipher.html index 9c54d8a..9e5c20e 100644 --- a/slides/affine-encipher.html +++ b/slides/affine-encipher.html @@ -47,6 +47,37 @@ # Affine ciphers +a | b | c | d | e | f | g | h | i | j | k | l | m | n | o | p | q | r | s | t | u | v | w | x | y | z +--|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|-- +b | e | h | k | n | q | t | w | z | c | f | i | l | o | r | u | x | a | d | g | j | m | p | s | v | y + +An extension of Caesar ciphers + +* Count the gaps in the letters. + +--- +# How affine ciphers work + +_ciphertext_letter_ =_plaintext_letter_ Ã a + b + +* Convert letters to numbers +* Take the total modulus 26 + +# Enciphering is easy + +* Build the `affine_encipher()` function + +--- + +# Deciphering affine ciphers is harder + +`$$p = \frac{c - b}{a}$$` + +But modular division is hard! + + +--- + ## Explanation of extended Euclid's algorithm from [Programming with finite fields](http://jeremykun.com/2014/03/13/programming-with-finite-fields/) **Definition:** An element _d_ is called a greatest common divisor (gcd) of _a, b_ if it divides both _a_ and _b_, and for every other _z_ dividing both _a_ and _b_, _z_ divides _d_. diff --git a/slides/alternative-plaintext-scoring.html b/slides/alternative-plaintext-scoring.html new file mode 100644 index 0000000..d6f4aa1 --- /dev/null +++ b/slides/alternative-plaintext-scoring.html @@ -0,0 +1,244 @@ +<!DOCTYPE html> +<html> + <head> + <title>Alternative plaintext scoring</title> + <meta http-equiv="Content-Type" content="text/html; charset=UTF-8"/> + <style type="text/css"> + /* Slideshow styles */ + body { + font-size: 20px; + } + h1, h2, h3 { + font-weight: 400; + margin-bottom: 0; + } + h1 { font-size: 3em; } + h2 { font-size: 2em; } + h3 { font-size: 1.6em; } + a, a > code { + text-decoration: none; + } + code { + -moz-border-radius: 5px; + -web-border-radius: 5px; + background: #e7e8e2; + border-radius: 5px; + font-size: 16px; + } + .plaintext { + background: #272822; + color: #80ff80; + text-shadow: 0 0 20px #333; + padding: 2px 5px; + } + .ciphertext { + background: #272822; + color: #ff6666; + text-shadow: 0 0 20px #333; + padding: 2px 5px; + } + .float-right { + float: right; + } + </style> + </head> + <body> + <textarea id="source"> + +# Alternative plaintext scoring methods + +--- + +# Back to frequency of letter counts + +Letter | Count +-------|------ +a | 489107 +b | 92647 +c | 140497 +d | 267381 +e | 756288 +. | . +. | . +. | . +z | 3575 + +Another way of thinking about this is a 26-dimensional vector. + +Create a vector of our text, and one of idealised English. + +The distance between the vectors is how far from English the text is. + +--- + +# Vector distances + +.float-right[] + +Several different distance measures (__metrics__, also called __norms__): + +* L<sub>2</sub> norm (Euclidean distance): +`\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^2} \)` + +* L<sub>1</sub> norm (Manhattan distance, taxicab distance): +`\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)` + +* L<sub>3</sub> norm: +`\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^3} \)` + +The higher the power used, the more weight is given to the largest differences in components. + +(Extends out to: + +* L<sub>0</sub> norm (Hamming distance): +`$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{ +\begin{matrix} 1 &\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\ + 0 &\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$` + +* L<sub>∞</sub> norm: +`\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)` + +neither of which will be that useful here, but they keep cropping up.) +--- + +# Normalisation of vectors + +Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them. + +* Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }$$` + +* Normalisation (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$` + +--- + +# Angle, not distance + +Rather than looking at the distance between the vectors, look at the angle between them. + +.float-right[] + +Vector dot product shows how much of one vector lies in the direction of another: +`\( \mathbf{A} \bullet \mathbf{B} = +\| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)` + +But, +`\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)` +and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)` + +A bit of rearranging give the cosine simiarity: +`$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } = +\frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} $$` + +This is independent of vector lengths! + +Cosine similarity is 1 if in parallel, 0 if perpendicular, -1 if antiparallel. + +--- + +# Which is best? + + | Euclidean | Normalised +---|-----------|------------ +L1 | x | x +L2 | x | x +L3 | x | x +Cosine | x | x + +And the probability measure! + +* Nine different ways of measuring fitness. + +## Computing is an empircal science + +Let's do some experiments to find the best solution! + +--- + +# Experimental harness + +## Step 1: build some other scoring functions + +We need a way of passing the different functions to the keyfinding function. + +## Step 2: find the best scoring function + +Try them all on random ciphertexts, see which one works best. + +--- + +# Functions are values! + +```python +>>> Pletters +<function Pletters at 0x7f60e6d9c4d0> +``` + +```python +def caesar_break(message, fitness=Pletters): + """Breaks a Caesar cipher using frequency analysis +... + for shift in range(26): + plaintext = caesar_decipher(message, shift) + fit = fitness(plaintext) +``` + +--- + +# Changing the comparison function + +* Must be a function that takes a text and returns a score + * Better fit must give higher score, opposite of the vector distance norms + +```python +def make_frequency_compare_function(target_frequency, frequency_scaling, metric, invert): + def frequency_compare(text): + ... + return score + return frequency_compare +``` + +--- + +# Data-driven processing + +```python +metrics = [{'func': norms.l1, 'invert': True, 'name': 'l1'}, + {'func': norms.l2, 'invert': True, 'name': 'l2'}, + {'func': norms.l3, 'invert': True, 'name': 'l3'}, + {'func': norms.cosine_similarity, 'invert': False, 'name': 'cosine_similarity'}] +scalings = [{'corpus_frequency': normalised_english_counts, + 'scaling': norms.normalise, + 'name': 'normalised'}, + {'corpus_frequency': euclidean_scaled_english_counts, + 'scaling': norms.euclidean_scale, + 'name': 'euclidean_scaled'}] +``` + +Use this to make all nine scoring functions. + + + </textarea> + <script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript"> + </script> + + <script type="text/javascript" + src="http://cdn.mathjax.org/mathjax/latest/MathJax.js?config=TeX-AMS-MML_HTMLorMML&delayStartupUntil=configured"></script> + + <script type="text/javascript"> + var slideshow = remark.create({ ratio: "16:9" }); + + // Setup MathJax + MathJax.Hub.Config({ + tex2jax: { + skipTags: ['script', 'noscript', 'style', 'textarea', 'pre'] + } + }); + MathJax.Hub.Queue(function() { + $(MathJax.Hub.getAllJax()).map(function(index, elem) { + return(elem.SourceElement()); + }).parent().addClass('has-jax'); + }); + MathJax.Hub.Configured(); + </script> + </body> +</html> diff --git a/slides/caesar-break.html b/slides/caesar-break.html index f6a031f..5ea77b9 100644 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@ -271,178 +271,23 @@ logger.setLevel(logging.WARNING) 'and decrypt starting: {2}'.format(shift, fit, plaintext[:50])) ``` - * Yes, it's ugly. +* Yes, it's ugly. - Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG +Use `logger.setLevel()` to change the level: CRITICAL, ERROR, WARNING, INFO, DEBUG ---- - -# Back to frequency of letter counts - -Letter | Count --------|------ -a | 489107 -b | 92647 -c | 140497 -d | 267381 -e | 756288 -. | . -. | . -. | . -z | 3575 - -Another way of thinking about this is a 26-dimensional vector. - -Create a vector of our text, and one of idealised English. - -The distance between the vectors is how far from English the text is. - ---- - -# Vector distances - -.float-right[] - -Several different distance measures (__metrics__, also called __norms__): - -* L<sub>2</sub> norm (Euclidean distance): -`\(\|\mathbf{a} - \mathbf{b}\| = \sqrt{\sum_i (\mathbf{a}_i - \mathbf{b}_i)^2} \)` - -* L<sub>1</sub> norm (Manhattan distance, taxicab distance): -`\(\|\mathbf{a} - \mathbf{b}\| = \sum_i |\mathbf{a}_i - \mathbf{b}_i| \)` - -* L<sub>3</sub> norm: -`\(\|\mathbf{a} - \mathbf{b}\| = \sqrt[3]{\sum_i |\mathbf{a}_i - \mathbf{b}_i|^3} \)` - -The higher the power used, the more weight is given to the largest differences in components. - -(Extends out to: - -* L<sub>0</sub> norm (Hamming distance): -`$$\|\mathbf{a} - \mathbf{b}\| = \sum_i \left\{ -\begin{matrix} 1 &\mbox{if}\ \mathbf{a}_i \neq \mathbf{b}_i , \\ - 0 &\mbox{if}\ \mathbf{a}_i = \mathbf{b}_i \end{matrix} \right. $$` - -* L<sub>∞</sub> norm: -`\(\|\mathbf{a} - \mathbf{b}\| = \max_i{(\mathbf{a}_i - \mathbf{b}_i)} \)` - -neither of which will be that useful here, but they keep cropping up.) ---- - -# Normalisation of vectors - -Frequency distributions drawn from different sources will have different lengths. For a fair comparison we need to scale them. - -* Eucliean scaling (vector with unit length): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \sqrt{\mathbf{x}_1^2 + \mathbf{x}_2^2 + \mathbf{x}_3^2 + \dots } }$$` - -* Normalisation (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$` - ---- - -# Angle, not distance - -Rather than looking at the distance between the vectors, look at the angle between them. - -.float-right[] - -Vector dot product shows how much of one vector lies in the direction of another: -`\( \mathbf{A} \bullet \mathbf{B} = -\| \mathbf{A} \| \cdot \| \mathbf{B} \| \cos{\theta} \)` - -But, -`\( \mathbf{A} \bullet \mathbf{B} = \sum_i \mathbf{A}_i \cdot \mathbf{B}_i \)` -and `\( \| \mathbf{A} \| = \sum_i \mathbf{A}_i^2 \)` - -A bit of rearranging give the cosine simiarity: -`$$ \cos{\theta} = \frac{ \mathbf{A} \bullet \mathbf{B} }{ \| \mathbf{A} \| \cdot \| \mathbf{B} \| } = -\frac{\sum_i \mathbf{A}_i \cdot \mathbf{B}_i}{\sum_i \mathbf{A}_i^2 \times \sum_i \mathbf{B}_i^2} $$` - -This is independent of vector lengths! - -Cosine similarity is 1 if in parallel, 0 if perpendicular, -1 if antiparallel. - ---- - -# Which is best? - - | Euclidean | Normalised ----|-----------|------------ -L1 | x | x -L2 | x | x -L3 | x | x -Cosine | x | x - -And the probability measure! - -* Nine different ways of measuring fitness. - -## Computing is an empircal science - -Let's do some experiments to find the best solution! +Use `logger.debug()`, `logger.info()`, etc. to log a message. --- -# Experimental harness - -## Step 1: build some other scoring functions - -We need a way of passing the different functions to the keyfinding function. - -## Step 2: find the best scoring function - -Try them all on random ciphertexts, see which one works best. - ---- +# How much ciphertext do we need? -# Functions are values! - -```python ->>> Pletters -<function Pletters at 0x7f60e6d9c4d0> -``` - -```python -def caesar_break(message, fitness=Pletters): - """Breaks a Caesar cipher using frequency analysis -... - for shift in range(26): - plaintext = caesar_decipher(message, shift) - fit = fitness(plaintext) -``` - ---- - -# Changing the comparison function - -* Must be a function that takes a text and returns a score - * Better fit must give higher score, opposite of the vector distance norms - -```python -def make_frequency_compare_function(target_frequency, frequency_scaling, metric, invert): - def frequency_compare(text): - ... - return score - return frequency_compare -``` - ---- - -# Data-driven processing - -```python -metrics = [{'func': norms.l1, 'invert': True, 'name': 'l1'}, - {'func': norms.l2, 'invert': True, 'name': 'l2'}, - {'func': norms.l3, 'invert': True, 'name': 'l3'}, - {'func': norms.cosine_similarity, 'invert': False, 'name': 'cosine_similarity'}] -scalings = [{'corpus_frequency': normalised_english_counts, - 'scaling': norms.normalise, - 'name': 'normalised'}, - {'corpus_frequency': euclidean_scaled_english_counts, - 'scaling': norms.euclidean_scale, - 'name': 'euclidean_scaled'}] -``` +## Let's do an experiment to find out -Use this to make all nine scoring functions. +1. Load the whole corpus into a string (sanitised) +2. Select a random chunk of plaintext and a random key +3. Encipher the text +4. Score 1 point if `caesar_cipher_break()` recovers the correct key +5. Repeat many times and with many plaintext lengths </textarea>