From: Neil Smith Date: Sun, 27 Mar 2016 12:35:08 +0000 (+0100) Subject: Merge conflict resolved X-Git-Url: https://git.njae.me.uk/?a=commitdiff_plain;h=346ad6e81b0a018e2164ba407412cf2045c55ef0;hp=50f2a1425d9523aaff91ea81fa098a59f44b69cd;p=cipher-training.git Merge conflict resolved --- diff --git a/count_1l.txt b/count_1l.txt new file mode 100644 index 0000000..e9ac0c6 --- /dev/null +++ b/count_1l.txt @@ -0,0 +1,26 @@ +e 758103 +t 560576 +o 504520 +a 490129 +i 421240 +n 419374 +h 416369 +s 404473 +r 373599 +d 267917 +l 259023 +u 190269 +m 172199 +w 154157 +y 143040 +c 141094 +f 135318 +g 117888 +p 100690 +b 92919 +v 65297 +k 54248 +x 7414 +j 6679 +q 5499 +z 3577 diff --git a/slides/caesar-break.html b/slides/caesar-break.html index 7a2fbf6..81a8396 100644 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@ -112,6 +112,8 @@ How do we define "closeness"? ## Abstraction: frequency of letter counts +.float-right[![right-aligned Letter frequencies](letter-frequency-treemap.png)] + Letter | Count -------|------ a | 489107 @@ -146,7 +148,7 @@ Letter | i | f | m | m | p | ifmmp ------------|---------|---------|---------|---------|---------|------- Probability | 0.06723 | 0.02159 | 0.02748 | 0.02748 | 0.01607 | 1.76244520 × 10-8 -(Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`) +(Implmentation issue: this can often underflow, so we rephrase it as `\( \sum_i \log p_i \)`) Letter | h | e | l | l | o | hello ------------|---------|---------|---------|---------|---------|------- @@ -207,6 +209,8 @@ Text encodings will bite you when you least expect it. # Five minutes on StackOverflow later... ```python +import unicodedata + def unaccent(text): """Remove all accents from letters. It does this by converting the unicode string to decomposed compatibility @@ -246,8 +250,6 @@ with open('count_1l.txt', 'w') as f: # Reading letter probabilities -New file: `language_models.py` - 1. Load the file `count_1l.txt` into a dict, with letters as keys. 2. Normalise the counts (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$` diff --git a/slides/caesar-encipher.html b/slides/caesar-encipher.html index 4afd78d..279c2bd 100644 --- a/slides/caesar-encipher.html +++ b/slides/caesar-encipher.html @@ -90,7 +90,7 @@ Before doing anything, create a new branch in Git Experiment in IPython (ephemeral, for us) -Once you've got something working, copy the code into a `.py` file (permanent and reusable) +Once you've got something working, export the code into a `.py` file (permanent and reusable) ```python from imp import reload @@ -224,6 +224,15 @@ ciphertext = [caesar_encipher_letter(p, key) for p in plaintext] ''.join() ``` +You'll be doing this a lot, so define a couple of utility functions: + +```python +cat = ''.join +wcat = ' '.join +``` + +`cat` after the Unix command (_concatenate_ files), `wcat` for _word concatenate_. + diff --git a/slides/letter-frequency-treemap.png b/slides/letter-frequency-treemap.png new file mode 100644 index 0000000..256230e Binary files /dev/null and b/slides/letter-frequency-treemap.png differ