From: Neil Smith Date: Wed, 16 Jul 2014 15:29:25 +0000 (+0100) Subject: Copied updated slides across X-Git-Url: https://git.njae.me.uk/?p=cipher-training.git;a=commitdiff_plain;h=995a501e53864ff95b984e846966162d851ee9b9 Copied updated slides across --- diff --git a/slides/caesar-break.html b/slides/caesar-break.html index 4d2ebfa..7a2fbf6 100644 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@ -128,11 +128,11 @@ Use this to predict the probability of each letter, and hence the probability of --- -# An infinite number of monkeys +.float-right[![right-aligned Typing monkey](typingmonkeylarge.jpg)] -What is the probability that this string of letters is a sample of English? +# Naive Bayes, or the bag of letters -## Naive Bayes, or the bag of letters +What is the probability that this string of letters is a sample of English? Ignore letter order, just treat each letter individually. @@ -234,13 +234,20 @@ def unaccent(text): 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`. 2. Find the frequencies (`.update()`) -3. Sort by count -4. Write counts to `count_1l.txt` (`'text{}\n'.format()`) +3. Sort by count (read the docs...) +4. Write counts to `count_1l.txt` +```python +with open('count_1l.txt', 'w') as f: + for each letter...: + f.write('text\t{}\n'.format(count)) +``` --- # Reading letter probabilities +New file: `language_models.py` + 1. Load the file `count_1l.txt` into a dict, with letters as keys. 2. Normalise the counts (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$` @@ -257,6 +264,8 @@ def unaccent(text): # Breaking caesar ciphers +New file: `cipherbreak.py` + ## Remember the basic idea ``` diff --git a/slides/caesar-encipher.html b/slides/caesar-encipher.html index 4ef1d34..4afd78d 100644 --- a/slides/caesar-encipher.html +++ b/slides/caesar-encipher.html @@ -82,6 +82,34 @@ chr() --- +# Using the tools + +Before doing anything, create a new branch in Git + +* This will keep your changes isolated + +Experiment in IPython (ephemeral, for us) + +Once you've got something working, copy the code into a `.py` file (permanent and reusable) + +```python +from imp import reload + +import test +reload(test) +from test import * +``` + +Re-evaluate the second cell to reload the file into the IPython notebook + +When you've made progress, make a Git commit + +* Commit early and often! + +When you've finished, change back to `master` branch and `merge` the development branch + +--- + # The [string module](http://docs.python.org/3.3/library/string.html) is your friend ```python @@ -95,6 +123,7 @@ string.punctuation ``` --- + # DRY and YAGNI Is your code DRY? @@ -131,7 +160,7 @@ if __name__ == "__main__": --- -# Doing all the letters +# Doing the whole message ## Test-first developement @@ -142,7 +171,7 @@ if __name__ == "__main__": --- -# Doing all the letters +# Doing the whole message ## Abysmal @@ -152,9 +181,11 @@ for i in range(len(plaintext)): ciphertext += caesar_encipher_letter(plaintext[i], key) ``` +Try it in IPython + --- -# Doing all the letters +# Doing the whole message ## Bad @@ -168,7 +199,7 @@ for p in plaintext: --- -# Doing all the letters +# Doing the whole message ## Good (but unPythonic) @@ -178,7 +209,7 @@ ciphertext = map(lambda p: caesar_encipher_letter(p, key), plaintext) --- -# Doing all the letters +# Doing the whole message ## Best diff --git a/slides/keyword-break.html b/slides/keyword-break.html index 08013f3..ddf82c1 100644 --- a/slides/keyword-break.html +++ b/slides/keyword-break.html @@ -115,7 +115,11 @@ for each key: Repetition of code is a bad smell. -Separate the 'try all keys, keep the best' logic from the 'score this one key' logic. +Separate out + +* enumerate the keys +* score a key +* find the key with the best score --- diff --git a/slides/typingmonkeylarge.jpg b/slides/typingmonkeylarge.jpg new file mode 100644 index 0000000..8078671 Binary files /dev/null and b/slides/typingmonkeylarge.jpg differ diff --git a/slides/word-segmentation.html b/slides/word-segmentation.html index 35721ab..6215255 100644 --- a/slides/word-segmentation.html +++ b/slides/word-segmentation.html @@ -129,7 +129,7 @@ Constructor (`__init__`) takes a data file, does all the adding up and taking lo ```python class Pdist(dict): def __init__(self, data=[]): - for key, count in data2: + for key, count in data: ... self.total = ... def __missing__(self, key): @@ -177,9 +177,9 @@ To segment a string: return the split with highest score ``` -Indexing pulls out letters. `'sometext'[0]` = 's' ; `'keyword'[3]` = 'e' ; `'keyword'[-1]` = 't' +Indexing pulls out letters. `'sometext'[0]` = 's' ; `'sometext'[3]` = 'e' ; `'sometext'[-1]` = 't' -Slices pulls out substrings. `'keyword'[1:4]` = 'ome' ; `'keyword'[:3]` = 'som' ; `'keyword'[5:]` = 'ext' +Slices pulls out substrings. `'sometext'[1:4]` = 'ome' ; `'sometext'[:3]` = 'som' ; `'sometext'[5:]` = 'ext' `range()` will sweep across the string