X-Git-Url: https://git.njae.me.uk/?a=blobdiff_plain;f=slides%2Fcaesar-break.html;fp=slides%2Fcaesar-break.html;h=7a2fbf6d550cbd8e8dc90bcea3b694c0e4dcd293;hb=2def5210d568279bb38890c2de282e171f4ff7dd;hp=4d2ebfa0d01d556dbdd37599f3e4320f92fc4031;hpb=31407ccd2650f1467329b8aece52a2b889d53341;p=cipher-training.git diff --git a/slides/caesar-break.html b/slides/caesar-break.html index 4d2ebfa..7a2fbf6 100644 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@ -128,11 +128,11 @@ Use this to predict the probability of each letter, and hence the probability of --- -# An infinite number of monkeys +.float-right[![right-aligned Typing monkey](typingmonkeylarge.jpg)] -What is the probability that this string of letters is a sample of English? +# Naive Bayes, or the bag of letters -## Naive Bayes, or the bag of letters +What is the probability that this string of letters is a sample of English? Ignore letter order, just treat each letter individually. @@ -234,13 +234,20 @@ def unaccent(text): 1. Read from `shakespeare.txt`, `sherlock-holmes.txt`, and `war-and-peace.txt`. 2. Find the frequencies (`.update()`) -3. Sort by count -4. Write counts to `count_1l.txt` (`'text{}\n'.format()`) +3. Sort by count (read the docs...) +4. Write counts to `count_1l.txt` +```python +with open('count_1l.txt', 'w') as f: + for each letter...: + f.write('text\t{}\n'.format(count)) +``` --- # Reading letter probabilities +New file: `language_models.py` + 1. Load the file `count_1l.txt` into a dict, with letters as keys. 2. Normalise the counts (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$` @@ -257,6 +264,8 @@ def unaccent(text): # Breaking caesar ciphers +New file: `cipherbreak.py` + ## Remember the basic idea ```