X-Git-Url: https://git.njae.me.uk/?a=blobdiff_plain;f=slides%2Fcaesar-break.html;h=187719da5f81e687adcac89c30c90646663df232;hb=12030d72c08d30d157f59a692c0f9ca10e57f655;hp=a47e2364c9d75c1da8f30993d41581d90739e467;hpb=5b51a469cc152b1035a5cf69f2c38d51f9d16eb8;p=cipher-training.git diff --git a/slides/caesar-break.html b/slides/caesar-break.html index a47e236..187719d 100644 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@ -141,6 +141,53 @@ Counting letters in _War and Peace_ gives all manner of junk. * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting +```python +[l.lower() for l in text if ...] +``` +--- + + +# Accents + +```python +>>> caesar_encipher_letter('é', 1) +``` +What does it produce? + +What should it produce? + +## Unicode, combining codepoints, and normal forms + +Text encodings will bite you when you least expect it. + +* urlencoding is the other pain point. + +--- + +# Five minutes on StackOverflow later... + +```python +def unaccent(text): + """Remove all accents from letters. + It does this by converting the unicode string to decomposed compatibility + form, dropping all the combining accents, then re-encoding the bytes. + + >>> unaccent('hello') + 'hello' + >>> unaccent('HELLO') + 'HELLO' + >>> unaccent('héllo') + 'hello' + >>> unaccent('héllö') + 'hello' + >>> unaccent('HÉLLÖ') + 'HELLO' + """ + return unicodedata.normalize('NFKD', text).\ + encode('ascii', 'ignore').\ + decode('utf-8') +``` + --- # Vector distances