From: Neil Smith Date: Wed, 12 Mar 2014 12:46:44 +0000 (+0000) Subject: Moved discussion of accents to cipher breaking X-Git-Url: https://git.njae.me.uk/?p=cipher-training.git;a=commitdiff_plain;h=12030d72c08d30d157f59a692c0f9ca10e57f655 Moved discussion of accents to cipher breaking --- diff --git a/slides/caesar-break.html b/slides/caesar-break.html index a47e236..187719d 100644 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@ -141,6 +141,53 @@ Counting letters in _War and Peace_ gives all manner of junk. * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting +```python +[l.lower() for l in text if ...] +``` +--- + + +# Accents + +```python +>>> caesar_encipher_letter('é', 1) +``` +What does it produce? + +What should it produce? + +## Unicode, combining codepoints, and normal forms + +Text encodings will bite you when you least expect it. + +* urlencoding is the other pain point. + +--- + +# Five minutes on StackOverflow later... + +```python +def unaccent(text): + """Remove all accents from letters. + It does this by converting the unicode string to decomposed compatibility + form, dropping all the combining accents, then re-encoding the bytes. + + >>> unaccent('hello') + 'hello' + >>> unaccent('HELLO') + 'HELLO' + >>> unaccent('héllo') + 'hello' + >>> unaccent('héllö') + 'hello' + >>> unaccent('HÉLLÖ') + 'HELLO' + """ + return unicodedata.normalize('NFKD', text).\ + encode('ascii', 'ignore').\ + decode('utf-8') +``` + --- # Vector distances