From 12030d72c08d30d157f59a692c0f9ca10e57f655 Mon Sep 17 00:00:00 2001 From: Neil Smith Date: Wed, 12 Mar 2014 12:46:44 +0000 Subject: [PATCH] Moved discussion of accents to cipher breaking --- slides/caesar-break.html | 47 ++++++++++++++++++++++++++++++++++++++++ 1 file changed, 47 insertions(+) diff --git a/slides/caesar-break.html b/slides/caesar-break.html index a47e236..187719d 100644 --- a/slides/caesar-break.html +++ b/slides/caesar-break.html @@ -141,6 +141,53 @@ Counting letters in _War and Peace_ gives all manner of junk. * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting +```python +[l.lower() for l in text if ...] +``` +--- + + +# Accents + +```python +>>> caesar_encipher_letter('é', 1) +``` +What does it produce? + +What should it produce? + +## Unicode, combining codepoints, and normal forms + +Text encodings will bite you when you least expect it. + +* urlencoding is the other pain point. + +--- + +# Five minutes on StackOverflow later... + +```python +def unaccent(text): + """Remove all accents from letters. + It does this by converting the unicode string to decomposed compatibility + form, dropping all the combining accents, then re-encoding the bytes. + + >>> unaccent('hello') + 'hello' + >>> unaccent('HELLO') + 'HELLO' + >>> unaccent('héllo') + 'hello' + >>> unaccent('héllö') + 'hello' + >>> unaccent('HÉLLÖ') + 'HELLO' + """ + return unicodedata.normalize('NFKD', text).\ + encode('ascii', 'ignore').\ + decode('utf-8') +``` + --- # Vector distances -- 2.34.1