From: Neil Smith <neil.git@njae.me.uk>
Date: Wed, 12 Mar 2014 12:46:44 +0000 (+0000)
Subject: Moved discussion of accents to cipher breaking
X-Git-Url: https://git.njae.me.uk/?p=cipher-training.git;a=commitdiff_plain;h=12030d72c08d30d157f59a692c0f9ca10e57f655

Moved discussion of accents to cipher breaking
---

diff --git a/slides/caesar-break.html b/slides/caesar-break.html
index a47e236..187719d 100644
--- a/slides/caesar-break.html
+++ b/slides/caesar-break.html
@@ -141,6 +141,53 @@ Counting letters in _War and Peace_ gives all manner of junk.
 
 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
 
+```python
+[l.lower() for l in text if ...]
+```
+---
+
+
+# Accents
+
+```python
+>>> caesar_encipher_letter('Ã©', 1)
+```
+What does it produce?
+
+What should it produce?
+
+## Unicode, combining codepoints, and normal forms
+
+Text encodings will bite you when you least expect it.
+
+* urlencoding is the other pain point.
+
+---
+
+# Five minutes on StackOverflow later...
+
+```python
+def unaccent(text):
+    """Remove all accents from letters. 
+    It does this by converting the unicode string to decomposed compatibility
+    form, dropping all the combining accents, then re-encoding the bytes.
+
+    >>> unaccent('hello')
+    'hello'
+    >>> unaccent('HELLO')
+    'HELLO'
+    >>> unaccent('hÃ©llo')
+    'hello'
+    >>> unaccent('hÃ©llÃ¶')
+    'hello'
+    >>> unaccent('HÃLLÃ')
+    'HELLO'
+    """
+    return unicodedata.normalize('NFKD', text).\
+        encode('ascii', 'ignore').\
+        decode('utf-8')
+```
+
 ---
 
 # Vector distances