X-Git-Url: https://git.njae.me.uk/?a=blobdiff_plain;f=slides%2Fcaesar-break.html;h=187719da5f81e687adcac89c30c90646663df232;hb=12030d72c08d30d157f59a692c0f9ca10e57f655;hp=a47e2364c9d75c1da8f30993d41581d90739e467;hpb=5b51a469cc152b1035a5cf69f2c38d51f9d16eb8;p=cipher-training.git

diff --git a/slides/caesar-break.html b/slides/caesar-break.html
index a47e236..187719d 100644
--- a/slides/caesar-break.html
+++ b/slides/caesar-break.html
@@ -141,6 +141,53 @@ Counting letters in _War and Peace_ gives all manner of junk.
 
 * Convert the text in canonical form (lower case, accents removed, non-letters stripped) before counting
 
+```python
+[l.lower() for l in text if ...]
+```
+---
+
+
+# Accents
+
+```python
+>>> caesar_encipher_letter('Ã©', 1)
+```
+What does it produce?
+
+What should it produce?
+
+## Unicode, combining codepoints, and normal forms
+
+Text encodings will bite you when you least expect it.
+
+* urlencoding is the other pain point.
+
+---
+
+# Five minutes on StackOverflow later...
+
+```python
+def unaccent(text):
+    """Remove all accents from letters. 
+    It does this by converting the unicode string to decomposed compatibility
+    form, dropping all the combining accents, then re-encoding the bytes.
+
+    >>> unaccent('hello')
+    'hello'
+    >>> unaccent('HELLO')
+    'HELLO'
+    >>> unaccent('hÃ©llo')
+    'hello'
+    >>> unaccent('hÃ©llÃ¶')
+    'hello'
+    >>> unaccent('HÃLLÃ')
+    'HELLO'
+    """
+    return unicodedata.normalize('NFKD', text).\
+        encode('ascii', 'ignore').\
+        decode('utf-8')
+```
+
 ---
 
 # Vector distances