## Abstraction: frequency of letter counts
+.float-right[![right-aligned Letter frequencies](letter-frequency-treemap.png)]
+
Letter | Count
-------|------
a | 489107
------------|---------|---------|---------|---------|---------|-------
Probability | 0.06723 | 0.02159 | 0.02748 | 0.02748 | 0.01607 | 1.76244520 × 10<sup>-8</sup>
-(Implmentation issue: this can often underflow, so get in the habit of rephrasing it as `\( \sum_i \log p_i \)`)
+(Implmentation issue: this can often underflow, so we rephrase it as `\( \sum_i \log p_i \)`)
Letter | h | e | l | l | o | hello
------------|---------|---------|---------|---------|---------|-------
# Five minutes on StackOverflow later...
```python
+import unicodedata
+
def unaccent(text):
"""Remove all accents from letters.
It does this by converting the unicode string to decomposed compatibility
# Reading letter probabilities
-New file: `language_models.py`
-
1. Load the file `count_1l.txt` into a dict, with letters as keys.
2. Normalise the counts (components of vector sum to 1): `$$ \hat{\mathbf{x}} = \frac{\mathbf{x}}{\| \mathbf{x} \|} = \frac{\mathbf{x}}{ \mathbf{x}_1 + \mathbf{x}_2 + \mathbf{x}_3 + \dots }$$`
Experiment in IPython (ephemeral, for us)
-Once you've got something working, copy the code into a `.py` file (permanent and reusable)
+Once you've got something working, export the code into a `.py` file (permanent and reusable)
```python
from imp import reload
''.join()
```
+You'll be doing this a lot, so define a couple of utility functions:
+
+```python
+cat = ''.join
+wcat = ' '.join
+```
+
+`cat` after the Unix command (_concatenate_ files), `wcat` for _word concatenate_.
+
</textarea>
<script src="http://gnab.github.io/remark/downloads/remark-0.6.0.min.js" type="text/javascript">
</script>