Merge branch 'master' of git.njae.me.uk:cas-master-teacher-training
[cas-master-teacher-training.git] / word_filter_comparison.ipynb
1 {
2 "metadata": {
3 "name": "",
4 "signature": "sha256:79decad1a699da8d950a948a249c8270194d5a7efe1d66f65f37c997a74f5325"
5 },
6 "nbformat": 3,
7 "nbformat_minor": 0,
8 "worksheets": [
9 {
10 "cells": [
11 {
12 "cell_type": "markdown",
13 "metadata": {},
14 "source": [
15 "# Filtering words\n",
16 "The challenge is to read a list of words from a dictionary, and keep only those words which contain only lower-case letters. Any \"word\" that contains an upper-case letter, punctuation, spaces, or similar should be rejected on the basis that it's a proper noun, and abbreviation, or something else that means it can't be a valid target word for Hangman."
17 ]
18 },
19 {
20 "cell_type": "code",
21 "collapsed": false,
22 "input": [
23 "# Import the libraries we'll need\n",
24 "import re\n",
25 "import random\n",
26 "import string"
27 ],
28 "language": "python",
29 "metadata": {},
30 "outputs": [],
31 "prompt_number": 3
32 },
33 {
34 "cell_type": "markdown",
35 "metadata": {},
36 "source": [
37 "Get the list of all words."
38 ]
39 },
40 {
41 "cell_type": "code",
42 "collapsed": false,
43 "input": [
44 "all_words = [w.strip() for w in open('/usr/share/dict/british-english').readlines()]\n",
45 "len(all_words)"
46 ],
47 "language": "python",
48 "metadata": {},
49 "outputs": [
50 {
51 "metadata": {},
52 "output_type": "pyout",
53 "prompt_number": 4,
54 "text": [
55 "99156"
56 ]
57 }
58 ],
59 "prompt_number": 4
60 },
61 {
62 "cell_type": "markdown",
63 "metadata": {},
64 "source": [
65 "# Checking a word\n",
66 "\n",
67 "## Explicit iteration over the word\n",
68 "This function walks over the word, character by character, and checks if it's in the list of valid characters (as given in `string.ascii_lowercase`). If it's not, the `valid` flag is set to `False`. The final value is returned."
69 ]
70 },
71 {
72 "cell_type": "code",
73 "collapsed": false,
74 "input": [
75 "def check_word_explicit(word):\n",
76 " valid = True\n",
77 " for letter in word:\n",
78 " if letter not in string.ascii_lowercase:\n",
79 " valid = False\n",
80 " return valid"
81 ],
82 "language": "python",
83 "metadata": {},
84 "outputs": [],
85 "prompt_number": 5
86 },
87 {
88 "cell_type": "markdown",
89 "metadata": {},
90 "source": [
91 "### Short-circuiting explicit iteration\n",
92 "As above, but the function `return`s `False` as soon as it detects an invalid character. This should make it quicker to reject words."
93 ]
94 },
95 {
96 "cell_type": "code",
97 "collapsed": false,
98 "input": [
99 "def check_word_short_circuit(word):\n",
100 " for letter in word:\n",
101 " if letter not in string.ascii_lowercase:\n",
102 " return False\n",
103 " return True"
104 ],
105 "language": "python",
106 "metadata": {},
107 "outputs": [],
108 "prompt_number": 6
109 },
110 {
111 "cell_type": "markdown",
112 "metadata": {},
113 "source": [
114 "## Using comprehensions\n",
115 "Use a comprehension function to convert the list of letters into a list of Booleans showing whether the character in that position is a valid letter. Use the built-in `all()` function to check that all the values in the list are `True`."
116 ]
117 },
118 {
119 "cell_type": "code",
120 "collapsed": false,
121 "input": [
122 "# Examples of the idea\n",
123 "print('hello :', [letter in string.ascii_lowercase for letter in 'hello'])\n",
124 "print('heLLo :', [letter in string.ascii_lowercase for letter in 'heLLo'])"
125 ],
126 "language": "python",
127 "metadata": {},
128 "outputs": [
129 {
130 "output_type": "stream",
131 "stream": "stdout",
132 "text": [
133 "hello : [True, True, True, True, True]\n",
134 "heLLo : [True, True, False, False, True]\n"
135 ]
136 }
137 ],
138 "prompt_number": 7
139 },
140 {
141 "cell_type": "code",
142 "collapsed": false,
143 "input": [
144 "def check_word_comprehension(word):\n",
145 " return all(letter in string.ascii_lowercase for letter in word)"
146 ],
147 "language": "python",
148 "metadata": {},
149 "outputs": [],
150 "prompt_number": 8
151 },
152 {
153 "cell_type": "markdown",
154 "metadata": {},
155 "source": [
156 "### Short-circuited comprehensions\n",
157 "An attempt to be clever. Can we stop the checking of letters as soon as we've found an invalid one?"
158 ]
159 },
160 {
161 "cell_type": "code",
162 "collapsed": false,
163 "input": [
164 "def check_word_comprehension_clever(word):\n",
165 " return not any(letter not in string.ascii_lowercase for letter in word)"
166 ],
167 "language": "python",
168 "metadata": {},
169 "outputs": [],
170 "prompt_number": 9
171 },
172 {
173 "cell_type": "markdown",
174 "metadata": {},
175 "source": [
176 "## A recursive definition\n",
177 "A word if all lowercase if the first character is lowercase and the rest of the word is all lowercase. The base case is an empty word. This should evaluate to `True` because an empty list does not contain any invalid characters.\n",
178 "\n",
179 "Note the Pythonic use of \"truthiness\" values. If you try to take the Boolean value of a string, it evaluates as `False` if it's empty and `True` otherwise. Using \n",
180 "\n",
181 "` if word == '':` \n",
182 "\n",
183 "in the first line is just as correct, but not as Pythonic."
184 ]
185 },
186 {
187 "cell_type": "code",
188 "collapsed": false,
189 "input": [
190 "def check_word_recursive(word):\n",
191 " if word:\n",
192 " if word[0] not in string.ascii_lowercase:\n",
193 " return False\n",
194 " else:\n",
195 " return check_word_recursive(word[1:])\n",
196 " else:\n",
197 " return True"
198 ],
199 "language": "python",
200 "metadata": {},
201 "outputs": [],
202 "prompt_number": 10
203 },
204 {
205 "cell_type": "markdown",
206 "metadata": {},
207 "source": [
208 "## Regular expressions\n",
209 "A regular expression is a way of defining a finite state machine (FSM) that accepts some sequences of characters. They're used a lot whenever you want to process something text-based. In this case, the regex consists of:\n",
210 "* `^` : match the start of the string\n",
211 "* `[a-z]` : match a single character in the range `a` to `z`\n",
212 "* `[a-z]+` : match a sequence of one one or more characters in the range `a` to `z`\n",
213 "* `$` : match the end of the string\n",
214 "This means you have a regular expression that matches strings containing just lower-case letters with nothing else between the matched letters and the start and end of the string. \n",
215 "\n",
216 "Python has the `re.compile` feature to build the specialised FSM that does the matching. This is faster if you want to use the same regular expression a lot. If you only want to use it a few times, it's often easier to just give the regex directly. See below for examples.\n",
217 "\n",
218 "Regular expresions are incredibly powerful, but take time to learn. See the [regular expression tutorial](http://www.regular-expressions.info/tutorial.html) for a guide."
219 ]
220 },
221 {
222 "cell_type": "code",
223 "collapsed": false,
224 "input": [
225 "valid_word_re = re.compile(r'^[a-z]+$')"
226 ],
227 "language": "python",
228 "metadata": {},
229 "outputs": [],
230 "prompt_number": 11
231 },
232 {
233 "cell_type": "markdown",
234 "metadata": {},
235 "source": [
236 "# Evaluation\n",
237 "Which of these alternatives is the best?\n",
238 "\n",
239 "The important measure is whether the program is both readable and correct. You can be the judge of that (though I used a regex as a first recourse).\n",
240 "\n",
241 "We can also look at performance: which is the fastest?\n",
242 "\n",
243 "Use the IPython timing cell-magic to find out. We'll also use an `assert`ion to check that all the approaches give the same answer.\n",
244 "\n",
245 "You'll have to run the notebook to find the answer. Which do you think would be these fastest, or the slowest?"
246 ]
247 },
248 {
249 "cell_type": "code",
250 "collapsed": false,
251 "input": [
252 "valid_word_count = len([w for w in all_words if valid_word_re.match(w)])\n",
253 "valid_word_count"
254 ],
255 "language": "python",
256 "metadata": {},
257 "outputs": [
258 {
259 "metadata": {},
260 "output_type": "pyout",
261 "prompt_number": 12,
262 "text": [
263 "62856"
264 ]
265 }
266 ],
267 "prompt_number": 12
268 },
269 {
270 "cell_type": "code",
271 "collapsed": false,
272 "input": [
273 "%%timeit\n",
274 "words = [w for w in all_words if check_word_explicit(w)]\n",
275 "assert len(words) == valid_word_count"
276 ],
277 "language": "python",
278 "metadata": {},
279 "outputs": []
280 },
281 {
282 "cell_type": "code",
283 "collapsed": false,
284 "input": [
285 "%%timeit\n",
286 "words = [w for w in all_words if check_word_short_circuit(w)]\n",
287 "assert len(words) == valid_word_count"
288 ],
289 "language": "python",
290 "metadata": {},
291 "outputs": []
292 },
293 {
294 "cell_type": "code",
295 "collapsed": false,
296 "input": [
297 "%%timeit\n",
298 "words = [w for w in all_words if check_word_comprehension(w)]\n",
299 "assert len(words) == valid_word_count"
300 ],
301 "language": "python",
302 "metadata": {},
303 "outputs": []
304 },
305 {
306 "cell_type": "code",
307 "collapsed": false,
308 "input": [
309 "%%timeit\n",
310 "words = [w for w in all_words if check_word_comprehension_clever(w)]\n",
311 "assert len(words) == valid_word_count"
312 ],
313 "language": "python",
314 "metadata": {},
315 "outputs": []
316 },
317 {
318 "cell_type": "code",
319 "collapsed": false,
320 "input": [
321 "%%timeit\n",
322 "words = [w for w in all_words if check_word_recursive(w)]\n",
323 "assert len(words) == valid_word_count"
324 ],
325 "language": "python",
326 "metadata": {},
327 "outputs": []
328 },
329 {
330 "cell_type": "code",
331 "collapsed": false,
332 "input": [
333 "%%timeit\n",
334 "words = [w for w in all_words if re.match(r'^[a-z]+$', w)]\n",
335 "assert len(words) == valid_word_count"
336 ],
337 "language": "python",
338 "metadata": {},
339 "outputs": []
340 },
341 {
342 "cell_type": "code",
343 "collapsed": false,
344 "input": [
345 "%%timeit\n",
346 "words = [w for w in all_words if valid_word_re.match(w)]\n",
347 "assert len(words) == valid_word_count"
348 ],
349 "language": "python",
350 "metadata": {},
351 "outputs": []
352 },
353 {
354 "cell_type": "code",
355 "collapsed": false,
356 "input": [],
357 "language": "python",
358 "metadata": {},
359 "outputs": []
360 }
361 ],
362 "metadata": {}
363 }
364 ]
365 }