Moved word filter
authorNeil Smith <neil.git@njae.me.uk>
Sun, 7 Dec 2014 12:09:31 +0000 (12:09 +0000)
committerNeil Smith <neil.git@njae.me.uk>
Sun, 7 Dec 2014 12:09:31 +0000 (12:09 +0000)
hangman/word_filter_comparison.ipynb [new file with mode: 0644]
word_filter_comparison.ipynb [deleted file]

diff --git a/hangman/word_filter_comparison.ipynb b/hangman/word_filter_comparison.ipynb
new file mode 100644 (file)
index 0000000..b96359e
--- /dev/null
@@ -0,0 +1,365 @@
+{
+ "metadata": {
+  "name": "",
+  "signature": "sha256:b1430467f492182774cf211bf9da55e45dbf53644a26cb4e401bed473b1551ed"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+  {
+   "cells": [
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Filtering words\n",
+      "The challenge is to read a list of words from a dictionary, and keep only those words which contain only lower-case letters. Any \"word\" that contains an upper-case letter, punctuation, spaces, or similar should be rejected on the basis that it's a proper noun, and abbreviation, or something else that means it can't be a valid target word for Hangman."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# Import the libraries we'll need\n",
+      "import re\n",
+      "import random\n",
+      "import string"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 3
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "Get the list of all words."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "all_words = [w.strip() for w in open('/usr/share/dict/british-english').readlines()]\n",
+      "len(all_words)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 4,
+       "text": [
+        "99156"
+       ]
+      }
+     ],
+     "prompt_number": 4
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Checking a word\n",
+      "\n",
+      "## Explicit iteration over the word\n",
+      "This function walks over the word, character by character, and checks if it's in the list of valid characters (as given in `string.ascii_lowercase`). If it's not, the `valid` flag is set to `False`. The final value is returned."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def check_word_explicit(word):\n",
+      "    valid = True\n",
+      "    for letter in word:\n",
+      "        if letter not in string.ascii_lowercase:\n",
+      "            valid = False\n",
+      "    return valid"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 5
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### Short-circuiting explicit iteration\n",
+      "As above, but the function `return`s `False` as soon as it detects an invalid character. This should make it quicker to reject words."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def check_word_short_circuit(word):\n",
+      "    for letter in word:\n",
+      "        if letter not in string.ascii_lowercase:\n",
+      "            return False\n",
+      "    return True"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 6
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Using comprehensions\n",
+      "Use a comprehension function to convert the list of letters into a list of Booleans showing whether the character in that position is a valid letter. Use the built-in `all()` function to check that all the values in the list are `True`."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "# Examples of the idea\n",
+      "print('hello :', [letter in string.ascii_lowercase for letter in 'hello'])\n",
+      "print('heLLo :', [letter in string.ascii_lowercase for letter in 'heLLo'])"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "output_type": "stream",
+       "stream": "stdout",
+       "text": [
+        "hello : [True, True, True, True, True]\n",
+        "heLLo : [True, True, False, False, True]\n"
+       ]
+      }
+     ],
+     "prompt_number": 7
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def check_word_comprehension(word):\n",
+      "    return all(letter in string.ascii_lowercase for letter in word)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 8
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "### Short-circuited comprehensions\n",
+      "An attempt to be clever. Can we stop the checking of letters as soon as we've found an invalid one?"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def check_word_comprehension_clever(word):\n",
+      "    return not any(letter not in string.ascii_lowercase for letter in word)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 9
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## A recursive definition\n",
+      "A word if all lowercase if the first character is lowercase and the rest of the word is all lowercase. The base case is an empty word. This should evaluate to `True` because an empty list does not contain any invalid characters.\n",
+      "\n",
+      "Note the Pythonic use of \"truthiness\" values. If you try to take the Boolean value of a string, it evaluates as `False` if it's empty and `True` otherwise. Using \n",
+      "\n",
+      "`    if word != '':` \n",
+      "\n",
+      "in the first line is just as correct, but not as Pythonic."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def check_word_recursive(word):\n",
+      "    if word:\n",
+      "        if word[0] not in string.ascii_lowercase:\n",
+      "            return False\n",
+      "        else:\n",
+      "            return check_word_recursive(word[1:])\n",
+      "    else:\n",
+      "        return True"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 10
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "## Regular expressions\n",
+      "A regular expression is a way of defining a finite state machine (FSM) that accepts some sequences of characters. They're used a lot whenever you want to process something text-based. In this case, the regex consists of:\n",
+      "* `^` : match the start of the string\n",
+      "* `[a-z]` : match a single character in the range `a` to `z`\n",
+      "* `[a-z]+` : match a sequence of one one or more characters in the range `a` to `z`\n",
+      "* `$` : match the end of the string\n",
+      "This means you have a regular expression that matches strings containing just lower-case letters with nothing else between the matched letters and the start and end of the string. \n",
+      "\n",
+      "Python has the `re.compile` feature to build the specialised FSM that does the matching. This is faster if you want to use the same regular expression a lot. If you only want to use it a few times, it's often easier to just give the regex directly. See below for examples.\n",
+      "\n",
+      "Regular expresions are incredibly powerful, but take time to learn. See the [regular expression tutorial](http://www.regular-expressions.info/tutorial.html) for a guide."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "valid_word_re = re.compile(r'^[a-z]+$')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 11
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "# Evaluation\n",
+      "Which of these alternatives is the best?\n",
+      "\n",
+      "The important measure is whether the program is both readable and correct. You can be the judge of that (though I used a regex as a first recourse).\n",
+      "\n",
+      "We can also look at performance: which is the fastest?\n",
+      "\n",
+      "Use the IPython timing cell-magic to find out. We'll also use an `assert`ion to check that all the approaches give the same answer.\n",
+      "\n",
+      "You'll have to run the notebook to find the answer. Which do you think would be these fastest, or the slowest?"
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "valid_word_count = len([w for w in all_words if valid_word_re.match(w)])\n",
+      "valid_word_count"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 12,
+       "text": [
+        "62856"
+       ]
+      }
+     ],
+     "prompt_number": 12
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%timeit\n",
+      "words = [w for w in all_words if check_word_explicit(w)]\n",
+      "assert len(words) == valid_word_count"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%timeit\n",
+      "words = [w for w in all_words if check_word_short_circuit(w)]\n",
+      "assert len(words) == valid_word_count"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%timeit\n",
+      "words = [w for w in all_words if check_word_comprehension(w)]\n",
+      "assert len(words) == valid_word_count"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%timeit\n",
+      "words = [w for w in all_words if check_word_comprehension_clever(w)]\n",
+      "assert len(words) == valid_word_count"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%timeit\n",
+      "words = [w for w in all_words if check_word_recursive(w)]\n",
+      "assert len(words) == valid_word_count"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%timeit\n",
+      "words = [w for w in all_words if re.match(r'^[a-z]+$', w)]\n",
+      "assert len(words) == valid_word_count"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "%%timeit\n",
+      "words = [w for w in all_words if valid_word_re.match(w)]\n",
+      "assert len(words) == valid_word_count"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [],
+     "language": "python",
+     "metadata": {},
+     "outputs": []
+    }
+   ],
+   "metadata": {}
+  }
+ ]
+}
\ No newline at end of file
diff --git a/word_filter_comparison.ipynb b/word_filter_comparison.ipynb
deleted file mode 100644 (file)
index b96359e..0000000
+++ /dev/null
@@ -1,365 +0,0 @@
-{
- "metadata": {
-  "name": "",
-  "signature": "sha256:b1430467f492182774cf211bf9da55e45dbf53644a26cb4e401bed473b1551ed"
- },
- "nbformat": 3,
- "nbformat_minor": 0,
- "worksheets": [
-  {
-   "cells": [
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "# Filtering words\n",
-      "The challenge is to read a list of words from a dictionary, and keep only those words which contain only lower-case letters. Any \"word\" that contains an upper-case letter, punctuation, spaces, or similar should be rejected on the basis that it's a proper noun, and abbreviation, or something else that means it can't be a valid target word for Hangman."
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "# Import the libraries we'll need\n",
-      "import re\n",
-      "import random\n",
-      "import string"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [],
-     "prompt_number": 3
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "Get the list of all words."
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "all_words = [w.strip() for w in open('/usr/share/dict/british-english').readlines()]\n",
-      "len(all_words)"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "metadata": {},
-       "output_type": "pyout",
-       "prompt_number": 4,
-       "text": [
-        "99156"
-       ]
-      }
-     ],
-     "prompt_number": 4
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "# Checking a word\n",
-      "\n",
-      "## Explicit iteration over the word\n",
-      "This function walks over the word, character by character, and checks if it's in the list of valid characters (as given in `string.ascii_lowercase`). If it's not, the `valid` flag is set to `False`. The final value is returned."
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "def check_word_explicit(word):\n",
-      "    valid = True\n",
-      "    for letter in word:\n",
-      "        if letter not in string.ascii_lowercase:\n",
-      "            valid = False\n",
-      "    return valid"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [],
-     "prompt_number": 5
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "### Short-circuiting explicit iteration\n",
-      "As above, but the function `return`s `False` as soon as it detects an invalid character. This should make it quicker to reject words."
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "def check_word_short_circuit(word):\n",
-      "    for letter in word:\n",
-      "        if letter not in string.ascii_lowercase:\n",
-      "            return False\n",
-      "    return True"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [],
-     "prompt_number": 6
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "## Using comprehensions\n",
-      "Use a comprehension function to convert the list of letters into a list of Booleans showing whether the character in that position is a valid letter. Use the built-in `all()` function to check that all the values in the list are `True`."
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "# Examples of the idea\n",
-      "print('hello :', [letter in string.ascii_lowercase for letter in 'hello'])\n",
-      "print('heLLo :', [letter in string.ascii_lowercase for letter in 'heLLo'])"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "output_type": "stream",
-       "stream": "stdout",
-       "text": [
-        "hello : [True, True, True, True, True]\n",
-        "heLLo : [True, True, False, False, True]\n"
-       ]
-      }
-     ],
-     "prompt_number": 7
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "def check_word_comprehension(word):\n",
-      "    return all(letter in string.ascii_lowercase for letter in word)"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [],
-     "prompt_number": 8
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "### Short-circuited comprehensions\n",
-      "An attempt to be clever. Can we stop the checking of letters as soon as we've found an invalid one?"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "def check_word_comprehension_clever(word):\n",
-      "    return not any(letter not in string.ascii_lowercase for letter in word)"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [],
-     "prompt_number": 9
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "## A recursive definition\n",
-      "A word if all lowercase if the first character is lowercase and the rest of the word is all lowercase. The base case is an empty word. This should evaluate to `True` because an empty list does not contain any invalid characters.\n",
-      "\n",
-      "Note the Pythonic use of \"truthiness\" values. If you try to take the Boolean value of a string, it evaluates as `False` if it's empty and `True` otherwise. Using \n",
-      "\n",
-      "`    if word != '':` \n",
-      "\n",
-      "in the first line is just as correct, but not as Pythonic."
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "def check_word_recursive(word):\n",
-      "    if word:\n",
-      "        if word[0] not in string.ascii_lowercase:\n",
-      "            return False\n",
-      "        else:\n",
-      "            return check_word_recursive(word[1:])\n",
-      "    else:\n",
-      "        return True"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [],
-     "prompt_number": 10
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "## Regular expressions\n",
-      "A regular expression is a way of defining a finite state machine (FSM) that accepts some sequences of characters. They're used a lot whenever you want to process something text-based. In this case, the regex consists of:\n",
-      "* `^` : match the start of the string\n",
-      "* `[a-z]` : match a single character in the range `a` to `z`\n",
-      "* `[a-z]+` : match a sequence of one one or more characters in the range `a` to `z`\n",
-      "* `$` : match the end of the string\n",
-      "This means you have a regular expression that matches strings containing just lower-case letters with nothing else between the matched letters and the start and end of the string. \n",
-      "\n",
-      "Python has the `re.compile` feature to build the specialised FSM that does the matching. This is faster if you want to use the same regular expression a lot. If you only want to use it a few times, it's often easier to just give the regex directly. See below for examples.\n",
-      "\n",
-      "Regular expresions are incredibly powerful, but take time to learn. See the [regular expression tutorial](http://www.regular-expressions.info/tutorial.html) for a guide."
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "valid_word_re = re.compile(r'^[a-z]+$')"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [],
-     "prompt_number": 11
-    },
-    {
-     "cell_type": "markdown",
-     "metadata": {},
-     "source": [
-      "# Evaluation\n",
-      "Which of these alternatives is the best?\n",
-      "\n",
-      "The important measure is whether the program is both readable and correct. You can be the judge of that (though I used a regex as a first recourse).\n",
-      "\n",
-      "We can also look at performance: which is the fastest?\n",
-      "\n",
-      "Use the IPython timing cell-magic to find out. We'll also use an `assert`ion to check that all the approaches give the same answer.\n",
-      "\n",
-      "You'll have to run the notebook to find the answer. Which do you think would be these fastest, or the slowest?"
-     ]
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "valid_word_count = len([w for w in all_words if valid_word_re.match(w)])\n",
-      "valid_word_count"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": [
-      {
-       "metadata": {},
-       "output_type": "pyout",
-       "prompt_number": 12,
-       "text": [
-        "62856"
-       ]
-      }
-     ],
-     "prompt_number": 12
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%timeit\n",
-      "words = [w for w in all_words if check_word_explicit(w)]\n",
-      "assert len(words) == valid_word_count"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%timeit\n",
-      "words = [w for w in all_words if check_word_short_circuit(w)]\n",
-      "assert len(words) == valid_word_count"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%timeit\n",
-      "words = [w for w in all_words if check_word_comprehension(w)]\n",
-      "assert len(words) == valid_word_count"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%timeit\n",
-      "words = [w for w in all_words if check_word_comprehension_clever(w)]\n",
-      "assert len(words) == valid_word_count"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%timeit\n",
-      "words = [w for w in all_words if check_word_recursive(w)]\n",
-      "assert len(words) == valid_word_count"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%timeit\n",
-      "words = [w for w in all_words if re.match(r'^[a-z]+$', w)]\n",
-      "assert len(words) == valid_word_count"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [
-      "%%timeit\n",
-      "words = [w for w in all_words if valid_word_re.match(w)]\n",
-      "assert len(words) == valid_word_count"
-     ],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
-    },
-    {
-     "cell_type": "code",
-     "collapsed": false,
-     "input": [],
-     "language": "python",
-     "metadata": {},
-     "outputs": []
-    }
-   ],
-   "metadata": {}
-  }
- ]
-}
\ No newline at end of file