--- /dev/null
+{
+ "metadata": {
+ "name": "",
+ "signature": "sha256:b1430467f492182774cf211bf9da55e45dbf53644a26cb4e401bed473b1551ed"
+ },
+ "nbformat": 3,
+ "nbformat_minor": 0,
+ "worksheets": [
+ {
+ "cells": [
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Filtering words\n",
+ "The challenge is to read a list of words from a dictionary, and keep only those words which contain only lower-case letters. Any \"word\" that contains an upper-case letter, punctuation, spaces, or similar should be rejected on the basis that it's a proper noun, and abbreviation, or something else that means it can't be a valid target word for Hangman."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "# Import the libraries we'll need\n",
+ "import re\n",
+ "import random\n",
+ "import string"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 3
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "Get the list of all words."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "all_words = [w.strip() for w in open('/usr/share/dict/british-english').readlines()]\n",
+ "len(all_words)"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "metadata": {},
+ "output_type": "pyout",
+ "prompt_number": 4,
+ "text": [
+ "99156"
+ ]
+ }
+ ],
+ "prompt_number": 4
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Checking a word\n",
+ "\n",
+ "## Explicit iteration over the word\n",
+ "This function walks over the word, character by character, and checks if it's in the list of valid characters (as given in `string.ascii_lowercase`). If it's not, the `valid` flag is set to `False`. The final value is returned."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "def check_word_explicit(word):\n",
+ " valid = True\n",
+ " for letter in word:\n",
+ " if letter not in string.ascii_lowercase:\n",
+ " valid = False\n",
+ " return valid"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 5
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Short-circuiting explicit iteration\n",
+ "As above, but the function `return`s `False` as soon as it detects an invalid character. This should make it quicker to reject words."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "def check_word_short_circuit(word):\n",
+ " for letter in word:\n",
+ " if letter not in string.ascii_lowercase:\n",
+ " return False\n",
+ " return True"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 6
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Using comprehensions\n",
+ "Use a comprehension function to convert the list of letters into a list of Booleans showing whether the character in that position is a valid letter. Use the built-in `all()` function to check that all the values in the list are `True`."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "# Examples of the idea\n",
+ "print('hello :', [letter in string.ascii_lowercase for letter in 'hello'])\n",
+ "print('heLLo :', [letter in string.ascii_lowercase for letter in 'heLLo'])"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "output_type": "stream",
+ "stream": "stdout",
+ "text": [
+ "hello : [True, True, True, True, True]\n",
+ "heLLo : [True, True, False, False, True]\n"
+ ]
+ }
+ ],
+ "prompt_number": 7
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "def check_word_comprehension(word):\n",
+ " return all(letter in string.ascii_lowercase for letter in word)"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 8
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "### Short-circuited comprehensions\n",
+ "An attempt to be clever. Can we stop the checking of letters as soon as we've found an invalid one?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "def check_word_comprehension_clever(word):\n",
+ " return not any(letter not in string.ascii_lowercase for letter in word)"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 9
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## A recursive definition\n",
+ "A word if all lowercase if the first character is lowercase and the rest of the word is all lowercase. The base case is an empty word. This should evaluate to `True` because an empty list does not contain any invalid characters.\n",
+ "\n",
+ "Note the Pythonic use of \"truthiness\" values. If you try to take the Boolean value of a string, it evaluates as `False` if it's empty and `True` otherwise. Using \n",
+ "\n",
+ "` if word != '':` \n",
+ "\n",
+ "in the first line is just as correct, but not as Pythonic."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "def check_word_recursive(word):\n",
+ " if word:\n",
+ " if word[0] not in string.ascii_lowercase:\n",
+ " return False\n",
+ " else:\n",
+ " return check_word_recursive(word[1:])\n",
+ " else:\n",
+ " return True"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 10
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "## Regular expressions\n",
+ "A regular expression is a way of defining a finite state machine (FSM) that accepts some sequences of characters. They're used a lot whenever you want to process something text-based. In this case, the regex consists of:\n",
+ "* `^` : match the start of the string\n",
+ "* `[a-z]` : match a single character in the range `a` to `z`\n",
+ "* `[a-z]+` : match a sequence of one one or more characters in the range `a` to `z`\n",
+ "* `$` : match the end of the string\n",
+ "This means you have a regular expression that matches strings containing just lower-case letters with nothing else between the matched letters and the start and end of the string. \n",
+ "\n",
+ "Python has the `re.compile` feature to build the specialised FSM that does the matching. This is faster if you want to use the same regular expression a lot. If you only want to use it a few times, it's often easier to just give the regex directly. See below for examples.\n",
+ "\n",
+ "Regular expresions are incredibly powerful, but take time to learn. See the [regular expression tutorial](http://www.regular-expressions.info/tutorial.html) for a guide."
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "valid_word_re = re.compile(r'^[a-z]+$')"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [],
+ "prompt_number": 11
+ },
+ {
+ "cell_type": "markdown",
+ "metadata": {},
+ "source": [
+ "# Evaluation\n",
+ "Which of these alternatives is the best?\n",
+ "\n",
+ "The important measure is whether the program is both readable and correct. You can be the judge of that (though I used a regex as a first recourse).\n",
+ "\n",
+ "We can also look at performance: which is the fastest?\n",
+ "\n",
+ "Use the IPython timing cell-magic to find out. We'll also use an `assert`ion to check that all the approaches give the same answer.\n",
+ "\n",
+ "You'll have to run the notebook to find the answer. Which do you think would be these fastest, or the slowest?"
+ ]
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "valid_word_count = len([w for w in all_words if valid_word_re.match(w)])\n",
+ "valid_word_count"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": [
+ {
+ "metadata": {},
+ "output_type": "pyout",
+ "prompt_number": 12,
+ "text": [
+ "62856"
+ ]
+ }
+ ],
+ "prompt_number": 12
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%timeit\n",
+ "words = [w for w in all_words if check_word_explicit(w)]\n",
+ "assert len(words) == valid_word_count"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%timeit\n",
+ "words = [w for w in all_words if check_word_short_circuit(w)]\n",
+ "assert len(words) == valid_word_count"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%timeit\n",
+ "words = [w for w in all_words if check_word_comprehension(w)]\n",
+ "assert len(words) == valid_word_count"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%timeit\n",
+ "words = [w for w in all_words if check_word_comprehension_clever(w)]\n",
+ "assert len(words) == valid_word_count"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%timeit\n",
+ "words = [w for w in all_words if check_word_recursive(w)]\n",
+ "assert len(words) == valid_word_count"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%timeit\n",
+ "words = [w for w in all_words if re.match(r'^[a-z]+$', w)]\n",
+ "assert len(words) == valid_word_count"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [
+ "%%timeit\n",
+ "words = [w for w in all_words if valid_word_re.match(w)]\n",
+ "assert len(words) == valid_word_count"
+ ],
+ "language": "python",
+ "metadata": {},
+ "outputs": []
+ },
+ {
+ "cell_type": "code",
+ "collapsed": false,
+ "input": [],
+ "language": "python",
+ "metadata": {},
+ "outputs": []
+ }
+ ],
+ "metadata": {}
+ }
+ ]
+}
\ No newline at end of file
+++ /dev/null
-{
- "metadata": {
- "name": "",
- "signature": "sha256:b1430467f492182774cf211bf9da55e45dbf53644a26cb4e401bed473b1551ed"
- },
- "nbformat": 3,
- "nbformat_minor": 0,
- "worksheets": [
- {
- "cells": [
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Filtering words\n",
- "The challenge is to read a list of words from a dictionary, and keep only those words which contain only lower-case letters. Any \"word\" that contains an upper-case letter, punctuation, spaces, or similar should be rejected on the basis that it's a proper noun, and abbreviation, or something else that means it can't be a valid target word for Hangman."
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "# Import the libraries we'll need\n",
- "import re\n",
- "import random\n",
- "import string"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [],
- "prompt_number": 3
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "Get the list of all words."
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "all_words = [w.strip() for w in open('/usr/share/dict/british-english').readlines()]\n",
- "len(all_words)"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "metadata": {},
- "output_type": "pyout",
- "prompt_number": 4,
- "text": [
- "99156"
- ]
- }
- ],
- "prompt_number": 4
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Checking a word\n",
- "\n",
- "## Explicit iteration over the word\n",
- "This function walks over the word, character by character, and checks if it's in the list of valid characters (as given in `string.ascii_lowercase`). If it's not, the `valid` flag is set to `False`. The final value is returned."
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "def check_word_explicit(word):\n",
- " valid = True\n",
- " for letter in word:\n",
- " if letter not in string.ascii_lowercase:\n",
- " valid = False\n",
- " return valid"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [],
- "prompt_number": 5
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Short-circuiting explicit iteration\n",
- "As above, but the function `return`s `False` as soon as it detects an invalid character. This should make it quicker to reject words."
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "def check_word_short_circuit(word):\n",
- " for letter in word:\n",
- " if letter not in string.ascii_lowercase:\n",
- " return False\n",
- " return True"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [],
- "prompt_number": 6
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Using comprehensions\n",
- "Use a comprehension function to convert the list of letters into a list of Booleans showing whether the character in that position is a valid letter. Use the built-in `all()` function to check that all the values in the list are `True`."
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "# Examples of the idea\n",
- "print('hello :', [letter in string.ascii_lowercase for letter in 'hello'])\n",
- "print('heLLo :', [letter in string.ascii_lowercase for letter in 'heLLo'])"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "output_type": "stream",
- "stream": "stdout",
- "text": [
- "hello : [True, True, True, True, True]\n",
- "heLLo : [True, True, False, False, True]\n"
- ]
- }
- ],
- "prompt_number": 7
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "def check_word_comprehension(word):\n",
- " return all(letter in string.ascii_lowercase for letter in word)"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [],
- "prompt_number": 8
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "### Short-circuited comprehensions\n",
- "An attempt to be clever. Can we stop the checking of letters as soon as we've found an invalid one?"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "def check_word_comprehension_clever(word):\n",
- " return not any(letter not in string.ascii_lowercase for letter in word)"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [],
- "prompt_number": 9
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## A recursive definition\n",
- "A word if all lowercase if the first character is lowercase and the rest of the word is all lowercase. The base case is an empty word. This should evaluate to `True` because an empty list does not contain any invalid characters.\n",
- "\n",
- "Note the Pythonic use of \"truthiness\" values. If you try to take the Boolean value of a string, it evaluates as `False` if it's empty and `True` otherwise. Using \n",
- "\n",
- "` if word != '':` \n",
- "\n",
- "in the first line is just as correct, but not as Pythonic."
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "def check_word_recursive(word):\n",
- " if word:\n",
- " if word[0] not in string.ascii_lowercase:\n",
- " return False\n",
- " else:\n",
- " return check_word_recursive(word[1:])\n",
- " else:\n",
- " return True"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [],
- "prompt_number": 10
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "## Regular expressions\n",
- "A regular expression is a way of defining a finite state machine (FSM) that accepts some sequences of characters. They're used a lot whenever you want to process something text-based. In this case, the regex consists of:\n",
- "* `^` : match the start of the string\n",
- "* `[a-z]` : match a single character in the range `a` to `z`\n",
- "* `[a-z]+` : match a sequence of one one or more characters in the range `a` to `z`\n",
- "* `$` : match the end of the string\n",
- "This means you have a regular expression that matches strings containing just lower-case letters with nothing else between the matched letters and the start and end of the string. \n",
- "\n",
- "Python has the `re.compile` feature to build the specialised FSM that does the matching. This is faster if you want to use the same regular expression a lot. If you only want to use it a few times, it's often easier to just give the regex directly. See below for examples.\n",
- "\n",
- "Regular expresions are incredibly powerful, but take time to learn. See the [regular expression tutorial](http://www.regular-expressions.info/tutorial.html) for a guide."
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "valid_word_re = re.compile(r'^[a-z]+$')"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [],
- "prompt_number": 11
- },
- {
- "cell_type": "markdown",
- "metadata": {},
- "source": [
- "# Evaluation\n",
- "Which of these alternatives is the best?\n",
- "\n",
- "The important measure is whether the program is both readable and correct. You can be the judge of that (though I used a regex as a first recourse).\n",
- "\n",
- "We can also look at performance: which is the fastest?\n",
- "\n",
- "Use the IPython timing cell-magic to find out. We'll also use an `assert`ion to check that all the approaches give the same answer.\n",
- "\n",
- "You'll have to run the notebook to find the answer. Which do you think would be these fastest, or the slowest?"
- ]
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "valid_word_count = len([w for w in all_words if valid_word_re.match(w)])\n",
- "valid_word_count"
- ],
- "language": "python",
- "metadata": {},
- "outputs": [
- {
- "metadata": {},
- "output_type": "pyout",
- "prompt_number": 12,
- "text": [
- "62856"
- ]
- }
- ],
- "prompt_number": 12
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%timeit\n",
- "words = [w for w in all_words if check_word_explicit(w)]\n",
- "assert len(words) == valid_word_count"
- ],
- "language": "python",
- "metadata": {},
- "outputs": []
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%timeit\n",
- "words = [w for w in all_words if check_word_short_circuit(w)]\n",
- "assert len(words) == valid_word_count"
- ],
- "language": "python",
- "metadata": {},
- "outputs": []
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%timeit\n",
- "words = [w for w in all_words if check_word_comprehension(w)]\n",
- "assert len(words) == valid_word_count"
- ],
- "language": "python",
- "metadata": {},
- "outputs": []
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%timeit\n",
- "words = [w for w in all_words if check_word_comprehension_clever(w)]\n",
- "assert len(words) == valid_word_count"
- ],
- "language": "python",
- "metadata": {},
- "outputs": []
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%timeit\n",
- "words = [w for w in all_words if check_word_recursive(w)]\n",
- "assert len(words) == valid_word_count"
- ],
- "language": "python",
- "metadata": {},
- "outputs": []
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%timeit\n",
- "words = [w for w in all_words if re.match(r'^[a-z]+$', w)]\n",
- "assert len(words) == valid_word_count"
- ],
- "language": "python",
- "metadata": {},
- "outputs": []
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [
- "%%timeit\n",
- "words = [w for w in all_words if valid_word_re.match(w)]\n",
- "assert len(words) == valid_word_count"
- ],
- "language": "python",
- "metadata": {},
- "outputs": []
- },
- {
- "cell_type": "code",
- "collapsed": false,
- "input": [],
- "language": "python",
- "metadata": {},
- "outputs": []
- }
- ],
- "metadata": {}
- }
- ]
-}
\ No newline at end of file