From: Neil Smith Date: Sun, 7 Dec 2014 12:09:31 +0000 (+0000) Subject: Moved word filter X-Git-Url: https://git.njae.me.uk/?a=commitdiff_plain;h=e1902dd364afaafaa80759b461bb43172458541e;p=cas-master-teacher-training.git Moved word filter --- diff --git a/hangman/word_filter_comparison.ipynb b/hangman/word_filter_comparison.ipynb new file mode 100644 index 0000000..b96359e --- /dev/null +++ b/hangman/word_filter_comparison.ipynb @@ -0,0 +1,365 @@ +{ + "metadata": { + "name": "", + "signature": "sha256:b1430467f492182774cf211bf9da55e45dbf53644a26cb4e401bed473b1551ed" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Filtering words\n", + "The challenge is to read a list of words from a dictionary, and keep only those words which contain only lower-case letters. Any \"word\" that contains an upper-case letter, punctuation, spaces, or similar should be rejected on the basis that it's a proper noun, and abbreviation, or something else that means it can't be a valid target word for Hangman." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "# Import the libraries we'll need\n", + "import re\n", + "import random\n", + "import string" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 3 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Get the list of all words." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "all_words = [w.strip() for w in open('/usr/share/dict/british-english').readlines()]\n", + "len(all_words)" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 4, + "text": [ + "99156" + ] + } + ], + "prompt_number": 4 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Checking a word\n", + "\n", + "## Explicit iteration over the word\n", + "This function walks over the word, character by character, and checks if it's in the list of valid characters (as given in `string.ascii_lowercase`). If it's not, the `valid` flag is set to `False`. The final value is returned." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_explicit(word):\n", + " valid = True\n", + " for letter in word:\n", + " if letter not in string.ascii_lowercase:\n", + " valid = False\n", + " return valid" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 5 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Short-circuiting explicit iteration\n", + "As above, but the function `return`s `False` as soon as it detects an invalid character. This should make it quicker to reject words." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_short_circuit(word):\n", + " for letter in word:\n", + " if letter not in string.ascii_lowercase:\n", + " return False\n", + " return True" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 6 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using comprehensions\n", + "Use a comprehension function to convert the list of letters into a list of Booleans showing whether the character in that position is a valid letter. Use the built-in `all()` function to check that all the values in the list are `True`." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "# Examples of the idea\n", + "print('hello :', [letter in string.ascii_lowercase for letter in 'hello'])\n", + "print('heLLo :', [letter in string.ascii_lowercase for letter in 'heLLo'])" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "hello : [True, True, True, True, True]\n", + "heLLo : [True, True, False, False, True]\n" + ] + } + ], + "prompt_number": 7 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_comprehension(word):\n", + " return all(letter in string.ascii_lowercase for letter in word)" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 8 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Short-circuited comprehensions\n", + "An attempt to be clever. Can we stop the checking of letters as soon as we've found an invalid one?" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_comprehension_clever(word):\n", + " return not any(letter not in string.ascii_lowercase for letter in word)" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 9 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A recursive definition\n", + "A word if all lowercase if the first character is lowercase and the rest of the word is all lowercase. The base case is an empty word. This should evaluate to `True` because an empty list does not contain any invalid characters.\n", + "\n", + "Note the Pythonic use of \"truthiness\" values. If you try to take the Boolean value of a string, it evaluates as `False` if it's empty and `True` otherwise. Using \n", + "\n", + "` if word != '':` \n", + "\n", + "in the first line is just as correct, but not as Pythonic." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_recursive(word):\n", + " if word:\n", + " if word[0] not in string.ascii_lowercase:\n", + " return False\n", + " else:\n", + " return check_word_recursive(word[1:])\n", + " else:\n", + " return True" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 10 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Regular expressions\n", + "A regular expression is a way of defining a finite state machine (FSM) that accepts some sequences of characters. They're used a lot whenever you want to process something text-based. In this case, the regex consists of:\n", + "* `^` : match the start of the string\n", + "* `[a-z]` : match a single character in the range `a` to `z`\n", + "* `[a-z]+` : match a sequence of one one or more characters in the range `a` to `z`\n", + "* `$` : match the end of the string\n", + "This means you have a regular expression that matches strings containing just lower-case letters with nothing else between the matched letters and the start and end of the string. \n", + "\n", + "Python has the `re.compile` feature to build the specialised FSM that does the matching. This is faster if you want to use the same regular expression a lot. If you only want to use it a few times, it's often easier to just give the regex directly. See below for examples.\n", + "\n", + "Regular expresions are incredibly powerful, but take time to learn. See the [regular expression tutorial](http://www.regular-expressions.info/tutorial.html) for a guide." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "valid_word_re = re.compile(r'^[a-z]+$')" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 11 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evaluation\n", + "Which of these alternatives is the best?\n", + "\n", + "The important measure is whether the program is both readable and correct. You can be the judge of that (though I used a regex as a first recourse).\n", + "\n", + "We can also look at performance: which is the fastest?\n", + "\n", + "Use the IPython timing cell-magic to find out. We'll also use an `assert`ion to check that all the approaches give the same answer.\n", + "\n", + "You'll have to run the notebook to find the answer. Which do you think would be these fastest, or the slowest?" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "valid_word_count = len([w for w in all_words if valid_word_re.match(w)])\n", + "valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 12, + "text": [ + "62856" + ] + } + ], + "prompt_number": 12 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_explicit(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_short_circuit(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_comprehension(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_comprehension_clever(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_recursive(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if re.match(r'^[a-z]+$', w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if valid_word_re.match(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [], + "language": "python", + "metadata": {}, + "outputs": [] + } + ], + "metadata": {} + } + ] +} \ No newline at end of file diff --git a/word_filter_comparison.ipynb b/word_filter_comparison.ipynb deleted file mode 100644 index b96359e..0000000 --- a/word_filter_comparison.ipynb +++ /dev/null @@ -1,365 +0,0 @@ -{ - "metadata": { - "name": "", - "signature": "sha256:b1430467f492182774cf211bf9da55e45dbf53644a26cb4e401bed473b1551ed" - }, - "nbformat": 3, - "nbformat_minor": 0, - "worksheets": [ - { - "cells": [ - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Filtering words\n", - "The challenge is to read a list of words from a dictionary, and keep only those words which contain only lower-case letters. Any \"word\" that contains an upper-case letter, punctuation, spaces, or similar should be rejected on the basis that it's a proper noun, and abbreviation, or something else that means it can't be a valid target word for Hangman." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# Import the libraries we'll need\n", - "import re\n", - "import random\n", - "import string" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 3 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "Get the list of all words." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "all_words = [w.strip() for w in open('/usr/share/dict/british-english').readlines()]\n", - "len(all_words)" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 4, - "text": [ - "99156" - ] - } - ], - "prompt_number": 4 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Checking a word\n", - "\n", - "## Explicit iteration over the word\n", - "This function walks over the word, character by character, and checks if it's in the list of valid characters (as given in `string.ascii_lowercase`). If it's not, the `valid` flag is set to `False`. The final value is returned." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def check_word_explicit(word):\n", - " valid = True\n", - " for letter in word:\n", - " if letter not in string.ascii_lowercase:\n", - " valid = False\n", - " return valid" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 5 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Short-circuiting explicit iteration\n", - "As above, but the function `return`s `False` as soon as it detects an invalid character. This should make it quicker to reject words." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def check_word_short_circuit(word):\n", - " for letter in word:\n", - " if letter not in string.ascii_lowercase:\n", - " return False\n", - " return True" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 6 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Using comprehensions\n", - "Use a comprehension function to convert the list of letters into a list of Booleans showing whether the character in that position is a valid letter. Use the built-in `all()` function to check that all the values in the list are `True`." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "# Examples of the idea\n", - "print('hello :', [letter in string.ascii_lowercase for letter in 'hello'])\n", - "print('heLLo :', [letter in string.ascii_lowercase for letter in 'heLLo'])" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "output_type": "stream", - "stream": "stdout", - "text": [ - "hello : [True, True, True, True, True]\n", - "heLLo : [True, True, False, False, True]\n" - ] - } - ], - "prompt_number": 7 - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def check_word_comprehension(word):\n", - " return all(letter in string.ascii_lowercase for letter in word)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 8 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "### Short-circuited comprehensions\n", - "An attempt to be clever. Can we stop the checking of letters as soon as we've found an invalid one?" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def check_word_comprehension_clever(word):\n", - " return not any(letter not in string.ascii_lowercase for letter in word)" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 9 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## A recursive definition\n", - "A word if all lowercase if the first character is lowercase and the rest of the word is all lowercase. The base case is an empty word. This should evaluate to `True` because an empty list does not contain any invalid characters.\n", - "\n", - "Note the Pythonic use of \"truthiness\" values. If you try to take the Boolean value of a string, it evaluates as `False` if it's empty and `True` otherwise. Using \n", - "\n", - "` if word != '':` \n", - "\n", - "in the first line is just as correct, but not as Pythonic." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "def check_word_recursive(word):\n", - " if word:\n", - " if word[0] not in string.ascii_lowercase:\n", - " return False\n", - " else:\n", - " return check_word_recursive(word[1:])\n", - " else:\n", - " return True" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 10 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "## Regular expressions\n", - "A regular expression is a way of defining a finite state machine (FSM) that accepts some sequences of characters. They're used a lot whenever you want to process something text-based. In this case, the regex consists of:\n", - "* `^` : match the start of the string\n", - "* `[a-z]` : match a single character in the range `a` to `z`\n", - "* `[a-z]+` : match a sequence of one one or more characters in the range `a` to `z`\n", - "* `$` : match the end of the string\n", - "This means you have a regular expression that matches strings containing just lower-case letters with nothing else between the matched letters and the start and end of the string. \n", - "\n", - "Python has the `re.compile` feature to build the specialised FSM that does the matching. This is faster if you want to use the same regular expression a lot. If you only want to use it a few times, it's often easier to just give the regex directly. See below for examples.\n", - "\n", - "Regular expresions are incredibly powerful, but take time to learn. See the [regular expression tutorial](http://www.regular-expressions.info/tutorial.html) for a guide." - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "valid_word_re = re.compile(r'^[a-z]+$')" - ], - "language": "python", - "metadata": {}, - "outputs": [], - "prompt_number": 11 - }, - { - "cell_type": "markdown", - "metadata": {}, - "source": [ - "# Evaluation\n", - "Which of these alternatives is the best?\n", - "\n", - "The important measure is whether the program is both readable and correct. You can be the judge of that (though I used a regex as a first recourse).\n", - "\n", - "We can also look at performance: which is the fastest?\n", - "\n", - "Use the IPython timing cell-magic to find out. We'll also use an `assert`ion to check that all the approaches give the same answer.\n", - "\n", - "You'll have to run the notebook to find the answer. Which do you think would be these fastest, or the slowest?" - ] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "valid_word_count = len([w for w in all_words if valid_word_re.match(w)])\n", - "valid_word_count" - ], - "language": "python", - "metadata": {}, - "outputs": [ - { - "metadata": {}, - "output_type": "pyout", - "prompt_number": 12, - "text": [ - "62856" - ] - } - ], - "prompt_number": 12 - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%timeit\n", - "words = [w for w in all_words if check_word_explicit(w)]\n", - "assert len(words) == valid_word_count" - ], - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%timeit\n", - "words = [w for w in all_words if check_word_short_circuit(w)]\n", - "assert len(words) == valid_word_count" - ], - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%timeit\n", - "words = [w for w in all_words if check_word_comprehension(w)]\n", - "assert len(words) == valid_word_count" - ], - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%timeit\n", - "words = [w for w in all_words if check_word_comprehension_clever(w)]\n", - "assert len(words) == valid_word_count" - ], - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%timeit\n", - "words = [w for w in all_words if check_word_recursive(w)]\n", - "assert len(words) == valid_word_count" - ], - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%timeit\n", - "words = [w for w in all_words if re.match(r'^[a-z]+$', w)]\n", - "assert len(words) == valid_word_count" - ], - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [ - "%%timeit\n", - "words = [w for w in all_words if valid_word_re.match(w)]\n", - "assert len(words) == valid_word_count" - ], - "language": "python", - "metadata": {}, - "outputs": [] - }, - { - "cell_type": "code", - "collapsed": false, - "input": [], - "language": "python", - "metadata": {}, - "outputs": [] - } - ], - "metadata": {} - } - ] -} \ No newline at end of file