From: Neil Smith Date: Thu, 6 Nov 2014 16:48:48 +0000 (+0000) Subject: Added word filter comparisons X-Git-Url: https://git.njae.me.uk/?a=commitdiff_plain;h=2b75c2854524fcdd432f4bec0b4b8cc063cab71d;p=cas-master-teacher-training.git Added word filter comparisons --- diff --git a/word_filter_comparison.ipynb b/word_filter_comparison.ipynb new file mode 100644 index 0000000..04452e6 --- /dev/null +++ b/word_filter_comparison.ipynb @@ -0,0 +1,427 @@ +{ + "metadata": { + "name": "", + "signature": "sha256:a0281c893b46c2a49f8ca60a55050ba07e01c477d80741db9e38b50971f0ed34" + }, + "nbformat": 3, + "nbformat_minor": 0, + "worksheets": [ + { + "cells": [ + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Filtering words\n", + "The challenge is to read a list of words from a dictionary, and keep only those words which contain only lower-case letters. Any \"word\" that contains an upper-case letter, punctuation, spaces, or similar should be rejected on the basis that it's a proper noun, and abbreviation, or something else that means it can't be a valid target word for Hangman." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "# Import the libraries we'll need\n", + "import re\n", + "import random\n", + "import string" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 17 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "Get the list of all words." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "all_words = [w.strip() for w in open('/usr/share/dict/british-english').readlines()]\n", + "len(all_words)" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 18, + "text": [ + "99156" + ] + } + ], + "prompt_number": 18 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Checking a word\n", + "\n", + "## Explicit iteration over the word\n", + "This function walks over the word, character by character, and checks if it's in the list of valid characters (as given in `string.ascii_lowercase`). If it's not, the `valid` flag is set to `False`. The final value is returned." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_explicit(word):\n", + " valid = True\n", + " for letter in word:\n", + " if letter not in string.ascii_lowercase:\n", + " valid = False\n", + " return valid" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 19 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Short-circuiting explicit iteration\n", + "As above, but the function `return`s `False` as soon as it detects an invalid character. This should make it quicker to reject words." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_short_circuit(word):\n", + " for letter in word:\n", + " if letter not in string.ascii_lowercase:\n", + " return False\n", + " return True" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 20 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Using comprehensions\n", + "Use a comprehension function to convert the list of letters into a list of Booleans showing whether the character in that position is a valid letter. Use the built-in `all()` function to check that all the values in the list are `True`." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "# Examples of the idea\n", + "print('hello :', [letter in string.ascii_lowercase for letter in 'hello'])\n", + "print('heLLo :', [letter in string.ascii_lowercase for letter in 'heLLo'])" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "hello : [True, True, True, True, True]\n", + "heLLo : [True, True, False, False, True]\n" + ] + } + ], + "prompt_number": 21 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_comprehension(word):\n", + " return all(letter in string.ascii_lowercase for letter in word)" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 22 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "### Short-circuited comprehensions\n", + "An attempt to be clever. Can we stop the checking of letters as soon as we've found an invalid one?" + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_comprehension_clever(word):\n", + " return not any(letter not in string.ascii_lowercase for letter in word)" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 23 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## A recursive definition\n", + "A word if all lowercase if the first character is lowercase and the rest of the word is all lowercase. The base case is an empty word. This should evaluate to `True` because an empty list does not contain any invalid characters.\n", + "\n", + "Note the Pythonic use of \"truthiness\" values. If you try to take the Boolean value of a string, it evaluates as `False` if it's empty and `True` otherwise. Using \n", + "\n", + "` if word == '':` \n", + "\n", + "in the first line is just as correct, but not as Pythonic." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "def check_word_recursive(word):\n", + " if word:\n", + " if word[0] not in string.ascii_lowercase:\n", + " return False\n", + " else:\n", + " return check_word_recursive(word[1:])\n", + " else:\n", + " return True" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 24 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "## Regular expressions\n", + "A regular expression is a way of defining a finite state machine (FSM) that accepts some sequences of characters. They're used a lot whenever you want to process something text-based. In this case, the regex consists of:\n", + "* `^` : match the start of the string\n", + "* `[a-z]` : match a single character in the range `a` to `z`\n", + "* `[a-z]+` : match a sequence of one one or more characters in the range `a` to `z`\n", + "* `$` : match the end of the string\n", + "This means you have a regular expression that matches strings containing just lower-case letters with nothing else between the matched letters and the start and end of the string. \n", + "\n", + "Python has the `re.compile` feature to build the specialised FSM that does the matching. This is faster if you want to use the same regular expression a lot. If you only want to use it a few times, it's often easier to just give the regex directly. See below for examples.\n", + "\n", + "Regular expresions are incredibly powerful, but take time to learn. See the [regular expression tutorial](http://www.regular-expressions.info/tutorial.html) for a guide." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "valid_word_re = re.compile(r'^[a-z]+$')" + ], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 25 + }, + { + "cell_type": "markdown", + "metadata": {}, + "source": [ + "# Evaluation\n", + "Which of these alternatives is the best?\n", + "\n", + "The important measure is whether the program is both readable and correct. You can be the judge of that (though I used a regex as a first recourse).\n", + "\n", + "We can also look at performance: which is the fastest?\n", + "\n", + "Use the IPython timing cell-magic to find out. We'll also use an `assert`ion to check that all the approaches give the same answer." + ] + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "valid_word_count = len([w for w in all_words if valid_word_re.match(w)])\n", + "valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "metadata": {}, + "output_type": "pyout", + "prompt_number": 38, + "text": [ + "62856" + ] + } + ], + "prompt_number": 38 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_explicit(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "10 loops, best of 3: 70.2 ms per loop\n" + ] + } + ], + "prompt_number": 48 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_short_circuit(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "10 loops, best of 3: 59.4 ms per loop\n" + ] + } + ], + "prompt_number": 40 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_comprehension(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "10 loops, best of 3: 107 ms per loop\n" + ] + } + ], + "prompt_number": 41 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_comprehension_clever(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "10 loops, best of 3: 107 ms per loop\n" + ] + } + ], + "prompt_number": 42 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if check_word_recursive(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "10 loops, best of 3: 174 ms per loop\n" + ] + } + ], + "prompt_number": 43 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if re.match(r'^[a-z]+$', w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "10 loops, best of 3: 86.6 ms per loop\n" + ] + } + ], + "prompt_number": 45 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [ + "%%timeit\n", + "words = [w for w in all_words if valid_word_re.match(w)]\n", + "assert len(words) == valid_word_count" + ], + "language": "python", + "metadata": {}, + "outputs": [ + { + "output_type": "stream", + "stream": "stdout", + "text": [ + "10 loops, best of 3: 30.4 ms per loop\n" + ] + } + ], + "prompt_number": 46 + }, + { + "cell_type": "code", + "collapsed": false, + "input": [], + "language": "python", + "metadata": {}, + "outputs": [], + "prompt_number": 46 + } + ], + "metadata": {} + } + ] +} \ No newline at end of file