Added explanations to the duplicate letter solutions
authorNeil Smith <neil.git@njae.me.uk>
Sun, 8 Feb 2015 19:00:37 +0000 (19:00 +0000)
committerNeil Smith <neil.git@njae.me.uk>
Sun, 8 Feb 2015 19:00:37 +0000 (19:00 +0000)
SIGNED.md
duplicate-letters.ipynb

index 447be387a488f8998f0b91cedccc7a7f43af46a0..5e8e1781dd5b786b3244527ae8188d01178192e6 100644 (file)
--- a/SIGNED.md
+++ b/SIGNED.md
@@ -3,19 +3,19 @@
 -----BEGIN PGP SIGNATURE-----
 Version: GnuPG v2
 
-iQIcBAABAgAGBQJUwfg1AAoJEJPB2e07PgbqpYEP/3HVphHcvq977tiPTB854NAY
-YDM2QPnKQ7kQetphTX9Y/U0qwGeYph+Rd00lqdJgBJLhJr15UZHEHd8SLF+yB58x
-m39e6YBwdzsyilpHl/GHKKtDjcXuMsnNvCVGdpeCP/319i0FTSzT/kBpkyBQAXK0
-TCBocPo5GaeSAgxdZ1DJA73wMlCiQzJsv9Vjf7rg3APwetuFwY8ZpVwfcslglnsX
-vGu1lfbdIvVpbhyjGxyI0zaXzehVuCIyBb6Hu0lw0FGU060ntN75Pz++Rf3LrE+U
-0XnCeYjbbWLTzVE436mWB45MJnc48dTzdsLMDMEE5z0dGifJf1nbtNgqsVRi1vZQ
-i9yGCCwKPP6rhZ1ezfNfpmzjQ6riF26RXYq9V4w6oCDTbIN9vjPFNC0+5YYYDjM1
-ROcccxFvUKTkb91NECcAbZGqPFWovRvaFX7KzGyizEWCGMoD7/5rpRw3YrY/x14C
-lhZ6n5AwEAqL9RBZk+Mrubz+YgNSDbTM/+QcL5PHGb8mRC41PD4xSo4PCPU6AB+y
-vDSW523YNkhHB3IC3uAijZppimyk484IYsuw1WY9uVdClgCsUcs2VT8E3GOQOemQ
-34ZaPgeZR/JLKVpLo3lCuxbgI2JsNvGYmFw8moTcW4mTlO3dmPSi6s3bns3/GNA8
-OTu2ZMrmO+Lff9JwTr5L
-=SFX3
+iQIcBAABAgAGBQJU17JdAAoJEJPB2e07PgbqqdAP/A0wHqJCvYJ1FhrfqBZlMm0w
+d6rIBnNCIxCOLhZqdHlKNYJc3BQd+p82L+OXouV6c0pXdxd57jNCPZxPUWRE5g8t
++jyO93zJ7MNoQJKGrVwHtlqbmzrMWfEzuc6yFsbmX3Vj64NFtc1XykTCnn+z+cq8
+ARAkW31nf0a2BVylX+RdEdHf0o2eSukfNolCAKNSj5EJXoEKmPiHnl++2O+E3esS
+Lg/NIy1RREWT18GNWJyM6mp/54RmLi4UDEczp7M4QVPAy0KtbhtalVktgcE0OYDd
++92FsmUcr5Eh49FC1CADmviIzcAoGB3rm178zOnK7UiFZOCF8stGNObD+FrzOudi
+ZUQmt+/88iKP0Y+KUqbktC8LukNFKAOwKP+acByt9J7EJ82d0WOg0r+muFxidld+
+Y308Ka6avk4FcjG7Z6wnO8gqHBV8dkAaMl4fA9r9lQmOJrWCGiRv4KtOY0BeVYXT
+17Izc5j87R75N3QpjsG16Z1IqUWKXCd5XRtfWfKHRi7tWPw7O8B6OPQYYeZAz51T
+jqcXAENAy+1X7eIm9i9dbvFQ60zy19kt18eeNQMDpvLWMzvG0Xm6c5R8zPGvRq7H
+G579la7H/pNGXqYWughjnzldgKDbrNIhlBEg8uELL2WPBOakRxISzf8yhSuHyA4J
+EozFRjYaEH4P3q3RU5WE
+=fjvz
 -----END PGP SIGNATURE-----
 
 ```
@@ -38,6 +38,7 @@ size    exec  file                                 contents
 27510           big-o-chart-2.png                  a9f40d4499bd43d07bad2b9674b3055a9a3f319349a1256d88a9419f206287d7|56acfe8154f080f9d3186170980a44efd640d4c1748c37c7031f1f20d2f3342e
 11222           big-o-chart-smaller.png            5b453ac13c187ea797dbdcf03c2f1941c730a952e285d2afbfde83cf231f0bc3|ba81f54c5daae7c199d9eb66f58db952e52e6b15cf9c69ed34245c4400ffc318
 31971           big-o-chart.png                    82cea33b61f5af16f928d7037877b31526e532ea62a7cf99ca3761bc4313e4f1|0c43ead93dc010d2fee9724cfda23f2d922cf58931d6258f1da9977fd5bb8cd1
+23974           duplicate-letters.ipynb            dae47262661abc6bca7526b10cd9928905b94eb6b4c04577a8491be10bad3a4d                                                                 
 21947           euler-11.ipynb                     81780dca17496751894f582e8691935ee6f94949017fb3135087056489f8ad88                                                                 
 24903           euler-14.ipynb                     3652a2be86ec270420431814efa453f8679ad054ae28c015d34232c48dac16bd                                                                 
                 hangman/                                                                                                                                                            
index 0b15536d10007e9842e149b979055bdb02ba1d27..26c182e582466f6518b5a25d5544ae5c74a873b4 100644 (file)
@@ -1,7 +1,7 @@
 {
  "metadata": {
   "name": "",
-  "signature": "sha256:19128e67a9754262fffc5d0706d52d873b5cd867f34b15dc0cb1987a03c4dd42"
+  "signature": "sha256:9a1b8b45577dfcc1daaffba595f74655068367c9070816fe6eeb9345bb55729d"
  },
  "nbformat": 3,
  "nbformat_minor": 0,
@@ -12,7 +12,8 @@
      "cell_type": "markdown",
      "metadata": {},
      "source": [
-      "You need to write a procedure that returns all duplicate characters from a given String.\n",
+      "#The specification\n",
+      "Write a procedure that returns all duplicate characters from a given String.\n",
       "\n",
       "For example if String is \"Java\" then program should return [\"a\"].\n",
       "\n",
       "Bonus points if you also write unit tests for normal and edge cases."
      ]
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 1\n",
+      "Walk over the word, incrementing a count for each letter as you go (stored in a `dict`). Then, walk over the set of counts and record those with a count greater than one."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 11
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 2\n",
+      "As solution 1, but using a `defaultdict` to record the counts, but pretending that each count is zero if it hasn't been seen already. Note that this makes the logic simpler in the first loop."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import collections\n",
+      "\n",
+      "def duplicates_via_defaultdict(word):\n",
+      "    counts = collections.defaultdict(int)\n",
+      "    for letter in word:\n",
+      "        counts[letter] += 1\n",
+      "    duplicates = []\n",
+      "    for letter in counts:\n",
+      "        if counts[letter] > 1:\n",
+      "            duplicates += [letter]\n",
+      "    return duplicates"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 34
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "duplicates_via_defaultdict('occurrences')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 35,
+       "text": [
+        "['c', 'e', 'r']"
+       ]
+      }
+     ],
+     "prompt_number": 35
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 3\n",
+      "Define a helper function to count all the occurrences of a particular letter in the word. \n",
+      "\n",
+      "Walk over the word and find how many times that letter occurs in the the word (stored in a `dict`). Then, walk over the set of counts and record those with a count greater than one.\n",
+      "\n",
+      "Note that the first loop doesn't check if a letter already exists in the dict. You could introduce a test to avoid duplicating calls to `count_letter`."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 12
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 4\n",
+      "A direct approach using comprehensions. The first one is a nested comprehension that counts how many times each letter appears in the word. The second comprehension just returns the letters that are duplicated."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 65
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 5\n",
+      "A recursive approach. The idea is that a letter is a duplicate if it's either the first letter of the string and it occurs in the rest of the string, or it's a duplicate if it's duplicated in the rest of the string.\n",
+      "\n",
+      "We keep track of the duplicates found so far to avoid double-counting letters that occur more than twice in the string."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 26
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 6\n",
+      "Sort the word, then walk along it. A letter is duplicated if you find two consecutive letters the same. Store the results in a set, so that we don't have to worry about double-counting more than two letters."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 30
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 7\n",
+      "As above, but this time we store the duplicates in a list and use an additional variable `found_duplicates` to keep the state of whether we've found duplicates of this letter.\n",
+      "\n",
+      "Initially `found_duplicates` if `False`. As soon as we find a second occurrence of the letter, set `found_duplicates` to `True` and record the letter as a duplicate. Then continue to walk along the list until we find a different letter, when we reset `found_duplicates` to `False` again."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 56
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 8\n",
+      "Sort the string then split it into a list of lists, where each sublist is the group of identical letters that occur together. Then find the segments that have more than one letter."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 50
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 9\n",
+      "The same as above, but without having to sort the word first. This requires an additional loop, where we look for the bucket to place this letter. If we can't find the bucket, create a new one."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 44
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 10\n",
+      "Keep two lists: a list of pending duplicates, and a list of actual duplicates. The first time we see a letter, add it to the pending list. If we see the letter again, it's already in the pending list, so that prompts us to add it to the duplicates list."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def duplicates_via_pending_list(word):\n",
+      "    pending = []\n",
+      "    duplicates = []\n",
+      "    for letter in word:\n",
+      "        if letter in pending:\n",
+      "            if letter not in duplicates:\n",
+      "                duplicates.append(letter)\n",
+      "        else:\n",
+      "            pending.append(letter)\n",
+      "    return duplicates"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 27
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "duplicates_via_pending_list('occurrences')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 28,
+       "text": [
+        "['c', 'r', 'e']"
+       ]
+      }
+     ],
+     "prompt_number": 28
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 11\n",
+      "Exactly the same as above, but using sets instead of lists. This means we don't have to worry about adding more than one copy of the letter to either `pending` or `duplicates`."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "def duplicates_via_pending_set(word):\n",
+      "    pending = set()\n",
+      "    duplicates = set()\n",
+      "    for letter in word:\n",
+      "        if letter in pending:\n",
+      "            duplicates.add(letter)\n",
+      "        pending.add(letter)\n",
+      "    return list(duplicates)"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 31
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "duplicates_via_pending_set('occurrences')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 32,
+       "text": [
+        "['c', 'r', 'e']"
+       ]
+      }
+     ],
+     "prompt_number": 32
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 12\n",
+      "Creating a set removes all the duplicates. Make a set from the word, then remove each letter in the set from the word. Whatever letters are left are the duplicates. (Convert the duplicates to a set then to a list to eliminate letters that occur more than three times.)"
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 32
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 13\n",
+      "Use the `Counter` in the standard `collections` library. This will count the occurrences of each letter, allowing us to simply find those with a count of more than one."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 34
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 14\n",
+      "Similar to solution 12 above, but this time using Counters as multisets (or bags), where we can just do multiset subtraction to find the duplicates. `letters` will have a count of 1 for each letter."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import collections\n",
+      "\n",
+      "def duplicates_via_counter_and_set(word):\n",
+      "    letters = collections.Counter(set(word))\n",
+      "    counts = collections.Counter(word)\n",
+      "    duplicates = counts - letters\n",
+      "    return [letter for letter in duplicates]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 24
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "duplicates_via_counter_and_set('occurrences')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 25,
+       "text": [
+        "['c', 'r', 'e']"
+       ]
+      }
+     ],
+     "prompt_number": 25
+    },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 15\n",
+      "Using the `itertools.groupby` standard library function to find all the groups in the sorted word. We can then pull out the longer groups. \n",
+      "\n",
+      "The same approach as solution 8 above, but using library functions to simplify the implementation."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 42
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 16\n",
+      "Using `functools.reduce` to merge the counts of letters in sublists of the word. Start by counting all the letters in the length one sublists, then combine them. Finally, find the letters with counts more than one."
+     ]
+    },
     {
      "cell_type": "code",
      "collapsed": false,
      ],
      "prompt_number": 63
     },
+    {
+     "cell_type": "markdown",
+     "metadata": {},
+     "source": [
+      "##Solution 17\n",
+      "Similar to above, but refactored to more clearly show the separate `map` and `reduce` stages of the map-reduce strategy.\n",
+      "\n",
+      "Also uses the `update` method of `Counter`s to merge intermediate results."
+     ]
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "import functools\n",
+      "import collections\n",
+      "\n",
+      "def duplicates_via_map_reduce_update(word):\n",
+      "    def mapper(letter):\n",
+      "        return collections.Counter(letter)\n",
+      "    \n",
+      "    def reducer(left, right):\n",
+      "        left.update(right.elements())\n",
+      "        return left\n",
+      "    \n",
+      "    letters = map(mapper, word)\n",
+      "    counts = functools.reduce(reducer, letters)\n",
+      "    return [letter for letter, count in counts.items() if count > 1]"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [],
+     "prompt_number": 36
+    },
+    {
+     "cell_type": "code",
+     "collapsed": false,
+     "input": [
+      "duplicates_via_map_reduce_update('occurrences')"
+     ],
+     "language": "python",
+     "metadata": {},
+     "outputs": [
+      {
+       "metadata": {},
+       "output_type": "pyout",
+       "prompt_number": 37,
+       "text": [
+        "['c', 'e', 'r']"
+       ]
+      }
+     ],
+     "prompt_number": 37
+    },
     {
      "cell_type": "code",
      "collapsed": false,