{
 "metadata": {
  "name": "",
  "signature": "sha256:9a1b8b45577dfcc1daaffba595f74655068367c9070816fe6eeb9345bb55729d"
 },
 "nbformat": 3,
 "nbformat_minor": 0,
 "worksheets": [
  {
   "cells": [
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "#The specification\n",
      "Write a procedure that returns all duplicate characters from a given String.\n",
      "\n",
      "For example if String is \"Java\" then program should return [\"a\"].\n",
      "\n",
      "* \"Java\" -> ['a']\n",
      "* \"String\" -> []\n",
      "* \"Computer science\" -> [\"c\", \"e\"]\n",
      "* \"happy\" -> ['p']\n",
      "\n",
      "Bonus points if your program is robust and handle different kinds of input e.g. String without duplicate, null or empty String etc. \n",
      "\n",
      "Bonus points if you also write unit tests for normal and edge cases."
     ]
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 1\n",
      "Walk over the word, incrementing a count for each letter as you go (stored in a `dict`). Then, walk over the set of counts and record those with a count greater than one."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_dict(word):\n",
      "    counts = {}\n",
      "    for letter in word:\n",
      "        if letter in counts:\n",
      "            counts[letter] += 1\n",
      "        else:\n",
      "            counts[letter] = 1\n",
      "    duplicates = []\n",
      "    for letter in counts:\n",
      "        if counts[letter] > 1:\n",
      "            duplicates += [letter]\n",
      "    return duplicates"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 4
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_dict('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 11,
       "text": [
        "['r', 'e', 'c']"
       ]
      }
     ],
     "prompt_number": 11
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 2\n",
      "As solution 1, but using a `defaultdict` to record the counts, but pretending that each count is zero if it hasn't been seen already. Note that this makes the logic simpler in the first loop."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import collections\n",
      "\n",
      "def duplicates_via_defaultdict(word):\n",
      "    counts = collections.defaultdict(int)\n",
      "    for letter in word:\n",
      "        counts[letter] += 1\n",
      "    duplicates = []\n",
      "    for letter in counts:\n",
      "        if counts[letter] > 1:\n",
      "            duplicates += [letter]\n",
      "    return duplicates"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 34
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_defaultdict('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 35,
       "text": [
        "['c', 'e', 'r']"
       ]
      }
     ],
     "prompt_number": 35
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 3\n",
      "Define a helper function to count all the occurrences of a particular letter in the word. \n",
      "\n",
      "Walk over the word and find how many times that letter occurs in the the word (stored in a `dict`). Then, walk over the set of counts and record those with a count greater than one.\n",
      "\n",
      "Note that the first loop doesn't check if a letter already exists in the dict. You could introduce a test to avoid duplicating calls to `count_letter`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_counts(word):\n",
      "    def count_letter(word, letter):\n",
      "        count = 0\n",
      "        for l in word:\n",
      "            if l == letter:\n",
      "                count += 1\n",
      "        return count\n",
      "    counts = {}\n",
      "    for letter in word:\n",
      "        counts[letter] = count_letter(word, letter)\n",
      "    duplicates = []\n",
      "    for letter in counts:\n",
      "        if counts[letter] > 1:\n",
      "            duplicates += [letter]\n",
      "    return duplicates"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 8
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_counts('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 12,
       "text": [
        "['r', 'e', 'c']"
       ]
      }
     ],
     "prompt_number": 12
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 4\n",
      "A direct approach using comprehensions. The first one is a nested comprehension that counts how many times each letter appears in the word. The second comprehension just returns the letters that are duplicated."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_comprehension(word):\n",
      "    counts = {letter: len([l for l in word if l == letter]) for letter in word}\n",
      "    return [letter for letter in counts if counts[letter] > 1]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 64
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_comprehension('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 65,
       "text": [
        "['r', 'e', 'c']"
       ]
      }
     ],
     "prompt_number": 65
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 5\n",
      "A recursive approach. The idea is that a letter is a duplicate if it's either the first letter of the string and it occurs in the rest of the string, or it's a duplicate if it's duplicated in the rest of the string.\n",
      "\n",
      "We keep track of the duplicates found so far to avoid double-counting letters that occur more than twice in the string."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_recursion(word):\n",
      "    return duplicates_via_recursion_helper(word, [])\n",
      "\n",
      "def duplicates_via_recursion_helper(word, duplicates_so_far):\n",
      "    if word:\n",
      "        letter = word[0]\n",
      "        if letter in word[1:] and letter not in duplicates_so_far:\n",
      "            return duplicates_via_recursion_helper(word[1:], [letter] + duplicates_so_far)\n",
      "        else:\n",
      "            return duplicates_via_recursion_helper(word[1:], duplicates_so_far)\n",
      "    else:\n",
      "        return duplicates_so_far"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 25
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_recursion('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 26,
       "text": [
        "['e', 'r', 'c']"
       ]
      }
     ],
     "prompt_number": 26
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 6\n",
      "Sort the word, then walk along it. A letter is duplicated if you find two consecutive letters the same. Store the results in a set, so that we don't have to worry about double-counting more than two letters."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_sorting(word):\n",
      "    duplicates = set()\n",
      "    sorted_word = sorted(word)\n",
      "    for i in range(len(sorted_word) - 1):\n",
      "        if sorted_word[i] == sorted_word[i+1]:\n",
      "            duplicates.add(sorted_word[i])\n",
      "    return list(duplicates)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 29
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_sorting('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 30,
       "text": [
        "['r', 'e', 'c']"
       ]
      }
     ],
     "prompt_number": 30
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 7\n",
      "As above, but this time we store the duplicates in a list and use an additional variable `found_duplicates` to keep the state of whether we've found duplicates of this letter.\n",
      "\n",
      "Initially `found_duplicates` if `False`. As soon as we find a second occurrence of the letter, set `found_duplicates` to `True` and record the letter as a duplicate. Then continue to walk along the list until we find a different letter, when we reset `found_duplicates` to `False` again."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_runs(word):\n",
      "    duplicates = []\n",
      "    sorted_word = sorted(word)\n",
      "    i = 0\n",
      "    found_duplicates = False\n",
      "    for i in range(len(sorted_word) - 1):\n",
      "        if sorted_word[i] == sorted_word[i+1]:\n",
      "            if not found_duplicates:\n",
      "                duplicates += [sorted_word[i]]\n",
      "                found_duplicates = True\n",
      "        else:\n",
      "            found_duplicates = False\n",
      "    return duplicates"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 55
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_runs('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 56,
       "text": [
        "['c', 'e', 'r']"
       ]
      }
     ],
     "prompt_number": 56
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 8\n",
      "Sort the string then split it into a list of lists, where each sublist is the group of identical letters that occur together. Then find the segments that have more than one letter."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_segments(word):\n",
      "    segments = []\n",
      "    sorted_word = sorted(word)\n",
      "    segment_start = 0\n",
      "    for i in range(len(sorted_word) - 1):\n",
      "        if sorted_word[i] != sorted_word[i+1]:\n",
      "            segments += [sorted_word[segment_start:i+1]]\n",
      "            segment_start = i + 1\n",
      "    return [segment[0] for segment in segments if len(segment) > 1]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 49
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_segments('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 50,
       "text": [
        "['c', 'e', 'r']"
       ]
      }
     ],
     "prompt_number": 50
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 9\n",
      "The same as above, but without having to sort the word first. This requires an additional loop, where we look for the bucket to place this letter. If we can't find the bucket, create a new one."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_buckets(word):\n",
      "    buckets = []\n",
      "    for letter in word:\n",
      "        found_bucket = False\n",
      "        for bucket in buckets:\n",
      "            if bucket[0] == letter:\n",
      "                bucket += [letter]\n",
      "                found_bucket = True\n",
      "        if not found_bucket:\n",
      "            buckets += [[letter]]\n",
      "    duplicates = []\n",
      "    for bucket in buckets:\n",
      "        if len(bucket) > 1:\n",
      "            duplicates += bucket[0]\n",
      "    return duplicates"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 43
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_buckets('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 44,
       "text": [
        "['c', 'r', 'e']"
       ]
      }
     ],
     "prompt_number": 44
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 10\n",
      "Keep two lists: a list of pending duplicates, and a list of actual duplicates. The first time we see a letter, add it to the pending list. If we see the letter again, it's already in the pending list, so that prompts us to add it to the duplicates list."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_pending_list(word):\n",
      "    pending = []\n",
      "    duplicates = []\n",
      "    for letter in word:\n",
      "        if letter in pending:\n",
      "            if letter not in duplicates:\n",
      "                duplicates.append(letter)\n",
      "        else:\n",
      "            pending.append(letter)\n",
      "    return duplicates"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 27
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_pending_list('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 28,
       "text": [
        "['c', 'r', 'e']"
       ]
      }
     ],
     "prompt_number": 28
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 11\n",
      "Exactly the same as above, but using sets instead of lists. This means we don't have to worry about adding more than one copy of the letter to either `pending` or `duplicates`."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_pending_set(word):\n",
      "    pending = set()\n",
      "    duplicates = set()\n",
      "    for letter in word:\n",
      "        if letter in pending:\n",
      "            duplicates.add(letter)\n",
      "        pending.add(letter)\n",
      "    return list(duplicates)"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_pending_set('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 32,
       "text": [
        "['c', 'r', 'e']"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 12\n",
      "Creating a set removes all the duplicates. Make a set from the word, then remove each letter in the set from the word. Whatever letters are left are the duplicates. (Convert the duplicates to a set then to a list to eliminate letters that occur more than three times.)"
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "def duplicates_via_set(word):\n",
      "    letters = set(word)\n",
      "    duplicates = list(word)\n",
      "    for letter in letters:\n",
      "        duplicates.remove(letter)\n",
      "    return list(set(duplicates))"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 31
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_set('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 32,
       "text": [
        "['r', 'e', 'c']"
       ]
      }
     ],
     "prompt_number": 32
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 13\n",
      "Use the `Counter` in the standard `collections` library. This will count the occurrences of each letter, allowing us to simply find those with a count of more than one."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import collections\n",
      "\n",
      "def duplicates_via_counter(word):\n",
      "    counts = collections.Counter(word)\n",
      "    return [letter for letter, count in counts.items() \n",
      "            if count > 1]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 33
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_counter('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 34,
       "text": [
        "['r', 'e', 'c']"
       ]
      }
     ],
     "prompt_number": 34
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 14\n",
      "Similar to solution 12 above, but this time using Counters as multisets (or bags), where we can just do multiset subtraction to find the duplicates. `letters` will have a count of 1 for each letter."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import collections\n",
      "\n",
      "def duplicates_via_counter_and_set(word):\n",
      "    letters = collections.Counter(set(word))\n",
      "    counts = collections.Counter(word)\n",
      "    duplicates = counts - letters\n",
      "    return [letter for letter in duplicates]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 24
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_counter_and_set('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 25,
       "text": [
        "['c', 'r', 'e']"
       ]
      }
     ],
     "prompt_number": 25
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 15\n",
      "Using the `itertools.groupby` standard library function to find all the groups in the sorted word. We can then pull out the longer groups. \n",
      "\n",
      "The same approach as solution 8 above, but using library functions to simplify the implementation."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import itertools\n",
      "\n",
      "def duplicates_via_groupby(word):\n",
      "    return [k for k, g in itertools.groupby(sorted(word)) \n",
      "            if len(list(g)) > 1]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 41
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_groupby('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 42,
       "text": [
        "['c', 'e', 'r']"
       ]
      }
     ],
     "prompt_number": 42
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 16\n",
      "Using `functools.reduce` to merge the counts of letters in sublists of the word. Start by counting all the letters in the length one sublists, then combine them. Finally, find the letters with counts more than one."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import functools\n",
      "\n",
      "def duplicates_via_reduce(word):\n",
      "    def merge_counts(counts1, counts2):\n",
      "        counts = {l: c for l, c in counts1.items()}\n",
      "        for l, c in counts2.items():\n",
      "            if l in counts:\n",
      "                counts[l] += c\n",
      "            else:\n",
      "                counts[l] = c\n",
      "        return counts\n",
      "    \n",
      "    counts = [{letter: 1} for letter in word]\n",
      "    return [letter for letter, count in functools.reduce(merge_counts, counts).items() if count > 1]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 62
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_reduce('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 63,
       "text": [
        "['r', 'e', 'c']"
       ]
      }
     ],
     "prompt_number": 63
    },
    {
     "cell_type": "markdown",
     "metadata": {},
     "source": [
      "##Solution 17\n",
      "Similar to above, but refactored to more clearly show the separate `map` and `reduce` stages of the map-reduce strategy.\n",
      "\n",
      "Also uses the `update` method of `Counter`s to merge intermediate results."
     ]
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "import functools\n",
      "import collections\n",
      "\n",
      "def duplicates_via_map_reduce_update(word):\n",
      "    def mapper(letter):\n",
      "        return collections.Counter(letter)\n",
      "    \n",
      "    def reducer(left, right):\n",
      "        left.update(right.elements())\n",
      "        return left\n",
      "    \n",
      "    letters = map(mapper, word)\n",
      "    counts = functools.reduce(reducer, letters)\n",
      "    return [letter for letter, count in counts.items() if count > 1]"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [],
     "prompt_number": 36
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [
      "duplicates_via_map_reduce_update('occurrences')"
     ],
     "language": "python",
     "metadata": {},
     "outputs": [
      {
       "metadata": {},
       "output_type": "pyout",
       "prompt_number": 37,
       "text": [
        "['c', 'e', 'r']"
       ]
      }
     ],
     "prompt_number": 37
    },
    {
     "cell_type": "code",
     "collapsed": false,
     "input": [],
     "language": "python",
     "metadata": {},
     "outputs": []
    }
   ],
   "metadata": {}
  }
 ]
}