Removing files from data analysis directory
[ou-summer-of-code-2017.git] / 01-ticket-prices / ticket-pricing-solution.ipynb
1 {
2 "cells": [
3 {
4 "cell_type": "markdown",
5 "metadata": {},
6 "source": [
7 "# Ticket pricing\n",
8 "\n",
9 "You've been shopping around for a holiday package deal and its time to make your choice of which deal to go with. The file [01-holidays.txt](01-holidays.txt) contains a summary of your investigations. \n",
10 "\n",
11 "It's a simple text file, with one possible holiday package per line.\n",
12 "\n",
13 "Each line has four fields, separated by spaces. They are:\n",
14 "* The deal ID, from the price comparison website you found it.\n",
15 "* The holiday price, in whole pounds.\n",
16 "* The location of the holiday, always a single word.\n",
17 "* The number of nights you'd be staying. \n",
18 "\n",
19 "For example, the data file might look like this:\n",
20 "\n",
21 "```\n",
22 "db61bb90 769 Morgantown 3\n",
23 "202c898b5f 1284 Morgantown 21\n",
24 "def36ffcd 1514 Giessenmestia 21\n",
25 "389018bd0707 1052 Estacada 21\n",
26 "a487c4270 782 Geoje-Si 14\n",
27 "6caf2584a55 724 Stonington-Island 14\n",
28 "199608abc5 1209 Nordkapp 21\n",
29 "```"
30 ]
31 },
32 {
33 "cell_type": "markdown",
34 "metadata": {},
35 "source": [
36 "## Part 1\n",
37 "You have a budget of £1200. How many of the holidays can you afford?\n",
38 "\n",
39 "Given the example data above, you could afford four of the holidays: the trips to Estacada, Geoje-Si and Stonnington-Island, and the three-day trip to Morgantown. \n",
40 "\n",
41 "The 21 day trip to Morgantown and the trips to Giessenmestia and Nordkapp are all too expensive."
42 ]
43 },
44 {
45 "cell_type": "markdown",
46 "metadata": {},
47 "source": [
48 "# Part 2\n",
49 "You don't just want _a_ holiday. You want the _best_ holiday. What is the code of the holiday which would give you the best value?\n",
50 "\n",
51 "The \"value\" of a holiday is the duration per pound. Because some destinations are better than others, you'll want to scale the value for some locations. For instance, a night in Timbuktu is worth three times as much as a holiday in Bletchley.\n",
52 "\n",
53 "Assume all holidays have a relative value of 1, apart from these destinations.\n",
54 "\n",
55 "| Destination | Score |\n",
56 "|-------------|-------|\n",
57 "| Almaty | 2.0 |\n",
58 "| Brorfelde | 0.9 |\n",
59 "| Estacada | 0.4 |\n",
60 "| Jayuya | 0.6 |\n",
61 "| Karlukovo | 2.2 |\n",
62 "| Morgantown | 2.9 |\n",
63 "| Nordkapp | 1.5 |\n",
64 "| Nullarbor | 2.2 |\n",
65 "| Puente-Laguna-Garzonkuala-Penyu | 0.4 |\n",
66 "| Uzupis | 0.9 |\n",
67 "\n",
68 "## Example\n",
69 "\n",
70 "Given the holiday list above, the holiday to Geoje-Si (with the standard weighting of 1) has a value of $\\frac{14}{782} = 0.0179$ nights per pound. \n",
71 "\n",
72 "The trip to Estacada looks promising, at $\\frac{21}{1052} = 0.0200$ nights per pound. Unfortunately, the weighting for Estacada is low, to the adjusted cost is $0.4 \\times \\frac{21}{1052} = 0.00798$ nights per pound.\n",
73 "\n",
74 "The best value holiday is the 21 day trip to Morgantown, with a value of $2.9 \\times \\frac{21}{1284} = 0.0474$ nights per pound. Unfortunately, it's unaffordable. \n",
75 "\n",
76 "The best value affordable holiday is the trip to Stonnington Island, with $\\frac{14}{1284} = 0.0193$ nights per pound."
77 ]
78 },
79 {
80 "cell_type": "markdown",
81 "metadata": {},
82 "source": [
83 "# Worked example of solving\n",
84 "\n",
85 "For those still getting to grips with things, I thought I'd show how I'd work through solving the day 1 task.\n",
86 "\n",
87 "My first step when looking at this kind of thing is to think about data structures and operations. What data do we have, and what do we need to do with it?\n",
88 "\n",
89 "We're given a list of holidays. In the first part of the task, we have to filter that list against some condition and count how many meet the condition. In the second part, we have to find the holiday with some maximal value. \n",
90 "\n",
91 "How to store holidays? We just need something we can walk over (iterate over). There are many choices. We could just use a list, or a set (if we assume each holiday is distinct), or a dict/map/hashmap, using the holiday ID as a key. They're all much of a sameness, so let's take the path of least resistance and use a list.\n",
92 "\n",
93 "How to represent each holiday? Each holiday is a record of four parts (ID, destination, price, duration). I could use a dict or (in Python) a namedtuple to store the parts. That has the advantage that it gives a name to each of the parts. The alternative is to use a list to store each holiday and refer to the parts by number. Either is good, but as this task is small, I'll be able to get away with just remembering four numbers and what they mean.\n",
94 "\n",
95 "So, my chosen representation is a list of lists. That's not the best representation, but it's good enough for this task."
96 ]
97 },
98 {
99 "cell_type": "markdown",
100 "metadata": {},
101 "source": [
102 "Next step: reading the list of holidays. Let's just start by trying to read in the data file and split it into a list, one element for each holiday. Let's not try to sort out each holiday in that list at the moment.\n",
103 "\n",
104 "I could look through the Python documentation for how to do this, or I could hit a search engine. \n",
105 "\n",
106 "I'm not the first to have done that. Typing \"python read file lines\" suggests and autocomplete of \"python read file lines into a list\", which is spot on, and leads to a [StackOverflow post](https://stackoverflow.com/questions/3925614/how-do-you-read-a-file-into-a-list-in-python). Result!\n",
107 "\n",
108 "Let's try the top-rated answer, with some print statements to check what's going on:\n",
109 "```\n",
110 "with open('01-holidays.txt') as f:\n",
111 " lines = f.read().splitlines()\n",
112 "print(len(lines), lines[0])\n",
113 "```\n",
114 "That gives output\n",
115 "```\n",
116 "124 dda7d369 1546 Uzupis 21\n",
117 "```\n",
118 "which looks right: 124 lines, and that's the first line in the file.\n",
119 "\n",
120 "Another quick search for splitting a string into a list points me in the direction of .split().\n",
121 "\n",
122 "I can combine those ideas like this:\n",
123 "\n",
124 "```\n",
125 "holidays = []\n",
126 "with open('01-holidays.txt') as f:\n",
127 " for line in f.read().splitlines():\n",
128 " holidays += [line.split()]\n",
129 " \n",
130 "print(len(lines), holidays[0])\n",
131 "```\n",
132 "\n",
133 "But I actually like the second answer of that Stack Overflow post better, and prefer to use the list comprehension:\n",
134 "\n",
135 "```\n",
136 "holidays = [line.strip().split() for line in open('01-holidays.txt')]\n",
137 "\n",
138 "print(len(lines), holidays[0])\n",
139 "```\n",
140 "\n",
141 "The basic format is `[<do something to item> for <item> in <bunch of items>]`\n",
142 "\n",
143 "I think it's cleaner, and it's certainly more compact."
144 ]
145 },
146 {
147 "cell_type": "markdown",
148 "metadata": {},
149 "source": [
150 "Now we have the holidays, time to find the affordable ones. \n",
151 "\n",
152 "The basic idea is to keep a count of the number of affordable holidays as we walk along the list of holidays. Every time we get to an affordable one, increase the count by one. The price is in the second element of the holiday record, which is number 1 in Python's count-from-zero world.\n",
153 "\n",
154 "That turns quite simply into code:\n",
155 "\n",
156 "```\n",
157 "affordable_count = 0\n",
158 "for holiday in holidays:\n",
159 " if holiday[1] <= 1200:\n",
160 " affordable_count += 1\n",
161 "\n",
162 "print(affordable_count)\n",
163 "```\n",
164 "\n",
165 "…which gives an error about unorderable types, comparing `str()` with `int()`.\n",
166 "\n",
167 "That makes sense. Strings and numbers are different, and what we've got is a list of strings. A quick search online tells us that `int(something)` will convert `something` into a number, if possible. Let's make that change and try again.\n",
168 "\n",
169 "```\n",
170 "affordable_count = 0\n",
171 "for holiday in holidays:\n",
172 " if int(holiday[1]) <= 1200:\n",
173 " affordable_count += 1\n",
174 "\n",
175 "print(affordable_count)\n",
176 "```\n",
177 "\n",
178 "And that gives the right answer!"
179 ]
180 },
181 {
182 "cell_type": "markdown",
183 "metadata": {},
184 "source": [
185 "Now for part 2. \n",
186 "\n",
187 "At first sight, its intimidating. Complex calculation, lots of data to move around. But, let's just start with what we can do easily and see where we end up.\n",
188 "\n",
189 "Each destination has a weight for how much it's liked. If a weight's not given, we use 1.\n",
190 "\n",
191 "We can store the given weights in a `dict`: it's what they were made for. Associate some value with a key; in this case, the key is the destination name and the value is the weight.\n",
192 "\n",
193 "```\n",
194 "destination_values = {'Almaty': 2.0, 'Brorfelde': 0.9, 'Estacada': 0.4, 'Jayuya': 0.6, 'Karlukovo': 2.2, \n",
195 " 'Morgantown': 2.9,'Nordkapp': 1.5, 'Nullarbor': 2.2, \n",
196 " 'Puente-Laguna-Garzonkuala-Penyu': 0.4, 'Uzupis': 0.9}\n",
197 "```\n",
198 "\n",
199 "We can get the values from the dict quite easily:\n",
200 "\n",
201 "```\n",
202 "print(destination_values['Jayaya'])\n",
203 "```\n",
204 "\n",
205 "How to get the value when the destination isn't in the table? We could use the defaultdict from the Python library, or we could use the `get()` method which accepts a default value, or we could just wrap some control in a function.\n",
206 "\n",
207 "```\n",
208 "def value_of_destination(name):\n",
209 " if name in destination_values:\n",
210 " return destination_values[name]\n",
211 " else:\n",
212 " return 1\n",
213 "```\n",
214 "\n",
215 "And try it:\n",
216 "\n",
217 "```\n",
218 "print(destination_values['Mamula'])\n",
219 "```"
220 ]
221 },
222 {
223 "cell_type": "markdown",
224 "metadata": {},
225 "source": [
226 "Now we can find the destination weight for each destination, how to find the value of a holiday? Again, wrap that in a function so we can call it without getting bogged down in the details. \n",
227 "\n",
228 "The tricky part here is to keep track of data types. We have to convert strings to numbers again, but division involving integers will often give an integer as the answer. The standard trick in these situation is to make sure that at least one number in each calculation is a floating-point number, and Python will ensure that the whole calculation is done a floating point numbers.\n",
229 "\n",
230 "```\n",
231 "def value_of_holiday(holiday):\n",
232 " hid, cost, destination, duration = tuple(holiday)\n",
233 " value = value_of_destination(destination) * float(duration) / int(cost)\n",
234 " return value\n",
235 "```\n",
236 "\n",
237 "Again, we can test it\n",
238 "\n",
239 "```\n",
240 "print(value_of_holiday(holidays[0]))\n",
241 "```"
242 ]
243 },
244 {
245 "cell_type": "markdown",
246 "metadata": {},
247 "source": [
248 "Now we have the parts for solving the task. Just like part 1, we'll walk over the list of holidays, keeping track of the best value one so far. We have remember two items now, though: the best value, and the code of the holiday with that value. We'll just use two separate variables to keep track of them.\n",
249 "\n",
250 "```\n",
251 "best_holiday = ''\n",
252 "best_value = 0\n",
253 "\n",
254 "for holiday in holidays:\n",
255 " if int(holiday[1]) <= 1200:\n",
256 " if value_of_holiday(holiday) > best_value:\n",
257 " best_value = value_of_holiday(holiday)\n",
258 " best_holiday = holiday[0] \n",
259 "\n",
260 "print(best_holiday)\n",
261 "```\n",
262 "\n",
263 "Rather than having two `if` statements, we could say \n",
264 "\n",
265 "```\n",
266 "if int(holiday[1]) <= 1200 and value_of_holiday(holiday) > best_value:\n",
267 "```\n",
268 "\n",
269 "but I'm not sure it's much better in this context.\n",
270 "\n",
271 "And there you have it!"
272 ]
273 },
274 {
275 "cell_type": "markdown",
276 "metadata": {},
277 "source": [
278 "I hope you found the exposition useful.\n",
279 "\n",
280 "Yes, I use the internet a lot to look up bits of syntax. It's much the same as looking in a textbook and finding snippets of code to reuse and repurpose. The skill of programming isn't so much the mastery of syntax as it is about understanding the problem and how to put together a solution. By all means look stuff up; a good programmer knows what to look up (or ask) and how to use the answer."
281 ]
282 },
283 {
284 "cell_type": "markdown",
285 "metadata": {},
286 "source": [
287 "### Solution"
288 ]
289 },
290 {
291 "cell_type": "code",
292 "execution_count": 2,
293 "metadata": {},
294 "outputs": [
295 {
296 "data": {
297 "text/plain": [
298 "[['dda7d369', '1546', 'Uzupis', '21'],\n",
299 " ['68022753', '1239', 'Mamula', '21'],\n",
300 " ['b261dbd1cef', '996', 'Holmegaard', '21']]"
301 ]
302 },
303 "execution_count": 2,
304 "metadata": {},
305 "output_type": "execute_result"
306 }
307 ],
308 "source": [
309 "holidays = []\n",
310 "with open('01-holidays.txt') as f:\n",
311 " for hol_line in f:\n",
312 " holidays.append(hol_line.split())\n",
313 " \n",
314 "holidays[:3]"
315 ]
316 },
317 {
318 "cell_type": "code",
319 "execution_count": 3,
320 "metadata": {},
321 "outputs": [
322 {
323 "data": {
324 "text/plain": [
325 "[['dda7d369', '1546', 'Uzupis', '21'],\n",
326 " ['68022753', '1239', 'Mamula', '21'],\n",
327 " ['b261dbd1cef', '996', 'Holmegaard', '21']]"
328 ]
329 },
330 "execution_count": 3,
331 "metadata": {},
332 "output_type": "execute_result"
333 }
334 ],
335 "source": [
336 "holidays = [line.split() for line in open('01-holidays.txt')]\n",
337 " \n",
338 "holidays[:3]"
339 ]
340 },
341 {
342 "cell_type": "code",
343 "execution_count": 13,
344 "metadata": {},
345 "outputs": [
346 {
347 "data": {
348 "text/plain": [
349 "59"
350 ]
351 },
352 "execution_count": 13,
353 "metadata": {},
354 "output_type": "execute_result"
355 }
356 ],
357 "source": [
358 "affordable_holidays = []\n",
359 "for h in holidays:\n",
360 " if int(h[1]) <= 1200:\n",
361 " affordable_holidays.append(h)\n",
362 "\n",
363 "len(affordable_holidays)"
364 ]
365 },
366 {
367 "cell_type": "code",
368 "execution_count": 5,
369 "metadata": {},
370 "outputs": [
371 {
372 "data": {
373 "text/plain": [
374 "59"
375 ]
376 },
377 "execution_count": 5,
378 "metadata": {},
379 "output_type": "execute_result"
380 }
381 ],
382 "source": [
383 "affordable_holidays = [h for h in holidays if int(h[1]) <= 1200]\n",
384 "len(affordable_holidays)"
385 ]
386 },
387 {
388 "cell_type": "code",
389 "execution_count": 14,
390 "metadata": {},
391 "outputs": [
392 {
393 "data": {
394 "text/plain": [
395 "124"
396 ]
397 },
398 "execution_count": 14,
399 "metadata": {},
400 "output_type": "execute_result"
401 }
402 ],
403 "source": [
404 "len(holidays)"
405 ]
406 },
407 {
408 "cell_type": "markdown",
409 "metadata": {},
410 "source": [
411 "### Smart-alec one-line solution"
412 ]
413 },
414 {
415 "cell_type": "code",
416 "execution_count": 4,
417 "metadata": {},
418 "outputs": [
419 {
420 "data": {
421 "text/plain": [
422 "59"
423 ]
424 },
425 "execution_count": 4,
426 "metadata": {},
427 "output_type": "execute_result"
428 }
429 ],
430 "source": [
431 "sum(1 for h in open('01-holidays.txt') if int(h.split()[1]) <= 1200)"
432 ]
433 },
434 {
435 "cell_type": "code",
436 "execution_count": 16,
437 "metadata": {
438 "collapsed": true
439 },
440 "outputs": [],
441 "source": [
442 "destination_values = {'Almaty': 2.0, 'Brorfelde': 0.9, 'Estacada': 0.4, 'Jayuya': 0.6, 'Karlukovo': 2.2, \n",
443 " 'Morgantown': 2.9,'Nordkapp': 1.5, 'Nullarbor': 2.2, \n",
444 " 'Puente-Laguna-Garzonkuala-Penyu': 0.4, 'Uzupis': 0.9}"
445 ]
446 },
447 {
448 "cell_type": "code",
449 "execution_count": 17,
450 "metadata": {
451 "collapsed": true
452 },
453 "outputs": [],
454 "source": [
455 "def value_of_destination(name):\n",
456 " if name in destination_values:\n",
457 " return destination_values[name]\n",
458 " else:\n",
459 " return 1"
460 ]
461 },
462 {
463 "cell_type": "code",
464 "execution_count": 18,
465 "metadata": {
466 "collapsed": true
467 },
468 "outputs": [],
469 "source": [
470 "def value_of_holiday(holiday):\n",
471 " hid, cost, destination, duration = tuple(holiday)\n",
472 " value = value_of_destination(destination) * float(duration) / int(cost)\n",
473 " return value"
474 ]
475 },
476 {
477 "cell_type": "code",
478 "execution_count": 19,
479 "metadata": {},
480 "outputs": [
481 {
482 "data": {
483 "text/plain": [
484 "'ee064e1e2ea'"
485 ]
486 },
487 "execution_count": 19,
488 "metadata": {},
489 "output_type": "execute_result"
490 }
491 ],
492 "source": [
493 "best_holiday = ''\n",
494 "best_value = 0\n",
495 "\n",
496 "for h in affordable_holidays:\n",
497 " if value_of_holiday(h) > best_value:\n",
498 " best_value = value_of_holiday(h)\n",
499 " best_holiday = h[0]\n",
500 " \n",
501 "best_holiday"
502 ]
503 },
504 {
505 "cell_type": "markdown",
506 "metadata": {},
507 "source": [
508 "## Smart-alec solution"
509 ]
510 },
511 {
512 "cell_type": "code",
513 "execution_count": 20,
514 "metadata": {},
515 "outputs": [
516 {
517 "data": {
518 "text/plain": [
519 "'ee064e1e2ea'"
520 ]
521 },
522 "execution_count": 20,
523 "metadata": {},
524 "output_type": "execute_result"
525 }
526 ],
527 "source": [
528 "# Right answer\n",
529 "max(affordable_holidays, key=value_of_holiday)[0]"
530 ]
531 },
532 {
533 "cell_type": "code",
534 "execution_count": 21,
535 "metadata": {},
536 "outputs": [
537 {
538 "data": {
539 "text/plain": [
540 "'c86e2e5826'"
541 ]
542 },
543 "execution_count": 21,
544 "metadata": {},
545 "output_type": "execute_result"
546 }
547 ],
548 "source": [
549 "# Answer if you don't filter by affordability\n",
550 "max(holidays, key=value_of_holiday)[0]"
551 ]
552 },
553 {
554 "cell_type": "code",
555 "execution_count": 23,
556 "metadata": {},
557 "outputs": [
558 {
559 "data": {
560 "text/plain": [
561 "'f60e203aaaf9'"
562 ]
563 },
564 "execution_count": 23,
565 "metadata": {},
566 "output_type": "execute_result"
567 }
568 ],
569 "source": [
570 "# Answer if you don't scale by perceived value\n",
571 "max(affordable_holidays, key=lambda h: float(h[3]) / float(h[1]))[0]"
572 ]
573 },
574 {
575 "cell_type": "code",
576 "execution_count": 22,
577 "metadata": {},
578 "outputs": [
579 {
580 "data": {
581 "text/plain": [
582 "'f60e203aaaf9'"
583 ]
584 },
585 "execution_count": 22,
586 "metadata": {},
587 "output_type": "execute_result"
588 }
589 ],
590 "source": [
591 "# Answer if you don't scale by perceived value, AND don't filter by affordability\n",
592 "max(holidays, key=lambda h: float(h[3]) / float(h[1]))[0]"
593 ]
594 }
595 ],
596 "metadata": {
597 "kernelspec": {
598 "display_name": "Python 3",
599 "language": "python",
600 "name": "python3"
601 },
602 "language_info": {
603 "codemirror_mode": {
604 "name": "ipython",
605 "version": 3
606 },
607 "file_extension": ".py",
608 "mimetype": "text/x-python",
609 "name": "python",
610 "nbconvert_exporter": "python",
611 "pygments_lexer": "ipython3",
612 "version": "3.5.2+"
613 }
614 },
615 "nbformat": 4,
616 "nbformat_minor": 2
617 }