notebooks/04.5 Visualising Data/.ipynb_checkpoints/4.5.2 Getting Started With ggplot-checkpoint.ipynb

   1 {
   2  "metadata": {
   3   "name": "",
   4   "signature": "sha256:afb82a020546413ed833589b41bdec8f02beaae66e1085c91a668be000123418"
   5  },
   6  "nbformat": 3,
   7  "nbformat_minor": 0,
   8  "worksheets": [
   9   {
  10    "cells": [
  11     {
  12      "cell_type": "heading",
  13      "level": 1,
  14      "metadata": {},
  15      "source": [
  16       "Getting Started With ggplot"
  17      ]
  18     },
  19     {
  20      "cell_type": "markdown",
  21      "metadata": {},
  22      "source": [
  23       "*ggplot* is a Python port of the popular *ggplot2* R implementation by Hadley of Wickham of many the ideas proposed in Leland Wilkinson's *The Grammar of Graphics*. As such, it attempts to enforce good practice in the generation of charts from appropriately shaped datasets.\n",
  24       "\n",
  25       "The \"default\" graphics library for use with *pandas* is arguably the [matplotlib](http://matplotlib.org/) 2D Python plotting library, but whilst *matplotlib* offers a wide variety of powerful charting capabilities, it is often less concise and more convoluted than *ggplot*. It doesn't produce charts that are quite as pretty (or professional looking) as *ggplot* does out of the box either!\n",
  26       "\n",
  27       "*ggplot* also offers a cleaner separation of data and graphical transformations, in accord with Wilkinson's original model. The full original *ggplot2* implementation was also developed with the production of *statistical graphics* in mind. That is, the library provided a range of statistical transformations that could be applied to a dataset as part of the graphic generation process.\n",
  28       "\n",
  29       "The full *ggplot* documentation can be found here: [*ggplot* documentation](http://ggplot.yhathq.com/docs/).\n",
  30       "\n",
  31       "The python *[statsmodels](http://statsmodels.sourceforge.net/)* library is one of the more widely use statistical computing libraries for Python, providing a range of powerful chart types as well as employing *pandas* based data structres for representing datasets. But whilst it *statsmodels* does support the generation of powerful statistical charts, it is rather lacking in support of simpler chart types.\n",
  32       "\n",
  33       "So on these grounds of what we might term, at worst, *principled expediency*, we will tend to focus on the use of *ggplot* although you are free to explore other charting libraries yourself. If you particularly want to make use of interactive Javascript style charts, howver, [Vincent](http://vincent.readthedocs.org/en/latest/) could be a good choice. If you prefer using *matplotlib*, that's fine too. However, we do expect that you *also* gain an understanding of how to make use of graphics libraries based on the ideas of *The Grammar of Graphics*."
  34      ]
  35     },
  36     {
  37      "cell_type": "heading",
  38      "level": 2,
  39      "metadata": {},
  40      "source": [
  41       "Finding Some Data..."
  42      ]
  43     },
  44     {
  45      "cell_type": "code",
  46      "collapsed": false,
  47      "input": [
  48       "import pandas as pd"
  49      ],
  50      "language": "python",
  51      "metadata": {},
  52      "outputs": []
  53     },
  54     {
  55      "cell_type": "markdown",
  56      "metadata": {},
  57      "source": [
  58       "The data we will be using in this notebook was released under a Freedom of Information request to the Isle of Wight Council and describes the revenue taken by two ticket machines in a particular pay and display car park over a twelve month period. (You can see the original FOI request here: [Pay and display ticket machine logs](https://www.whatdotheyknow.com/request/pay_and_display_ticket_machine_l).)"
  59      ]
  60     },
  61     {
  62      "cell_type": "markdown",
  63      "metadata": {},
  64      "source": [
  65       "The data is supplied in a set of Excel spreadsheets. We will open just one for now."
  66      ]
  67     },
  68     {
  69      "cell_type": "code",
  70      "collapsed": false,
  71      "input": [
  72       "!ls data"
  73      ],
  74      "language": "python",
  75      "metadata": {},
  76      "outputs": []
  77     },
  78     {
  79      "cell_type": "code",
  80      "collapsed": false,
  81      "input": [
  82       "! unzip data/pay_and_display_ticket_machine_l.zip -P "
  83      ],
  84      "language": "python",
  85      "metadata": {},
  86      "outputs": []
  87     },
  88     {
  89      "cell_type": "code",
  90      "collapsed": false,
  91      "input": [
  92       "df=pd.read_excel(\"data/iw_parkingMeterData/4_5_Transaction Report RR Dec 2012 March 2013.xls\")\n",
  93       "df[:10]"
  94      ],
  95      "language": "python",
  96      "metadata": {},
  97      "outputs": []
  98     },
  99     {
 100      "cell_type": "markdown",
 101      "metadata": {},
 102      "source": [
 103       "By inspection, we see that there are six rows before the header row. Let's try loading the data in by skipping those rows."
 104      ]
 105     },
 106     {
 107      "cell_type": "code",
 108      "collapsed": false,
 109      "input": [
 110       "df=pd.read_excel(\"data/iw_parkingMeterData/4_5_Transaction Report RR Dec 2012 March 2013.xls\", \\\n",
 111       "                 skiprows=6)"
 112      ],
 113      "language": "python",
 114      "metadata": {},
 115      "outputs": []
 116     },
 117     {
 118      "cell_type": "markdown",
 119      "metadata": {},
 120      "source": [
 121       "There appear to be some columns that just contain NaN values - let's tidy the dataset a little by dropping those columns."
 122      ]
 123     },
 124     {
 125      "cell_type": "code",
 126      "collapsed": false,
 127      "input": [
 128       "df.dropna(how='all',axis=1,inplace=True)\n",
 129       "df[:10]"
 130      ],
 131      "language": "python",
 132      "metadata": {},
 133      "outputs": []
 134     },
 135     {
 136      "cell_type": "code",
 137      "collapsed": false,
 138      "input": [
 139       "#Check to see how the columns are typed\n",
 140       "df.dtypes"
 141      ],
 142      "language": "python",
 143      "metadata": {},
 144      "outputs": []
 145     },
 146     {
 147      "cell_type": "markdown",
 148      "metadata": {},
 149      "source": [
 150       "Let's do a little more tidying:"
 151      ]
 152     },
 153     {
 154      "cell_type": "code",
 155      "collapsed": false,
 156      "input": [
 157       "#Cast the date column as a date type, specifying how to parse the dates.\n",
 158       "#Set coerce = True to cast any strings that aren't recognised to a NaT value.\n",
 159       "df.Date=pd.to_datetime(df.Date,  format=\"%Y-%m-%d %H:%M:%S\",coerce=True)\n",
 160       "#The final row - which originally had a \"Date\" labelled Total was actually a total row\n",
 161       "df[-4:]"
 162      ],
 163      "language": "python",
 164      "metadata": {},
 165      "outputs": []
 166     },
 167     {
 168      "cell_type": "code",
 169      "collapsed": false,
 170      "input": [
 171       "#Let's see if any other dates weren;t recognised as such\n",
 172       "df[df[\"Date\"].isnull()]"
 173      ],
 174      "language": "python",
 175      "metadata": {},
 176      "outputs": []
 177     },
 178     {
 179      "cell_type": "code",
 180      "collapsed": false,
 181      "input": [
 182       "#Let's also just check the total by summing the values (except the total) in the Cash column\n",
 183       "df[['Cash']][:-1].sum()"
 184      ],
 185      "language": "python",
 186      "metadata": {},
 187      "outputs": []
 188     },
 189     {
 190      "cell_type": "code",
 191      "collapsed": false,
 192      "input": [
 193       "#Let's just check the row count too\n",
 194       "df[['Cash']][:-1].count()"
 195      ],
 196      "language": "python",
 197      "metadata": {},
 198      "outputs": []
 199     },
 200     {
 201      "cell_type": "code",
 202      "collapsed": false,
 203      "input": [
 204       "#Drop the final total row\n",
 205       "df.dropna(subset=[\"Date\"],inplace=True)"
 206      ],
 207      "language": "python",
 208      "metadata": {},
 209      "outputs": []
 210     },
 211     {
 212      "cell_type": "heading",
 213      "level": 2,
 214      "metadata": {},
 215      "source": [
 216       "Using ggplot"
 217      ]
 218     },
 219     {
 220      "cell_type": "code",
 221      "collapsed": false,
 222      "input": [
 223       "#The ggplot library is currently under active development\n",
 224       "#Grab the most recent version from the github repository\n",
 225       "#!pip3 uninstall -y ggplot\n",
 226       "#!pip3 install git+https://github.com/yhat/ggplot.git"
 227      ],
 228      "language": "python",
 229      "metadata": {},
 230      "outputs": []
 231     },
 232     {
 233      "cell_type": "markdown",
 234      "metadata": {},
 235      "source": [
 236       "To get started with ggplot we need to load it in."
 237      ]
 238     },
 239     {
 240      "cell_type": "code",
 241      "collapsed": false,
 242      "input": [
 243       "from ggplot import *"
 244      ],
 245      "language": "python",
 246      "metadata": {},
 247      "outputs": []
 248     },
 249     {
 250      "cell_type": "markdown",
 251      "metadata": {},
 252      "source": [
 253       "We call `ggplot` by passing in a dataframe and specifying the aesthetic mappings. We need to make sure we define an appropriate aesthetic mapping for each dimnsion we wish to represent in the final display."
 254      ]
 255     },
 256     {
 257      "cell_type": "heading",
 258      "level": 3,
 259      "metadata": {},
 260      "source": [
 261       "geom_point()"
 262      ]
 263     },
 264     {
 265      "cell_type": "markdown",
 266      "metadata": {},
 267      "source": [
 268       "We use `geom_point()` to generate a scatterplot. `geom_point()` requires at least *x* and *y* value mappings. In the `aes()` definition, assign the *x* and *y* coordinate axes to the (quoted) column names you wish to plot from the specified dataset."
 269      ]
 270     },
 271     {
 272      "cell_type": "code",
 273      "collapsed": false,
 274      "input": [
 275       "ggplot(df, aes(x=\"Date\",y=\"Cash\"))+geom_point()"
 276      ],
 277      "language": "python",
 278      "metadata": {},
 279      "outputs": []
 280     },
 281     {
 282      "cell_type": "heading",
 283      "level": 3,
 284      "metadata": {},
 285      "source": [
 286       "ggtitle()"
 287      ]
 288     },
 289     {
 290      "cell_type": "markdown",
 291      "metadata": {},
 292      "source": [
 293       "`ggplot` charts are constructed on a layered basis. the `geom_title()` layer can be used to add a title to the chart."
 294      ]
 295     },
 296     {
 297      "cell_type": "code",
 298      "collapsed": false,
 299      "input": [
 300       "ggplot(df, aes(x=\"Date\",y=\"Cash\")) + geom_point() \\\n",
 301       "                                   + ggtitle(\"Payments made over time\")"
 302      ],
 303      "language": "python",
 304      "metadata": {},
 305      "outputs": []
 306     },
 307     {
 308      "cell_type": "heading",
 309      "level": 3,
 310      "metadata": {},
 311      "source": [
 312       "labs()"
 313      ]
 314     },
 315     {
 316      "cell_type": "markdown",
 317      "metadata": {},
 318      "source": [
 319       "Another useful layer for styling the presentation of a chart is the `labs()` layer that can be used to set axis labels.\n",
 320       "\n",
 321       "Note that *ggplot* actually returns a chart object, which means that we can assign it to a variable and add further layers or modifcations to that variable or chart object. "
 322      ]
 323     },
 324     {
 325      "cell_type": "code",
 326      "collapsed": false,
 327      "input": [
 328       "g = ggplot(df, aes(x=\"Date\",y=\"Cash\")) + geom_point()\n",
 329       "g = g + ggtitle(\"Payments made over time\")\n",
 330       "g = g + labs(\"Transaction Date\", \"Transaction amount (\u00a3)\")\n",
 331       "g"
 332      ],
 333      "language": "python",
 334      "metadata": {},
 335      "outputs": []
 336     },
 337     {
 338      "cell_type": "heading",
 339      "level": 3,
 340      "metadata": {},
 341      "source": [
 342       "Bar Charts and Histograms"
 343      ]
 344     },
 345     {
 346      "cell_type": "markdown",
 347      "metadata": {},
 348      "source": [
 349       "Whilst a very simple technique, *counting* is often one of the most useful tools in our toolbox. For example, let's count how many tickets were issued by each machine for each tariff by aggregating over each group using the `len` function."
 350      ]
 351     },
 352     {
 353      "cell_type": "code",
 354      "collapsed": false,
 355      "input": [
 356       "df[[\"Tariff\",\"Machine\"]].groupby(['Tariff',\"Machine\"]).agg(len).sort_index()"
 357      ],
 358      "language": "python",
 359      "metadata": {},
 360      "outputs": []
 361     },
 362     {
 363      "cell_type": "markdown",
 364      "metadata": {},
 365      "source": [
 366       "Bar charts can be used to provide charts showing counts across categorical variables. Supply the categorical variable you wish to chart as the *x* value and use `geom_bar()`.\n",
 367       "\n",
 368       "Note that if variable (`foo`) you wish to use for the categorical x-values has a numeric type, you can cast it to a categorical variable by calling it as follows:  `x='factor(foo)'`."
 369      ]
 370     },
 371     {
 372      "cell_type": "code",
 373      "collapsed": false,
 374      "input": [
 375       "p = ggplot(aes(x='Tariff'), data=df)\n",
 376       "p + geom_bar() + ggtitle(\"Number of Tickets per Tariff\")  + labs(\"Tariff Code\", \"Count\")"
 377      ],
 378      "language": "python",
 379      "metadata": {},
 380      "outputs": []
 381     },
 382     {
 383      "cell_type": "markdown",
 384      "metadata": {},
 385      "source": [
 386       "We can add in a grouping variable to produce a stacked bar chart. Here we can see the contribution to the total count in each tariff made from each machine."
 387      ]
 388     },
 389     {
 390      "cell_type": "code",
 391      "collapsed": false,
 392      "input": [
 393       "p = ggplot(aes(x='Tariff',fill=\"Machine\"), data=df)\n",
 394       "p + geom_bar() + ggtitle(\"Number of Tickets per Tariff\")  + labs(\"Tariff Code\", \"Count\")"
 395      ],
 396      "language": "python",
 397      "metadata": {},
 398      "outputs": []
 399     },
 400     {
 401      "cell_type": "markdown",
 402      "metadata": {},
 403      "source": [
 404       "If the range we want along the horizontal *x*-axis is a continuous one, we can make use of `geom_histogram()`. The *binwidth* will be automatically calculated, but we can also force it to a particular width using the *binwidth* parameter.\n",
 405       "\n",
 406       "Here I set the *binwidth* to 0.1, that is, 10 pence, so we can more closely look for small overpayments."
 407      ]
 408     },
 409     {
 410      "cell_type": "code",
 411      "collapsed": false,
 412      "input": [
 413       "p = ggplot(aes(x='Cash'), data=df)\n",
 414       "p + geom_histogram(binwidth=0.1) "
 415      ],
 416      "language": "python",
 417      "metadata": {},
 418      "outputs": []
 419     },
 420     {
 421      "cell_type": "heading",
 422      "level": 3,
 423      "metadata": {},
 424      "source": [
 425       "Frequency Distributions - geom_density()"
 426      ]
 427     },
 428     {
 429      "cell_type": "markdown",
 430      "metadata": {},
 431      "source": [
 432       "Sometimes, whilst we *can* plot a chart, it may not really be meaningful to do so.\n",
 433       "\n",
 434       "For situations where you have a continuous distribution of values along a continuous numerical axis, it may may sense to produce a *frequency distribution chart*. For example, a chart showing the distribution of the height of people in a population. The`geo,_density()` chart works out the relative frequency of each value and produces a smoothed curve that extimates the continuos (*frequency*) distribution. The vertical *y*-axis is the propotion of the population taking the value. The area under the curve should sum to 1.\n",
 435       "\n",
 436       "The *Cash* payment received is, in a sense, a continuous variable (people could pay any amount, at least in steps of 5 pence, the smallest coin accepted by the parking meters) although the expectation is that only discrete amounts (as specified by the tariffs) are required as payment."
 437      ]
 438     },
 439     {
 440      "cell_type": "code",
 441      "collapsed": false,
 442      "input": [
 443       "p = ggplot(aes(x='Cash'), data=df)\n",
 444       "p + geom_density() + ggtitle(\"Number of Tickets per Tariff\")  + labs(\"Payment (\u00a3)\", \"Proportion\")"
 445      ],
 446      "language": "python",
 447      "metadata": {},
 448      "outputs": []
 449     },
 450     {
 451      "cell_type": "markdown",
 452      "metadata": {},
 453      "source": [
 454       "From the peaks in the chart, we see peaks at about 50 pence, just below \u00a32, at about \u00a33.30, a small bump about \u00a34.40 and a final burst at about \u00a36.60.\n",
 455       "\n",
 456       "We can lookup the tariff amounts from the *Description.1* column:"
 457      ]
 458     },
 459     {
 460      "cell_type": "code",
 461      "collapsed": false,
 462      "input": [
 463       "df['Description.1'].unique()"
 464      ],
 465      "language": "python",
 466      "metadata": {},
 467      "outputs": []
 468     },
 469     {
 470      "cell_type": "markdown",
 471      "metadata": {},
 472      "source": [
 473       "That is, we have distinct payment amounts at \u00a310, \u00a36.60, \u00a34.50, \u00a33.40, \u00a33.00, \u00a31.90 and \u00a30.60.\n",
 474       "\n",
 475       "We can add additional layers to the chart to hightlight these valuese using `geom_vline()`, which adds a vertical line at a particular *x* value."
 476      ]
 477     },
 478     {
 479      "cell_type": "code",
 480      "collapsed": false,
 481      "input": [
 482       "p = ggplot(aes(x='Cash'), data=df)\n",
 483       "p = p + geom_density() + ggtitle(\"Number of Tickets per Tariff\")  + labs(\"Payment (\u00a3)\", \"Proportion\")\n",
 484       "p + geom_vline(xintercept=[10, 6.6, 4.5, 3.4, 3.0, 1.9, 1,0, 0.6 ],colour='blue') "
 485      ],
 486      "language": "python",
 487      "metadata": {},
 488      "outputs": []
 489     },
 490     {
 491      "cell_type": "markdown",
 492      "metadata": {},
 493      "source": [
 494       "We see high frequency spikes at all these amounts apart from at the \u00a33 tariff (short stay coaches)."
 495      ]
 496     },
 497     {
 498      "cell_type": "markdown",
 499      "metadata": {},
 500      "source": [
 501       "Note that we can set the extent of the *x* and *y* axes by adding `+ xlim(MIN_X, MAX_X)` and `+ ylim(MIN_Y, MAX_Y)` modification layers to the chart."
 502      ]
 503     },
 504     {
 505      "cell_type": "heading",
 506      "level": 3,
 507      "metadata": {},
 508      "source": [
 509       "Line Charts - geom_line()"
 510      ]
 511     },
 512     {
 513      "cell_type": "markdown",
 514      "metadata": {},
 515      "source": [
 516       "For charting continuous values, particular ones that are plotted over time, a line chart often makes most sense.\n",
 517       "\n",
 518       "When looking at transaction reports including separate amounts for each trasnaction, a chart showing the *running total* or accumulated amount can often be useful\n",
 519       "\n",
 520       "We can add such a value as an additional column by sorting the data frame appropriately and then calculaing the cumulative sun over the *Cash* column using the `cumsum()` method."
 521      ]
 522     },
 523     {
 524      "cell_type": "code",
 525      "collapsed": false,
 526      "input": [
 527       "df.sort(['Date'],inplace=True)\n",
 528       "df['Cash_cumul'] = df.Cash.cumsum()\n",
 529       "df[:5]"
 530      ],
 531      "language": "python",
 532      "metadata": {},
 533      "outputs": []
 534     },
 535     {
 536      "cell_type": "markdown",
 537      "metadata": {},
 538      "source": [
 539       "To plot the value as a line chart, we use `geom_line()`."
 540      ]
 541     },
 542     {
 543      "cell_type": "code",
 544      "collapsed": false,
 545      "input": [
 546       "#As well as passing a dataframe to the ggplot function as the first argument, we can also pass it via the data= attribute \n",
 547       "g = ggplot(aes(x=\"Date\",y=\"Cash_cumul\"), data=df )+ geom_line()\n",
 548       "g"
 549      ],
 550      "language": "python",
 551      "metadata": {},
 552      "outputs": []
 553     },
 554     {
 555      "cell_type": "heading",
 556      "level": 3,
 557      "metadata": {},
 558      "source": [
 559       "Exercise"
 560      ]
 561     },
 562     {
 563      "cell_type": "markdown",
 564      "metadata": {},
 565      "source": [
 566       "Modify the chart generated directly above by adding an appropriate title and tidying up the axis titles."
 567      ]
 568     },
 569     {
 570      "cell_type": "code",
 571      "collapsed": false,
 572      "input": [
 573       "#Add title\n",
 574       "\n",
 575       "#Add suitable axis labels, and then display the chart\n"
 576      ],
 577      "language": "python",
 578      "metadata": {},
 579      "outputs": []
 580     },
 581     {
 582      "cell_type": "heading",
 583      "level": 3,
 584      "metadata": {},
 585      "source": [
 586       "Grouping by Colour"
 587      ]
 588     },
 589     {
 590      "cell_type": "markdown",
 591      "metadata": {},
 592      "source": [
 593       "Judicious use of colour can often help us pack more information into a chart in a way that still allows us to read it. For example, suppose we want to look at the accumulated spend over time by Tariff to see which Tariff appears to be generating most revenue.\n",
 594       "\n",
 595       "We can group on the tariff and calculate the accumulated revenue within each tariff band. "
 596      ]
 597     },
 598     {
 599      "cell_type": "code",
 600      "collapsed": false,
 601      "input": [
 602       "group=df[['Tariff','Cash']].groupby('Tariff')\n",
 603       "#For group of rows, apply the transformation to each row in the group\n",
 604       "#The number of rows in the response will be the same as the number of rows in the original data frame\n",
 605       "df['Cash_cumul2']=group.transform(cumsum)['Cash']\n",
 606       "df[:10]"
 607      ],
 608      "language": "python",
 609      "metadata": {},
 610      "outputs": []
 611     },
 612     {
 613      "cell_type": "markdown",
 614      "metadata": {},
 615      "source": [
 616       "We can now plot the tariff based culumative totals as separate lines, splitting each line out using the `colour` aesthetic."
 617      ]
 618     },
 619     {
 620      "cell_type": "code",
 621      "collapsed": false,
 622      "input": [
 623       "ggplot(df,aes(x=\"Date\",y=\"Cash_cumul2\",colour=\"Tariff\"))+geom_line()"
 624      ],
 625      "language": "python",
 626      "metadata": {},
 627      "outputs": []
 628     },
 629     {
 630      "cell_type": "heading",
 631      "level": 3,
 632      "metadata": {},
 633      "source": [
 634       "Faceted Charts"
 635      ]
 636     },
 637     {
 638      "cell_type": "markdown",
 639      "metadata": {},
 640      "source": [
 641       "On other occasions, we may wish to split out data from different groups into different charts. This is referred to as *faceting*. We can split a datset across separate charts based on the value of a particular group attribute by using the `facet_wrap()` layer."
 642      ]
 643     },
 644     {
 645      "cell_type": "code",
 646      "collapsed": false,
 647      "input": [
 648       "ggplot(df, aes(x=\"Date\",y=\"Cash_cumul2\")) + geom_line() \\\n",
 649       "                                   + ggtitle(\"Payments made over time\") \\\n",
 650       "                                   + labs(\"Transaction Date\", \"Transaction amount (\u00a3)\") \\\n",
 651       "                                   + facet_wrap(\"Tariff\")"
 652      ],
 653      "language": "python",
 654      "metadata": {},
 655      "outputs": []
 656     },
 657     {
 658      "cell_type": "markdown",
 659      "metadata": {},
 660      "source": [
 661       "By default, axis values are generated for each chart independently. However, we can also force them to use the same axes by setting the `scales` parameter to `fixed`, as opposed to `free`."
 662      ]
 663     },
 664     {
 665      "cell_type": "code",
 666      "collapsed": false,
 667      "input": [
 668       "ggplot(df, aes(x=\"Date\",y=\"Cash_cumul2\")) + geom_line() \\\n",
 669       "                                   + ggtitle(\"Payments made over time\") \\\n",
 670       "                                   + labs(\"Transaction Date\", \"Transaction amount (\u00a3)\") \\\n",
 671       "                                   + facet_wrap(\"Tariff\",scales = \"fixed\")"
 672      ],
 673      "language": "python",
 674      "metadata": {},
 675      "outputs": []
 676     },
 677     {
 678      "cell_type": "heading",
 679      "level": 3,
 680      "metadata": {},
 681      "source": [
 682       "Exercise"
 683      ]
 684     },
 685     {
 686      "cell_type": "markdown",
 687      "metadata": {},
 688      "source": [
 689       "How many ticket machines is the data collected from and how many transactions are recorded by each one?\n",
 690       "\n",
 691       "How would you generate a faceted chart showing the accumulated transactions over time for each of the ticket machines identified in the ticket column?"
 692      ]
 693     },
 694     {
 695      "cell_type": "code",
 696      "collapsed": false,
 697      "input": [
 698       "# Identifying the number of distinct machines and number of transactions recorded by each one\n"
 699      ],
 700      "language": "python",
 701      "metadata": {},
 702      "outputs": []
 703     },
 704     {
 705      "cell_type": "code",
 706      "collapsed": false,
 707      "input": [
 708       "# Accumulated total for each machine\n"
 709      ],
 710      "language": "python",
 711      "metadata": {},
 712      "outputs": []
 713     },
 714     {
 715      "cell_type": "code",
 716      "collapsed": false,
 717      "input": [
 718       "#Chart faceted by ticket machine\n"
 719      ],
 720      "language": "python",
 721      "metadata": {},
 722      "outputs": []
 723     },
 724     {
 725      "cell_type": "heading",
 726      "level": 2,
 727      "metadata": {},
 728      "source": [
 729       "Themes"
 730      ]
 731     },
 732     {
 733      "cell_type": "markdown",
 734      "metadata": {},
 735      "source": [
 736       "When it comes to publishing charts in a particular publication, we often require that the chart is presented in a particular *style*. In the same way that we can use different CSS style files to alter the look of a particular HTML document, so we can alter the look of a chart generated using *ggplot* by applying differnt *themes* to the chart.\n",
 737       "\n",
 738       "For example, if we need to inject a little humour or apparent \"casualness\" into a chart, at the expense of some accuracy in the chart, we can use the XCKD theme.\n",
 739       "\n",
 740       "(For more information about the use of such a theme, and techniques for creating such \"sketchy visualiastions\", see *Wood, Jo, Petra Isenberg, Tobias Isenberg, Jason Dykes, Nadia Boukhelifa, and Aidan Slingsby. \"[Sketchy rendering for information visualization.\" IEEE Transactions on Visualization and Computer Graphics](http://hal.archives-ouvertes.fr/docs/00/72/08/24/PDF/Wood_2012_SRI.pdf), 18(12), 2012: 2749-2758*.)"
 741      ]
 742     },
 743     {
 744      "cell_type": "code",
 745      "collapsed": false,
 746      "input": [
 747       "p = ggplot(aes(x='Tariff',fill=\"Machine\"), data=df)\n",
 748       "p = p + geom_bar() + ggtitle(\"Number of Tickets per Tariff\")  + labs(\"Tariff Code\", \"Count\") \n",
 749       "\n",
 750       "p + theme_xkcd()\n"
 751      ],
 752      "language": "python",
 753      "metadata": {},
 754      "outputs": []
 755     },
 756     {
 757      "cell_type": "markdown",
 758      "metadata": {},
 759      "source": [
 760       "Another very useful theme is the \"clean\" looking `theme_bw()`."
 761      ]
 762     },
 763     {
 764      "cell_type": "code",
 765      "collapsed": false,
 766      "input": [
 767       "p + theme_bw()"
 768      ],
 769      "language": "python",
 770      "metadata": {},
 771      "outputs": []
 772     },
 773     {
 774      "cell_type": "heading",
 775      "level": 2,
 776      "metadata": {},
 777      "source": [
 778       "What Next?"
 779      ]
 780     },
 781     {
 782      "cell_type": "markdown",
 783      "metadata": {},
 784      "source": [
 785       "In this notebook,we have introduced some of the basic chart types that are supported by *ggplot*, as well as some of the modifications you can make to the charts. You can find further information, as well as descriptions of additioanl chart types, from the [*ggplot* documentation](http://ggplot.yhathq.com/docs/).\n",
 786       "\n",
 787       "Feel free to extend this notebook as your own personal reference notebook by adding further sections about additional chart types.\n",
 788       "\n",
 789       "If you are working through this notebook as part of an inline exercise, return to the course materials now. If you are working through this set of notebooks as a whole, move on to [4.5.3 Getting Started With Maps - folium](4.5.3%20Getting%20Started%20With%20Maps%20-%20folium.ipynb)."
 790      ]
 791     }
 792    ],
 793    "metadata": {}
 794   }
 795  ]
 796 }