{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Section 5.1: Using the model" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Imports and defintions" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "init_cell": true }, "outputs": [], "source": [ "library(tidyverse)\n", "library(repr)\n", "library(ggfortify)\n", "\n", "# Change plot size to 4 x 3\n", "options(repr.plot.width=6, repr.plot.height=4)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "init_cell": true }, "outputs": [], "source": [ "source('plot_extensions.R')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Modelling abrasion loss\n", "\n", "This example concerns the dataset `rubber`, which you first met in Exercise 3.1. The experiment introduced in that exercise concerned an investigation of the resistance of rubber to abrasion (the abrasion loss) and how that resistance depended on various attributes of the rubber. In Exercise 3.1, the only explanatory variable we analysed was hardness; but in Exercise 3.19, a second explanatory variable, tensile strength, was taken into account too. There are 30 datapoints." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": true }, "outputs": [], "source": [ "rubber <- read.csv('rubber.csv')\n", "rubber" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The figures below show scatterplots of abrasion loss against hardness and of abrasion loss against strength. (A scatterplot of abrasion loss against hardness also appeared in Solution 3.1.) Figure (a) suggests a strong decreasing linear relationship of abrasion loss with hardness. Figure (b) is much less indicative of any strong dependence of abrasion loss on tensile strength." ] }, { "cell_type": "code", "execution_count": null, "metadata": { "scrolled": false }, "outputs": [], "source": [ "hardloss <- ggplot(rubber, aes(x=hardness, y=loss)) + geom_point()\n", "strloss <- ggplot(rubber, aes(x=strength, y=loss)) + geom_point()\n", "\n", "multiplot(hardloss, strloss, cols=2)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In exercise 3.5(a) you obtained the following output for the regression of abrasion loss on hardness." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fit.h <- lm(loss ~ hardness, data = rubber)\n", "summary(fit.h)\n", "anova(fit.h)\n", "# af <- anova(fit)\n", "# afss <- af$\"Sum Sq\"\n", "# print(cbind(af,PctExp=afss/sum(afss)*100))" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggplotRegression(fit.h)" ] }, { "cell_type": "markdown", "metadata": { "solution": "shown", "solution2": "hidden", "solution2_first": true, "solution_first": true }, "source": [ "### Exercise 5.1\n", "Now repeat the for the regression of abrasion loss on tensile strength.\n", "\n", "Enter your solution in the cell below." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your solution here" ] }, { "cell_type": "markdown", "metadata": { "heading_collapsed": true }, "source": [ "### Solution" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true, "solution2": "hidden" }, "outputs": [], "source": [ "fit.s <- lm(loss ~ strength, data = rubber)\n", "summary(fit.s)\n", "anova(fit.s)" ] }, { "cell_type": "code", "execution_count": null, "metadata": { "hidden": true }, "outputs": [], "source": [ "ggplotRegression(fit.s)" ] }, { "cell_type": "markdown", "metadata": { "hidden": true, "solution2": "hidden" }, "source": [ "Individually, then, regression of abrasion loss on hardness seems satisfactory, albeit accounting for a disappointingly small percentage of the variance in the response; regression of abrasion loss on tensile strength, on the other hand, seems to have very little explanatory power. " ] }, { "cell_type": "markdown", "metadata": { "solution": "hidden" }, "source": [ "### Multiple regression\n", "However, back in Exercise 3.19 it was shown that the residuals from the former regression showed a clear relationship with the tensile strength values, and this suggested that a regression model involving both variables is necessary. \n", "\n", "Let us try it. The only change is to include `strength` in the equations being fitted in the `lm()` function. Instead of \n", "\n", "```\n", "lm(loss ~ hardness, data = rubber)\n", "```\n", "use \n", "```\n", "lm(loss ~ hardness + strength, data = rubber)\n", "```" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fit.hs <- lm(loss ~ hardness + strength, data = rubber)\n", "summary(fit.hs)\n", "anova(fit.hs)" ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "ggplotRegression(fit.hs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Looking at the regression coefficient output next, the estimated model for the mean response is\n", "\n", "$$ \\hat{y} = \\hat{\\alpha} + \\hat{\\beta}_1 x_1 + \\hat{\\beta}_2 x_2 = 885.2 − 6.571 x_1 − 1.374 x_2 $$ \n", "\n", "where $x_1$ and $x_2$ stand for values of hardness and tensile strength, respectively. \n", "\n", "Associated with each estimated parameter, GenStat gives a standard error (details of the calculation of which need not concern us now) and hence a _t_-statistic (estimate divided by standard error) to be compared with the distribution on `d.f. (Residual)`=27 degrees of freedom. GenStat makes the comparison and gives _p_ values, which in this case are all very small, suggesting strong evidence for the non-zeroness (and hence presence) of each parameter, $\\alpha$, $\\beta_1$ and $\\beta_2$, individually." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "fits <- list(fit.h, fit.s, fit.hs)\n", "data.frame(\n", " \"Vars\" = sapply(fits, function(x) toString(attr(summary(x)$terms, \"variables\")[-(1:2)]) ),\n", " \"Adj R^2\" = sapply(fits, function(x) summary(x)$adj.r.squared)\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "So, even though the single regression of abrasion loss on tensile strength did not suggest a close relationship, when taken in conjunction with the effect of hardness the effect of tensile strength is also a considerable one. A key idea here is that $\\beta_1$ and $\\beta_2$ (and more generally $\\beta_1, \\beta_2, \\ldots , \\beta_k$ in model (5.1)) are partial regression coefficients. That is, $\\beta_1$ measures the effect of an increase of one unit in $x_1$ treating the value of the other variable $x_2$ as fixed. Contrast this with the single regression model $E(Y) = \\alpha + \\beta_1 x_1$ in which $\\beta_1$ represents an increase of one unit in $x_1$ treating $x_2$ as zero. The meaning of $\\beta_1$ in the regression models with one and two explanatory variables is not the same.\n", "\n", "You will notice that the percentage of variance accounted for has increased dramatically in the two-explanatory-variable model to a much more respectable 82.8%. This statistic has the same interpretation — or difficulty of interpretation — as for the case of one explanatory variable (see [Unit 3](unit3.ipynb)).\n", "\n", "Other items in the GenStat output are as for the case of one explanatory variable: the `Standard error of observations` is $\\hat{\\sigma}$, and is the square root of `m.s. (Residual)`; the message about standardised residuals will be clarified below; and the message about leverage will be ignored until [Unit 10](unit10.ipynb).\n", "\n", "The simple residuals are, again, defined as the differences between the observed and predicted responses:\n", "\n", "$$ r_i = y_i - \\left( \\hat{\\alpha} + \\sum_{j=1}^{k} \\hat{\\beta}_j x_{i, j} \\right) ,\\ i = 1, 2, \\ldots, n. $$\n", "\n", "GenStat can obtain these for you and produce a plot of residuals against fitted values. The fitted values are simply $\\hat{Y}_i = \\hat{\\alpha} + \\sum_{j=1}^{k} \\hat{\\beta}_j x_{i, j}$. As in the regression models in [Unit 3](unit3.ipynb), the default in GenStat is to use deviance residuals which, in this context, are equivalent to standardised residuals (simple residuals divided by their estimated standard errors)." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Gratuitous example of more complex maths markup\n", "If \n", "\n", "$$ \\rho(z) = \\rho_c \\exp \\left( - \\frac{z^2}{2H^2} \\right) $$\n", "\n", "then\n", "\n", "\\begin{eqnarray*}\n", "\\frac{\\partial \\rho(z)}{\\partial z} & = & \\rho_c \\frac{\\partial}{\\partial z}\\exp \n", " \\left( - \\frac{z^2}{2H^2} \\right) \\\\\n", "\t& = & \\rho_c \\exp \\left( - \\frac{z^2}{2H^2} \\right) \\cdot \n", "\t \\frac{\\partial}{\\partial z} \\left( - \\frac{z^2}{2H^2} \\right) \\\\\n", "\t& = & \\rho_c \\exp \\left( - \\frac{z^2}{2H^2} \\right) \\cdot \n", "\t - \\frac{z}{H^2} \\\\\n", "\t& = & - \\frac{z}{H^2} \\rho(z) \n", "\\end{eqnarray*}" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Example 5.2\n", "Something about plots of residuals..." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "autoplot(fit.hs)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "These plots are of just the same sort as those in [Unit 3](unit3.ipynb), and are interpreted in the same way. In this case, the normal plot and histogram show some slight suggestion of skewness in the residuals, but there is no obvious cause to doubt the model. In the regression output, GenStat flagged one of the points (number 19) as having a large standardised (i.e. deviance) residual. To decide which points to flag in this way, GenStat uses the same rules as described in [Units 3](unit3.ipynb) and [4](unit4.ipynb). In this case, it warned us about the most negative residual. However, the value of this standardised residual is not very large (at −2.38), and the plot of residuals against fitted values in Figure 5.2 makes it clear that this particular residual is not exceptionally large relative to some of the others, which are almost as large." ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5.2\n", "The table below shows values for the first five datapoints, of 13, of a dataset concerned with predicting the heat evolved when cement sets via knowledge of its constituents. There are two explanatory variables, tricalcium aluminate (TA) and tricalcium silicate (TS), and the response variable is the heat generated in calories per gram (heat).\n", "\n", "heat | TA | TS\n", "-----|----|-----\n", "78.5 | 7 | 26\n", "74.3 | 1 | 29\n", "104.3 | 11 | 56\n", "87.6 | 11 | 31\n", "95.9 | 7 | 52\n", " $\\vdots$ | $\\vdots$ | $\\vdots$\n" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "a. Load the `cemheat` dataset.\n", "\n", "b. Make scatterplots of heat against each of TA and TS in turn, and comment on what you see.\n", "\n", "c. Use GenStat to fit each individual regression equation (of heat on TA and of heat on TS) in turn, and then to fit the regression equation with two explanatory variables. Does the latter regression equation give you a better model than either of the individual ones?\n", "\n", "d. According to the regression equation with two explanatory variables fitted in part (b), what is the predicted value of heat when TA = 15 and TS = 55?\n", "\n", "e. By looking at the (default) composite residual plots, comment on the appropriateness of the fitted regression model.\n", "\n", "The solution is in the [Section 5.1 solutions](section5.1solutions.ipynb) notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your solution here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "### Exercise 5.3: Fitting a quadratic regression model\n", "In Unit 3, the dataset `anaerob` was briefly considered (Example 3.1) and swiftly dismissed as a candidate for simple linear regression modelling. The reason is clear from the figure below, in which the response variable, expired ventilation ($y$), is plotted against the single explanatory variable, oxygen uptake ($x$). (The same plot appeared as Figure 3.1 in Example 3.1.) A model quadratic in $x$, i.e. $E(Y) = \\alpha + \\beta x + \\gamma x^2$, was suggested instead." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "anaerobic <- read.csv('anaerob.csv')\n", "ggplot(anaerobic, aes(x=oxygen, y=ventil)) + geom_point()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Since this quadratic model is nonetheless linear in its parameters, we can fit it using multiple regression of $y$ on $x$ plus the new variable $x^2$. Even though $x_1 = x$ and $x_2 = x^2$ are closely related, there is no immediate impediment to forgetting this and regressing on them in the usual way. A variable such as $x_2$ is sometimes called a derived variable.\n", "\n", "a. Using GenStat, perform the regression of expired ventilation (`ventil`) on oxygen uptake (`oxygen`). Are you at all surprised by how good this regression model seems?\n", "\n", "b. Now form a new variable `oxy2`, say, by squaring oxygen. (Create a new column in the `anearobic` dataframe which is `anaerobic$oxygen ^ 2`.) Perform the regression of `ventil` on `oxygen` and `oxy2`. Comment on the fit of this model according to the printed output (and with recourse to Figure 3.2 in Example 3.1).\n", "\n", "c. Make the usual residual plots and comment on the fit of the model again.\n", "\n", "The solution is in the [Section 5.1 solutions](section5.1solutions.ipynb) notebook." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [ "# Your solution here" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Next steps\n", "Move to the next notebook, or go back to the course material, or something." ] }, { "cell_type": "code", "execution_count": null, "metadata": {}, "outputs": [], "source": [] } ], "metadata": { "kernelspec": { "display_name": "R", "language": "R", "name": "ir" }, "language_info": { "codemirror_mode": "r", "file_extension": ".r", "mimetype": "text/x-r-source", "name": "R", "pygments_lexer": "r", "version": "3.4.2" }, "toc": { "nav_menu": {}, "number_sections": true, "sideBar": true, "skip_h1_title": false, "title_cell": "Table of Contents", "title_sidebar": "Contents", "toc_cell": false, "toc_position": { "height": "calc(100% - 180px)", "left": "10px", "top": "150px", "width": "342px" }, "toc_section_display": true, "toc_window_display": true } }, "nbformat": 4, "nbformat_minor": 2 }