section5.1.Rmd

   1 ---
   2 title: "section5.1"
   3 output: html_document
   4 ---
   5
   6 ```{r setup, include=FALSE}
   7 knitr::opts_chunk$set(echo = TRUE)
   8
   9 library(tidyverse)
  10 library(repr)
  11 library(ggfortify)
  12
  13 # Change plot size to 4 x 3
  14 # options(repr.plot.width=6, repr.plot.height=4)
  15
  16 # Multiple plot function
  17 #
  18 # ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
  19 # - cols:   Number of columns in layout
  20 # - layout: A matrix specifying the layout. If present, 'cols' is ignored.
  21 #
  22 # If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
  23 # then plot 1 will go in the upper left, 2 will go in the upper right, and
  24 # 3 will go all the way across the bottom.
  25 #
  26 multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
  27   library(grid)
  28
  29   # Make a list from the ... arguments and plotlist
  30   plots <- c(list(...), plotlist)
  31
  32   numPlots = length(plots)
  33
  34   # If layout is NULL, then use 'cols' to determine layout
  35   if (is.null(layout)) {
  36     # Make the panel
  37     # ncol: Number of columns of plots
  38     # nrow: Number of rows needed, calculated from # of cols
  39     layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
  40                     ncol = cols, nrow = ceiling(numPlots/cols))
  41   }
  42
  43  if (numPlots==1) {
  44     print(plots[[1]])
  45
  46   } else {
  47     # Set up the page
  48     grid.newpage()
  49     pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
  50
  51     # Make each plot, in the correct location
  52     for (i in 1:numPlots) {
  53       # Get the i,j matrix positions of the regions that contain this subplot
  54       matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
  55
  56       print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
  57                                       layout.pos.col = matchidx$col))
  58     }
  59   }
  60 }
  61 ```
  62
  63 ## Modelling abrasion loss
  64
  65 This example concerns the dataset `rubber`, which you first met in Exercise 3.1. The experiment introduced in that exercise concerned an investigation of the resistance of rubber to abrasion (the abrasion loss) and how that resistance depended on various attributes of the rubber. In Exercise 3.1, the only explanatory variable we analysed was hardness; but in Exercise 3.19, a second explanatory variable, tensile strength, was taken into account too. There are 30 datapoints.
  66
  67 ```{r load_rubber}
  68 rubber <- read.csv('rubber.csv')
  69 rubber
  70 ```
  71
  72 The figures below show scatterplots of abrasion loss against hardness and of abrasion loss against strength. (A scatterplot of abrasion loss against hardness also appeared in Solution 3.1.) Figure (a) suggests a strong decreasing linear relationship of abrasion loss with hardness. Figure (b) is much less indicative of any strong dependence of abrasion loss on tensile strength.
  73
  74 ```{r rubber_plots}
  75 hardloss <- ggplot(rubber, aes(x=hardness, y=loss)) + geom_point()
  76 strloss <-  ggplot(rubber, aes(x=strength, y=loss)) + geom_point()
  77
  78 multiplot(hardloss, strloss, cols=2)
  79 ```
  80
  81 In exercise 3.5(a) you obtained the following output for the regression of abrasion loss on hardness.
  82
  83 ```{r hardness_fit}
  84 fit <- lm(loss ~ hardness, data = rubber)
  85 summary(fit)
  86 anova(fit)
  87 ```
  88
  89 ### Exercise 5.1
  90
  91 Now repeat the for the regression of abrasion loss on tensile strength.
  92
  93 Enter your solution in the cell below.
  94
  95 ```{r ex5.1}
  96 # Your solution here
  97 ```
  98
  99 ### Solution 5.1
 100
 101 ```{r ex5.1sol}
 102 fit <- lm(loss ~ strength, data = rubber)
 103 summary(fit)
 104 anova(fit)
 105 ```
 106
 107 Individually, then, regression of abrasion loss on hardness seems satisfactory, albeit accounting for a disappointingly small percentage of the variance in the response; regression of abrasion loss on tensile strength, on the other hand, seems to have very little explanatory power.
 108
 109
 110 ### Multiple regression
 111 However, back in Exercise 3.19 it was shown that the residuals from the former regression showed a clear relationship with the tensile strength values, and this suggested that a regression model involving both variables is necessary.
 112
 113 Let us try it. The only change is to include `strength` in the equations being fitted in the `lm()` function. Instead of
 114
 115 ```{r, eval = FALSE}
 116 lm(loss ~ hardness, data = rubber)
 117 ```
 118 use
 119 ```{r, eval = FALSE}
 120 lm(loss ~ hardness + strength, data = rubber)
 121 ```
 122
 123 ```{r hardness_str_fit}
 124 fit <- lm(loss ~ hardness + strength, data = rubber)
 125 summary(fit)
 126 anova(fit)
 127 ```
 128
 129 Looking at the regression coefficient output next, the estimated model for the mean response is
 130
 131 $$ \hat{y} = \hat{\alpha} + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = 885.2 − 6.571 x_1 − 1.374 x_2 $$
 132
 133 where $x_1$ and $x_2$ stand for values of hardness and tensile strength, respectively.
 134
 135 Associated with each estimated parameter, GenStat gives a standard error (details of the calculation of which need not concern us now) and hence a _t_-statistic (estimate divided by standard error) to be compared with the distribution on `d.f. (Residual)`=27 degrees of freedom. GenStat makes the comparison and gives _p_ values, which in this case are all very small, suggesting strong evidence for the non-zeroness (and hence presence) of each parameter, $\alpha$, $\beta_1$ and $\beta_2$, individually.
 136
 137 So, even though the single regression of abrasion loss on tensile strength did not suggest a close relationship, when taken in conjunction with the effect of hardness the effect of tensile strength is also a considerable one. A key idea here is that $\beta_1$ and $\beta_2$ (and more generally $\beta_1, \beta_2, \ldots , \beta_k$ in model (5.1)) are partial regression coefficients. That is, $\beta_1$ measures the effect of an increase of one unit in $x_1$ treating the value of the other variable $x_2$ as fixed. Contrast this with the single regression model $E(Y) = \alpha + \beta_1 x_1$ in which $\beta_1$ represents an increase of one unit in $x_1$ treating $x_2$ as zero. The meaning of $\beta_1$ in the regression models with one and two explanatory variables is not the same.
 138
 139 You will notice that the percentage of variance accounted for has increased dramatically in the two-explanatory-variable model to a much more respectable 82.8%. This statistic has the same interpretation — or difficulty of interpretation — as for the case of one explanatory variable (see [Unit 3](unit3.ipynb)).
 140
 141 Other items in the GenStat output are as for the case of one explanatory variable: the `Standard error of observations` is $\hat{\sigma}$, and is the square root of `m.s. (Residual)`; the message about standardised residuals will be clarified below; and the message about leverage will be ignored until [Unit 10](unit10.ipynb).
 142
 143 The simple residuals are, again, defined as the differences between the observed and predicted responses:
 144
 145 $$ r_i = y_i - \left( \hat{\alpha} + \sum_{j=1}^{k} \hat{\beta}_j x_{i, j} \right) ,\ i = 1, 2, \ldots, n. $$
 146
 147 GenStat can obtain these for you and produce a plot of residuals against fitted values. The fitted values are simply $\hat{Y}_i = \hat{\alpha} + \sum_{j=1}^{k} \hat{\beta}_j x_{i, j}$. As in the regression models in [Unit 3](unit3.ipynb), the default in GenStat is to use deviance residuals which, in this context, are equivalent to standardised residuals (simple residuals divided by their estimated standard errors).
 148
 149 ## Gratuitous example of more complex maths markup
 150 If
 151
 152 $$ \rho(z) = \rho_c \exp \left( - \frac{z^2}{2H^2} \right) $$
 153
 154 then
 155
 156 \begin{eqnarray*}
 157 \frac{\partial \rho(z)}{\partial z} & = & \rho_c \frac{\partial}{\partial z}\exp
 158                  \left( - \frac{z^2}{2H^2} \right) \\
 159         & = & \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \cdot
 160                      \frac{\partial}{\partial z}  \left( - \frac{z^2}{2H^2} \right) \\
 161         & = &  \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \cdot
 162                      - \frac{z}{H^2} \\
 163         & = & - \frac{z}{H^2} \rho(z)
 164 \end{eqnarray*}
 165
 166
 167 ## Example 5.2
 168 Something about plots of residuals...
 169
 170 ```{r residual_plots}
 171 fit <- lm(loss ~ hardness + strength, data = rubber)
 172 autoplot(fit)
 173 ```
 174
 175 These plots are of just the same sort as those in [Unit 3](unit3.ipynb), and are interpreted in the same way. In this case, the normal plot and histogram show some slight suggestion of skewness in the residuals, but there is no obvious cause to doubt the model. In the regression output, GenStat flagged one of the points (number 19) as having a large standardised (i.e. deviance) residual. To decide which points to flag in this way, GenStat uses the same rules as described in [Units 3](unit3.ipynb) and [4](unit4.ipynb). In this case, it warned us about the most negative residual. However, the value of this standardised residual is not very large (at −2.38), and the plot of residuals against fitted values in Figure 5.2 makes it clear that this particular residual is not exceptionally large relative to some of the others, which are almost as large.
 176
 177
 178 ### Exercise 5.2
 179 The table below shows values for the first five datapoints, of 13, of a dataset concerned with predicting the heat evolved when cement sets via knowledge of its constituents. There are two explanatory variables, tricalcium aluminate (TA) and tricalcium silicate (TS), and the response variable is the heat generated in calories per gram (heat).
 180
 181 heat | TA |  TS
 182 -----|----|-----
 183 78.5 |  7 | 26
 184 74.3 |  1 | 29
 185 104.3 |  11 |  56
 186 87.6 |  11 |  31
 187 95.9 |  7 | 52
 188  $\vdots$ | $\vdots$ | $\vdots$
 189
 190 a. Load the `cemheat` dataset.
 191
 192 b. Make scatterplots of heat against each of TA and TS in turn, and comment on what you see.
 193
 194 c. Use GenStat to fit each individual regression equation (of heat on TA and of heat on TS) in turn, and then to fit the regression equation with two explanatory variables. Does the latter regression equation give you a better model than either of the individual ones?
 195
 196 d. According to the regression equation with two explanatory variables fitted in part (b), what is the predicted value of heat when TA = 15 and TS = 55?
 197
 198 e. By looking at the (default) composite residual plots, comment on the appropriateness of the fitted regression model.
 199
 200 The solution is in the [Section 5.1 solutions](section5.1solutions.ipynb) notebook.
 201
 202 ### Exercise 5.3: Fitting a quadratic regression model
 203 In Unit 3, the dataset `anaerob` was briefly considered (Example 3.1) and swiftly dismissed as a candidate for simple linear regression modelling. The reason is clear from the figure below, in which the response variable, expired ventilation ($y$), is plotted against the single explanatory variable, oxygen uptake ($x$). (The same plot appeared as Figure 3.1 in Example 3.1.) A model quadratic in $x$, i.e. $E(Y) = \alpha + \beta x + \gamma x^2$, was suggested instead.
 204
 205 ```{r anaerobic_plot}
 206 anaerobic <- read.csv('anaerob.csv')
 207 ggplot(anaerobic, aes(x=oxygen, y=ventil)) + geom_point()
 208 ```
 209
 210 Since this quadratic model is nonetheless linear in its parameters, we can fit it using multiple regression of $y$ on $x$ plus the new variable $x^2$. Even though $x_1 = x$ and $x_2 = x^2$ are closely related, there is no immediate impediment to forgetting this and regressing on them in the usual way. A variable such as $x_2$ is sometimes called a derived variable.
 211
 212 a. Using GenStat, perform the regression of expired ventilation (`ventil`) on oxygen uptake (`oxygen`). Are you at all surprised by how good this regression model seems?
 213
 214 b. Now form a new variable `oxy2`, say, by squaring oxygen. (Create a new column in the `anearobic` dataframe which is `anaerobic$oxygen ^ 2`.) Perform the regression of ventil on `oxygen` and `oxy2`. Comment on the fit of this model according to the printed output (and with recourse to Figure 3.2 in Example 3.1).
 215
 216 c. Make the usual residual plots and comment on the fit of the model again.
 217
 218 The solution is in the [Section 5.1 solutions](section5.1solutions.ipynb) notebook.
 219