Cleared outputs, set some init cells
[ou-jupyter-r-demo.git] / section5.1.Rmd
1 ---
2 title: "section5.1"
3 output: html_document
4 ---
5
6 ```{r setup, include=FALSE}
7 knitr::opts_chunk$set(echo = TRUE)
8
9 library(tidyverse)
10 library(repr)
11 library(ggfortify)
12
13 # Change plot size to 4 x 3
14 # options(repr.plot.width=6, repr.plot.height=4)
15
16 # Multiple plot function
17 #
18 # ggplot objects can be passed in ..., or to plotlist (as a list of ggplot objects)
19 # - cols: Number of columns in layout
20 # - layout: A matrix specifying the layout. If present, 'cols' is ignored.
21 #
22 # If the layout is something like matrix(c(1,2,3,3), nrow=2, byrow=TRUE),
23 # then plot 1 will go in the upper left, 2 will go in the upper right, and
24 # 3 will go all the way across the bottom.
25 #
26 multiplot <- function(..., plotlist=NULL, file, cols=1, layout=NULL) {
27 library(grid)
28
29 # Make a list from the ... arguments and plotlist
30 plots <- c(list(...), plotlist)
31
32 numPlots = length(plots)
33
34 # If layout is NULL, then use 'cols' to determine layout
35 if (is.null(layout)) {
36 # Make the panel
37 # ncol: Number of columns of plots
38 # nrow: Number of rows needed, calculated from # of cols
39 layout <- matrix(seq(1, cols * ceiling(numPlots/cols)),
40 ncol = cols, nrow = ceiling(numPlots/cols))
41 }
42
43 if (numPlots==1) {
44 print(plots[[1]])
45
46 } else {
47 # Set up the page
48 grid.newpage()
49 pushViewport(viewport(layout = grid.layout(nrow(layout), ncol(layout))))
50
51 # Make each plot, in the correct location
52 for (i in 1:numPlots) {
53 # Get the i,j matrix positions of the regions that contain this subplot
54 matchidx <- as.data.frame(which(layout == i, arr.ind = TRUE))
55
56 print(plots[[i]], vp = viewport(layout.pos.row = matchidx$row,
57 layout.pos.col = matchidx$col))
58 }
59 }
60 }
61 ```
62
63 ## Modelling abrasion loss
64
65 This example concerns the dataset `rubber`, which you first met in Exercise 3.1. The experiment introduced in that exercise concerned an investigation of the resistance of rubber to abrasion (the abrasion loss) and how that resistance depended on various attributes of the rubber. In Exercise 3.1, the only explanatory variable we analysed was hardness; but in Exercise 3.19, a second explanatory variable, tensile strength, was taken into account too. There are 30 datapoints.
66
67 ```{r load_rubber}
68 rubber <- read.csv('rubber.csv')
69 rubber
70 ```
71
72 The figures below show scatterplots of abrasion loss against hardness and of abrasion loss against strength. (A scatterplot of abrasion loss against hardness also appeared in Solution 3.1.) Figure (a) suggests a strong decreasing linear relationship of abrasion loss with hardness. Figure (b) is much less indicative of any strong dependence of abrasion loss on tensile strength.
73
74 ```{r rubber_plots}
75 hardloss <- ggplot(rubber, aes(x=hardness, y=loss)) + geom_point()
76 strloss <- ggplot(rubber, aes(x=strength, y=loss)) + geom_point()
77
78 multiplot(hardloss, strloss, cols=2)
79 ```
80
81 In exercise 3.5(a) you obtained the following output for the regression of abrasion loss on hardness.
82
83 ```{r hardness_fit}
84 fit <- lm(loss ~ hardness, data = rubber)
85 summary(fit)
86 anova(fit)
87 ```
88
89 ### Exercise 5.1
90
91 Now repeat the for the regression of abrasion loss on tensile strength.
92
93 Enter your solution in the cell below.
94
95 ```{r ex5.1}
96 # Your solution here
97 ```
98
99 ### Solution 5.1
100
101 ```{r ex5.1sol}
102 fit <- lm(loss ~ strength, data = rubber)
103 summary(fit)
104 anova(fit)
105 ```
106
107 Individually, then, regression of abrasion loss on hardness seems satisfactory, albeit accounting for a disappointingly small percentage of the variance in the response; regression of abrasion loss on tensile strength, on the other hand, seems to have very little explanatory power.
108
109
110 ### Multiple regression
111 However, back in Exercise 3.19 it was shown that the residuals from the former regression showed a clear relationship with the tensile strength values, and this suggested that a regression model involving both variables is necessary.
112
113 Let us try it. The only change is to include `strength` in the equations being fitted in the `lm()` function. Instead of
114
115 ```{r, eval = FALSE}
116 lm(loss ~ hardness, data = rubber)
117 ```
118 use
119 ```{r, eval = FALSE}
120 lm(loss ~ hardness + strength, data = rubber)
121 ```
122
123 ```{r hardness_str_fit}
124 fit <- lm(loss ~ hardness + strength, data = rubber)
125 summary(fit)
126 anova(fit)
127 ```
128
129 Looking at the regression coefficient output next, the estimated model for the mean response is
130
131 $$ \hat{y} = \hat{\alpha} + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = 885.2 − 6.571 x_1 − 1.374 x_2 $$
132
133 where $x_1$ and $x_2$ stand for values of hardness and tensile strength, respectively.
134
135 Associated with each estimated parameter, GenStat gives a standard error (details of the calculation of which need not concern us now) and hence a _t_-statistic (estimate divided by standard error) to be compared with the distribution on `d.f. (Residual)`=27 degrees of freedom. GenStat makes the comparison and gives _p_ values, which in this case are all very small, suggesting strong evidence for the non-zeroness (and hence presence) of each parameter, $\alpha$, $\beta_1$ and $\beta_2$, individually.
136
137 So, even though the single regression of abrasion loss on tensile strength did not suggest a close relationship, when taken in conjunction with the effect of hardness the effect of tensile strength is also a considerable one. A key idea here is that $\beta_1$ and $\beta_2$ (and more generally $\beta_1, \beta_2, \ldots , \beta_k$ in model (5.1)) are partial regression coefficients. That is, $\beta_1$ measures the effect of an increase of one unit in $x_1$ treating the value of the other variable $x_2$ as fixed. Contrast this with the single regression model $E(Y) = \alpha + \beta_1 x_1$ in which $\beta_1$ represents an increase of one unit in $x_1$ treating $x_2$ as zero. The meaning of $\beta_1$ in the regression models with one and two explanatory variables is not the same.
138
139 You will notice that the percentage of variance accounted for has increased dramatically in the two-explanatory-variable model to a much more respectable 82.8%. This statistic has the same interpretation — or difficulty of interpretation — as for the case of one explanatory variable (see [Unit 3](unit3.ipynb)).
140
141 Other items in the GenStat output are as for the case of one explanatory variable: the `Standard error of observations` is $\hat{\sigma}$, and is the square root of `m.s. (Residual)`; the message about standardised residuals will be clarified below; and the message about leverage will be ignored until [Unit 10](unit10.ipynb).
142
143 The simple residuals are, again, defined as the differences between the observed and predicted responses:
144
145 $$ r_i = y_i - \left( \hat{\alpha} + \sum_{j=1}^{k} \hat{\beta}_j x_{i, j} \right) ,\ i = 1, 2, \ldots, n. $$
146
147 GenStat can obtain these for you and produce a plot of residuals against fitted values. The fitted values are simply $\hat{Y}_i = \hat{\alpha} + \sum_{j=1}^{k} \hat{\beta}_j x_{i, j}$. As in the regression models in [Unit 3](unit3.ipynb), the default in GenStat is to use deviance residuals which, in this context, are equivalent to standardised residuals (simple residuals divided by their estimated standard errors).
148
149 ## Gratuitous example of more complex maths markup
150 If
151
152 $$ \rho(z) = \rho_c \exp \left( - \frac{z^2}{2H^2} \right) $$
153
154 then
155
156 \begin{eqnarray*}
157 \frac{\partial \rho(z)}{\partial z} & = & \rho_c \frac{\partial}{\partial z}\exp
158 \left( - \frac{z^2}{2H^2} \right) \\
159 & = & \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \cdot
160 \frac{\partial}{\partial z} \left( - \frac{z^2}{2H^2} \right) \\
161 & = & \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \cdot
162 - \frac{z}{H^2} \\
163 & = & - \frac{z}{H^2} \rho(z)
164 \end{eqnarray*}
165
166
167 ## Example 5.2
168 Something about plots of residuals...
169
170 ```{r residual_plots}
171 fit <- lm(loss ~ hardness + strength, data = rubber)
172 autoplot(fit)
173 ```
174
175 These plots are of just the same sort as those in [Unit 3](unit3.ipynb), and are interpreted in the same way. In this case, the normal plot and histogram show some slight suggestion of skewness in the residuals, but there is no obvious cause to doubt the model. In the regression output, GenStat flagged one of the points (number 19) as having a large standardised (i.e. deviance) residual. To decide which points to flag in this way, GenStat uses the same rules as described in [Units 3](unit3.ipynb) and [4](unit4.ipynb). In this case, it warned us about the most negative residual. However, the value of this standardised residual is not very large (at −2.38), and the plot of residuals against fitted values in Figure 5.2 makes it clear that this particular residual is not exceptionally large relative to some of the others, which are almost as large.
176
177
178 ### Exercise 5.2
179 The table below shows values for the first five datapoints, of 13, of a dataset concerned with predicting the heat evolved when cement sets via knowledge of its constituents. There are two explanatory variables, tricalcium aluminate (TA) and tricalcium silicate (TS), and the response variable is the heat generated in calories per gram (heat).
180
181 heat | TA | TS
182 -----|----|-----
183 78.5 | 7 | 26
184 74.3 | 1 | 29
185 104.3 | 11 | 56
186 87.6 | 11 | 31
187 95.9 | 7 | 52
188 $\vdots$ | $\vdots$ | $\vdots$
189
190 a. Load the `cemheat` dataset.
191
192 b. Make scatterplots of heat against each of TA and TS in turn, and comment on what you see.
193
194 c. Use GenStat to fit each individual regression equation (of heat on TA and of heat on TS) in turn, and then to fit the regression equation with two explanatory variables. Does the latter regression equation give you a better model than either of the individual ones?
195
196 d. According to the regression equation with two explanatory variables fitted in part (b), what is the predicted value of heat when TA = 15 and TS = 55?
197
198 e. By looking at the (default) composite residual plots, comment on the appropriateness of the fitted regression model.
199
200 The solution is in the [Section 5.1 solutions](section5.1solutions.ipynb) notebook.
201
202 ### Exercise 5.3: Fitting a quadratic regression model
203 In Unit 3, the dataset `anaerob` was briefly considered (Example 3.1) and swiftly dismissed as a candidate for simple linear regression modelling. The reason is clear from the figure below, in which the response variable, expired ventilation ($y$), is plotted against the single explanatory variable, oxygen uptake ($x$). (The same plot appeared as Figure 3.1 in Example 3.1.) A model quadratic in $x$, i.e. $E(Y) = \alpha + \beta x + \gamma x^2$, was suggested instead.
204
205 ```{r anaerobic_plot}
206 anaerobic <- read.csv('anaerob.csv')
207 ggplot(anaerobic, aes(x=oxygen, y=ventil)) + geom_point()
208 ```
209
210 Since this quadratic model is nonetheless linear in its parameters, we can fit it using multiple regression of $y$ on $x$ plus the new variable $x^2$. Even though $x_1 = x$ and $x_2 = x^2$ are closely related, there is no immediate impediment to forgetting this and regressing on them in the usual way. A variable such as $x_2$ is sometimes called a derived variable.
211
212 a. Using GenStat, perform the regression of expired ventilation (`ventil`) on oxygen uptake (`oxygen`). Are you at all surprised by how good this regression model seems?
213
214 b. Now form a new variable `oxy2`, say, by squaring oxygen. (Create a new column in the `anearobic` dataframe which is `anaerobic$oxygen ^ 2`.) Perform the regression of ventil on `oxygen` and `oxy2`. Comment on the fit of this model according to the printed output (and with recourse to Figure 3.2 in Example 3.1).
215
216 c. Make the usual residual plots and comment on the fit of the model again.
217
218 The solution is in the [Section 5.1 solutions](section5.1solutions.ipynb) notebook.
219