This example concerns the dataset rubber
, which you first met in Exercise 3.1. The experiment introduced in that exercise concerned an investigation of the resistance of rubber to abrasion (the abrasion loss) and how that resistance depended on various attributes of the rubber. In Exercise 3.1, the only explanatory variable we analysed was hardness; but in Exercise 3.19, a second explanatory variable, tensile strength, was taken into account too. There are 30 datapoints.
rubber <- read.csv('rubber.csv')
rubber
## loss hardness strength
## 1 372 45 162
## 2 206 55 233
## 3 175 61 232
## 4 154 66 231
## 5 136 71 231
## 6 112 71 237
## 7 55 81 224
## 8 45 86 219
## 9 221 53 203
## 10 166 60 189
## 11 164 64 210
## 12 113 68 210
## 13 82 79 196
## 14 32 81 180
## 15 228 56 200
## 16 196 68 173
## 17 128 75 188
## 18 97 83 161
## 19 64 88 119
## 20 249 59 161
## 21 219 71 151
## 22 186 80 165
## 23 155 82 151
## 24 114 89 128
## 25 341 51 161
## 26 340 59 146
## 27 283 65 148
## 28 267 74 144
## 29 215 81 134
## 30 148 86 127
The figures below show scatterplots of abrasion loss against hardness and of abrasion loss against strength. (A scatterplot of abrasion loss against hardness also appeared in Solution 3.1.) Figure (a) suggests a strong decreasing linear relationship of abrasion loss with hardness. Figure (b) is much less indicative of any strong dependence of abrasion loss on tensile strength.
hardloss <- ggplot(rubber, aes(x=hardness, y=loss)) + geom_point()
strloss <- ggplot(rubber, aes(x=strength, y=loss)) + geom_point()
multiplot(hardloss, strloss, cols=2)
In exercise 3.5(a) you obtained the following output for the regression of abrasion loss on hardness.
fit <- lm(loss ~ hardness, data = rubber)
summary(fit)
##
## Call:
## lm(formula = loss ~ hardness, data = rubber)
##
## Residuals:
## Min 1Q Median 3Q Max
## -86.15 -46.77 -19.49 54.27 111.49
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 550.4151 65.7867 8.367 4.22e-09 ***
## hardness -5.3366 0.9229 -5.782 3.29e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 60.52 on 28 degrees of freedom
## Multiple R-squared: 0.5442, Adjusted R-squared: 0.5279
## F-statistic: 33.43 on 1 and 28 DF, p-value: 3.294e-06
anova(fit)
## Analysis of Variance Table
##
## Response: loss
## Df Sum Sq Mean Sq F value Pr(>F)
## hardness 1 122455 122455 33.433 3.294e-06 ***
## Residuals 28 102556 3663
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Now repeat the for the regression of abrasion loss on tensile strength.
Enter your solution in the cell below.
# Your solution here
fit <- lm(loss ~ strength, data = rubber)
summary(fit)
##
## Call:
## lm(formula = loss ~ strength, data = rubber)
##
## Residuals:
## Min 1Q Median 3Q Max
## -155.640 -59.919 2.795 61.221 183.285
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 305.2248 79.9962 3.815 0.000688 ***
## strength -0.7192 0.4347 -1.654 0.109232
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 85.56 on 28 degrees of freedom
## Multiple R-squared: 0.08904, Adjusted R-squared: 0.0565
## F-statistic: 2.737 on 1 and 28 DF, p-value: 0.1092
anova(fit)
## Analysis of Variance Table
##
## Response: loss
## Df Sum Sq Mean Sq F value Pr(>F)
## strength 1 20035 20034.8 2.7368 0.1092
## Residuals 28 204977 7320.6
Individually, then, regression of abrasion loss on hardness seems satisfactory, albeit accounting for a disappointingly small percentage of the variance in the response; regression of abrasion loss on tensile strength, on the other hand, seems to have very little explanatory power.
However, back in Exercise 3.19 it was shown that the residuals from the former regression showed a clear relationship with the tensile strength values, and this suggested that a regression model involving both variables is necessary.
Let us try it. The only change is to include strength
in the equations being fitted in the lm()
function. Instead of
lm(loss ~ hardness, data = rubber)
use
lm(loss ~ hardness + strength, data = rubber)
fit <- lm(loss ~ hardness + strength, data = rubber)
summary(fit)
##
## Call:
## lm(formula = loss ~ hardness + strength, data = rubber)
##
## Residuals:
## Min 1Q Median 3Q Max
## -79.385 -14.608 3.816 19.755 65.981
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 885.1611 61.7516 14.334 3.84e-14 ***
## hardness -6.5708 0.5832 -11.267 1.03e-11 ***
## strength -1.3743 0.1943 -7.073 1.32e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 36.49 on 27 degrees of freedom
## Multiple R-squared: 0.8402, Adjusted R-squared: 0.8284
## F-statistic: 71 on 2 and 27 DF, p-value: 1.767e-11
anova(fit)
## Analysis of Variance Table
##
## Response: loss
## Df Sum Sq Mean Sq F value Pr(>F)
## hardness 1 122455 122455 91.970 3.458e-10 ***
## strength 1 66607 66607 50.025 1.325e-07 ***
## Residuals 27 35950 1331
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Looking at the regression coefficient output next, the estimated model for the mean response is
\[ \hat{y} = \hat{\alpha} + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = 885.2 − 6.571 x_1 − 1.374 x_2 \]
where \(x_1\) and \(x_2\) stand for values of hardness and tensile strength, respectively.
Associated with each estimated parameter, GenStat gives a standard error (details of the calculation of which need not concern us now) and hence a t-statistic (estimate divided by standard error) to be compared with the distribution on d.f. (Residual)
=27 degrees of freedom. GenStat makes the comparison and gives p values, which in this case are all very small, suggesting strong evidence for the non-zeroness (and hence presence) of each parameter, \(\alpha\), \(\beta_1\) and \(\beta_2\), individually.
So, even though the single regression of abrasion loss on tensile strength did not suggest a close relationship, when taken in conjunction with the effect of hardness the effect of tensile strength is also a considerable one. A key idea here is that \(\beta_1\) and \(\beta_2\) (and more generally \(\beta_1, \beta_2, \ldots , \beta_k\) in model (5.1)) are partial regression coefficients. That is, \(\beta_1\) measures the effect of an increase of one unit in \(x_1\) treating the value of the other variable \(x_2\) as fixed. Contrast this with the single regression model \(E(Y) = \alpha + \beta_1 x_1\) in which \(\beta_1\) represents an increase of one unit in \(x_1\) treating \(x_2\) as zero. The meaning of \(\beta_1\) in the regression models with one and two explanatory variables is not the same.
You will notice that the percentage of variance accounted for has increased dramatically in the two-explanatory-variable model to a much more respectable 82.8%. This statistic has the same interpretation — or difficulty of interpretation — as for the case of one explanatory variable (see Unit 3).
Other items in the GenStat output are as for the case of one explanatory variable: the Standard error of observations
is \(\hat{\sigma}\), and is the square root of m.s. (Residual)
; the message about standardised residuals will be clarified below; and the message about leverage will be ignored until Unit 10.
The simple residuals are, again, defined as the differences between the observed and predicted responses:
\[ r_i = y_i - \left( \hat{\alpha} + \sum_{j=1}^{k} \hat{\beta}_j x_{i, j} \right) ,\ i = 1, 2, \ldots, n. \]
GenStat can obtain these for you and produce a plot of residuals against fitted values. The fitted values are simply \(\hat{Y}_i = \hat{\alpha} + \sum_{j=1}^{k} \hat{\beta}_j x_{i, j}\). As in the regression models in Unit 3, the default in GenStat is to use deviance residuals which, in this context, are equivalent to standardised residuals (simple residuals divided by their estimated standard errors).
If
\[ \rho(z) = \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \]
then
\[\begin{eqnarray*} \frac{\partial \rho(z)}{\partial z} & = & \rho_c \frac{\partial}{\partial z}\exp \left( - \frac{z^2}{2H^2} \right) \\ & = & \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \cdot \frac{\partial}{\partial z} \left( - \frac{z^2}{2H^2} \right) \\ & = & \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \cdot - \frac{z}{H^2} \\ & = & - \frac{z}{H^2} \rho(z) \end{eqnarray*}\]Something about plots of residuals…
fit <- lm(loss ~ hardness + strength, data = rubber)
autoplot(fit)
These plots are of just the same sort as those in Unit 3, and are interpreted in the same way. In this case, the normal plot and histogram show some slight suggestion of skewness in the residuals, but there is no obvious cause to doubt the model. In the regression output, GenStat flagged one of the points (number 19) as having a large standardised (i.e. deviance) residual. To decide which points to flag in this way, GenStat uses the same rules as described in Units 3 and 4. In this case, it warned us about the most negative residual. However, the value of this standardised residual is not very large (at −2.38), and the plot of residuals against fitted values in Figure 5.2 makes it clear that this particular residual is not exceptionally large relative to some of the others, which are almost as large.
The table below shows values for the first five datapoints, of 13, of a dataset concerned with predicting the heat evolved when cement sets via knowledge of its constituents. There are two explanatory variables, tricalcium aluminate (TA) and tricalcium silicate (TS), and the response variable is the heat generated in calories per gram (heat).
heat | TA | TS |
---|---|---|
78.5 | 7 | 26 |
74.3 | 1 | 29 |
104.3 | 11 | 56 |
87.6 | 11 | 31 |
95.9 | 7 | 52 |
\(\vdots\) | \(\vdots\) | \(\vdots\) |
Load the cemheat
dataset.
Make scatterplots of heat against each of TA and TS in turn, and comment on what you see.
Use GenStat to fit each individual regression equation (of heat on TA and of heat on TS) in turn, and then to fit the regression equation with two explanatory variables. Does the latter regression equation give you a better model than either of the individual ones?
According to the regression equation with two explanatory variables fitted in part (b), what is the predicted value of heat when TA = 15 and TS = 55?
By looking at the (default) composite residual plots, comment on the appropriateness of the fitted regression model.
The solution is in the Section 5.1 solutions notebook.
In Unit 3, the dataset anaerob
was briefly considered (Example 3.1) and swiftly dismissed as a candidate for simple linear regression modelling. The reason is clear from the figure below, in which the response variable, expired ventilation (\(y\)), is plotted against the single explanatory variable, oxygen uptake (\(x\)). (The same plot appeared as Figure 3.1 in Example 3.1.) A model quadratic in \(x\), i.e. \(E(Y) = \alpha + \beta x + \gamma x^2\), was suggested instead.
anaerobic <- read.csv('anaerob.csv')
ggplot(anaerobic, aes(x=oxygen, y=ventil)) + geom_point()
Since this quadratic model is nonetheless linear in its parameters, we can fit it using multiple regression of \(y\) on \(x\) plus the new variable \(x^2\). Even though \(x_1 = x\) and \(x_2 = x^2\) are closely related, there is no immediate impediment to forgetting this and regressing on them in the usual way. A variable such as \(x_2\) is sometimes called a derived variable.
Using GenStat, perform the regression of expired ventilation (ventil
) on oxygen uptake (oxygen
). Are you at all surprised by how good this regression model seems?
Now form a new variable oxy2
, say, by squaring oxygen. (Create a new column in the anearobic
dataframe which is anaerobic$oxygen ^ 2
.) Perform the regression of ventil on oxygen
and oxy2
. Comment on the fit of this model according to the printed output (and with recourse to Figure 3.2 in Example 3.1).
Make the usual residual plots and comment on the fit of the model again.
The solution is in the Section 5.1 solutions notebook.