Modelling abrasion loss

This example concerns the dataset rubber, which you first met in Exercise 3.1. The experiment introduced in that exercise concerned an investigation of the resistance of rubber to abrasion (the abrasion loss) and how that resistance depended on various attributes of the rubber. In Exercise 3.1, the only explanatory variable we analysed was hardness; but in Exercise 3.19, a second explanatory variable, tensile strength, was taken into account too. There are 30 datapoints.

rubber <- read.csv('rubber.csv')
rubber
##    loss hardness strength
## 1   372       45      162
## 2   206       55      233
## 3   175       61      232
## 4   154       66      231
## 5   136       71      231
## 6   112       71      237
## 7    55       81      224
## 8    45       86      219
## 9   221       53      203
## 10  166       60      189
## 11  164       64      210
## 12  113       68      210
## 13   82       79      196
## 14   32       81      180
## 15  228       56      200
## 16  196       68      173
## 17  128       75      188
## 18   97       83      161
## 19   64       88      119
## 20  249       59      161
## 21  219       71      151
## 22  186       80      165
## 23  155       82      151
## 24  114       89      128
## 25  341       51      161
## 26  340       59      146
## 27  283       65      148
## 28  267       74      144
## 29  215       81      134
## 30  148       86      127

The figures below show scatterplots of abrasion loss against hardness and of abrasion loss against strength. (A scatterplot of abrasion loss against hardness also appeared in Solution 3.1.) Figure (a) suggests a strong decreasing linear relationship of abrasion loss with hardness. Figure (b) is much less indicative of any strong dependence of abrasion loss on tensile strength.

hardloss <- ggplot(rubber, aes(x=hardness, y=loss)) + geom_point()
strloss <-  ggplot(rubber, aes(x=strength, y=loss)) + geom_point()

multiplot(hardloss, strloss, cols=2)

In exercise 3.5(a) you obtained the following output for the regression of abrasion loss on hardness.

fit <- lm(loss ~ hardness, data = rubber)
summary(fit)
## 
## Call:
## lm(formula = loss ~ hardness, data = rubber)
## 
## Residuals:
##    Min     1Q Median     3Q    Max 
## -86.15 -46.77 -19.49  54.27 111.49 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 550.4151    65.7867   8.367 4.22e-09 ***
## hardness     -5.3366     0.9229  -5.782 3.29e-06 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 60.52 on 28 degrees of freedom
## Multiple R-squared:  0.5442, Adjusted R-squared:  0.5279 
## F-statistic: 33.43 on 1 and 28 DF,  p-value: 3.294e-06
anova(fit)
## Analysis of Variance Table
## 
## Response: loss
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## hardness   1 122455  122455  33.433 3.294e-06 ***
## Residuals 28 102556    3663                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Exercise 5.1

Now repeat the for the regression of abrasion loss on tensile strength.

Enter your solution in the cell below.

# Your solution here

Solution 5.1

fit <- lm(loss ~ strength, data = rubber)
summary(fit)
## 
## Call:
## lm(formula = loss ~ strength, data = rubber)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -155.640  -59.919    2.795   61.221  183.285 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 305.2248    79.9962   3.815 0.000688 ***
## strength     -0.7192     0.4347  -1.654 0.109232    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 85.56 on 28 degrees of freedom
## Multiple R-squared:  0.08904,    Adjusted R-squared:  0.0565 
## F-statistic: 2.737 on 1 and 28 DF,  p-value: 0.1092
anova(fit)
## Analysis of Variance Table
## 
## Response: loss
##           Df Sum Sq Mean Sq F value Pr(>F)
## strength   1  20035 20034.8  2.7368 0.1092
## Residuals 28 204977  7320.6

Individually, then, regression of abrasion loss on hardness seems satisfactory, albeit accounting for a disappointingly small percentage of the variance in the response; regression of abrasion loss on tensile strength, on the other hand, seems to have very little explanatory power.

Multiple regression

However, back in Exercise 3.19 it was shown that the residuals from the former regression showed a clear relationship with the tensile strength values, and this suggested that a regression model involving both variables is necessary.

Let us try it. The only change is to include strength in the equations being fitted in the lm() function. Instead of

lm(loss ~ hardness, data = rubber)

use

lm(loss ~ hardness + strength, data = rubber)
fit <- lm(loss ~ hardness + strength, data = rubber)
summary(fit)
## 
## Call:
## lm(formula = loss ~ hardness + strength, data = rubber)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -79.385 -14.608   3.816  19.755  65.981 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 885.1611    61.7516  14.334 3.84e-14 ***
## hardness     -6.5708     0.5832 -11.267 1.03e-11 ***
## strength     -1.3743     0.1943  -7.073 1.32e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 36.49 on 27 degrees of freedom
## Multiple R-squared:  0.8402, Adjusted R-squared:  0.8284 
## F-statistic:    71 on 2 and 27 DF,  p-value: 1.767e-11
anova(fit)
## Analysis of Variance Table
## 
## Response: loss
##           Df Sum Sq Mean Sq F value    Pr(>F)    
## hardness   1 122455  122455  91.970 3.458e-10 ***
## strength   1  66607   66607  50.025 1.325e-07 ***
## Residuals 27  35950    1331                      
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Looking at the regression coefficient output next, the estimated model for the mean response is

\[ \hat{y} = \hat{\alpha} + \hat{\beta}_1 x_1 + \hat{\beta}_2 x_2 = 885.2 − 6.571 x_1 − 1.374 x_2 \]

where \(x_1\) and \(x_2\) stand for values of hardness and tensile strength, respectively.

Associated with each estimated parameter, GenStat gives a standard error (details of the calculation of which need not concern us now) and hence a t-statistic (estimate divided by standard error) to be compared with the distribution on d.f. (Residual)=27 degrees of freedom. GenStat makes the comparison and gives p values, which in this case are all very small, suggesting strong evidence for the non-zeroness (and hence presence) of each parameter, \(\alpha\), \(\beta_1\) and \(\beta_2\), individually.

So, even though the single regression of abrasion loss on tensile strength did not suggest a close relationship, when taken in conjunction with the effect of hardness the effect of tensile strength is also a considerable one. A key idea here is that \(\beta_1\) and \(\beta_2\) (and more generally \(\beta_1, \beta_2, \ldots , \beta_k\) in model (5.1)) are partial regression coefficients. That is, \(\beta_1\) measures the effect of an increase of one unit in \(x_1\) treating the value of the other variable \(x_2\) as fixed. Contrast this with the single regression model \(E(Y) = \alpha + \beta_1 x_1\) in which \(\beta_1\) represents an increase of one unit in \(x_1\) treating \(x_2\) as zero. The meaning of \(\beta_1\) in the regression models with one and two explanatory variables is not the same.

You will notice that the percentage of variance accounted for has increased dramatically in the two-explanatory-variable model to a much more respectable 82.8%. This statistic has the same interpretation — or difficulty of interpretation — as for the case of one explanatory variable (see Unit 3).

Other items in the GenStat output are as for the case of one explanatory variable: the Standard error of observations is \(\hat{\sigma}\), and is the square root of m.s. (Residual); the message about standardised residuals will be clarified below; and the message about leverage will be ignored until Unit 10.

The simple residuals are, again, defined as the differences between the observed and predicted responses:

\[ r_i = y_i - \left( \hat{\alpha} + \sum_{j=1}^{k} \hat{\beta}_j x_{i, j} \right) ,\ i = 1, 2, \ldots, n. \]

GenStat can obtain these for you and produce a plot of residuals against fitted values. The fitted values are simply \(\hat{Y}_i = \hat{\alpha} + \sum_{j=1}^{k} \hat{\beta}_j x_{i, j}\). As in the regression models in Unit 3, the default in GenStat is to use deviance residuals which, in this context, are equivalent to standardised residuals (simple residuals divided by their estimated standard errors).

Gratuitous example of more complex maths markup

If

\[ \rho(z) = \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \]

then

\[\begin{eqnarray*} \frac{\partial \rho(z)}{\partial z} & = & \rho_c \frac{\partial}{\partial z}\exp \left( - \frac{z^2}{2H^2} \right) \\ & = & \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \cdot \frac{\partial}{\partial z} \left( - \frac{z^2}{2H^2} \right) \\ & = & \rho_c \exp \left( - \frac{z^2}{2H^2} \right) \cdot - \frac{z}{H^2} \\ & = & - \frac{z}{H^2} \rho(z) \end{eqnarray*}\]

Example 5.2

Something about plots of residuals…

fit <- lm(loss ~ hardness + strength, data = rubber)
autoplot(fit)

These plots are of just the same sort as those in Unit 3, and are interpreted in the same way. In this case, the normal plot and histogram show some slight suggestion of skewness in the residuals, but there is no obvious cause to doubt the model. In the regression output, GenStat flagged one of the points (number 19) as having a large standardised (i.e. deviance) residual. To decide which points to flag in this way, GenStat uses the same rules as described in Units 3 and 4. In this case, it warned us about the most negative residual. However, the value of this standardised residual is not very large (at −2.38), and the plot of residuals against fitted values in Figure 5.2 makes it clear that this particular residual is not exceptionally large relative to some of the others, which are almost as large.

Exercise 5.2

The table below shows values for the first five datapoints, of 13, of a dataset concerned with predicting the heat evolved when cement sets via knowledge of its constituents. There are two explanatory variables, tricalcium aluminate (TA) and tricalcium silicate (TS), and the response variable is the heat generated in calories per gram (heat).

heat TA TS
78.5 7 26
74.3 1 29
104.3 11 56
87.6 11 31
95.9 7 52
\(\vdots\) \(\vdots\) \(\vdots\)
  1. Load the cemheat dataset.

  2. Make scatterplots of heat against each of TA and TS in turn, and comment on what you see.

  3. Use GenStat to fit each individual regression equation (of heat on TA and of heat on TS) in turn, and then to fit the regression equation with two explanatory variables. Does the latter regression equation give you a better model than either of the individual ones?

  4. According to the regression equation with two explanatory variables fitted in part (b), what is the predicted value of heat when TA = 15 and TS = 55?

  5. By looking at the (default) composite residual plots, comment on the appropriateness of the fitted regression model.

The solution is in the Section 5.1 solutions notebook.

Exercise 5.3: Fitting a quadratic regression model

In Unit 3, the dataset anaerob was briefly considered (Example 3.1) and swiftly dismissed as a candidate for simple linear regression modelling. The reason is clear from the figure below, in which the response variable, expired ventilation (\(y\)), is plotted against the single explanatory variable, oxygen uptake (\(x\)). (The same plot appeared as Figure 3.1 in Example 3.1.) A model quadratic in \(x\), i.e. \(E(Y) = \alpha + \beta x + \gamma x^2\), was suggested instead.

anaerobic <- read.csv('anaerob.csv')
ggplot(anaerobic, aes(x=oxygen, y=ventil)) + geom_point()

Since this quadratic model is nonetheless linear in its parameters, we can fit it using multiple regression of \(y\) on \(x\) plus the new variable \(x^2\). Even though \(x_1 = x\) and \(x_2 = x^2\) are closely related, there is no immediate impediment to forgetting this and regressing on them in the usual way. A variable such as \(x_2\) is sometimes called a derived variable.

  1. Using GenStat, perform the regression of expired ventilation (ventil) on oxygen uptake (oxygen). Are you at all surprised by how good this regression model seems?

  2. Now form a new variable oxy2, say, by squaring oxygen. (Create a new column in the anearobic dataframe which is anaerobic$oxygen ^ 2.) Perform the regression of ventil on oxygen and oxy2. Comment on the fit of this model according to the printed output (and with recourse to Figure 3.2 in Example 3.1).

  3. Make the usual residual plots and comment on the fit of the model again.

The solution is in the Section 5.1 solutions notebook.