Next: Relationships Between Variables, Part Up: Descriptive Statistics Previous: Relationships Between Variables, Part

# Relationships Between Variables, Part 2: Residual Analysis

Picking a model for a problem is a major undertaking. If the model fits well then it can be used to increase understanding of the problem and/or for prediction. For instance, by fitting the linear model to the heights and weights of the baseball data, we could see that weight seems to increase 5 pounds for each additional gain of one inch in height. This is not a great discovery but it is easy to think of situations where this rate of change is quite important. For example, consider a cancer drug that is supposed to reduce the size of a tumor and the experiment is the shrinkage (Y) of tumor size for a given dose (X) of the drug. If a linear model seems appropriate then the slope is expected reduction in tumor size when the dose is increased by one unit. In this section, we deal with the question of model adequacy.

We will only discuss the simple linear model. So we are considering two variables X and Y and we want to examine the adequacy of the model

Y = a + bX + e

The variable e denotes random error, that is, if there were no error Y would be a deterministic linear function of X.

When is a model good? At first, one might say when there is no error. But for all the data that we consider in this class there will always be error. For the baseball data above, there is a distribution of weight for each height. Actually we will say a model is good if there is no connection between e and a + bX; that is, the random error is free of X. Hence, for predicting Y, we have found the model that contains all the information based on X. Now there may be other variables which help in predicting Y. These will be contained in e. So the assumption we want to verify on a model is:

Model Assumption : The random error  component is independent of the X component.

How would we check this assumption? If we knew the random errors, e, we could just plot them against a + bX. A random scatter would indicate that the errors do not depend on a + bX; i.e., the errors are free of a + bX. Thus the model is good. However, we don't know the errors, we only know Y and X. But using Y and X we estimate a and b. This leads to an estimate of a + bX, the predicted value of Y, which we label as . Our estimate of the error is . This is called the residual , literally, what's left. We will denote the residual by , that is

Then we can check our model assumption by plotting versus . This is called the residual plot . A random scatter indicates a good model. If it is not a random scatter then we need to rethink the model.

For example, consider the LS fit of the original baseball data, (no outlier). The prediction equation is .

For each data point, we can find the predicted value and then the residual. We can then plot the residuals versus the fitted values to check our model assumption. For example the first data point is (74,218). The predicted value is . Hence the residual is pounds. So we under predicted the weight of the first individual by 24.74 pounds. Hence one point on the residual plot is (193.26,24.74). Figure 1.7 contains the complete residual plot. Locate the point (193.26,24.74) on the plot. Determine the residual for the data point (76,200) and find it on the plot.

The residual plot is given by the regression module. Check the "Plot residuals vs predicted value" button if you wish the residual plot to be returned.

As a final example, consider the changed data set. The LS residual plot is given in Figure 1.8. Notice how the outlier stands out.

The Wilcoxon residual plot for the changed data set is given in Figure 1.9. Notice that the outlier stands out even further at 120 compared to 100 on the LS plot. Again the outlier draws the fit, thus shortening the distance (residual) between the outlier and the fit.

The LS estimates of slope and intercept are given by

Exercise 2.9.1
1.
Let X be the length (cm) of a laboratory mouse and let Y be its weight (gm). Consider the data for X and Y given below.
        X     Y
16   32
15   26
20   40
13   27
15   30
17   38
16   34
21   43
22   64
23   45
24   46
18   39

Recall that you obtained an eyeball fit of this data in the last exercise. Use your fitted line (don't calculate!) to obtain the predicted value for each value of x. Then by subraction obtain the residuals.
(a)
Plot the residuals versus the fitted values. Comment on the plot.
(b)
Obtain a stem leaf plot of the residuals.
(c)
Obtain the 5 basic descriptive statistics for the residuals. Are there any outliers?
2.
Recall that you obtained the LS fit for the above data in the last problem set. Calculate the LS residual for Case 9 ( x = 22, y = 64 ).
3.
Recall that you obtained the Wilcoxon fit for the above data in the last problem set. Calculate the Wilcoxon residual for Case 9 ( x = 22, y = 64 ). From which of the 3 fits, LS, Wilcoxon or eyeball, would you spot the outlier more readily? Why?

Next: Relationships Between Variables, Part Up: Descriptive Statistics Previous: Relationships Between Variables, Part

2001-01-01