Often we collect observations from several different variables on a subject. A simple example is a form, such as an application form, which are collected from a group of people. Each item on the form corresponds to a variable. For example, suppose it is a form that upper classmen are filling out at an university. Items might include the college GPA, major, ACT score, high school GPA, high school percentile, weight, height, gender, family income, major, etc. We may want to describe each variable separately using the descriptive statistics and plots that we have discussed, but often we also want to investigate the relationship between the variables.
For this example, we might be interested in the relationship between college GPA and high school GPA. In particular, we may want to predict college GPA in terms of high school GPA, high school percentile, ACT score, IQ, etc.
In this section, we shall consider a pair of variables. Other examples, besides those above, are:
We are interested in the relationship between X and Y. We may further be interested in predicting one variable in terms of the other. For this prediction problem we will always label the variables so that we are interested in predicting Y in terms of X.
As an example, we will consider the Carrie's baseball data , which is found in Appendix A. The first and second columns of this data contain the heights and weights of the baseball players. Let X denote the height of a ball player and Y denote his weight. Certainly as a first step we plot the data with Y on the vertical axis and X on the horizontal axis. This is called a scatterplot of the data. For this data set the scatterplot is given in Figure 1.2
Use the summary module to reproduce the above plot using Carrie's baseball data set.
In order to see if you understand the plot, find the point on the plot corresponding to the ball player who is 5'8" and weighs 175 pounds? Or to the ball player who is 6'3" and weighs about 220 pounds? The relationship between weight and height is increasing , as height increases weight tends to increase too.
Suppose we try to model the data. On the basis of the plot, a linear model is certainly worthy
of a first try. Note that the model cannot be deterministic for ballplayers who have the same height have many different weights (in fact a sample of weights). For example, there are at least (some points overlap) 7 ball players who are 76" tall with weights varying from about 175 pounds to 210 pounds. So the model has to allow for error ; i.e, a model of the form:
When is a model good? We will discuss this important question in Part 2 of this section. Now we just want to fit the model ; that is, obtain estimates of a and b. We will first consider a simple eyeball fit and then discuss more formal fits.
We have selected an easy eyeball method of fit. Pick two points on the plot so that the line passing through them gives a "fairly" good fit. Say the two points are (X1,Y1) and (X2,Y2). Then an estimate of the slope is
To estimate the intercept, simply take one of the points, say, (X1,Y1). Then estimate the intercept by solving the linear equation for a; that is
Based on my first point my estimate is
Thus we estimate a ball player of 0 height to weigh -336.8 pounds. Actually the newspapers often do this to make fun of scientists. But in this class you know the correct answer to such a farce. Right! The model is only good where we have data!. We have no data around X=0, so we cannot predict there.
We thus have our prediction equation,
The scatterplot of the data superimposed with our eyeball fit is given in Figure 1.3. Note that you can obtain the predicted value for a given height, say 75", by drawing a vertical line starting at 75" on the horizontal axis and ending when it intersects our fitted line. Do this to determine the predicted weight of a ball player who is 72" tall.
We will present two methods. The first is the method of least squares , which we will often denote by LS . Consider again the data set consisting of the weights and heights of baseball players. For convenience the scatter plot of the data is given in Figure 1.4:
Try eyeballing a fit of a straight line on this plot, say,
Y = a + bX. Consider the point (77,190) the lowest point at height 77". It probably will not be on your fitted line, so in choosing your line you missed the point by the deviation
The regression equation is Weight = - 213 + 5.49 HeightHence LS estimates an increase of 5.49 pounds for every inch of increase in height. As an example in terms of prediction, the LS predicted weight of a ball player who is 75" tall is pounds. Of a ball player who is 70" tall is pounds. Locate the points (75,198.75) and (70,171.30) on the above plot. Then draw the line determined by those two points. This is the LS fit. It should look like Figure 1.5:
Reproduce the above results using Carrie's baseball data set. Choose regression from the analysis menu after entering the data. Choose weight as a response variable and height as a predictor.
We will make frequent use of the LS fit in later chapters but there is one problem with it. It is not robust. The LS fit is easily distorted by outliers. Lets look at this using the baseball data. Note at height 68" there is one player whose weight is at 175 pounds. Suppose the weight was recorded as 275 pounds. Although high, this weight is not inconceivable for a ball player. The LS fit of this changed data is:
The regression equation is Weight = - 88.2 + 3.82 HeightThis is quite a change from the previous fit. In particular, the slope estimate has changed from 5.49 to 3.82, a difference of 1.67 pounds. That is, because of one data point we now predict weight to increase 1.67 pounds less for each one inch in height. We can also see the effect on the plot. See Figure 1.6.
Notice how the outlier pulled up the LS fit, resulting in a very poor to the bulk of the data. One data point drove the fit!
As an alternative to LS, we present the Wilcoxon fit . Recall that the LS fit minimizes the averaged squared deviation from the chosen line. An outlier will have a large deviation and under the LS procedure its influence is made much greater by the squaring of this deviation. Because of the square, deviation times deviation, LS is weighing the large deviation by a large weight. The Wilcoxon, though, uses a much smaller weight in determining the chosen line. The Wilcoxon fit is less sensitive than the LS fit at least for outliers in the Y-direction. For good data, no outliers, the Wilcoxon fit is in close agreement with the LS fit. This the Wilcoxon fit is robust fit .
The regression module gives the option of a LS fit or a Wilcoxon fit. The Wilcoxon fit of the good data results in:
Weight = -228 + 5.71 Height
Recall that the LS estimate of slope is 5.72 whereas the Wilcoxon estimate is 5.71, quite close. The Wilcoxon fit can be used like the other fits for prediction. For instance, if a ball player is 75" tall then the Wilcoxon fit predicts a weight of
-228 + 5.71(75) = 200.25 pounds.
On the changed data, the Wilcoxon fit is
Weight = -225 + 5.67 HeightThe change in slope estimates is very slight (.05 pounds). Unlike the LS fit, the Wilcoxon fit is not sensitive to the outlier.
X Y 16 32 15 26 20 40 13 27 15 30 17 38 16 34 21 43 22 64 23 45 24 46 18 39