Next: Relationships Between Variables, Part Up: Descriptive Statistics Previous: Measures of Scale

# Relationships Between Variables, Part 1: Linear Models

Often we collect observations from several different variables on a subject. A simple example is a form, such as an application form, which are collected from a group of people. Each item on the form corresponds to a variable. For example, suppose it is a form that upper classmen are filling out at an university. Items might include the college GPA, major, ACT score, high school GPA, high school percentile, weight, height, gender, family income, major, etc. We may want to describe each variable separately using the descriptive statistics and plots that we have discussed, but often we also want to investigate the relationship between the variables. For this example, we might be interested in the relationship between college GPA and high school GPA. In particular, we may want to predict  college GPA in terms of high school GPA, high school percentile, ACT score, IQ, etc.

In this section, we shall consider a pair of variables. Other examples, besides those above, are:

1.
For example X is the height of a person and Y is his/her weight.
2.
For example X is the grade of a student on first test and Y is his/her grade on second test.
3.
For example X is the points for of a NFL team and Y is the team's win-loss percentage.
4.
For example X is the points against of a NFL team and Y is the team's win-loss percentage.

We are interested in the relationship between X and Y. We may further be interested in predicting one variable in terms of the other. For this prediction problem we will always label the variables so that we are interested in predicting Y in terms of X.

As an example, we will consider the Carrie's baseball data , which is found in Appendix A. The first and second columns of this data contain the heights and weights of the baseball players. Let X denote the height of a ball player and Y denote his weight. Certainly as a first step we plot the data with Y on the vertical axis and X on the horizontal axis. This is called a scatterplot of the data. For this data set the scatterplot  is given in Figure 1.2

Use the summary module to reproduce the above plot using Carrie's baseball data set.

In order to see if you understand the plot, find the point on the plot corresponding to the ball player who is 5'8" and weighs 175 pounds? Or to the ball player who is 6'3" and weighs about 220 pounds? The relationship between weight and height is increasing , as height increases weight tends to increase too.

Suppose we try to model the data. On the basis of the plot, a linear model is certainly worthy of a first try. Note that the model cannot be deterministic for ballplayers who have the same height have many different weights (in fact a sample of weights). For example, there are at least (some points overlap) 7 ball players who are 76" tall with weights varying from about 175 pounds to 210 pounds. So the model has to allow for error ; i.e, a model of the form:

Y = a + bX + e

where X denotes the height of a ball player and Y denotes his weight. What are the other parts in the model statement? The variable e denotes random error , that is, if there were no error Y would be a deterministic linear function of X. There are two parameters in the model. The most important parameter is the slope , b. It gives the expected change in weight for an increase in 1 inch of height. The intercept , a, is the expected weight of a person who is 0" tall. This is absurd. So the intercept in this model has no practical meaning, but we need it to set the line, (there are infinite number of lines with the same slope). We want a model that fits the data well over the range of the X's which in this case is between 68 and 78 inches. The model is only good where we have data.

When is a model good? We will discuss this important question in Part 2 of this section. Now we just want to fit the model ; that is, obtain estimates of a and b. We will first consider a simple eyeball fit and then discuss more formal fits.

1.
Eyeball Fit

We have selected an easy eyeball method of fit. Pick two points on the plot so that the line passing through them gives a "fairly" good fit. Say the two points are (X1,Y1) and (X2,Y2). Then an estimate of the slope is

For the baseball data, I chose the points (69,160) and (78,225). Hence my estimate of slope is

Thus I estimate 7.2 more pounds in weight for every inch in height.

To estimate the intercept, simply take one of the points, say, (X1,Y1). Then estimate the intercept by solving the linear equation for a; that is . Based on my first point my estimate is . Thus we estimate a ball player of 0 height to weigh -336.8 pounds. Actually the newspapers often do this to make fun of scientists. But in this class you know the correct answer to such a farce. Right! The model is only good where we have data!. We have no data around X=0, so we cannot predict there.

We thus have our prediction equation,

Suppose we want to predict the weight of a ball player who is 75" tall. Our prediction  is pounds.

The scatterplot of the data superimposed with our eyeball fit is given in Figure 1.3. Note that you can obtain the predicted value for a given height, say 75", by drawing a vertical line starting at 75" on the horizontal axis and ending when it intersects our fitted line. Do this to determine the predicted weight of a ball player who is 72" tall.

2.
Least Squares Fit

We will present two methods. The first is the method of least squares , which we will often denote by LS . Consider again the data set consisting of the weights and heights of baseball players. For convenience the scatter plot of the data is given in Figure 1.4:

Try eyeballing a fit of a straight line on this plot, say, Y = a + bX. Consider the point (77,190) the lowest point at height 77". It probably will not be on your fitted line, so in choosing your line you missed the point by the deviation

190 - (a + bX)

This deviation is an error determined by the fit. Since two points determine a line, in choosing your fit you will have committed many errors, at least 57 because there are 59 data points). As a goal in determining the fit, choose the line which minimizes these deviations or errors. It does not matter whether the deviation is positive or negative. The method of least squares minimizes the average of the squared deviations. It does result in equations for estimates of a and b, which we will give below. But at the moment lets just use it. The LS fit is:

The regression equation is
Weight = - 213 + 5.49 Height

Hence LS estimates an increase of 5.49 pounds for every inch of increase in height. As an example in terms of prediction, the LS predicted weight of a ball player who is 75" tall is pounds. Of a ball player who is 70" tall is pounds. Locate the points (75,198.75) and (70,171.30) on the above plot. Then draw the line determined by those two points. This is the LS fit. It should look like Figure 1.5:

Reproduce the above results using Carrie's baseball data set. Choose regression from the analysis menu after entering the data. Choose weight as a response variable and height as a predictor.

We will make frequent use of the LS fit in later chapters but there is one problem with it. It is not robust. The LS fit is easily distorted by outliers. Lets look at this using the baseball data. Note at height 68" there is one player whose weight is at 175 pounds. Suppose the weight was recorded as 275 pounds. Although high, this weight is not inconceivable for a ball player. The LS fit of this changed data is:

The regression equation is
Weight = - 88.2 + 3.82 Height

This is quite a change from the previous fit. In particular, the slope estimate has changed from 5.49 to 3.82, a difference of 1.67 pounds. That is, because of one data point we now predict weight to increase 1.67 pounds less for each one inch in height. We can also see the effect on the plot. See Figure 1.6.

Notice how the outlier pulled up the LS fit, resulting in a very poor to the bulk of the data. One data point drove the fit!

3.
Wilcoxon Fit

As an alternative to LS, we present the Wilcoxon fit  . Recall that the LS fit minimizes the averaged squared deviation from the chosen line. An outlier will have a large deviation and under the LS procedure its influence is made much greater by the squaring of this deviation. Because of the square, deviation times deviation, LS is weighing the large deviation by a large weight. The Wilcoxon, though, uses a much smaller weight in determining the chosen line. The Wilcoxon fit is less sensitive than the LS fit at least for outliers in the Y-direction. For good data, no outliers, the Wilcoxon fit is in close agreement with the LS fit. This the Wilcoxon fit is robust fit .

The regression module gives the option of a LS fit or a Wilcoxon fit. The Wilcoxon fit of the good data results in:

    Weight = -228 + 5.71 Height


Recall that the LS estimate of slope is 5.72 whereas the Wilcoxon estimate is 5.71, quite close. The Wilcoxon fit can be used like the other fits for prediction. For instance, if a ball player is 75" tall then the Wilcoxon fit predicts a weight of -228 + 5.71(75) = 200.25 pounds.

On the changed data, the Wilcoxon fit is

    Weight = -225 + 5.67 Height

The change in slope estimates is very slight (.05 pounds). Unlike the LS fit, the Wilcoxon fit is not sensitive to the outlier.

Exercise 2.8.1
1.
Let X be the length (cm) of a laboratory mouse and let Y be its weight (gm). Consider the data for X and Y given below. Obtain a scatterplot of the data and comment on the plot.
        X     Y
16   32
15   26
20   40
13   27
15   30
17   38
16   34
21   43
22   64
23   45
24   46
18   39

2.
For the data set in Problem #1, eyeball a linear fit obtaining an estimate of the slope and the intercept.
(a)
(b)
Use your plotted fit, to predict the weight of a mouse that is 20 cm long.
(c)
Use your predicition equation to predict the weight of a mouse that is 25 cm long.
(d)
What does the estimate of slope mean in terms of the problem?
(e)
What does the estimate of intercept mean in terms of the problem?
3.
Use the formulas given in class to determine the LS fit for the data given in Problem #1. (ANS: LS slope is: 2.405).
4.
5.
Compare the LS fit with your eyeball fit? Which is a better fit? Why?
6.
Use the LS predicition equation to predict the weight of a mouse that is 25 cm long.
7.
What does the estimate of slope mean in terms of the problem?
8.
Use the regression module to scatterplot the data and obtain the LS and Wilcoxon fits. Write the Wilcoxon fit down.
(a)
Plot the Wilcoxon fit on your plot in #1.
(b)
Compare the Wilcoxn and the LS. Which is a better fit? Why?
(c)
Use the Wilcoxon predicition equation to predict the weight of a mouse that is 25 cm long.
(d)
What does the estimate of slope mean in terms of the problem?
9.
Consider the height weight of the baseball players in Carrie's baseball data (Appendix A). Obtain the scatterplot of height versus weight, the LS fit, and the Wilcoxon fit.

Next: Relationships Between Variables, Part Up: Descriptive Statistics Previous: Measures of Scale

2001-01-01