next up previous contents index
Next: Data Sets Up: Regression : Second Pass Previous: Observational Studies

How Regression Got Its Name

In the mid to late 1800's, a scientist called Galton was working with large observational studies on humans. One of these data sets consisted of the heights of fathers and first sons. In our terminology let Assume we want to predict Y in terms of X . When Galton plotted Y versus X the scatter filled in a large oval. The trend was linear and increasing. The least squares fit went through the center of the data ($\bar{X}$, $\bar{Y}$), as it always does, with positive slope. For this data$\bar{X}$is about the same as $\bar{Y}$; i.e., the average heights were about the same over these two adjacent generations, (this was certainly true in the 1800's). This was true of the scale (standard deviations) also.

Suppose the slope of the least squares fit was 1. Then since the line goes through ($\bar{X}$, $\bar{Y}$) which are about the same then you would predict the height of the first son to be same as the height of the father. Galton noticed, though, that the slope of the line was definitely (significantly) less than 1. Hence for father's whose heights were taller than the average, the line predicts the son to be shorter than the father. Likewise, for father's whose heights were shorter than the average, the line predicts the son to be taller than the father. That is, taller fathers tend to have shorter sons and shorter fathers tend to have taller sons. There is a regression towards the mean effect . That's how regression got its name. Actually it is a good thing that this phenomenon occurs. Why?

Does regression towards the mean occur for other data sets? It does. Suppose we have observational data (X's and Y's both random). Suppose the data follow the linear model

where the errors are independent of the X's. Now suppose that the variance of Y is the same as the variance of X. (This is the key assumption; i.e., the variances are the same). Then we can show that the absolute value of b is less than 1. Hence if b > 0 then the model exhibits regression towards the mean.

Here's an example with real data. The data consist of the scores 36 students made on two tests in there statistics course. These were hour exams (over 20 questions). Test 1 was the first test and Test 2 was taken about a month later. So we want to predict Test 2 scores in terms of Test 1. Here's the data:

Test 1

    12    17    16    18    12    12    12    20    18    18    11    13    15
    16    20    13    15    11     9    12    17    12    16    15    19    13
    13    16    18    12    12    15    11    14    12    13

Test 2 (these data are paired (same order) with Test 1

    14    14    19    17    12    14    13    17    14    19    12    16    16
    19    15    14    11    13    14    17     9    12    13    12    20    18
    17    14    12     9    12    19    10    13    17    14
As I noted this is paired data. The first student scored 12 on his first test and 14 on his second test. Here is a scatter plot of the data:
          -
      20.0+                                                    *
          -                                *    2         *
  C10     -
          -                      *
          -                 2    *                        *         *
      16.0+                      *         *
          -                                                         *
          -  *              2    2              *    *    *
          -
          -            *    *         *         *
      12.0+            *    3              *              *
          -                                *
          -
          -            *
          -                 *                        *
       8.0+
            ------+---------+---------+---------+---------+---------+C9
               10.0      12.0      14.0      16.0      18.0      20.0
The averages are 14.4 and 14.5 for Tests 1 and 2, respectively. The standard deviations are 2.86 and 2.93 for Tests 1 and 2, respectively. The least squares fit is The slope is less than 1 which is not surprising since the standard deviations are about the same. Hence this data set exhibits regression towards the mean.

You can see it in the data too. Note that two students scored 20 on the first test. They scored less than 20 on the second test. Note the four students who scored 18 on the first test. Three of these scored less than 18 on the second test while 1 scored higher. Likewise, notice the 5 students who scored 13 on Test 1. They all scored higher on the second test.

As a final thought on regression towards the mean, the plot below shows the least squares fit contrasted with the line through points where the second coordinate is the same as the first coordinate (i.e., scores on second test exactly the same as on first).

          -
      20.0+                                                         2
          -                                                    A
          -
          -                                               4
          -                                          2              2
      16.0+                                     4         4    B
          -                                4    4    2
          -                           2    4
          -            3    9    5
          -  B                   5
      12.0+                 9
          -            3
          -
          -
          -  A
       8.0+
            ------+---------+---------+---------+---------+---------+
               10.0      12.0      14.0      16.0      18.0      20.0
Finally to contrast the regression towards the mean effect to The least squares estimate of slope is .35 which is much less than 1. Sketch the fit (show it goes through (14.4,14.5)).


Exercise 12.5.1  
1.
Use the regression module to obtain the Wilcoxon fit for the above data.
2.
The scores below are test scores for students in a Stat class over two tests, Test 1 and Test2.
  Test 1
    37  17  23  40  37  39  35  29  32  40  26  39  34 
    29  38  21  36  38  14  27  34  38  25  18  39  37 
    36  12  34  26 
             
  Test 2  (Paired data, first student scored 37, 28 on tests 1 and 2 respectively)

    28  24  20  32  39  36  40  33  23  36  21  30  30 
    21  22  24  27  20   8  31  28  30  25  16  31  18 
    25   6  36  20
(a)
Plot the data, Test 2 versus Test 1.
(b)
Use the summary module to find the standard deviations of the two data sets. Do you think these data will exhibit the regression towards the mean effect?
(c)
Use regression module to obtain the Wilcoxon fit. Do the data exhibit the regression towards the mean effect?


next up previous contents index
Next: Data Sets Up: Regression : Second Pass Previous: Observational Studies

2001-01-01