Suppose the slope of the least squares fit was 1. Then since the line goes
through (
,
) which are about the same then you would predict
the height of the first son to be same as the height of the father. Galton
noticed, though, that the slope of the line was definitely (significantly)
less than 1. Hence for father's whose heights were taller than the average,
the line predicts the son to be shorter than the father. Likewise, for
father's whose heights were shorter than the average, the line predicts
the son to be taller than the father. That is, taller fathers tend to have
shorter sons and shorter fathers tend to have taller sons. There is a regression
towards the mean effect . That's how regression got its name. Actually
it is a good thing that this phenomenon occurs. Why?
Does regression towards the mean occur for other data sets? It does. Suppose we have observational data (X's and Y's both random). Suppose the data follow the linear model
Here's an example with real data. The data consist of the scores 36 students made on two tests in there statistics course. These were hour exams (over 20 questions). Test 1 was the first test and Test 2 was taken about a month later. So we want to predict Test 2 scores in terms of Test 1. Here's the data:
Test 1
12 17 16 18 12 12 12 20 18 18 11 13 15
16 20 13 15 11 9 12 17 12 16 15 19 13
13 16 18 12 12 15 11 14 12 13
Test 2 (these data are paired (same order) with Test 1
14 14 19 17 12 14 13 17 14 19 12 16 16
19 15 14 11 13 14 17 9 12 13 12 20 18
17 14 12 9 12 19 10 13 17 14
As I noted this is paired data. The first student scored 12 on his first
test and 14 on his second test. Here is a scatter plot of the data:
-
20.0+ *
- * 2 *
C10 -
- *
- 2 * * *
16.0+ * *
- *
- * 2 2 * * *
-
- * * * *
12.0+ * 3 * *
- *
-
- *
- * *
8.0+
------+---------+---------+---------+---------+---------+C9
10.0 12.0 14.0 16.0 18.0 20.0
The averages are 14.4 and 14.5 for Tests 1 and 2, respectively. The standard
deviations are 2.86 and 2.93 for Tests 1 and 2, respectively. The least
squares fit is
You can see it in the data too. Note that two students scored 20 on
the first test. They scored less than 20 on the second test. Note the four
students who scored 18 on the first test. Three of these scored less than
18 on the second test while 1 scored higher. Likewise, notice the 5 students
who scored 13 on Test 1. They all scored higher on the second test.
As a final thought on regression towards the mean, the plot below shows the least squares fit contrasted with the line through points where the second coordinate is the same as the first coordinate (i.e., scores on second test exactly the same as on first).
-
20.0+ 2
- A
-
- 4
- 2 2
16.0+ 4 4 B
- 4 4 2
- 2 4
- 3 9 5
- B 5
12.0+ 9
- 3
-
-
- A
8.0+
------+---------+---------+---------+---------+---------+
10.0 12.0 14.0 16.0 18.0 20.0
Finally to contrast the regression towards the mean effect to The least
squares estimate of slope is .35 which is much less than 1. Sketch the
fit (show it goes through
(14.4,14.5)).
Test 1
37 17 23 40 37 39 35 29 32 40 26 39 34
29 38 21 36 38 14 27 34 38 25 18 39 37
36 12 34 26
Test 2 (Paired data, first student scored 37, 28 on tests 1 and 2 respectively)
28 24 20 32 39 36 40 33 23 36 21 30 30
21 22 24 27 20 8 31 28 30 25 16 31 18
25 6 36 20