In observational studies, cause and effect are hard (often impossible)
to establish. But associations and predictabilities among variables can
be investigated. Such associations and predictabilities may be further
studied in a lab setting.
Here's a simple example. Let Y be the weight of a baseball player and let X be the height of a baseball player. Recall the scatter plot which is given by:
Weight -
- * *
225+ 2
- * * * *
-
- 2 2
- * 2 *
200+ * 3 *
- * 2 2 * *
- 2 * * 2 *
- 2 * 3 2 * *
- * *
175+ * * * *
- *
- *
- 2 * *
- * *
+---------+---------+---------+---------+---------+------ Height
68.0 70.0 72.0 74.0 76.0 78.0
Here is the data set:
Height (X)
74 75 77 73 69 73 78 76 77 78 76 72 73
73 74 75 72 75 76 76 72 76 68 73 69 76
77 74 75 73 79 72 75 70 75 78 73 75 74
71 73 76 73 75 73 73 74 72 73 71 71 71
73 74 76 71 76 71 70
Weight (Y) This paired data, e.g., the 74 goes with 218, etc.
218 185 219 185 160 222 225 205 230 225 190
180 185 200 195 195 185 190 200 180 175 195
175 185 160 211 190 195 200 207 232 190 200
175 200 220 195 205 185 185 210 210 195 205
175 190 185 190 210 195 166 185 160 170 185
155 190 160 155
If we enter this data set into the data box and choose regression, we get the prediction equation (Wilcoxon):
To make better predictions, there may be other variables to consider.
In the height-weight data, a measure of body build would be useful. In
a more advance class, we would discuss these issues.
We do need to emphasize one thing concerning observational studies. There must be a reason to explore associations and predictions. An example here is worth thousands of words. Let Y be the number of deaths per 100,000 in England for a year in the late 1800's and let X be the number of church weddings (in thousands) in England for that year. There is no reason to seek an association between these variables. But suppose we do. The data is given in Appendix A. The scatter plot of the data is:
- *
- 2
21.0+ 2 *
- * 3
deaths - *2*
- 3
- *323
18.0+ 22
- ***
- * *
- 3 *
-
15.0+ **
- * *
- 2
-
- *
--------+---------+---------+---------+---------+--------ch_weds
600 640 680 720 760
The relationship is linear. In fact the pattern is quite tight. It is clear
from the plot: to reduce deaths, reduce church weddings! There is a variable
here causing this pattern. It is time! These data are recorded over the
years. Here is a plot of the death rate versus year:
- *
- * *
21.0+ * * *
- ** * *
deaths - * * * *
- * * *
- ****** ** *
18.0+ * * **
- * **
- * *
- * * * *
-
15.0+ * *
- **
- * *
-
- *
+---------+---------+---------+---------+---------+------year
1860 1870 1880 1890 1900 1910
Great strides in science were made in these years (Louis Pasteur, etc.) that
helped the death rate to plummet. Here is a plot of church weddings versus
year:
- *
770+ * *
- * * *
ch_weds - ** * *
- ** ** *
- * ** * * **
700+ * * * * **
- * **
- ***
- * * *
- * *
630+ * * *
- ** *
-
- *
-
560+
+---------+---------+---------+---------+---------+------year
1860 1870 1880 1890 1900 1910
Church attendance dropped over these years. Hence both variables decrease
with respect to year and thus have an increasing relationship when plotted
with each other. So that solves the puzzle. Time is called a lurking
variable here.
In an observational study, make sure you are including variables for which a relationship between them makes sense. If a paradox occurs (such as death rate and church wedding rate) look for a lurking variable.
Age 1 2 2 3 3 4 6 7 8 10 Price 2.45 1.80 2.00 2.00 1.70 1.20 1.15 .69 .60 .47
Year 50 51 52 53 54 55 56 57 58 59 60 61 Calls 0.44 0.47 0.47 0.59 0.66 0.73 0.81 0.88 1.06 1.20 1.35 1.49 Year 62 63 64 65 66 67 68 69 70 71 72 73 Calls 1.61 2.12 11.90 12.40 14.20 15.90 18.20 21.20 4.30 2.40 2.70 2.90