next up previous contents index
Next: How Regression Got Its Up: Regression : Second Pass Previous: Regression Experimental Designs

Observational Studies

In order to study the relationships among variables, observational studies  are performed. Unlike controlled experimental designs where only certain variables are allowed to vary (at prespecified levels), in observational studies the variables are observed and recorded. Often some of the variables are controlled as much as possible. Consider a long term study on a drug involving humans where a variable that needs to be controlled is diet. The diet guidelines are set but these will probably be broken from time to time (or maybe often) by some of the human subjects. Contrast this with a lab setting, where the diet of animals can be controlled.

In observational studies, cause and effect are hard (often impossible) to establish. But associations and predictabilities among variables can be investigated. Such associations and predictabilities may be further studied in a lab setting.

Here's a simple example. Let Y be the weight of a baseball player and let X be the height of a baseball player. Recall the scatter plot which is given by:

 Weight   -
          -                                              *         *
       225+                                                   2
          -                          *    *              *    *
          -
          -                          2              2
          -                          *         2    *
       200+                          *         3    *
          -                *         2    2    *    *
          -                     2    *         *    2    *
          -                2    *    3    2    *    *
          -                     *                   *
       175+ *         *         *    *
          -                               *
          -                *
          -      2         *         *
          -           *    *
            +---------+---------+---------+---------+---------+------ Height
         68.0      70.0      72.0      74.0      76.0      78.0
Here is the data set:
Height (X)


    74    75    77    73    69    73    78    76    77    78    76    72    73
    73    74    75    72    75    76    76    72    76    68    73    69    76
    77    74    75    73    79    72    75    70    75    78    73    75    74
    71    73    76    73    75    73    73    74    72    73    71    71    71
    73    74    76    71    76    71    70

Weight (Y)  This paired data, e.g., the 74 goes with 218, etc.


    218    185    219    185    160    222    225    205    230    225    190
    180    185    200    195    195    185    190    200    180    175    195
    175    185    160    211    190    195    200    207    232    190    200
    175    200    220    195    205    185    185    210    210    195    205
    175    190    185    190    210    195    166    185    160    170    185
    155    190    160    155
If we enter this data set into the data box and choose regression, we get the prediction equation (Wilcoxon): There is an association between height and weight, An increasing relationship. We predict a baseball player's weight in terms of his height. A confidence interval for the slope parameter is (4.2, 7.2); hence, we predict the weight of a ball player to increase between 4 to 7 pounds for each additional increase in 1 inch of height, (4 to 7 pounds per inch). We are not saying taller causes heavier, this is absurd. But we are observing an association between height and weight. We are saying that if a ball player is taller then he is more likely to be heavier.

To make better predictions, there may be other variables to consider. In the height-weight data, a measure of body build would be useful. In a more advance class, we would discuss these issues.

We do need to emphasize one thing concerning observational studies. There must be a reason to explore associations and predictions. An example here is worth thousands of words. Let Y be the number of deaths per 100,000 in England for a year in the late 1800's and let X be the number of church weddings  (in thousands) in England for that year. There is no reason to seek an association between these variables. But suppose we do. The data is given in Appendix A. The scatter plot of the data is:

          -                                                      *
          -                                                    2
      21.0+                                                2 *
          -                                          *  3
  deaths  -                                         *2*
          -                                       3
          -                                  *323
      18.0+                                22
          -                            ***
          -                         * *
          -                      3 *
          -
      15.0+                **
          -             * *
          -            2
          -
          -     *
            --------+---------+---------+---------+---------+--------ch_weds
                  600       640       680       720       760
The relationship is linear. In fact the pattern is quite tight. It is clear from the plot: to reduce deaths, reduce church weddings! There is a variable here causing this pattern. It is time! These data are recorded over the years. Here is a plot of the death rate versus year:
          -       *
          -           *    *
      21.0+          * *  *
          -        **   *     *
  deaths  -              *  *  *           *
          -                  *  *         *
          -                       ******    ** *
      18.0+                      *       *         **
          -                             *        **
          -                                     *    *
          -                                   *       * * *
          -
      15.0+                                            * *
          -                                                **
          -                                                  * *
          -
          -                                                   *
            +---------+---------+---------+---------+---------+------year
         1860      1870      1880      1890      1900      1910
Great strides in science were made in these years (Louis Pasteur, etc.) that helped the death rate to plummet. Here is a plot of church weddings versus year:
          -       *
       770+           *    *
          -          * *  *
  ch_weds -        **         *            *
          -             **  ** *
          -                     *  ** *   * **
       700+                       *  * *       *   **
          -                      *      **
          -                                     ***
          -                                   *      *    *
          -                                           * *
       630+                                            * * *
          -                                                 ** *
          -
          -                                                   *
          -
       560+
            +---------+---------+---------+---------+---------+------year
         1860      1870      1880      1890      1900      1910
Church attendance dropped over these years. Hence both variables decrease with respect to year and thus have an increasing relationship when plotted with each other. So that solves the puzzle. Time is called a lurking  variable here.

In an observational study, make sure you are including variables for which a relationship between them makes sense. If a paradox occurs (such as death rate and church wedding rate) look for a lurking variable.


Exercise 12.4.1  
1.
(From Bhattacharyya and Johnson (1977), Statistical Concepts and Methods, New York: Wiley). Below are used-car prices (in thousands of dollars) for a foreign compact (1970's data) with their ages in years.
  Age     1      2    2     3     3     4     6      7     8     10
  Price  2.45  1.80  2.00  2.00  1.70  1.20  1.15   .69   .60   .47
(a)
Plot the data, Price versus Age. Comment on the car buyer's lament (depreciation).
(b)
Use the regression module to obtain the Wilcoxon fit of a linear model to the data.
(c)
Obtain a 95% confidence interval for slope and interpret it in terms of the problem.
(d)
Predict the price of an 11 year-old compact.
(e)
What are some other X variables that would help predict price?
(f)
If we had much older cars, would you expect to see a continual down hill trend? Why?

2.
(From Hettmansperger and McKean (1998), Robust Nonparametric Statistical Methods, London: Arnold). Below are the number of telephone calls (tens of millions) made in Belgium for the years 1950-1973:
 Year   50    51    52    53    54    55    56    57    58    59    60    61 
 Calls 0.44  0.47  0.47  0.59  0.66  0.73  0.81 0.88  1.06   1.20  1.35  1.49 

 Year  62   63    64    65    66    67    68    69    70    71    72    73 
 Calls 1.61 2.12 11.90 12.40 14.20 15.90 18.20 21.20 4.30  2.40  2.70  2.90
(a)
Plot the data and comment on the plot (There were a few years where a recording error was made. Find those years).
(b)
Use the regression module to obtain both the least squares and Wilcoxon fits of the data set.
(c)
Plot these fits. Which would you use for prediction for the number of calls in 1974.


next up previous contents index
Next: How Regression Got Its Up: Regression : Second Pass Previous: Regression Experimental Designs

2001-01-01