next up previous contents index
Next: Relationships Between Variables, Part Up: Measures of Scale or Previous: Motivation

Measures of Scale

The range  and the interquartile range , IQR, are measures of scale. The range is of course not robust but the interquartile range is. For our three data sets the interquartile ranges are: 37.5, 17.5, and 68, respectively for Sample 1, 2 and 3. The ratios agree with our quick glance at the boxplots above.

We need to discuss an estimate of scale that we use in conjunction with the mean. It is a measure of deviation from the mean. For instance, the value $x_1 - \bar{x}$ is the deviation of the first point from the mean. Hence, we have the n deviations:

\begin{displaymath}x_1 - \bar{x}, x_2 - \bar{x}, ..., x_n - \bar{x}
\end{displaymath}

It does not matter here whether the deviation is negative or positive. One way to get rid of the sign is to square the deviation. But we still have n squared deviations. So we will take the average of these squared deviations, except we will divide by n - 1 and not n. The resulting statistic is called the sample variance  and we usually use the symbol s2 to represent it. However, the units of s2  are squared units. For example if we are data consists of the weights in pounds of individuals then s2 will be in pounds squared. We rectify this by taking the square root and we call the resulting statistic the sample standard deviation , s . In notation we have

\begin{displaymath}s = \sqrt{\frac{\textrm{Sum}(x - \bar{x})^2}{(n-1)}}
\end{displaymath}

Lets use the simple data set 11, 18, 6, 4, 8, 15, 22, for an example. The sample mean is 12, hence the deviations are -1, 6, -6, -8 3, and 10. The squared deviations are 1, 36, 36, 64, 9 and 100. Thus s2 = 246/6 = 41. So that the sample standard deviation is $s = \sqrt{41}= 6.4$. Of course the easy way to compute is to just enter these data into the data box and choosing summary from the analysis menu. Then check the variable name and the covariance  button.

The sample standard deviation is not robust, as the table below, on the simple example with changes to the last data point, dramatically shows,

                      Data                        median   mean   IQ   s
Set 1: 11    18     6     4     8    15    22       11     12     12   6.61
Set 2: 11    18     6     4     8    15    72       11     19.1   12   23.8
Set 3: 11    18     6     4     8    15   720       11    112     12   268
Set 4: 11    18     6     4     8    15  2200       11    323     12   828
Set 5: 11    18     6     4     8    15  7200       11   1037     12  2717
Set 6: 11    18     6     4     8    15 72000       11  10295     12 27210
Even the first change (22 to 72) brings almost a 4 fold increase in noise as measured by s. The interquartile range is robust.
What does s mean? We will answer that later in Chapter 5.


Exercise 2.7.1  
1.
Use the summary module to obtain these statistics for the two data sets in #1, Exercise 1.4. Using these statistics, obtain comparison boxplots of the two samples.
2.
Check the robustness of the statistics in the descriptive statistics command on the following two data sets using the summary module.
       Data set 1 

            102 131 137 63 42 12 23 49 63 21 
             56  68  35 63 62 19 85 38 76 29 
             31  16   0  8 47 40  2 44  8 16 
              7  43   2 50 22  1 51 34  4 78

        
       Data set 2 

            1020 131 137 63 42 12 23 49 63 21 
              56  68  35 63 62 19 85 38 76 29 
              31  16   0  8 47 40  2 44  8 16 
               7  43   2 50 22  1 51 34  4 78
Notice that in the second data set,the 102 was changed to 1020. Which statistics were robust to this change? Which weren't?
3.
Same as the last exercise but change the 1020 to 10200.
4.
Same as the last exercise but change the 10200 to 102000.
5.
Did Manuel I shortchange the people by having less silver in in later days mintings? Try to answer this question by comparing the following two data sets (use comparison boxplots). The first data set is the amount of silver (percentage)in Manuel's first minting while the second data set is the amount of silver (percentage) in Manuel's fourth minting.
       First:     5.9   6.8    6.4   7.0   6.6   7.7   7.2   6.9   6.2 
       Fourth     5.3   5.6    5.5   5.1   6.2   5.8   5.8
6.
Using the LDL levels of quail a drug compound (call it A) was put on test. In the experiment, 30 quail were randomly chosen and 20 were assigned to a placebo and the other 10 to the treatment using Drug A. The drug was mixed in their food. Other than this, though, the quail were treated the same. At the end of the treament period, the Low Density Lipid levels of the quail were measured and are given below. Here smaller is definitely better. The data are real.
  Placebo:  64  49  54  64  97  66  76  44  71  89  
            70  72  71  55  60  62  46  77  86  71 
	    
   Drug A:  40  31  50  48 152  44  74  38  81  64
(a)
Obtain comparison dot plots of the data and try to decide if the drug A was effective.
(b)
Obtain the descriptive statistics for each data sets. Which (difference in means, difference in medians, difference in HL) seem more appropriate here? Why?


next up previous contents index
Next: Relationships Between Variables, Part Up: Measures of Scale or Previous: Motivation

2001-01-01