Next: Other Statistics Up: Descriptive Statistics Previous: Outliers and Box Plots

# Comparing Data Sets

As with dotplots, boxplots lend themselves to comparisons. Just make sure that the same number scale is used for each boxplot. Then simply draw the boxplots in rows (or in columns). As an example, reconsider he subsample of Italian skull sizes given by,

```133    128    136    140    127    136    131    131    128    132    125
133    134    136    134    129    132    139    143    138
```

Recall that the 5 basic descriptive statistics are: 123, 129, 133, 136, and 143. Hence, h = .5(136-129)= 10.5 and the fences are LIF = 129-10.5 = 118.5 and UIF = 136+10.5 = 146.5. The adjacent points are 125 and 143. Based on these statistics, the comparison boxplots  are:

```                                       --------------
Etruscan     *        -----------------I     +      I-------------
--------------

-----------
Italian    ---------I    +    I-----------
-----------
--+---------+---------+---------+---------+---------+----
126.0     132.0     138.0     144.0     150.0     156.0
```

Use the summary module to obtain:

1.
a boxplot for the example data given here. Enter the data in the first "DATA SETS" window.
```126    132    138    140
141    141    142    143
144    144    144
```
2.
a comparison boxplot for the above and the following data sets.
```123   324   145   156   265
143   221   322   133   233
142   144   244
```

A final remark on this example is in order. Notice that the scales (noise levels) in the data sets are about same; i.e., the interquartiles ranges are about the same, 8 and 7, and ignoring the outlier the ranges are about the same. We do not have much data here to comment on the shapes of the distributions but based on the comparison dotplots above symmetry cannot be discounted.

In light of this, what catches your eye as you look at the box plots? There is a shift ; that is, the Etruscan data is shifted up from the Italian data. If you draw lines connecting the Etruscan and Italian lower quartiles and then a line connecting their upper quartiles the lines will be almost parallel. The line connecting the medians will also be almost parallel with these lines. In fact, it is tempting to summarize the data with one number which is the difference in the medians. In this case the difference is 146 - 133 = 13. This is called a location problem . These problems are characterized by the samples having similar shapes and scales (noise levels). In such cases, a convenient summary is a difference in locations or centers. Here, that difference is 13; so the Etruscan head sizes are shifted up 13mm from the Italian head sizes. Be very careful, though. This number 13 is based on just two samples. We also need a measure of sample error. If this measure turns out to be greater than 13 then our estimate of shift loses a lot of meaning. In later chapters we will say it is insignificant . If sampling error is small (less than 13 here) then our estimate of shift is meaningful. In later chapters we will say it is significant .

Exercise 2.6.1
1.
A standardized exam was given to two groups of people. The first group took the exam under adverse conditions, (room was too cold, room was dirty, proctor swore at them) while the second group took it under normal conditions. The data are given below. Determine the five basic statistics for the two groups, find the fences, and determine if there are any outliers. Then draw comparison boxplots for the two data sets. Are there any location differences? Scale differences?
```      Group 1:   153    150    132    123    148    146    140    154
137    112

Group 2:   148    113     69    129    150    129    157    184
143    167    141    179    124    130    166
```
2.
Consider Carrie's baseball data. Obtain back-to-back stem-leaf plots of the height of the hitters and pitchers. Discuss the plots.
3.
In the last problem, obtain the 5 basic descriptive statistics for the heights of the hitters and pitchers. Obtain the fences, and determine if there are any outliers. Then draw comparison boxplots for the two data sets. Are there any location differences? Scale differences?
4.
Ten batteries from each of three brands (A, B, and C) were put on test to determine their lifetimes (in hours). Obtain comparison dotplots. Use these dotplots to obtain the 5 basic descriptive statistics for each brand. Bigger means better here. Which brand seems best, if any?
```      A:    41    289    214    102     38
94    179     87    116    155

B:    39     65     22     64     22
191     99     32    142    317

C:    24     95    139    122     41
360    318     34     43     18
```

Next: Other Statistics Up: Descriptive Statistics Previous: Outliers and Box Plots

2001-01-01