next up previous contents index
Next: Outliers and Box Plots Up: Descriptive Statistics Previous: Sample Distributions for Continuous

The 5 Basic Descriptive Statistics for Continuous Data

Stem-leaf plots and histograms are useful descriptions of a sample but often we want to describe samples or compare samples with a few descriptive statistics. The statistics we discuss next are commonly called the Five Basic Descriptive Statistics  . First, alas, we need a little notation. We will be discussing samples throughout this course and we need to often call the items something. So for a generic sample of size n lets use

x1, x2, ..., xi, ..., xn

where

x1 denotes the first item (measurement) in the sample,    
x2 denotes the second item (measurement) in the sample,    
$\displaystyle \vdots$      
xi $\textstyle \textrm{ denotes the {\it i}th item (measurement) in the sample,}$    
$\displaystyle \vdots$      
xn $\textstyle \textrm{ denotes the {\it n}th item (measurement) in the sample,}$    

Using this notation we can now define the 5 basic descriptive statistics. We will illustrate these statistics with the sample of n=25 Etruscan skull sizes, given above, but repeated for convenience:

126    132    138    140    141    141    142    143    144    144    144
145    146    147    148    148    149    149    150    150    150    154
155    158    158
Note that we have ordered the data from low to high. If you had to choose some numbers to describe this data set, probably the first two you would pick are the Minimum   and the Maximum  . The minimum of a sample is the smallest measurement, i.e. the first ordered data point. We will denote the minimum by min. For the Etruscan data set min = 126mm. The maximum of a sample is the largest measurement, i.e. the nth ordered data point. We will denote the maximum by max. For the Etruscan data set max = 158mm. The min to the max is the range of the data. In fact we call their difference the range  . For the Etruscan data the range is 158 - 126 = 32 mm. The range is a measure of scale  , or dispersion   or noise . The range is extremely sensitive to  s. Outliers are points that are far from the rest of the data. We will formally define "outlier" in the next section. A statistic is said to be robust  if it is not sensitive to outliers. So the minimum, maximum, and range are not robust statistics.

Now that we have the range of the data, the next statistic is a measure of the center of the sample. We will use the median . The median is the middle ordered data point if the sample size is an odd number and the average of the middle ordered data points if the the sample size is even. 50% of the data is less than or equal to the median and 50% of the data is greater than or equal to the median. For the Etruscan data, upon ordering the data we get,

126    132    138    140    141    141    142    143    144    144    144
145    146    147    148    148    149    149    150    150    150    154
155    158    158

Since n is 25, (n+1)/2 is 13 and, hence, the median is the 13th order data point or 146 mm. We shall use Q2  to denote the median. So half of the Etruscans in the sample had a skull size less than or equal to 146mm and half of the Etruscans had a skull size greater than or equal to 146mm. The median is a measure of center . The median is very robust. Half the data would have to change for the median to change.

We now have the range of the data and a measure of the center. How about the middle 50%? This goes from the First Quartile  to Third Quartile . The first quartile is the median of the first half of the data. We will denote it by  . 25% of the data is less than or equal to the first quartile and 75% of the data is greater than or equal to the first quartile. There are many rules for finding Q1. In this class we will be using the computer for large data sets and the computer (the statistical software) will compute Q1. For class and tests lets use a very simple rule. To find the ordered data point, divide n by 4. If the result is an integer use that integer to pick out the ordered data point corresponding to that integer. If the result is a fraction round up to the nearest integer. Pick out the ordered data point corresponding to this integer. For the Etruscan data, 25/4 is 6.25; hence, we round up to 7. The 7th ordered data point is 142, so Q1 = 142mm for the Etruscan data set. The first quartile is a robust statistic.

The third quartile is the median of the second half of the data. We will denote it by Q3 . 75% of the data is less than or equal to the third quartile and 25% of the data is greater than or equal to the third quartile. There are many rules for finding Q3. To find Q3 by hand, just use the integer we found for the first quartile, but this time count through the data from the high measurements to the low measurements. Hence Q3 = 150mm (several are tied at 150 but the 8th point from the top in my counting was 149). The third quartile is robust.

The difference between the quartiles is called the interquartile range  of the sample. It is denoted by IQR , so for the Etruscan data IQR = 150-142 = 8mm. IQR is also a measure of scale. It is not sensitive to outliers (25% of the data have to be outliers to affect IQR); hence, IQR is robust.

In summary, for the Etruscan data, the five basic descriptive statistics are: 126, 142, 146, 150 and 158mm. We want to put these summary statistics in a picture but first we need the concept of an outlier, which we will do in the next section.

Lets do one more example which shows how we can get very quickly the 5 basic descriptive statistics from a stem leaf plot. Consider the subsample of Italian skull sizes given by,

133    128    136    140    127    136    131    131    128    132    125
133    134    136    134    129    132    139    143    138

The stem leaf plot is

 Stem  Leaves     f   F  FTB
  12   87859      5   5
  13   31123442   8  13
  13   66698      5  18   7
  14   03         2  20   2
We have added three columns on the right side of the stem-leaf plot. The column labeled f is the frequency of the class, the column labeled F is the cumulative frequency of the class (the number of data points down through the end of the class), and the column labeled FTB is the cumulative frequency of the class from large numbers to small (the number of data points down through the beginning of the class). Based on this plot and those columns the 5 basic descriptive statistics are a cinch. The minimum is 125 and the maximum is 143.

The sample size is 20 (last number in column F) which is even. So the median is the average of 10th (n/2), and the 11th ordered data points. To get these look at column F. There are 5 data points down through the end of the first class and there are 13 data points down through the end of the second class; hence, the median must occur in the second class. In the second class the 6th through 11th ordered data points are 131, 131, 132, 132, 133, 133. Thus the median is .5(133 + 133) = 133.

Since 20/4 is 5, the first quartile is the 5th ordered data point which is 129 (The largest data point in the first class as dictated by column F). The third quartile is the 5th ordered data point from the top to the bottom. By the FTB column its in the second class from the top. The 7th ordered (from top to bottom) is 136, the 6th is 136, and the 5th is 136. So Q3=136.

For small data sets we can get these statistics by hand. But for large data sets it is best if the computer gets them for us.

To run the class code for the descriptive statistics choose the summary module of the and choose Numerical Summaries after entering the data.

 
12 18 25 15 9 14 21 25 28 125
TRY IT!
next up previous contents index
Next: Outliers and Box Plots Up: Descriptive Statistics Previous: Sample Distributions for Continuous

2001-01-01