Stem-leaf plots and histograms are useful descriptions of a sample but
often we want to describe samples or compare samples with a few descriptive
statistics. The statistics we discuss next are commonly called the Five
Basic Descriptive Statistics .
First, alas, we need a little notation. We will be discussing samples
throughout this course and we need to often call the items something. So
for a generic sample of size n lets use
where
| x1 | denotes the first item (measurement) in the sample, | ||
| x2 | denotes the second item (measurement) in the sample, | ||
| xi | |||
| xn |
Using this notation we can now define the 5 basic descriptive statistics. We will illustrate these statistics with the sample of n=25 Etruscan skull sizes, given above, but repeated for convenience:
126 132 138 140 141 141 142 143 144 144 144 145 146 147 148 148 149 149 150 150 150 154 155 158 158Note that we have ordered the data from low to high. If you had to choose some numbers to describe this data set, probably the first two you would pick are the Minimum and the Maximum . The minimum of a sample is the smallest measurement, i.e. the first ordered data point. We will denote the minimum by min. For the Etruscan data set min = 126mm. The maximum of a sample is the largest measurement, i.e. the nth ordered data point. We will denote the maximum by max. For the Etruscan data set max = 158mm. The min to the max is the range of the data. In fact we call their difference the range . For the Etruscan data the range is 158 - 126 = 32 mm. The range is a measure of scale , or dispersion or noise . The range is extremely sensitive to s. Outliers are points that are far from the rest of the data. We will formally define "outlier" in the next section. A statistic is said to be robust if it is not sensitive to outliers. So the minimum, maximum, and range are not robust statistics.
Now that we have the range of the data, the next statistic is a measure of the center of the sample. We will use the median . The median is the middle ordered data point if the sample size is an odd number and the average of the middle ordered data points if the the sample size is even. 50% of the data is less than or equal to the median and 50% of the data is greater than or equal to the median. For the Etruscan data, upon ordering the data we get,
126 132 138 140 141 141 142 143 144 144 144 145 146 147 148 148 149 149 150 150 150 154 155 158 158
Since n is 25, (n+1)/2 is 13 and, hence, the median is the 13th order data point or 146 mm. We shall use Q2 to denote the median. So half of the Etruscans in the sample had a skull size less than or equal to 146mm and half of the Etruscans had a skull size greater than or equal to 146mm. The median is a measure of center . The median is very robust. Half the data would have to change for the median to change.
We now have the range of the data and a measure of the center. How about the middle 50%? This goes from the First Quartile to Third Quartile . The first quartile is the median of the first half of the data. We will denote it by . 25% of the data is less than or equal to the first quartile and 75% of the data is greater than or equal to the first quartile. There are many rules for finding Q1. In this class we will be using the computer for large data sets and the computer (the statistical software) will compute Q1. For class and tests lets use a very simple rule. To find the ordered data point, divide n by 4. If the result is an integer use that integer to pick out the ordered data point corresponding to that integer. If the result is a fraction round up to the nearest integer. Pick out the ordered data point corresponding to this integer. For the Etruscan data, 25/4 is 6.25; hence, we round up to 7. The 7th ordered data point is 142, so Q1 = 142mm for the Etruscan data set. The first quartile is a robust statistic.
The third quartile is the median of the second half of the data. We will denote it by Q3 . 75% of the data is less than or equal to the third quartile and 25% of the data is greater than or equal to the third quartile. There are many rules for finding Q3. To find Q3 by hand, just use the integer we found for the first quartile, but this time count through the data from the high measurements to the low measurements. Hence
Q3 = 150mm (several are tied at 150 but the 8th point from the top in my counting was 149). The third quartile is robust.
The difference between the quartiles is called the interquartile range of the sample. It is denoted by IQR , so for the Etruscan data
IQR = 150-142 = 8mm. IQR is also a measure of scale. It is not sensitive to outliers (25% of the data have to be outliers to affect IQR); hence, IQR is robust.
In summary, for the Etruscan data, the five basic descriptive statistics are: 126, 142, 146, 150 and 158mm. We want to put these summary statistics in a picture but first we need the concept of an outlier, which we will do in the next section.
Lets do one more example which shows how we can get very quickly the 5 basic descriptive statistics from a stem leaf plot. Consider the subsample of Italian skull sizes given by,
133 128 136 140 127 136 131 131 128 132 125 133 134 136 134 129 132 139 143 138
The stem leaf plot is
Stem Leaves f F FTB 12 87859 5 5 13 31123442 8 13 13 66698 5 18 7 14 03 2 20 2We have added three columns on the right side of the stem-leaf plot. The column labeled f is the frequency of the class, the column labeled F is the cumulative frequency of the class (the number of data points down through the end of the class), and the column labeled FTB is the cumulative frequency of the class from large numbers to small (the number of data points down through the beginning of the class). Based on this plot and those columns the 5 basic descriptive statistics are a cinch. The minimum is 125 and the maximum is 143.
The sample size is 20 (last number in column F) which is even. So the median is the average of 10th (n/2), and the 11th ordered data points. To get these look at column F. There are 5 data points down through the end of the first class and there are 13 data points down through the end of the second class; hence, the median must occur in the second class. In the second class the 6th through 11th ordered data points are 131, 131, 132, 132, 133, 133. Thus the median is
.5(133 + 133) = 133.
Since 20/4 is 5, the first quartile is the 5th ordered data point which is 129 (The largest data point in the first class as dictated by column F). The third quartile is the 5th ordered data point from the top to the bottom. By the FTB column its in the second class from the top. The 7th ordered (from top to bottom) is 136, the 6th
is 136, and the 5th is 136. So Q3=136.
For small data sets we can get these statistics by hand. But for large
data sets it is best if the computer gets them for us.
To run the class code for the descriptive statistics choose the summary module of the and choose Numerical Summaries after entering the data.
12 18 25 15 9 14 21 25 28 125TRY IT!