Frequently the most interesting points of a data set are the points that do not seem to belong; i.e., they seem to differ by a substantial amount from the rest of the data. We call these points outliers . These are often points worthy of investigation in order to understand why they differ. Such points can lead to significant discoveries.
For example, each year satellites measure the ozone level over Antarctica. In the early 1980s, however, scientists were so astounded in detecting a dramatical seasonal drop in ozone levels over Antarctica by a fly over that they spent two years rechecking their satellite data. They discovered that satellites had dutifully been recording the ozone collapse but the computers had not raised an alert because they were programmed to reject such extreme data as anomalies; see R. Benedick, Scientific American, April 1992. This discovery of the drop in ozone levels has had profound influences on manufacturing and society. If the computer had been programmed correctly it would have flagged the outliers and, hence, alerted the scientists
to investigate the outliers on the first occasion. Changes in manufacturing could have been made much sooner.
We have chosen the following simple rule for determining when a point is labeled an outlier: First determine the quartiles Q1 and Q3. Recall that the interquartile range,
Q3 - Q1, is a measure of noise or scale for the data set. Points that are beyond the quartiles by one-and-a-half IQR's will be deemed potential outliers. I know what you are asking (you are so inquisitive), why this rule? Stay tuned for Chapter 5 when an
answer will be provided.
In order to set up a formal mechanism, denote the above distance by h; i.e,
Next, denote the lower and upper inner fences by
Hence points beyond these fences are potential outliers. Those points of the data set which are closest to the fences but still inside the fences are called the adjacent points . There are two adjacent points in a data set, the lower adjacent point (the point inside the fences but closest to LIF) and the higher adjacent point (the point inside
the fences but closest to UIF).
We now have the ingredients to draw a boxplot of a data set. This is an easily drawn schematic of the data set which displays the five basic descriptive statistics and the outliers, if there are any. Simply draw a number line, as you did in the dotplot. Find the quartiles on the number line and draw a rectangle above the number line which encloses the
quartiles; i.e, this box encloses the middle 50% of the data. Find the median on the number line and place a + in the box above the median. Next find the fences and adjacent points. Draw lines from the ends of the box to the adjacent points. Finally, indicate the outliers by *'s.
Sounds horrible, right! Now wait a minute. Consider the sample of n=25 Etruscan skull sizes, given above:
126 132 138 140 141 141 142 143 144 144 144 145 146 147 148 148 149 149 150 150 150 154 155 158 158Recall that the quartiles are 142 and 150 and the median is 146. Hence, h = 1.5(150-142)=12. Thus the fences are: LIF = 142 - 12=130 and UIF = 150 + 12 = 162. Therefore the adjacent points are 132 and 158 and the point 126 is an outlier. So the boxplot is:
-------------- * -----------------I + I------------- -------------- --+---------+---------+---------+---------+---------+---- 126.0 132.0 138.0 144.0 150.0 156.0