next up previous contents index
Next: Dotplot Up: Summarizing Numerical Data Previous: Relative Frequency Table and

Box-and-Whisker Plot

Order the data from smallest to largest. What is the range of the first (or smaller) half of the data? The smallest quarter of the data? The next quarter? The box-and-whisker plot or boxplot  is a graphical picture of the distribution of quarters of the data. Consider once more the HS GPA variable in the Graduation Rate Data of Section  1.3. Since there are n=56 observations, a quarter of the data comprises 56/4=14 observations. The left whisker of the boxplot goes from 2.37 to 2.72. This means that the lowest 14 GPA's (one quarter of the data) lie within (2.37, 2.72). The next 14 larger observations lie within (2.72, 3.11). The third set of 14 observations lie within (3.11, 3.52). Finally, the largest 14 observations lie within (3.52, 4.00).






Boxplot of High School GPA

\epsfig{file=box_hsgpa.ps, height=4in, width=5in, angle=-180}






The five values (2.37, 2.72, 3.11, 3.52, 4.00) that divide the data into quarters and form the fences and whiskers of the boxplot are collectively called the five-number-summary of the data. They are often denoted as MIN, Q1, MED, Q3, and MAX respectively.

1. MIN  is called the minimum , and is the smallest of the ordered observations.
2. Q1 is the upper boundary of the first quarter, and is called the first quartile .
3. MED is the upper boundary of the second quarter, and is called the second quartile. However, it also divides the data into lower and upper halves, and is more often called the median .
4. Q3 is the upper boundary of the third quarter, and is called the third quartile.
5. MAX  is the largest of the ordered observations and is called the maximum .

Boxplots are quite useful for comparing two distributions side-by-side. Below, we present a boxplot of second year GPA alongside the boxplot of high school GPA. The boxplots are presented vertically this time, but the interpretation remains the same. Note that there is a slight difference in location as measured by the medians, but there is a radical difference in spread between the two distributions. The most noteworthy feature of 2nd year GPA is the long left tail, which is evidence that some students are not doing very well in college. (Note: Some computing packages use a special symbol to denote outlying values, or outliers. The boxplot for second year GPA has an extremely low outlier, denoted by a circle. The left (or bottom) whisker ends at the second smallest observation).

\epsfig{file=box_sydbysyd.ps, height=4in, angle=-90}

Different statistical computing packages often have different ways of computing the quartiles. In this class, we compute the quartiles as follows. First, arrange the observations from smallest (1st ordered observation) to largest (nth ordered observation). Then

Q1 is the .25(n+1)st ordered observation.
MED is the .50(n+1)st ordered observation.
Q3 is the .75(n+1)st ordered observation.
If .25(n+1) is not an integer, take the average of the two adjacent ordered observations. Similarly for MED and Q3. Following are the 56 ordered observations of HS GPA used in the boxplot above.



  

      2.37   2.43   2.46   2.55   2.57   2.58   2.59   2.60   2.60   2.60   
      2.61   2.63   2.67   2.71   2.73   2.75   2.78   2.78   2.78   2.79   
      2.81   2.81   2.82   2.90   2.91   2.93   2.94   3.08   3.14   3.16
      3.19   3.20   3.21   3.29   3.32   3.33   3.35   3.36   3.36   3.36
      3.44   3.50   3.54   3.54   3.57   3.58   3.60   3.62   3.72   3.73
      3.76   3.76   3.77   3.83   3.86   4.00



Since .25(56+1)=14.25, then Q1 is computed as the average of the 14th and 15th ordered observations (2.71+2.73)/2=2.72. Similarly, MED=(3.08+3.14)/2 = 3.11, and Q3= (3.50+3.54)/2=3.52.


next up previous contents index
Next: Dotplot Up: Summarizing Numerical Data Previous: Relative Frequency Table and

2003-09-08