next up previous contents index
Next: Tests of Hypotheses Up: Confidence Intervals Previous: Confidence Intervals for Proportions

  
Confidence Intervals Based on Resampling

In this section we discuss a way of obtaining confidence intervals in a variety of situations. These are based on resampling and are often referred to as bootstrap percentile confidence intervals.

Recall from Section 7.2 that the endpoints of the confidence interval for the population mean, $\mu$, are actually estimates of the approximate (Central Limit Theorem) 2.5th and 97.5th percentiles of the distribution of $\bar{X}$. We will use this for our bootstrap  confidence intervals.

Consider the population median, $\theta$ . We will obtain a bootstrap confidence interval for the median . This is often a parameter of interest because it divides the population in half. That is, half the time a population item is less than $\theta$ and half the time a population item is greater than $\theta$. For instance, suppose the population of interest is the income of an American family. If you knew $\theta$ then you would know if your family is in the bottom half or top half of American families when it came to income.

As another example, suppose you were doing research on a new battery  to power an electrical automobile. Suppose the median lifetime in miles of the current battery is 300 miles. You feel that the new battery is a vast improvement. Let $\theta$ be the median lifetime in miles of your new battery. You don't know $\theta$, but you would like to show it is an improvement; i.e., it is over 300 miles. How would you investigate this?

It's easy, right? You don't know the population (the lifetime in miles of a typical new battery), so you will have to use a sample. So you select 20 new batteries and put them on test. WAIT!!!!!!. It is extremely important that the batteries are selected:

1.
Independent of one another.

2.
They were manufactured under the same conditions.
If these conditions are not meant get ready for GIGO , Garbage In, Garbage Out.

These assumptions are very important and they have to be followed. You can see why we are often dealing with small samples. In this case, we are destroying the battery when we sample it, (you can recharge it, but a recharged battery is not in the population of interest!). How long a recharged battery lasts is of interest but in the present experiment, we re not measuring the effect of recharging. This may be a later experiment. Also, we are doing research on the battery so you may be tempted to make modifications to the battery as we sample. Nope, not allowed for this violates assumption 2. (In certain situations this can be done but it is a much different experimental design ; see the section on regression design.)

Continuing with our example, suppose you do select 20 new batteries at random and put them on test, (20 cars of the same type are selected, one of the new batteries are installed in each car, and they are driven over the same route). Suppose the (sorted) lifetimes of the batteries in miles are:

196    204    233    256    258    313    315    322    403    408
483    510    538    559    586    722    806    875    930   1192
A dotplot of the sample is:
                   .
              : .:  :   :   . ....      .   .   .  .            .
           ---+---------+---------+---------+---------+---------+---C1
            200       400       600       800      1000      1200
Dotplot is skewed right and shows a lot of noise. The sample median is $Q_2 = .5 \times (408+483) = 445.5$. This is above 300 (the lifetime in miles of the old battery). But wait! Five of the new batteries lasted less than 300 miles. So are we sure??? Of course not, Q2 = 445.5 is just an estimate of the population median $\theta$. We need to know how much it missed by.

We of course need to know the distribution of Q2, but we don't. Next, how about a Central Limit Theorem from which we could obtain the approximate distribution of Q2. Such theorems exist, but the approximate standard error of Q2 is not easy to estimate. How about estimating the 2.5th and 97.5th percentiles of the distribution of Q2? Hey, now you are cooking!

Okay, we need the distribution of Q2. But we don't know it. We could do it this way, though. Simply do the experiment over and over, say, 1000 times. For each of those times, calculate Q2. Form a histogram of these 1000 Q2's and pick off the 2.5th and 97.5th percentiles. (This is the same as sorting the 1000 Q2's and selecting the 25th and 976th sorted Q2.

Back to the battery experiment! We just have to do 1000 experiments. WHAT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!. That's $1000 \times 20$ batteries that we need. Ridiculous. Of course, we can't do this.

So what's to do. Louder, I can't hear you. You got it!!!!!!!!!!!!!!!!!! Resample the sample 1000 times. For each resample calculate the sample median. Form a histogram of these 1000 Q2 and pick off the 2.5th and 97.5th percentiles. (This is the same as sorting the 1000 Q2's and selecting the 25th and 976th sorted Q2's. These percentiles do indeed estimate the percentiles of the true distribution of Q2. It's such a simple idea and it works. You simply resample the sample. To insure independence, you resample with replacement and you use the same sample size.

For example, here is a resample of the sample of lifetimes of the batteries (I have sorted them):

196    196    204    204    204    204    256    258    315    315
315    322    483    538    559    559    806    875    875    930
The sample median is 315. Here is another resample:
196    204    256    256    256    313    313    403    483    510
538    538    538    538    538    586    722    806    875   1192
The sample median is 524. We only have 998 more resamples to go. Of course, the computer will do this for us.

We will have two levels of CC for this.

1.
Input the data and select the number of resamples (trials) that you want. For class use, the concept (what is a confidence interval) is more important than real use so we will often just do 100 resamples. As you will see, the 100 resampled medians will be returned in order. Avoiding fractions, we will choose the 3rd and the 98th items from these sorted 100 resample medians as our 95% confidence interval for the median.

2.
For the second level, 1000 resamples will be done. They will not be printed out. Just the 2.5th and 97.5th percentiles will be printed out.
I did 100 resamples by getting the class code (One-Sample bootstrap means and medians), dropping the data in the input box, entering my id, putting in for 100 resamples (bootstraps), and clicking submit. This gave me:
   286.5   313.0   314.0   314.0   315.0   315.0   317.5   317.5   318.5
   318.5   318.5   318.5   318.5   318.5   322.0   358.0   359.0   361.5
   361.5   362.5   362.5   362.5   362.5   362.5   362.5   365.0   365.0
   365.0   402.5   403.0   403.0   403.0   403.0   403.0   405.5   405.5
   405.5   405.5   405.5   405.5   408.0   408.0   408.0   408.0   408.0
   408.0   408.0   408.0   412.5   443.0   443.0   445.5   445.5   445.5
   445.5   445.5   459.0   459.0   459.0   470.5   483.0   483.0   483.0
   483.0   483.0   483.0   483.0   483.0   483.0   483.5   496.5   496.5
   496.5   496.5   496.5   496.5   510.0   510.0   510.0   510.0   510.0
   510.0   510.5   521.0   524.0   524.0   524.0   524.0   524.0   534.5
   534.5   538.0   538.0   538.0   548.5   548.5   559.0   559.0   572.5
   572.5
Hence my confidence interval is (314, 559). This is only based on 100 resamples, so lets use the terminology: we are fairly confident that the true population median is between 314 and 559. Note that the interval did not include 300.

For the practical confidence interval based on 1000 resamples, I got the interval (314, 572.5) This is based on 1000 resamples, so we will use the terminology: we are 95% confident that the true population median is between 314 and 559. Note that the interval did not include 300. Because this interval did not contain 300 and all values in the interval exceeded 300, we are confident that the new battery is an improvement. Is it a practical improvement? This is a question for the engineers to determine.

Next is a dotplot of the sample showing the location (X's) of the confidence interval:

                   .
              : .:  :   :   . ....      .   .   .  .            .
           ---+----X----+--------X+---------+---------+---------+---C1
            200       400       600       800      1000      1200
A dotplot of the 1000 resample medians is given by
Each dot represents 14 points


                            .
                            :
                            :
                            :
                        .   :  .
                   ..  :   :  :  :. :
                    ::  :   :  :. :: :::.
              . . . ::  :  .:..::.:: ::::::. . . .    .      .
           ---+---------+---------+---------+---------+---------+---
            240       360       480       600       720       840
It is fairly symmetric in the middle with an obvious tail to the right. It shows a central limit effect as we have seen with the sample mean.


Exercise 8.4.1  
1.
Consider the following simple data set.
           77     79     81     91    106    114    126    132
Obtain 5 resamples of this data set using the previous resampling code which we used for probability. (Just sample with replacement the numbers: min = 1, max = 8, trials = 5, numbers to be drawn 8. These are the sample item numbers for the resample. For example if the numbers you draw are:
      6       5       4       4       4       5       8       6
Then your resample is: 114, 106, 91, 91 91, 106, 132, 114

Obtain 4 more resamples. Calculate the median of each. Compare your resampled medians with the sample median.

2.
For the last problem use the class code (One-Sample bootstrap means and medians (Sorted)) to obtain a 95% confidence interval for the true population median based on 100 resamples.
3.
For problem #1, use the class code (One-Sample CI's for the mean and median) to obtain a 95% confidence interval for the true population median based on 1000 resamples. Dotplot your data set and locate the confidence interval and sample median on your plot.
4.
Below are the weights of the pitchers in Carrie's baseball data set. Obtain the sample median. Use the class code (One-Sample CI's for the mean and median) to obtain a 95% confidence interval for the true population median weight of a professional baseball pitcher based on 1000 resamples. Locate your interval on the dotplot plot below and interpret your interval.
  160    175    180    185    185    185    190    190    195    195    195
  200    200    200    200    205    205    210    210    218    219    220
  222    225    225    232



                                 .      .  :
                 .         .  .  :   :  :  :   :  :    .... :    .
             -------+---------+---------+---------+---------+---------C21
                  165       180       195       210       225       240
5.
Do the last problem for the weights of the hitters:
  155    155    160    160    160    166    170    175    175    175    180
  185    185    185    185    185    185    185    190    190    190    190
  190    195    195    195    195    200    205    207    210    211    230

                                   .
                                   :   .
                   .         .     :   :  :
               :   :   . .   :  .  :   :  :  .   .. ..           .
            +---------+---------+---------+---------+---------+-------C22
          150       165       180       195       210       225
6.
Plot your CI's for the last two problems on the same real number line. What do you conclude about the true median weights of hitters and pitchers based on this plot?
7.
Select one of your textbooks or a novel that you are reading. Select a passage at random (not dialogue). Then count up the number of words in the first sentence of the passage. Record this number. Repeat this for 30 sentences. This your sample of size 30. Dotplot your data and describe the shape. Determine the sample median. Next obtain a 95% confidence interval for the true median sentence length. Locate the interval on your dotplot. What does it mean?


next up previous contents index
Next: Tests of Hypotheses Up: Confidence Intervals Previous: Confidence Intervals for Proportions

2001-01-01