Confidence Intervals Based on Resampling

Recall from Section 7.2 that the endpoints of the confidence interval
for the population mean, ,
are actually estimates of the approximate
(Central Limit Theorem) 2.5*th* and 97.5*th* percentiles of the
distribution of .
We will use this for our bootstrap confidence
intervals.

Consider the population median, . We will obtain a bootstrap confidence interval for the median . This is often a
parameter of interest because it divides the population in half. That is,
half the time a population item is less than
and half the time a population item is greater than
.
For instance,
suppose the population of interest is the income of an American family.
If you knew
then you would know if your family is
in the bottom half or top half of American families when it came to income.

As another example, suppose you were doing research on a new battery
to power an electrical automobile. Suppose the median lifetime in miles of the current
battery is 300 miles. You feel that the new battery is a vast improvement.
Let
be the median lifetime in miles of your new battery.
You don't know ,
but you would like to show it is an
improvement; i.e., it is over 300 miles. How would you investigate this?

It's easy, right? You don't know the population (the lifetime in miles
of a typical new battery), so you will have to use a sample. So you select
20 new batteries and put them on test.
**WAIT!!!!!!**. It is extremely
important that the batteries are selected:

- 1.
**Independent of one another.**- 2.
**They were manufactured under the same conditions.**

These assumptions are very important and they have to be followed. You
can see why we are often dealing with small samples. In this case, we are
destroying the battery when we sample it, (you can recharge it, but a recharged
battery is not in the population of interest!). How long a recharged battery
lasts is of interest but in the present experiment, we re not measuring
the effect of recharging. This may be a later experiment. Also, we are
doing research on the battery so you may be tempted to make modifications
to the battery as we sample.
**Nope, **not allowed for this violates assumption 2. (In certain situations this can be done but it is a much different experimental design ; see the section on regression design.)

Continuing with our example, suppose you do select 20 new batteries at random and put them on test, (20 cars of the same type are selected, one of the new batteries are installed in each car, and they are driven over the same route). Suppose the (sorted) lifetimes of the batteries in miles are:

196 204 233 256 258 313 315 322 403 408 483 510 538 559 586 722 806 875 930 1192A dotplot of the sample is:

. : .: : : . .... . . . . . ---+---------+---------+---------+---------+---------+---C1 200 400 600 800 1000 1200Dotplot is skewed right and shows a lot of noise. The sample median is . This is above 300 (the lifetime in miles of the old battery). But wait! Five of the new batteries lasted less than 300 miles. So are we sure??? Of course not,

We of course need to know the distribution of *Q*_{2},
but we don't. Next, how about a Central Limit Theorem from which we could
obtain the approximate distribution of *Q*_{2}. Such theorems
exist, but the approximate standard error of *Q*_{2} is not
easy to estimate. How about estimating the 2.5*th* and 97.5*th*
percentiles of the distribution of *Q*_{2}? Hey, now you are
cooking!

Okay, we need the distribution of *Q*_{2}. But we don't
know it. We could do it this way, though. Simply do the experiment over
and over, say, 1000 times. For each of those times, calculate *Q*_{2}.
Form a histogram of these 1000 *Q*_{2}'s and pick off the
2.5*th* and 97.5*th* percentiles. (This is the same as sorting
the 1000 *Q*_{2}'s and selecting the 25*th* and 976*th*
sorted *Q*_{2}.

Back to the battery experiment! We just have to do 1000 experiments.
**WHAT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!**.
**That's **
** batteries** that we need. Ridiculous. Of course, we can't do
this.

So what's to do. Louder, I can't hear you. You got it!!!!!!!!!!!!!!!!!!
**Resample the sample 1000 times**. For each resample calculate the sample median.
Form a histogram of these 1000 *Q*_{2} and pick off the 2.5*th*
and 97.5*th* percentiles. (This is the same as sorting the 1000 *Q*_{2}'s
and selecting the 25*th* and 976*th* sorted *Q*_{2}'s. These
percentiles do indeed estimate the percentiles of the true distribution
of *Q*_{2}. It's such a simple idea and it works. You simply
resample the sample. To insure independence, you resample **with replacement**
and you use the same sample size.

For example, here is a resample of the sample of lifetimes of the batteries (I have sorted them):

196 196 204 204 204 204 256 258 315 315 315 322 483 538 559 559 806 875 875 930The sample median is 315. Here is another resample:

196 204 256 256 256 313 313 403 483 510 538 538 538 538 538 586 722 806 875 1192The sample median is 524. We only have 998 more resamples to go. Of course, the computer will do this for us.

We will have two levels of CC for this.

- 1.
- Input the data and select the number of resamples (trials) that you want.
For class use, the concept (what is a confidence interval) is more important
than real use so we will often just do 100 resamples. As you will see,
the 100 resampled medians will be returned in order. Avoiding fractions,
we will choose the 3
*rd*and the 98*th*items from these sorted 100 resample medians as our 95% confidence interval for the median. - 2.
- For the second level, 1000 resamples will be done. They will not be printed
out. Just the 2.5
*th*and 97.5*th*percentiles will be printed out.

286.5 313.0 314.0 314.0 315.0 315.0 317.5 317.5 318.5 318.5 318.5 318.5 318.5 318.5 322.0 358.0 359.0 361.5 361.5 362.5 362.5 362.5 362.5 362.5 362.5 365.0 365.0 365.0 402.5 403.0 403.0 403.0 403.0 403.0 405.5 405.5 405.5 405.5 405.5 405.5 408.0 408.0 408.0 408.0 408.0 408.0 408.0 408.0 412.5 443.0 443.0 445.5 445.5 445.5 445.5 445.5 459.0 459.0 459.0 470.5 483.0 483.0 483.0 483.0 483.0 483.0 483.0 483.0 483.0 483.5 496.5 496.5 496.5 496.5 496.5 496.5 510.0 510.0 510.0 510.0 510.0 510.0 510.5 521.0 524.0 524.0 524.0 524.0 524.0 534.5 534.5 538.0 538.0 538.0 548.5 548.5 559.0 559.0 572.5 572.5Hence my confidence interval is

For the practical confidence interval based on 1000 resamples, I got
the interval *(314, 572.5)* This is based on 1000 resamples, so we
will use the terminology:
*we are 95% confident that the true
population median is between 314 and 559.* Note that the interval did
not include 300. Because this interval did not contain 300 and all values in the interval exceeded 300, we are confident that the new battery is an improvement. Is it a practical improvement? This is a question for the engineers to determine.

Next is a dotplot of the sample showing the location (X's) of the confidence interval:

. : .: : : . .... . . . . . ---+----X----+--------X+---------+---------+---------+---C1 200 400 600 800 1000 1200A dotplot of the 1000 resample medians is given by

Each dot represents 14 points . : : : . : . .. : : : :. : :: : : :. :: :::. . . . :: : .:..::.:: ::::::. . . . . . ---+---------+---------+---------+---------+---------+--- 240 360 480 600 720 840It is fairly symmetric in the middle with an obvious tail to the right. It shows a central limit effect as we have seen with the sample mean.

- 1.
- Consider the following simple data set.
77 79 81 91 106 114 126 132

Obtain 5 resamples of this data set using the previous resampling code which we used for probability. (Just sample with replacement the numbers: min = 1, max = 8, trials = 5, numbers to be drawn 8. These are the sample item numbers for the resample. For example if the numbers you draw are:6 5 4 4 4 5 8 6

Then your resample is: 114, 106, 91, 91 91, 106, 132, 114Obtain 4 more resamples. Calculate the median of each. Compare your resampled medians with the sample median.

- 2.
- For the last problem use the class code (One-Sample bootstrap means and medians (Sorted)) to obtain a 95% confidence interval for the true population median based on 100 resamples.
- 3.
- For problem #1, use the class code (One-Sample CI's for the mean and median) to obtain a 95% confidence interval for the true population median based on 1000 resamples. Dotplot your data set and locate the confidence interval and sample median on your plot.
- 4.
- Below are the weights of the pitchers in Carrie's baseball data set. Obtain the sample median. Use the class code (One-Sample CI's for the mean and median) to obtain a 95%
confidence interval for the true population median weight of a professional baseball pitcher based on 1000 resamples. Locate your interval on the dotplot
plot below and interpret your interval.
160 175 180 185 185 185 190 190 195 195 195 200 200 200 200 205 205 210 210 218 219 220 222 225 225 232 . . : . . . : : : : : : .... : . -------+---------+---------+---------+---------+---------C21 165 180 195 210 225 240

- 5.
- Do the last problem for the weights of the hitters:
155 155 160 160 160 166 170 175 175 175 180 185 185 185 185 185 185 185 190 190 190 190 190 195 195 195 195 200 205 207 210 211 230 . : . . . : : : : : . . : . : : : . .. .. . +---------+---------+---------+---------+---------+-------C22 150 165 180 195 210 225

- 6.
- Plot your CI's for the last two problems on the same real number line. What do you conclude about the true median weights of hitters and pitchers based on this plot?
- 7.
- Select one of your textbooks or a novel that you are reading. Select a passage at random (not dialogue). Then count up the number of words in the first sentence of the passage. Record this number. Repeat this for 30 sentences. This your sample of size 30. Dotplot your data and describe the shape. Determine the sample median. Next obtain a 95% confidence interval for the true median sentence length. Locate the interval on your dotplot. What does it mean?