Recall from Section 7.2 that the endpoints of the confidence interval
for the population mean,
,
are actually estimates of the approximate
(Central Limit Theorem) 2.5th and 97.5th percentiles of the
distribution of
.
We will use this for our bootstrap confidence
intervals.
Consider the population median,
. We will obtain a bootstrap confidence interval for the median . This is often a
parameter of interest because it divides the population in half. That is,
half the time a population item is less than
and half the time a population item is greater than
.
For instance,
suppose the population of interest is the income of an American family.
If you knew
then you would know if your family is
in the bottom half or top half of American families when it came to income.
As another example, suppose you were doing research on a new battery
to power an electrical automobile. Suppose the median lifetime in miles of the current
battery is 300 miles. You feel that the new battery is a vast improvement.
Let
be the median lifetime in miles of your new battery.
You don't know
,
but you would like to show it is an
improvement; i.e., it is over 300 miles. How would you investigate this?
It's easy, right? You don't know the population (the lifetime in miles of a typical new battery), so you will have to use a sample. So you select 20 new batteries and put them on test. WAIT!!!!!!. It is extremely important that the batteries are selected:
These assumptions are very important and they have to be followed. You
can see why we are often dealing with small samples. In this case, we are
destroying the battery when we sample it, (you can recharge it, but a recharged
battery is not in the population of interest!). How long a recharged battery
lasts is of interest but in the present experiment, we re not measuring
the effect of recharging. This may be a later experiment. Also, we are
doing research on the battery so you may be tempted to make modifications
to the battery as we sample.
Nope, not allowed for this violates assumption 2. (In certain situations this can be done but it is a much different experimental design ; see the section on regression design.)
Continuing with our example, suppose you do select 20 new batteries at random and put them on test, (20 cars of the same type are selected, one of the new batteries are installed in each car, and they are driven over the same route). Suppose the (sorted) lifetimes of the batteries in miles are:
196 204 233 256 258 313 315 322 403 408 483 510 538 559 586 722 806 875 930 1192A dotplot of the sample is:
.
: .: : : . .... . . . . .
---+---------+---------+---------+---------+---------+---C1
200 400 600 800 1000 1200
Dotplot is skewed right and shows a lot of noise. The sample median is
We of course need to know the distribution of Q2,
but we don't. Next, how about a Central Limit Theorem from which we could
obtain the approximate distribution of Q2. Such theorems
exist, but the approximate standard error of Q2 is not
easy to estimate. How about estimating the 2.5th and 97.5th
percentiles of the distribution of Q2? Hey, now you are
cooking!
Okay, we need the distribution of Q2. But we don't
know it. We could do it this way, though. Simply do the experiment over
and over, say, 1000 times. For each of those times, calculate Q2.
Form a histogram of these 1000 Q2's and pick off the
2.5th and 97.5th percentiles. (This is the same as sorting
the 1000 Q2's and selecting the 25th and 976th
sorted Q2.
Back to the battery experiment! We just have to do 1000 experiments.
WHAT!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!.
That's
batteries that we need. Ridiculous. Of course, we can't do
this.
So what's to do. Louder, I can't hear you. You got it!!!!!!!!!!!!!!!!!!
Resample the sample 1000 times. For each resample calculate the sample median.
Form a histogram of these 1000 Q2 and pick off the 2.5th
and 97.5th percentiles. (This is the same as sorting the 1000 Q2's
and selecting the 25th and 976th sorted Q2's. These
percentiles do indeed estimate the percentiles of the true distribution
of Q2. It's such a simple idea and it works. You simply
resample the sample. To insure independence, you resample with replacement
and you use the same sample size.
For example, here is a resample of the sample of lifetimes of the batteries (I have sorted them):
196 196 204 204 204 204 256 258 315 315 315 322 483 538 559 559 806 875 875 930The sample median is 315. Here is another resample:
196 204 256 256 256 313 313 403 483 510 538 538 538 538 538 586 722 806 875 1192The sample median is 524. We only have 998 more resamples to go. Of course, the computer will do this for us.
We will have two levels of CC for this.
286.5 313.0 314.0 314.0 315.0 315.0 317.5 317.5 318.5 318.5 318.5 318.5 318.5 318.5 322.0 358.0 359.0 361.5 361.5 362.5 362.5 362.5 362.5 362.5 362.5 365.0 365.0 365.0 402.5 403.0 403.0 403.0 403.0 403.0 405.5 405.5 405.5 405.5 405.5 405.5 408.0 408.0 408.0 408.0 408.0 408.0 408.0 408.0 412.5 443.0 443.0 445.5 445.5 445.5 445.5 445.5 459.0 459.0 459.0 470.5 483.0 483.0 483.0 483.0 483.0 483.0 483.0 483.0 483.0 483.5 496.5 496.5 496.5 496.5 496.5 496.5 510.0 510.0 510.0 510.0 510.0 510.0 510.5 521.0 524.0 524.0 524.0 524.0 524.0 534.5 534.5 538.0 538.0 538.0 548.5 548.5 559.0 559.0 572.5 572.5Hence my confidence interval is (314, 559). This is only based on 100 resamples, so lets use the terminology: we are fairly confident that the true population median is between 314 and 559. Note that the interval did not include 300.
For the practical confidence interval based on 1000 resamples, I got
the interval (314, 572.5) This is based on 1000 resamples, so we
will use the terminology:
we are 95% confident that the true
population median is between 314 and 559. Note that the interval did
not include 300. Because this interval did not contain 300 and all values in the interval exceeded 300, we are confident that the new battery is an improvement. Is it a practical improvement? This is a question for the engineers to determine.
Next is a dotplot of the sample showing the location (X's) of the confidence interval:
.
: .: : : . .... . . . . .
---+----X----+--------X+---------+---------+---------+---C1
200 400 600 800 1000 1200
A dotplot of the 1000 resample medians is given by
Each dot represents 14 points
.
:
:
:
. : .
.. : : : :. :
:: : : :. :: :::.
. . . :: : .:..::.:: ::::::. . . . . .
---+---------+---------+---------+---------+---------+---
240 360 480 600 720 840
It is fairly symmetric in the middle with an obvious tail to the right.
It shows a central limit effect as we have seen with the sample mean.
77 79 81 91 106 114 126 132
Obtain 5 resamples of this data set using the previous resampling code which we used for probability. (Just sample with replacement the numbers: min = 1, max = 8, trials = 5, numbers to be drawn 8. These are the sample item numbers for the resample. For example if the numbers you draw are:
6 5 4 4 4 5 8 6
Then your resample is: 114, 106, 91, 91 91, 106, 132, 114
Obtain 4 more resamples. Calculate the median of each. Compare your resampled medians with the sample median.
160 175 180 185 185 185 190 190 195 195 195
200 200 200 200 205 205 210 210 218 219 220
222 225 225 232
. . :
. . . : : : : : : .... : .
-------+---------+---------+---------+---------+---------C21
165 180 195 210 225 240
155 155 160 160 160 166 170 175 175 175 180
185 185 185 185 185 185 185 190 190 190 190
190 195 195 195 195 200 205 207 210 211 230
.
: .
. . : : :
: : . . : . : : : . .. .. .
+---------+---------+---------+---------+---------+-------C22
150 165 180 195 210 225