Lets pick on the mean, .
That is, we have a population with unknown
So we take a random sample of size n from this distribution,
X1, X2, ... , Xn. Then our estimate of
is the sample average .
Income Example : Suppose we take a sample of 25 students from Smith University and record their family incomes. Suppose the incomes (in thousands of dollars) are:
28 29 35 42 42 44 50 52 54 56 59 78 84 90 95 101 108 116 121 122 133 150 158 167 235The data have been sorted. So the lowest income is $28,000 and the highest income is $235,000. The average is (Either add up all the numbers or use the summary module) 89.96, i.e, about $90,000. So now we need to determine how much our estimate missed by.
In general, our estimate of is . And we know something about the distribution of . The Central Limit Theorem tells us that the distribution of is approximately normal with mean (the population mean) and standard deviation , ( is the population standard deviation). By the empirical rule , 95% of the time falls in the interval to , (1.96 is more accurate than 2 which we have been using). A picture of it is seen in Figure 7.1.
We need an interval which we are fairly confident contains
The interval in the above plot
occurs 95% of the time. It's endpoints are the 2.5 and 97.5 percentiles
of the distribution of . But we can't use it because we don't know .
Well if you don't know it, estimate it. Ignoring ,
consider the interval
Oddly enough, this interval works. When will this interval not cover ? If
then the right side of the interval
will be less than .
This will happen 2.5% of the time. If
then the left side of the interval
be greater than .
This will happen 2.5% of the time. If these
two things don't occur then the interval
will contain .
That is, this interval will contain
95% of the time.
What's that? We don't know
so we can't use the interval!
That's right. We will replace
by the sample standard deviation
Thus the interval we will use is:
Income Example : Lets apply to the income example. Recall that the data are:
28 29 35 42 42 44 50 52 54 56 59 78 84 90 95 101 108 116 121 122 133 150 158 167 235Recall the average income is 89.96. The sample standard deviation is (Either do it by hand or check the numerical summaries button in the summary module):
Rweb:> # STANDARD DEVIATION of x Rweb:> var(1)^.5  51.68Hence s = 51.68. Note for the interval we actually need which is called Standard Error of the Mean : . So the interval we want is:
(89.96 - 1.96*10.33, 89.96 + 1.96*10.33) (69.71, 110.21)So we estimate the mean family income of a Smith University student to be between $69,710 to $110,210. Our error of estimation is ; i.e., $20,250. That seems like a lot. How can we reduce the error of estimation? A larger sample size; i.e, as n gets larger, gets smaller.
Interpretation. What is this interval? One way of thinking about
it is: the probability that the random interval
is .95. What the heck does this mean? Think of it this way. This interval is a result of a Bernoulli
trial with probability of success .95. In practice, we have only one sample
and one interval. It will either catch
or not. But it is the
outcome of a Bernoulli trial with probability of success .95. Hence, we
are fairly confident of success. So we call it a 95% confidence interval.
Other Remarks. There are two approximations in our confidence interval:
A final remark of considerable importance: The end points of our confidence interval are estimates of the 2.5 and 97.5 percentiles of the distribution of , the estimator. This will be very important in the section after next.
10 12 16 18 24Do this one by hand. The sample mean and standard deviation are easy to get and .
76 87 98 102 111 114 115 115 120 126First boxplot the data. Next mark the sample average and the endpoints of the confidence interval on the plot. Here's some output from the summary module to do the confidence interval:
Rweb:> summary(variables) x Min. : 76.0 1st Qu.: 99.0 Median :112.5 Mean :106.4 3rd Qu.:115.0 Max. :126.0 Rweb:> # STANDARD DEVIATION of x Rweb:> var(x)^.5  15.5863
6 8 14 30 31 32 51 57 87 87 109 145 156 171 342First boxplot the data. Next mark the sample average and the endpoints of the confidence interval on the plot. Here's the output from the summary module to do the confidence interval:
Rweb:> summary(variables) x Min. : 6.0 1st Qu.: 30.5 Median : 57.0 Mean : 88.4 3rd Qu.:127.0 Max. :342.0 Rweb:> # STANDARD DEVIATION of x Rweb:> var(x)^.5  88.8005
141 145 145 146 142 126 144 146 154 149 143 131(Ans: ).
134 132 126 134 131 130 130 125 132 126(Ans: ).