next up previous contents index
Next: Comparing Averages of Two Up: Confidence Intervals Previous: Determining Sample Size for

   
Comparing the Averages of Two Independent Samples

Is there "grade inflation" in WMU? How does the average GPA of WMU students today compare with, say 10, years ago? Suppose a random sample of 100 student records from 10 years ago yields a sample average GPA of 2.90 with a standard deviation of .40. A random sample of 100 current students today yields a sample average of 2.98 with a standard deviation of .45. The difference between the two sample means is 2.98-2.90 = .08. Is this proof that GPA's are higher today than 10 years ago? Well....first we need to account for the fact that 2.98 and 2.90 are not the true averages, but are computed from random samples. Therefore, .08 is not the true difference, but simply an estimate of the true difference. Can this estimate miss by much? Fortunately, statistics has a way of measuring the expected size of the ``miss'' (or error of estimation) . For our example, it is .06 (we show how to calculate this later). Therefore, we can state the bottom line of the study as follows: "The average GPA of WMU students today is .08 higher than 10 years ago, give or take .06 or so."

We now show how to calculate the .06, the standard error of the estimate. But first, a note on terminology. The estimate .08=2.98-2.90 is a difference between averages (or means) of two independent random samples. "Independent" refers to the sampling luck-of-the-draw: the luck of the second sample is unaffected by the first sample. In other words, there were two independent chances to have gotten lucky or unlucky with the sampling. The likely size of the error of estimation in the .08 is called the standard error of the difference between independent means. We calculate it using the following formula:

 \begin{displaymath}
\mbox{SE of} \;\; (\overline{X}_1 - \overline{X}_2) = \sqrt{ \mbox{SE}_1^2 + \mbox{SE}_2^2}
\end{displaymath} (7.4)

where $\mbox{SE}_1=S_1/\sqrt{n_1}$ and $\mbox{SE}_2=S_2/\sqrt{n_2}$.

Note that $\mbox{SE}_1$ and $\mbox{SE}_2$ are the SE's of $\overline{X}_1$ and $\overline{X}_2$, respectively. The formula looks easier without the notation and the subscripts. 2.98 is a sample mean, and has standard error $.45/\sqrt{100}=.045$ (since SE= $S/\sqrt {n}$). Similarly, 2.90 is a sample mean and has standard error $.40/\sqrt{100}=.040$. Summarizing, we write the two mean estimates (and their SE's in parentheses) as

2.98 (SE=.045)
2.90 (SE=.040)
If two independent estimates are subtracted, the formula ( 7.6) shows how to compute the SE of the difference  :
2.98 - 2.90 (SE= $\sqrt{(.045)^2 + (.040^2)}$)
or .08 $\pm$ .06.

Remember the Pythagorean Theorem in geometry? Think of the two SE's as the length of the two sides of the triangle (call them a and b). The SE of the difference then equals the length of the hypotenuse (SE of difference = $\sqrt{ a^2 + b^2}$).

We are now ready to state a confidence interval for the difference between two independent means.

  % latex2html id marker 4902
\fbox{ \parbox{5.5in}{
\vspace*{1ex}
{\bf $z$ -Confi...
...$n_1$\space and $n_2$\space are large,
preferably at least 30.
\vspace*{1ex}
} }

The correct z critical value for a 95% confidence interval is z=1.96. Therefore a 95% z-confidence interval for $\mu_1-\mu_2$ is

\begin{displaymath}2.98 - 2.90 \pm 1.96 \; \sqrt{(.045)^2 + (.040^2)}
\end{displaymath}

or (-.04, .20).

There is a second procedure that is preferable when either n1 or n2 or both are small. However, this method needs additional requirements to be satisfied (at least approximately):



Requirement R1: Both samples follow a normal-shaped histogram
Requirement R2: The population SD's $\sigma_1$ and $\sigma_2$ are equal.

Let Sp denote a ``pooled''  estimate of the common SD, as follows:

\begin{displaymath}S_p= \sqrt{ \frac{ (n_1-1)S_1^2+(n_2-1)S_2^2}{ n_1+n_2-2 }}
\end{displaymath}

The following confidence interval is called a ``Pooled SD'' or ``Pooled Variance'' confidence interval.

  % latex2html id marker 4910
\fbox{ \parbox{5.5in}{
\vspace*{1ex}
{\bf $t$ -Confi...
...val works best when both requirements R1 and R2 are satisfied.
\vspace*{1ex}
} }

Returning to the grade inflation example, the pooled SD is

\begin{displaymath}S_p= \sqrt{ \frac{ (100-1)(.45^2) + (100-1)(.40^2)}{100+100-2}} = .426.
\end{displaymath}

Therefore, $\mbox{SE}_1=.426/\sqrt{100}=.0426$, $\mbox{SE}_2=.426/\sqrt{100}=.0426$, and the difference between means is estimated as

\begin{displaymath}2.98 - 2.90 \pm t \;\sqrt{(.0426)^2 + (.0426)^2}
\end{displaymath}

where the second term is the standard error. For a 95% confidence interval, the appropriate value from the t curve with 198 degrees of freedom is 1.96. Therefore a t-confidence interval for $\mu_1-\mu_2$ with confidence level .95 is

\begin{displaymath}2.98 - 2.90 \pm 1.96\; \sqrt{(.0426)^2 + (.0426)^2}
\end{displaymath}

or (-.04, .20).

Note that the t-confidence interval ( 7.8) with pooled SD looks like the z-confidence interval ( 7.7), except that S1 and S2 are replaced by Sp, and z is replaced by t. We present a summary of the situations under which each method is recommended.

  R1 and R2 are both satisfied R1 or R2 or both not satisfied
Both samples are large Use z or t Use z
One or both samples small Use t Consult a statistician


next up previous contents index
Next: Comparing Averages of Two Up: Confidence Intervals Previous: Determining Sample Size for

2003-09-08