Following is the printout for running regression in Excel.
*********************************************************************************
SUMMARY OUTPUT
Regression Statistics
Multiple R 0.640955792
R Square 0.410824328
0.351906761
Standard Error 3283.378116
Observations 12
ANOVA
df SS MS F Significance F
Regression 1 75171487.75 75171488 6.972866 0.024707725
Residual 10 107805718.5 10780572
total 11 182977206.3
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 8136.150298 1519.8802 5.353152 0.000322 4749.645588 11522.65501
miles -0.051269039 0.0194155 -2.64062 0.024708 -0.09452957 -0.00800851
*********************************************************************************
In the previous section, we noted that 41% of this total price variability is explained by the car's mileage (since R2=SSR/SSTo=.41). This `R Square' is reported on the second line of the printout under `Regression Statistics'.
The correlation between Miles and Price (X and Y) is r=.640955792. The square of this value is r2=(.640955792)2=.410824328. This number should look familiar, because it is ALSO the value of SSR/SSTo, which explains why SSR/SSTo is called `R Square'. Note the use of upper case `R' instead of lower case `r' in the Excel printout. Use of the upper case R is standard notation in regression studies, while lower case r tends to be used in correlation studies.
Now move down to the Analysis of Variance (ANOVA) table . It contains three rows; "Regression" (or R for short), "Residual" (or E), and "Total" (or To). The sum of squares column (denoted "SS") gives SSR, SSE, and SSTo respectively. Mathematically, SSR and SSE are not comparable in size because they contain different amounts of information, also called "degrees of freedom" or simply "df". This is where the degrees of freedom column comes in. We have SSE=107.8M, but it contains 10 pieces of information, which averages to 10.78M per piece. On the other hand, SSR=75.1M contains only 1 piece of information, so the average is 75.1M per piece. These "average sum of squares per df" are called "mean squares" and reported under the MS column. Note that MS for Regression is 6.97 times the MS for Residual. This may be interpreted as follows. ``Prices of cars vary partly because they have different mileage, right? How much of the variability in price is due to (or explained by) mileage? Answer: the average variation EXPLAINED by Miles is 6.97 times larger than UNEXPLAINED .'' Is 6.97 statistically significant? The answer is Yes, because there is only a 2.47% chance of getting a ratio that large from pure chance variation. Since this P-value is smaller than 5%, the result is statistically significant.
A generic representation of the ANOVA table for simple regression is given below. As usual, the sample size is denoted by n. (For the Saturn price data, n=12.) The formulas for SSR, SSE and SSTo have already been discussed previously. MSR and MSE are obtained by division. F is the ratio between MSR and MSE. The P-value, or observed significance level of the F-ratio, is obtained from an F-table (which is beyond the scope of this class).
*****************************************************************
ANOVA
df SS MS F Significance F
Regression 1 SSR MSR=SSR/dfR F=MSR/MSE P-value
Residual n-2 SSE MSE=SSE/dfE
total n-1 SSTo
*****************************************************************
Below the ANOVA table in the Excel regression printout are the estimates of intercept and slope (under the column "coefficients"). Their standard errors are also reported along with a t-ratio, p-value for the t-ratio, and 95% confidence interval. The standard errors are interpreted the usual way, as follows. If another random sample of 12 cars were selected, the computed slope of the regression line would probably change. By how much? Answer: by approximately .019. The t-ratio is the estimate divided by its standard error, and the P-value measures how likely it is from chance alone to get a ratio that large.
Consider the following two statements:
The first statement gives no information on mileage of the car. In the absence of an X-value,
our best guess is the sample mean
.
The standard error for this prediction
is the standard deviation from the mean, which is simply the regular
sample standard deviation
.
The second statement, on the other hand, tells us the car mileage.
The predicted price for a car with 80,000 miles is obtained
from the regression equation as
.
The standard error for this prediction
is the standard deviation from predicted values, which turns out to be
the squareroot of the MSE
.
This is reported in the Excel
printout as "Standard Error" (4th line of printout under `Regression Statistics').
To summarize: if a prediction is based on the average, the appropriate give-or-take
is the sample SD. If a prediction is based on a regression line, the
appropriate give-or-take is the squareroot of the MSE.