next up previous contents index
Next: The Excel Printout Up: Linear Regression Previous: Calculating the Least Squares

More on Simple Regression

How well does the prediction model work? We now discuss statistics that help us measure the predictive performance of the model.

Consider Car 7 which has Miles=65940 and Price=$8800. In the scatterplot, it is plotted as $(X,\; Y) = (65940,\; 8800)$. The regression line says that a car with X=65940 will have predicted Y value  

\begin{displaymath}\mbox{PREDICTED Y} = \$8136 - .05127 (65940) = \$4755.47
\end{displaymath}

The observed value Y=$8800 is much higher than PREDICTED Y= $4755.47. How much higher? The difference is $8800-$4755.47 = $4044.53. This is called the "Residual".   

Residual = Y - PREDICTED Y

Here is the complete table of values of the explanatory variable X, the observed values of Y, the predicted values PRED, and the residuals RES.


              Car  Miles(X) Price(Y)  PRED    RES=Y-PRED
           
               1     9300    7100    7659.35    -559.35
               2    10565   15500    7594.49    7905.51
               3    15000    4400    7367.11   -2967.11
               4    15000    4400    7367.11   -2967.11
               5    17764    5900    7225.41   -1325.41
               6    57000    4600    5213.81    -613.81
               7    65940    8800    4755.47    4044.53
               8    73676    2000    4358.85   -2358.85
               9    77006    2750    4188.13   -1438.13
              10    93739    2550    3330.24    -780.24
              11   146088     960     646.36     313.64
              12   153260    1025     278.66     746.34



In a scatterplot, Y is the height of the point. PREDICTED Y is the height of the line directly above or below that point (denoted by *'s in the scatterplot below). The residuals, which are computed by subtraction (RESIDUAL=Y-PREDICTED Y), tell us how far each point is above or below the line. Points above the regression line will have positive residual. Points that are below the line will have negative residual.

\epsfig{file=chokregline2.ps, width=5in, angle=-90}

Large residuals (ignoring the signs) indicate observations where the regression line gives poor predictions. For example, Car 7 has a large positive residual while Car 8 has a large negative residual. Based on its mileage, Car 7 has a predicted price of $4755; its actual price is $4044 higher (a residual value of +$4044). Car 8 has a predicted price of $4358, its actual Price is $2358 lower (a residual value of -$2358).

The sum of the squared residuals  

\begin{displaymath}\mbox{SSE} = (Y_1-\mbox{PREDICTED}\;Y_1)^2 + \cdots + (Y_n-\mbox{PREDICTED}\;Y_n)^2
\end{displaymath}

is a measure of `overall size' of the residuals. In the Saturn Price data, $\mbox{SSE} = (-559.35)^2 + \cdots + (746.34)^2 = 107805718$. If a regression line is predicting poorly, there tend to be lots of large (positive and negative) residuals, and hence SSE tends to be large.

\fbox{ \parbox{5.5in}{
\vspace*{1ex}
A large value of SSE means that the regress...
...alue of SSE means that the regression line is predicting well.
\vspace*{1ex}
} }

The regression line predicts best when correlation is high. In the best case scenario where r=1 or r=-1, all the points will fall on a straight line. In this extreme case, all the residuals are 0, and SSE=0.

\fbox{ \parbox{5.5in}{
\vspace*{1ex}
When $r=1$\space or $r=-1$ , SSE equals 0. This is the best case for
regression prediction.
} }

There is a more basic prediction method than regression: this method ignores the X-values altogether, and uses the sample average $\overline{Y}$ to predict each Y. This is called one-sample prediction in contrast to regression prediction. In one-sample prediction, the errors of prediction are the differences $Y-\overline{Y}$ and their sum of squares is called Total Sum of Squares, denoted SSTo:   

\begin{displaymath}\mbox{SSTo} = (Y_1-\overline{Y})^2 + \cdots + (Y_n-\overline{Y})^2
\end{displaymath}

What happens in regression when r=0, i.e. the X and Y variable are uncorrelated? In this `worst-case' scenario, the regression slope is computed as $b=r(\mbox{SD}_Y/\mbox{SD}_X) =0$! The computation for intercept results in $a=\overline{Y} - b \overline{X} = \overline{Y}$. The resulting predicted values are $\mbox{Pred}\;Y= a + b X = \overline{Y}$. Therefore when X and Y have 0 correlation, regression reduces to one-sample prediction ignoring the X-values, and SSE equals SSTo.

\fbox{ \parbox{5.5in}{
\vspace*{1ex}
When $r=0$ , SSE equals SSTo. This is the worst case for
regression prediction.
\vspace*{1ex}
} }

In applications, the typical scenario falls somewhere in the middle: |r| is somewhere between 0 and 1, and SSE is somewhere between 0 and SSTo.



      |__________________|___________________________________|
      0                 SSE                                SSTo 

      <----------------- | ---------------------------------->
       BEST                                             WORST



Is SSE closer to 0 (best case) or to SSTo (worst case)? The distance between SSTo and SSE is a useful measure of the statistical performance of the prediction model. It is calculated by subtraction, and is called SSR:   

\fbox{ \parbox{5.5in}{
\vspace*{1ex}
\mbox{SSR}= \mbox{SSTo} - \mbox{SSE}
\vspace*{1ex}
} }

For the Saturn Price data, SSTo=182,977,206 and SSE=107,805,718. The difference between them is SSR=75,171,488.



      x______________________________x_______________________x
      0                         107,805,718             182,977,206
                                  (SSE)                   (SSTo)

                                     |_______________________|
                                              75,171,488
                                                (SSR)



Recall that large residuals indicate poor predictions by the straight line model which predicts Y (Price) given X (Miles). The residuals then represent a component of the car price that is not due to, or `not explained by', the car's mileage. Since SSE is a summary measure of residual size, and since SSTo=SSE+SSR, we may think of SSTo as having been partitioned into (i) a component not explained by mileage (SSE) and (ii) a component explained by mileage (SSR). The latter component is often expressed as a proportion or percentage:   

\begin{displaymath}R^2 = \frac{\mbox{SSR}}{\mbox{SSTo}}
\end{displaymath}

For the Saturn Price data, SSTo=182,977,206. Since SSR/SSTo=.41, we say that 41% of total price variability is explained by the car's mileage. The remaining 59% is due to something else - probably condition of car, power and luxury options, time remaining in auction, market luck-of-the-draw, etc.



                              SSTo=182,977,206
      |______________________________________________________|
                                                                   
            SSE=107,805,718               SSR=75,171,488
      |______________________________|_______________________|
                 (.59)                          (.41)





 
next up previous contents index
Next: The Excel Printout Up: Linear Regression Previous: Calculating the Least Squares

2003-09-08