How well does the prediction model work? We now discuss statistics that help us measure the predictive performance of the model.
Consider Car 7 which has Miles=65940 and Price=$8800.
In the scatterplot, it is plotted as
The regression line says that a car with X=65940 will have
predicted Y value
Residual = Y - PREDICTED Y
Here is the complete table of values of the explanatory variable X, the
observed values of Y, the predicted values PRED, and the
Car Miles(X) Price(Y) PRED RES=Y-PRED 1 9300 7100 7659.35 -559.35 2 10565 15500 7594.49 7905.51 3 15000 4400 7367.11 -2967.11 4 15000 4400 7367.11 -2967.11 5 17764 5900 7225.41 -1325.41 6 57000 4600 5213.81 -613.81 7 65940 8800 4755.47 4044.53 8 73676 2000 4358.85 -2358.85 9 77006 2750 4188.13 -1438.13 10 93739 2550 3330.24 -780.24 11 146088 960 646.36 313.64 12 153260 1025 278.66 746.34
In a scatterplot, Y is the height of the point. PREDICTED Y is the height of the line directly above or below that point (denoted by *'s in the scatterplot below). The residuals, which are computed by subtraction (RESIDUAL=Y-PREDICTED Y), tell us how far each point is above or below the line. Points above the regression line will have positive residual. Points that are below the line will have negative residual.
Large residuals (ignoring the signs) indicate observations where the regression line gives poor predictions. For example, Car 7 has a large positive residual while Car 8 has a large negative residual. Based on its mileage, Car 7 has a predicted price of $4755; its actual price is $4044 higher (a residual value of +$4044). Car 8 has a predicted price of $4358, its actual Price is $2358 lower (a residual value of -$2358).
The sum of the squared residuals
The regression line predicts best when correlation is high. In the best case scenario where r=1 or r=-1, all the points will fall on a straight line. In this extreme case, all the residuals are 0, and SSE=0.
There is a more basic prediction method than regression: this method
ignores the X-values altogether, and uses the sample average
to predict each Y. This is called one-sample prediction
in contrast to regression prediction.
In one-sample prediction, the errors of prediction are
and their sum of squares is called
Total Sum of Squares, denoted SSTo:
What happens in regression when r=0, i.e. the X and Y variable are uncorrelated? In this `worst-case' scenario, the regression slope is computed as ! The computation for intercept results in . The resulting predicted values are . Therefore when X and Y have 0 correlation, regression reduces to one-sample prediction ignoring the X-values, and SSE equals SSTo.
In applications, the typical scenario falls somewhere in the middle: |r| is somewhere between 0 and 1, and SSE is somewhere between 0 and SSTo.
|__________________|___________________________________| 0 SSE SSTo <----------------- | ----------------------------------> BEST WORST
Is SSE closer to 0 (best case) or to SSTo (worst case)? The distance between SSTo and SSE is a useful measure of the statistical performance of the prediction model. It is calculated by subtraction, and is called SSR:
For the Saturn Price data, SSTo=182,977,206 and SSE=107,805,718. The difference between them is SSR=75,171,488.
x______________________________x_______________________x 0 107,805,718 182,977,206 (SSE) (SSTo) |_______________________| 75,171,488 (SSR)
Recall that large residuals indicate poor predictions by the straight line model
which predicts Y (Price) given X (Miles).
The residuals then represent a component of the car price that is not due to,
or `not explained by', the car's mileage.
Since SSE is a summary measure of residual size, and since SSTo=SSE+SSR, we may think of
SSTo as having been partitioned into
(i) a component not explained by mileage (SSE)
and (ii) a component explained by mileage (SSR).
The latter component is often expressed as a proportion or percentage:
SSTo=182,977,206 |______________________________________________________| SSE=107,805,718 SSR=75,171,488 |______________________________|_______________________| (.59) (.41)