How well does the prediction model work? We now discuss statistics that help us measure the predictive performance of the model.

Consider Car 7 which has Miles=65940 and Price=$8800.
In the scatterplot, it is plotted as
.
The regression line says that a car with *X*=65940 will have
predicted *Y* value

The observed value Y=$8800 is much higher than PREDICTED Y= $4755.47. How much higher? The difference is $8800-$4755.47 = $4044.53. This is called the "Residual".

Residual = Y - PREDICTED Y

Here is the complete table of values of the explanatory variable *X*, the
observed values of *Y*, the predicted values *PRED*, and the
residuals *RES*.

Car Miles(X) Price(Y) PRED RES=Y-PRED 1 9300 7100 7659.35 -559.35 2 10565 15500 7594.49 7905.51 3 15000 4400 7367.11 -2967.11 4 15000 4400 7367.11 -2967.11 5 17764 5900 7225.41 -1325.41 6 57000 4600 5213.81 -613.81 7 65940 8800 4755.47 4044.53 8 73676 2000 4358.85 -2358.85 9 77006 2750 4188.13 -1438.13 10 93739 2550 3330.24 -780.24 11 146088 960 646.36 313.64 12 153260 1025 278.66 746.34

In a scatterplot, *Y* is the height of the point.
PREDICTED Y is the height of the line directly above
or below that point (denoted by *'s in the scatterplot below).
The residuals, which are computed by subtraction
(RESIDUAL=Y-PREDICTED Y),
tell us how far each point is above or below
the line.
Points above the regression line will have positive residual.
Points that are below the
line will have negative residual.

Large residuals (ignoring the signs) indicate observations where the regression line gives poor predictions. For example, Car 7 has a large positive residual while Car 8 has a large negative residual. Based on its mileage, Car 7 has a predicted price of $4755; its actual price is $4044 higher (a residual value of +$4044). Car 8 has a predicted price of $4358, its actual Price is $2358 lower (a residual value of -$2358).

The sum of the squared residuals

is a measure of `overall size' of the residuals. In the Saturn Price data, . If a regression line is predicting poorly, there tend to be lots of large (positive and negative) residuals, and hence SSE tends to be large.

The regression line predicts best when correlation is high.
In the best case scenario where *r*=1 or *r*=-1, all the points
will fall on a straight line. In this extreme case, all the residuals are 0,
and SSE=0.

There is a more basic prediction method than regression: this method
*ignores the X-values altogether*, and uses the sample average
to predict each *Y*. This is called *one-sample prediction*
in contrast to *regression prediction*.
In one-sample prediction, the errors of prediction are
the differences
and their sum of squares is called
*Total Sum of Squares*, denoted SSTo:

What happens in regression when *r*=0, i.e. the *X* and *Y* variable are
uncorrelated? In this `worst-case' scenario, the
regression slope is computed as
! The
computation for intercept results in
.
The resulting predicted values are
.
Therefore when *X* and *Y* have 0 correlation,
regression reduces to one-sample prediction ignoring
the *X*-values, and SSE equals SSTo.

In applications, the typical scenario falls somewhere in the middle:
|*r*| is somewhere between 0 and 1, and SSE is somewhere between 0 and SSTo.

|__________________|___________________________________| 0 SSE SSTo <----------------- | ----------------------------------> BEST WORST

*Is SSE closer to 0 (best case) or to SSTo (worst case)?*
The distance between SSTo and SSE is a useful
measure of the statistical performance of the prediction model.
It is calculated by subtraction, and is called SSR:

For the Saturn Price data, SSTo=182,977,206 and SSE=107,805,718. The difference between them is SSR=75,171,488.

x______________________________x_______________________x 0 107,805,718 182,977,206 (SSE) (SSTo) |_______________________| 75,171,488 (SSR)

Recall that large residuals indicate poor predictions by the straight line model
which predicts *Y* (Price) given *X* (Miles).
The residuals then represent a component of the car price that is not due to,
or `not explained by', the car's mileage.
Since SSE is a summary measure of residual size, and since SSTo=SSE+SSR, we may think of
SSTo as having been partitioned into
(i) a component *not explained* by mileage (SSE)
and (ii) a component *explained* by mileage (SSR).
The latter component is often expressed as a proportion or percentage:

For the Saturn Price data, SSTo=182,977,206. Since SSR/SSTo=.41, we say that 41% of total price variability is explained by the car's mileage. The remaining 59% is due to something else - probably condition of car, power and luxury options, time remaining in auction, market luck-of-the-draw, etc.

SSTo=182,977,206 |______________________________________________________| SSE=107,805,718 SSR=75,171,488 |______________________________|_______________________| (.59) (.41)