When the X-variables are highly correlated, the coefficients lose their meaning. Consider the following hypothetical data:
X1 X2 Y 10 20 140 20 40 180 30 60 220 40 80 260 50 100 300
The data is fit perfectly by the model
|Model B: Y= 100 + 4 (X1) + 0 (X2)|
|Model C: Y= 100 + 0 (X1) + 2 (X2)|
|Model D: Y= 100 - 6 (X1) + 5 (X2)|
|Model E: Y= 100 + 10 (X1) - 3 (X2)|
Since a model with fits just as well as a model with , the beta coefficents have no real meaningful interpretation. In statistics, this coexistence of multiple correct models is called the nonidentifiability problem. It is a consequence of one of the X-variables being a linear combination of the others, also called the multicollinearity problem.
Moral: Avoid models with X-variables which are too strongly correlated with each other.
Of course, sometimes multicollinearity is unavoidable. In the Saturn price example, the two X-variables MILES and YEAR have correlation r=-.91, so the multicollineariy problem exists to some extent. Therefore, the coefficients have to be very carefully interpreted. One way to minimize this collinearity problem is to try to include cars which are new but have high mileage, or old but have low mileage. This will reduce the correlation between MILES and YEAR.