next up previous contents index
Next: Using Excel Up: Linear Regression Previous: Multiple Linear Regression

Optional: The Multicollinearity Problem

When the X-variables are highly correlated, the coefficients lose their meaning. Consider the following hypothetical data:



                           X1     X2      Y  

                           10     20     140
                           20     40     180
                           30     60     220
                           40     80     260
                           50    100     300



The data is fit perfectly by the model

\begin{displaymath}\mbox{Model A}: Y= 100 + 2 (X_1) + 1 (X_2).
\end{displaymath}

However, note that X2= 2 (X1) (i.e. one is a linear function of the other). Thus, the following models, all with different coefficients, also fit perfectly.

Model B: Y= 100 + 4 (X1) + 0 (X2)      
Model C: Y= 100 + 0 (X1) + 2 (X2)      
Model D: Y= 100 - 6 (X1) + 5 (X2)      
Model E: Y= 100 + 10 (X1) - 3 (X2)      

Since a model with $\beta_1=4$ fits just as well as a model with $\beta_1=10$, the beta coefficents have no real meaningful interpretation. In statistics, this coexistence of multiple correct models is called the nonidentifiability problem. It is a consequence of one of the X-variables being a linear combination of the others, also called the multicollinearity problem.

Moral: Avoid models with X-variables which are too strongly correlated with each other.

Of course, sometimes multicollinearity is unavoidable. In the Saturn price example, the two X-variables MILES and YEAR have correlation r=-.91, so the multicollineariy problem exists to some extent. Therefore, the coefficients have to be very carefully interpreted. One way to minimize this collinearity problem is to try to include cars which are new but have high mileage, or old but have low mileage. This will reduce the correlation between MILES and YEAR.


next up previous contents index
Next: Using Excel Up: Linear Regression Previous: Multiple Linear Regression

2003-09-08