In this section, we discuss measures of relationships between two variables X and Y. It is easiest to start with no relationship. What do we mean by no relationship? Suppose we had a lot of data on (X,Y) and obtained a scatterplot of Y versus X. If the plot was a random scatter then we would conclude that the variables X and Y are not related. What if they are related? Look at the six plots in Figure 1.10. In the first, we would probably conclude that X and Y are not related. Plot 2, we would characterize as probably a linear relationship, certainly exhibiting random error. Plot 3 is similar to Plot 2, although the pattern is not quite as tight. Plot 4 shows some negative drift. Plots 5 and 6 show the strongest relationships (tightest patterns) among the plots. Plot 5 shows a very strong circular relationship while Plot 6 a very strong quadratic pattern. It seems that a measure of a relationship should depend on what type of relationship it is. In this section, we will only be concerned for the most part about linear relationships and we will consider measures of such a relationship. It should not be surprising that this measure will indicate no (linear) relationship for the two strongest relationships in the plots.
Consider Plot 2 again. We want to measure the linear relationship exhibited in this plot. Two simple lines will help a lot. On the x-axis locate the sample mean of the X's ( ) and draw a vertical line through this point. On the y-axis locate the sample mean of the Y's ( ) and draw a horizontal line through this point. Figure 1.11 shows these lines.
The lines intersect at
(locate it). This is our new center. Next Label the quadrants I, II , III and IV, beginning at the upper right quadrant and continuing counter-clockwise. The coordinates of (X,Y) relative to the new center are
The signs on the coordinates are (+,+), (-,+), (-,-), and (+,-) as we go around the quadrants I, II , III and IV, respectively. Then it's easy to come up with many measures of linear relationships. A simple one is to count the number of points with the same sign (those in quadrants I and III) and subtract the number of points with different signs (those in quadrants
II and IV). High values of this measure indicate a positive linear relationship while low values indicate a negative linear relationship.
Instead of counting like and unlike signs, we consider a measure which takes the product of these new coordinates. Thus we have n products, one for each point in the plot. Consider as a measure their average:
For a given data set, we can always make this measure larger (or smaller) by changing the units. Suppose we have a positive linear relationship and X is measured in feet. If we change the X's to inches then sXY increases by the factor 12. If we change the X's to mm's then sXY increases by the factor 304.8. Thus we need to standardize our measure. In this chapter (we revisit this problem in Chapter 11), we will insist on an absolute measure which in absolute value cannot exceed 1. This is called the sample correlation coefficient and it is simply sXY divided by the product of the standard deviations of the X's and the Y's, (except we divide by n and not n-1; i.e,
The values of r for each of the plots in Figure 1.10 is indicated in Figure 1.12.
As we thought, the strongest relationships score 0 with our measure because they are both nonlinear. The best linear pattern is Plot 2, although Plot 3 is close. The negative drift, Plot 3, registers r = -.43 and the first plot shows little linearity as initially thought.
We can do a bit more with the sample correlation coefficient. It is associated with the LS fit. It can be shown that
is the LS estimate of slope. So r contains information on the fit.
We can be more precise. Consider the variation (or noise) in the Y data. A measure of this variation is the sample variance sY2 of the Y's. When we fit the linear model
Y = a + bX + e we should account be able to account for some of this variability (X should be of help in predicting Y. In fact,
is the percentage of variation accounted for in the LS fit of Y versus X. We call this the coefficient of determination and we often use capital R2 to denote it. Consider the values of R2 for Plots 1-6. R2 =.007 for Plot 1; hence we have accounted for .7% of the variation in Y. R2=.66 for Plot 2; hence we have accounted for 66% of the variation in Y. R2=.59
for Plot 3; hence we have accounted for 59% of the variation in Y. R2=.18 for Plot 4; hence we have accounted for 18% of the variation in Y. Of course for the last two plots, R2=0. The value of R2 can be obtained using the regression module.
The measures r and R2 are not robust. We will consider alternative measures of r later, but for now we do offer an alternative to R2 , labeled as RW2. This is the measure that corresponds to the robust Wilcoxon fit. This is not as sensitive as R2 to outliers. We show this for the baseball height and weight data. Recall that we changed the original data by inserting an outlier. The plots in Figure 1.13 show the original and changed data along with their R2's and RW2's.
For the LS fit, notice that due to one outlier, the percentage of variation accounted for dropped from 50% to 19%. The measure corresponding to the robust Wilcoxon fit only changed from .44 to .39.
1 2 2 4 3 4 6 3
x y 16 32 15 26 20 40 13 27 15 30 17 38 16 34 21 43 22 64 23 45 24 46 18 39
C3 - * - * * - * 400+ - * * - ** - 2 * * - * * * 300+ * * - * * * 2 * - * - ** - * 200+ * - * - * - ------+---------+---------+---------+---------+---------+C1 -40 -20 0 20 40 60
-1600+ * - ** C3 - ** - ** * - ** -2400+ 22* - 2* - *2* - ** * - -3200+ * - * - * - * - * -4000+ - ----+---------+---------+---------+---------+---------+--C2 30 40 50 60 70 80
-1600+ * - * * C3 - * * - * * * - * * -2400+ * * ** * - * * * - * 2 * - * ** - -3200+ * - * - * - * - * -4000+ ------+---------+---------+---------+---------+---------+C1 -40 -20 0 20 40 60