## Correlation

Sunday 28 February 2010 at 10:05 am Correlation is a concept based upon the notion of co-variance. Two varying quantities are said to co-vary when, whenever one is increasing in value, the other does so too, and conversely when one is decreasing in value. Two correlated quantities vary together so that they actually seem to interact.In statistical terms, when the varying quantities are described as random variables X and Y, covariance has a specific meaning. Covariance is the expected value (denoted E[.] here) of the product between deviations of each variable from its mean. Cov(X,Y) = E[(X - E(X)) (Y - E(Y))]

When both variables take values larger than their respective means, their product is highly positive. When one of them takes a value higher than its mean while the other takes a value lower than its mean, the product is largely negative.

Correlation is computed like the covariance, except that it normalized so as to take values between -1 and 1, no matter which quantities are considered. It is defined as Cov(X,Y) / (std(X) std(Y)) where std(X) is the square root of the variance of X (its standard deviation) and is known as the Pearson Product Moment correlation.

A correlation of +1 between two variables indicates perfect co-variation, while a value of -1 indicates a perfect anti-variation ; when one of the variables is increasing, the other is decreasing, and vice versa.

Using the definition of the correlation with two binary variables can be seen as computing the average value of the result of the operation X XOR Y where XOR is the exclusive OR binary operator.

As far as continuous variables are concerned, the correlation is directly linked to the (normalized) error made by a linear model when one of the variables is used to predict the other and vice versa.

In that context, a correlation of +1 indicates a perfect linear relationship between both variables, that is on can be expressed as some factor times the other, plus some bias, like, for instance, Celsius and Fahrenheit temperature scales. A value of -1 refers to the same property except that in that case, the factor is negative. When two variables express a correlation of zero, it means that their relationship is not describable by a linear model. This can be either because there is no relation at all (variables are independent) or because their relation is non linear. For instance, the correlation between X and X^2 over a symmetrical interval (i.e. centered on zero) is zero while X and X^2 are obviously not independent. The correlation between X and X^2 over a non-negative interval (an interval whose bounds are both non negative) is not zero, but it is not +1 either ; it takes an intermediate value.

To overcome that limitation, Spearman (C. Spearman, "The proof and measurement of association between two things" Amer. J. Psychol. , 15 (1904) pp. 72–101) proposed another definition of the correlation, not based on the values of the variables, but on the ordering of the instances according to those variables. In practice, the Spearman correlation is computed as the correlation between the ranks of each instance when sorted according to that variable. When both ordering match perfectly, the correlation is +1 and it means that there is a perfect monotonic relationship between both variables. When one increases, the other one increases too, but not necessarily in a linear way. The Spearman correlation between X and X^2 over a non-negative intervals then perfectly +1. Kendall's tau coefficient (Kendall, M. (1948) Rank Correlation Methods, Charles Griffin & Company Limited), is computed slightly differently, but is expresses the same idea.

Correlation can also be computed between one variable and the same variable to which a delay is applied. It is then called auto-correlation (Spectral analysis and time series, M.B. Priestley (London, New York : Academic Press, 1982). Rather than a single value, the auto-correlation produces a function of the delay. The local maxima of that function indicate potential periodicity.

Correlation can be defined between groups of variables. It is then called canonical correlation and is strongly linked to Principal Component Analysis (Kanti V. Mardia, J. T. Kent and J. M. Bibby (1979). Multivariate Analysis. Academic Press.).

Finally, it is very important to distinguish correlation from causality. Two variables can be correlated while share absolutely no causality link, that is no one is actually influencing the other. A typical example is that if you compute the correlation between the number of crimes committed in a city per year and the number of churches in that city, you will most probably find a rather large value. This is simply the case because both are proportional to the number of inhabitants of the city.