Understanding the Covariance Matrix

The covariance matrix is a confusing yet powerful matrix, which frequently shows up while delving into statistics and probability theory. Here I’ll show a more intuitive geometric explanation of the covariance matrix and the way it describes the shape of a data set.

Introduction

Before we get started, we shall take a quick look at the difference between covariance and variance. Variance measures the variation of a single random variable (like height of a person in a population), whereas covariance is a measure of how much two random variables vary together (like the height of a person and the weight of a person in a population). The formula for variance is given by

where is the number of samples (e.g. number of people) and is the mean of the random variable (represented as a vector). The covariance of two random variables and is given by

with samples. The variance of a random variable can be also expressed as the covariance with itself by .

Covariance Matrix

With the covariance we can calculate entries of the covariance matrix, which is a square matrix given by where and describes the dimension or number of random variables of the data (e.g. the number of features like height, width, weight, …). Also the covariance matrix is symmetric since . The diagonal entries of the covariance matrix are the variances and the other entries are the covariances. For this reason the covariance matrix is sometimes called the variance-covariance matrix. The calculation for the covariance matrix can be also expressed as

where our data set is expressed by the matrix . Following from this equation, the covariance matrix can be computed for a data set with zero mean with by using the semi-definite matrix .

In this article we will focus on the two dimensional case, but it can be easily generalized to more dimensional data. Following from the previous equations the covariance matrix for two dimensions is given by

We want to show how linear transformation affect the data set and in result the covariance matrix. First we will generate random points with mean values , at the origin and unit variance which is also called white noise and has the identity matrix as the covariance matrix.

import numpy as np
import matplotlib.pyplot as plt

# Normal distributed x and y vector with mean 0 and standard deviation 1
x = np.random.normal(0, 1, 500)
y = np.random.normal(0, 1, 500)
X = np.vstack((x, y)).T

plt.scatter(X[:, 0], X[:, 1])
plt.title('Generated Data')
plt.axis('equal')
plt.show()

png

This case would mean that and are independent (or uncorrelated) and the covariance matrix is

We can check this by calculating the covariance matrix

# Covariance
def cov(x, y):
    xbar, ybar = x.mean(), y.mean()
    return np.sum((x - xbar)*(y - ybar))/(len(x) - 1)

# Covariance matrix
def cov_mat(X):
    return np.array([[cov(X[0], X[0]), cov(X[0], X[1])], \
                     [cov(X[1], X[0]), cov(X[1], X[1])]])

# Calculate covariance matrix 
cov_mat(X.T) # (or with np.cov(X.T))
array([[ 1.00666865,  0.0243458 ],
       [ 0.0243458 ,  0.88681178]])

Which approximatelly gives us our expected covariance matrix with variances .

Linear Transformations of the Data Set

Next we will look at how transformations affect our data and the covariance matrix . We will transform our data with the following scaling matrix.

where the transformation simply scales the and components by multiplying them by and respectively. What we expect is that the covariance matrix of our transformed data set will simply be

which means that we can extract the scaling matrix from our covariance matrix by calculating and the data is transformed by .

# Center the matrix at the origin
X = X - np.mean(X, 0)

# Scaling matrix
sx, sy = 0.7, 3.4
Scale = np.array([[sx, 0], [0, sy]])

# Apply scaling matrix to X
Y = X.dot(Scale)

plt.scatter(Y[:, 0], Y[:, 1])
plt.title('Transformed Data')
plt.axis('equal')
plt.show()

# Calculate covariance matrix
cov_mat(Y.T)

png

array([[  0.49326764,   0.05794301],
       [  0.05794301,  10.25154418]])

We can see that this does in fact approximately match our expectation with and for and . This relation holds when the data is scaled in and direction, but it gets more involved for other linear transformations.

Now we will apply a linear transformation in the form of a transformation matrix to the data set which will be composed of a two dimensional rotation matrix and the previous scaling matrix as follows

where the rotation matrix is given by

where is the rotation angle. The transformed data is then calculated by or .

# Scaling matrix
sx, sy = 0.7, 3.4
Scale = np.array([[sx, 0], [0, sy]])

# Rotation matrix
theta = 0.77*np.pi
c, s = np.cos(theta), np.sin(theta)
Rot = np.array([[c, -s], [s, c]])

# Transformation matrix
T = Scale.dot(Rot)

# Apply transformation matrix to X
Y = X.dot(T)

plt.scatter(Y[:, 0], Y[:, 1])
plt.title('Transformed Data')
plt.axis('equal')
plt.show()

# Calculate covariance matrix
cov_mat(Y.T)

png

array([[ 4.70340162, -4.83340262],
       [-4.83340262,  6.0414102 ]])

This leads to the question how to decompose the covariance matrix into a rotation matrix and a scaling matrix .

Eigen Decomposition of the Covariance Matrix

Eigen Decomposition is one connection between a linear transformation and the covariance matrix. An eigenvector is a vector whose direction remains unchanged when a linear transformation is applied to it. It can be expressed as

where is an eigenvector of and is the corresponding eigenvalue. If we put all eigenvectors into the colums of a Matrix and all eigenvalues as the entries of a diagonal matrix we can write for our covariance matrix the following equation

where the covariance matrix can be represented as

which can be also obtained by singular value decomposition. The eigenvectors are unit vectors representing the direction of the largest variance of the data, while the eigenvalues represent the magnitude of this variance in the corresponding directions. This means represents a rotation matrix and represents a scaling matrix. From this equation we can represent the covariance matrix as

where the rotation matrix and the scaling matrix . From the previous linear transformation we can derive

because due to the properties since is orthogonal and since is a diagonal matrix. This enables us to calculate the covariance matrix from a linear transformation. In order to calculate the linear transformation of the covariance matrix one must calculate the eigenvectors and eigenvectors from the covariance matrix . This can be done by calculating

where is the previous matrix where the columns are the eigenvectors of and is the previous diagonal matrix consisting of the corresponding eigenvalues. The transformation matrix can be also computed by the Cholesky decomposition with where is the Cholesky factor of .

We can see the basis vectors of the transformation matrix by showing each eigenvector multiplied by . By multiply with we cover approximately of the points according to the three sigma rule if we would draw a ellipse with the two basis vectors and count the points inside the ellipse.

C = cov_mat(Y.T)
eVe, eVa = np.linalg.eig(C)

plt.scatter(Y[:, 0], Y[:, 1])
for e, v in zip(eVe, eVa.T):
    plt.plot([0, 3*np.sqrt(e)*v[0]], [0, 3*np.sqrt(e)*v[1]], 'r-', lw=2)
plt.title('Transformed Data')
plt.axis('equal')
plt.show()

png

We can now get from the covariance the transformation matrix and we can use the inverse of to uncorrelate (whiten) the data.

C = cov_mat(Y.T)

# Calculate eigenvalues
eVa, eVe = np.linalg.eig(C)

# Calculate transformation matrix from eigen decomposition
R, S = eVe, np.diag(np.sqrt(eVa))
T = R.dot(S).T

# Transform data with inverse transformation matrix T^-1
Z = Y.dot(np.linalg.inv(T))

plt.scatter(Z[:, 0], Z[:, 1])
plt.title('Uncorrelated Data')
plt.axis('equal')
plt.show()

# Covariance matrix of the uncorrelated data
cov_mat(Z.T)

png

An interesting use of the covariance matrix is in the Mahalanobis distance, which is used when measuring multivariate distances with covariance. It does that by calculating the uncorrelated distance between a point to a multivariate normal distribution with the following formula

where is the mean and is the covariance of the multivariate normal distribution (the set of points assumed to be normal distributed). A derivation of the Mahalanobis distance with the use of the Cholesky decomposition can be found in this article.

Conclusion

In this article we saw the relationship of the covariance matrix with linear transformation which is an important building block for understanding and using PCA, SVD, the Bayes Classifier, the Mahalanobis distance and other topics in statistics and pattern recognition. I found the covariance matrix to be a helpful cornerstone in the understanding of the many concepts and methods in pattern recognition and statistics.

Many of the matrix identities can be found in The Matrix Cookbook. The relationship between SVD, PCA and the covariance matrix are elegantly shown in this question and this question. Some further explanations on PCA with python can be found in the article on Change of Basis.