Unveiling Dimensionality Reduction - A Beginner's Guide to Principal Component Analysis

#Probability #Mathematics
Table of Contents

Introduction

Imagine for a second you were transplanted into Olvera Street in LA. It’s a Tuesday, but today is a little different. Theres a spark in the air. You’re not quite sure what to make of it, but you know that today, something great is going to happen. You walk around aimlessly for awhile, until your mind begins to get distracted by this huge sense of hunger. “Dang – if only I could have some tacos”, you think to yourself. Good thing you’re in the right place.

A long last, your eyes spots the gentrified “Taco de Matrix” โ€“ Where Flavor Spans Matrices ๐ŸŒฎ๐Ÿ“. It only ever shows up on March 14.

You remember the handwritten spreadsheet you leave in your pocket journal. As you search yourself loose change, used napkins and other miscellaneous items fall out of your trench coat. Today is not the day we tackle your hoarding problem, we have bigger fish to fry. Recall what you wrote!

$$ \bf{X} = \begin{bmatrix} 1 & 2 & 3 \\\ 4 & 5 & 6 \\\ 7 & 8 & 9 \end{bmatrix} $$

It’s a time series of prices.

That is each row in \(\bf{X}\) is a data point in \(\mathbb{R}^3\) and there are three observations (or samples). The \(x_{i,j}\) item refers to the \(j\) -th menu item’s price on year \(i\) . It looks like over the years, Ernesto changed the prices of the “Vector Verde” tacos from 2 dollars, to 5 dollars, and finally to 8 dollars.

The “Orthogonal Carnitas”, similarly increased from 3 dollars, to 6 to 9 dollars finally. You have these prices represented in your matrix as columns \(\vec{x}_j\) and it hurts to see the them go up!

Matrix as Data

First, how does the linear algebra machinery relate to what we are doing? Gregory Gundersen provides perspective on a matrix-as-data interpretation: that we can represent a new piece of data as a linear combination of existing data.

In our leading example, think for a second how we might model a new menu item’s cost \(c_i\) ? We can do it with a linear combination!

Don’t forget to check out my previous tutorial on Matrix Multiplication as a refresher for what comes ahead!

Our Linear model:

$$ \begin{aligned} c_{\text{'Sine Cosine Churros'}} &= \vec{\beta}(\vec{x}_{\text{'Linear Refried Beans'}}) &+ \vec{\beta}(\vec{x}_{\text{'Vector Verde Tacos'}}) \\\ &+ \vec{\beta}(\vec{x}_{\text{'Orthogonal Carnitas'}}) \end{aligned} $$

Depending on the vector \(\vec{\beta}\) , our prediction for the price of the upcoming ‘Sine Cosine Churros’, can be linearly related to the existing data we already have. Now lets get back to our dataset \(\bf{X}\) .

Dimension Reduction

Suppose our dataset \(\bf{X}\) had millions of features (of size \(P\) ), it could be hard to know which attributes are actually useful for a predictive task.

Moreover, sometimes algorithms can be very expensive to compute on such a large dataset. Without losing too much information in our data, we would like to reduce the features of our dataset, so that it may become a lower rank matrix \(K < P\) :

$$ \bf{X} \in \mathbb{R}^{N \times P} \overset{\text{Dimension Reduction}}{\to} \bf{Y} \in \mathbb{R}^{N \times K} $$

The intuition behind this, is that some prices can exhibit a lot of collinearity with one another. For example, if ‘Vector Verde Tacos’ and ‘Orthogonal Carnitas’ both use the same carne asada, their prices will probably move the same. Instead, we can approximate \(\bf{X}\) with \(\bf{Y}\) , with each column having an interpretation as ‘food’ category instead. This compresses our data and we’re able to mitigate some of the problems we were encountering.

Principal Component Analysis

A commonly used method to perform dimension reduction is called principal component analysis (PCA). An assumption into using this method relies on variables having a linear relationship to eachother, since the algorithm relies on covariance. It follows five steps. Lets get started.

1. Normalizing our data

First lets calculate the mean of every feature contained in our matrix :

$$ \mathbb{E}(\bf{X}) = \begin{bmatrix} 4 \\\ 5 \\\ 6 \end{bmatrix}^T $$

Recall that for samples we are measuring the statistic of variance:

$$ \begin{aligned}&\hspace{1em} S_n^2 = \frac{\sum_{i = 1}^{n}(x_i - \bar{x})^2}{n} \\&= \frac{\sum_{i = 1}^{n}(x_i^2 - 2\bar{x}x_i + \bar{x}^2)}{n} \\&= \frac{\sum_{i = 1}^{n}x_i^2 - 2\bar{x}\sum_{i = 1}^{n}x_i + n\bar{x}^2}{n} \\&= \frac{\sum_{i = 1}^{n}x_i^2 - \bar{x}\sum_{i = 1}^{n}x_i}{n} \\&= \frac{\sum_{i = 1}^{n}x_i^2}{n} - \bar{x}^2 \end{aligned} $$

Lets calculate the standard variation for each column of \(\bf{X}\) as well:

  • Recall that standard deviation is the square root of variation
$$ Std(\bf{X}) = \begin{bmatrix} \sqrt{\frac{(1)^2 + (4)^2 + (7)^2}{3} - (4)^2} \\\ \sqrt{\frac{(2)^2 + (5)^2 + (8)^2}{3} - (5)^2} \\\ \sqrt{\frac{(3)^2 + (6)^2 + (9)^2}{3} - (6)^2} \end{bmatrix}^T = \begin{bmatrix} \sqrt{22 - 16} \\\ \sqrt{31 - 25} \\\ \sqrt{42 - 36} \end{bmatrix}^T \approx \begin{bmatrix} 2.449 \\\ 2.449 \\\ 2.449 \end{bmatrix}^T $$

Now that we have computed the mean and standard variation of our data, lets normalize it:

$$ \bar{\bf{X}} = \begin{bmatrix} \frac{1 - 4}{\sqrt{6}} & \frac{2 - 5}{\sqrt{6}} & \frac{3 - 6}{\sqrt{6}} \\\ \frac{4 - 4}{\sqrt{6}} & \frac{5 - 5}{\sqrt{6}} & \frac{6 - 6}{\sqrt{6}} \\\ \frac{7 - 4}{\sqrt{6}} & \frac{8 - 5}{\sqrt{6}} & \frac{3 - 6}{\sqrt{6}} \end{bmatrix} \approx \begin{bmatrix} - 1.224 & - 1.224 & - 1.224 \\\ 0 & 0 & 0 \\\ 1.224 & 1.224 & 1.224 \end{bmatrix} $$
2. Computing Covariance Between Features

Now that we have normalized data, we can compute the covariance, which is the average product of two distinct features’ signed distances to their mean feature value, for all combination of the features.

Recall that in this setting, \(\bar{\bf{X}}\) is a \(3 \times 3\) dimensional matrix. More precisely, it is three samples, each with 3 features. So, it turns out that computing covariance is different in this setting.

In other words, we treat each feature in the dataset \(\bar{\bf{X}}\) as a random variable. Since we are working with vectors, we want an analogue to multiplication. So the inner product is a natural choice.

The covariance matrix is computed by taking the dot product of each row with each column of the transposed matrix.

More precisely, matrix multiplication is being performed. This allows us to perform a parallel computation of this ‘multiplication analogue’:

$$ \begin{aligned} Cov(\bar{\bf{X}}, \bar{\bf{X}}^T) &= \frac{1}{n - 1}\bar{\bf{X}}^T\bar{\bf{X}} \\&= \frac{1}{3 - 1} \begin{bmatrix} - 1.224 & 0 & 1.224 \\\ - 1.224 & 0 & 1.224 \\\ -1.224 & 0 & 1.224 \end{bmatrix} \begin{bmatrix} - 1.224 & - 1.224 & - 1.224 \\\ 0 & 0 & 0 \\\ 1.224 & 1.224 & 1.224 \end{bmatrix} \\&= \frac{1}{2} \begin{bmatrix} 3 & 3 & 3 \\\ 3 & 3 & 3 \\\ 3 & 3 & 3 \end{bmatrix} \\& \bf{\Sigma}:= \begin{bmatrix} 1.5 & 1.5 & 1.5 \\\ 1.5 & 1.5 & 1.5 \\\ 1.5 & 1.5 & 1.5 \end{bmatrix} \end{aligned} $$

It turns out, there is a really nice property of having our data as a matrix, which tells us where all the covariance, between features, comes from.

3. Eigen Decomposition

Recall that eigenvectors are special properties of a matrix, denoting the principal direction that a linear transformation results in a completely scalar movement. It merely stretches or shrinks an input vector, instead of rotating it.

In the context of this covariance matrix, the eigenvectors are the principal components that explain where all the variance comes from. The eigenvalues tell us the scaling factor, or magnitude of how much a column is responsible for variance.

Therefore, we go and solve the characteristic equation of our matrix \(\bf{\Sigma}\) :

$$ \begin{aligned} det\left( \bf{\Sigma} - \lambda I \right)\vec{v} &= 0 \\& \text{LHS: } det\left ( \begin{bmatrix} 1.5 & 1.5 & 1.5 \\\ 1.5 & 1.5 & 1.5 \\\ 1.5 & 1.5 & 1.5 \end{bmatrix} - \begin{bmatrix} \lambda & 0 & 0 \\\ 0 & \lambda & 0 \\\ 0 & 0 & \lambda \end{bmatrix} \right) \\&= det \left(\begin{bmatrix} 1.5 - \lambda & 1.5 & 1.5 \\\ 1.5 & 1.5 - \lambda & 1.5 \\\ 1.5 & 1.5 & 1.5 - \lambda \end{bmatrix} \right) \\&= (1.5 - \lambda)\left| \begin{matrix} 1.5 - \lambda & 1.5 \\\ 1.5 & 1.5 - \lambda \end{matrix} \right| - 1.5\left| \begin{matrix} 1.5 & 1.5 \\\ 1.5 & 1.5 - \lambda \end{matrix} \right| \\& + 1.5 \left| \begin{matrix} 1.5 & 1.5 - \lambda \\\ 1.5 & 1.5 \end{matrix} \right| \end{aligned} $$

After setting this determinant equal to zero, we end up with a cubic polynomial. I skipped this, but after solving for the eigenvalues, we get \(\lambda_1 = 4.5, \lambda_2 = \lambda_3 = 0\) . Now lets plug the values of our eigenvalues back into our characteristic equation.

Solving for the eigenvectors in \(\lambda_1 = 4.5\) :

  • We perform row reduced echelon and get the following:
$$ \begin{aligned} \begin{bmatrix} -3 & 1.5 & 1.5 \\\ 1.5 & -3 & 1.5 \\\ 1.5 & 1.5 & -3 \end{bmatrix} \begin{bmatrix} v_1 \\\ v_2 \\\ v_3 \end{bmatrix} = \begin{bmatrix} 0 \\\ 0 \\\ 0 \end{bmatrix} \to \begin{bmatrix} -6 & 3 & 3\\\ 0 & -\frac{9}{2} & \frac{9}{2} \\\ 0 & 0 & 0 \end{bmatrix} \end{aligned} $$

We parameterize \(v_1 = t\) , we find that \(v_2 = t\) and \(v_3 = t\) , so our eigenvector is:

$$ \vec{v} = t \begin{bmatrix} 1 \\\ 1 \\\ 1 \end{bmatrix} $$

Solving for the eigenvectors in \(\lambda = 0\) :

$$ \begin{aligned} \begin{bmatrix} -3 & 1.5 & 1.5 \\\ 1.5 & -3 & 1.5 \\\ 1.5 & 1.5 & -3 \end{bmatrix} \begin{bmatrix} v_1 \\\ v_2 \\\ v_3 \end{bmatrix} = \begin{bmatrix} 0 \\\ 0 \\\ 0 \end{bmatrix} \to \begin{bmatrix} 1 & 1 & 1\\\ 0 & 0 & 0 \\\ 0 & 0 & 0 \end{bmatrix} \end{aligned} $$

So \(v_1 = -v_2 - v_3\) . We parameterize \(v_2 = t\) and \(v_3 = w\) and get the following:

$$ \begin{aligned} \vec{v} = t \begin{bmatrix} - 1 \\\ 1 \\\ 0 \end{bmatrix} + w \begin{bmatrix} -1 \\\ 0 \\\ 1 \end{bmatrix} \end{aligned} $$

Finally, we have our three basis eigenvectors.

$$ \text{Eigenvectors} = \begin{aligned} \left \{ \begin{bmatrix} 1 \\\ 1 \\\ 1 \end{bmatrix} , \begin{bmatrix} - 1 \\\ 1 \\\ 0 \end{bmatrix} , \begin{bmatrix} -1 \\\ 0 \\\ 1 \end{bmatrix} \right\} \end{aligned} $$
4. Sorting and Selecting k Eigenvectors

Originally we had a dimension of \(P = 3\) , however we want to select the top \(K = 1\) eigenvectors to form a matrix. Since the eigenvalue of \(\lambda = 4.5\) is the largest, we choose the eigenvector that corresponds to that.

$$ \begin{aligned} \bf{V}_1 &= \left [ \vec{v}_1\right ] \\&= \begin{bmatrix} 1 \\\ 1 \\\ 1 \end{bmatrix} \end{aligned} $$

In general, we form a \(\bf{V}_k\) matrix with a rank of \(K\) , denoting the principal components we want to keep.

5. Transform Data into New Subspace

Finally, we transform the normalized data \(\bf{\bar{X}}\) into the new subspace \(\bf{Y}\) using the matrix of eigenvectors.

$$ \begin{aligned} \bf{Y} &= \bar{\bf{X}}\bf{V}_k \\&= \begin{bmatrix} - 1.224 & - 1.224 & - 1.224 \\\ 0 & 0 & 0 \\\ 1.224 & 1.224 & 1.224 \end{bmatrix} \begin{bmatrix} 1 \\\ 1 \\\ 1 \end{bmatrix} \\&= \begin{bmatrix} -\frac{459}{125} \\\ 0 \\\ \frac{459}{125} \end{bmatrix} \end{aligned} $$

Congratulations, we end up with a new matrix \(\bf{Y}\) that approximates our original dataset \(\bf{X}\) . Lets think of our new matrix as having a column representing all the ‘Tace de Matrix’ menu items.

Now, we can use the new columns in \(\bf{Y}\) to make predictions. Since there is only a single column, we can use a simple linear model to predict the price of the upcoming ‘Sine Cosine Churros’. This involves a single dot product between \(\vec{\beta}\) and the approximate data \(\vec{y_i} \in \bf{Y}\) .

New model:

$$ \begin{aligned} c_{\text{'Sine Cosine Churros'}} &= \vec{\beta}(\vec{y}_{\text{'Taco de Matrix Menu'}}) \end{aligned} $$

Python Implementation

Theย numerical computationย that is embedded into the NumPy package you’re using is inherently subject to the small errors and vicissitudes ofย floating point numerical representations. Such errors and approximations are unavoidable with numerical computing.

import numpy as np

def pca(X: np.array, k: int):

    # Step 1: Standardize the data
    X_std = (X - np.mean(X, axis=0)) / np.std(X, axis=0)
    # Step 2: Calculate the covariance matrix
    covariance_matrix = np.cov(X_std.T)
    # Step 3: Calculate the eigenvectors and eigenvalues
    eigenvalues, eigenvectors = np.linalg.eig(covariance_matrix)
    # Step 4: Sort the eigenvectors by their eigenvalues
    idx = eigenvalues.argsort()[::-1]
    eigenvectors = eigenvectors[:, idx]
    top_k_eigenvectors = eigenvectors[:, :k]
    # Step 5: Transform the data into the new subspace
    new_data = X_std.dot(top_k_eigenvectors)
    return new_data


X = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(X)

> [[1 2 3]
ย [4 5 6]
ย [7 8 9]]
Y = pca(X, 1)
print(Y)

> [[-2.12132034e+00]
ย [ 0.00000000e+00]
ย [ 2.12132034e+00]]

Conclusion

In conclusion, if we want to reduce the features in a dataset, we can estimate a new dataset using PCA. This is a form of a data reduction technique. It relies on the same assumptions that go into computing covariance for all the columns in the original dataset.

This mitigates a lot of problems that comes with having high dimensional datasets.