Principal Component Analysis

09/17/2025

Principal Component Analysis
- The objective of PCA is to find the axes that maximize the variance of data, and then project it into the lower dimensional space.
- PCA is not just a process of reducing dimensions, but rather it finds the axes that best explain the essential structure of the data.
- Covariance Matrix
  - For the data matrix \(\hat{X} \in \mathbb{R}^{n \times d}\), covariance matrix \(\Sigma\) is defined as follows, where the \(n\) is the number of samples, \(d\) is the number of features, and \(\Sigma_{ij} = Cov(\hat{X}_i, \hat{X}_j) \) means a covariance between feature \(i\) and feature \(j\). \[ \begin{aligned} \,\\ \Sigma &= \frac{1}{n}\hat{X}^{\top}\hat{X} \quad \in \mathbb{R}^{d \times d} \,\\ \,\\ \Sigma &= \frac{1}{n} \begin{bmatrix} dot(\hat{X}_1, \hat{X}_1) & dot(\hat{X}_1, \hat{X}_2) & \cdots & dot(\hat{X}_1, \hat{X}_d) \\ dot(\hat{X}_2, \hat{X}_1) & dot(\hat{X}_2, \hat{X}_2) & \cdots & dot(\hat{X}_2, \hat{X}_d) \\ \vdots & \vdots & \ddots & \vdots \\ dot(\hat{X}_d, \hat{X}_1) & dot(\hat{X}_d, \hat{X}_2) & \cdots & dot(\hat{X}_d, \hat{X}_d) \end{bmatrix} \,\\ \end{aligned} \]
  - Let the data matrix \(X\) be as follows, where each row is a sample and each column is a variable. \[ \,\\ X = \begin{bmatrix} -6 & 2 \\ 6 & 4 \\ 3 & -5 \\ -5 & -4 \\ -5 & -9 \end{bmatrix} \,\\ \,\\ \] For this data, we can obtain the centralized data matrix \(\hat{X}\) with zero mean by using \(\mu = [ -1.4, -2.4 ]\): \[ \,\\ \hat{X} = \begin{bmatrix} -4.6 & 4.4 \\ 7.4 & 6.4 \\ 4.4 & -2.6 \\ -3.6 & -1.6 \\ -3.6 & -6.6 \end{bmatrix} \,\\ \,\\ \] Here, the covariance matrix is calculated using the definition above: \[ \,\\ \Sigma = \frac{1}{5} \begin{bmatrix} 121.2 & 45.2 \\ 45.2 & 113.2 \end{bmatrix} = \begin{bmatrix} \color{red}{24.24} & \color{green}{9.04} \\ \color{green}{9.04} & \color{blue}{22.64} \end{bmatrix} \,\\ \,\\ \] where, from a geometric perspective , the diagonal elements \(\Sigma_{11}\) and \(\Sigma_{22}\) are the variances along the \(\color{red}{\text{x-axis}}\) and \(\color{blue}{\text{y-axis}}\), respectively. The symmetric off-diagonal elements \(\Sigma_{12}\) and \(\Sigma_{21}\) represent the \(\color{green}{\text{covariance}}\) between x and y, which indicates how much x and y change together.
- Eigenvalues and Eigenvectors of Covariance Matrix
  - Due to the geometric properties of the covariance matrix, we can find the principal component with the maximum variance by computing its eigenvalues, eigenvectors. Here, the eigenvalues indicate the amount of variance of each eigenvector .
  - As mentioned above, the covariance matrix is a symmetric matrix, and one important property of symmetric matrices is that their eigenvectors are orthogonal to each other. Since an \(n \times n\) matrix can have up to \(n\) independent eigenvectors, the eigendecomposition of the covariance matrix allows us to obtain a complete set of orthogonal axes.
  - Why do the eigenvalues of each eigenvector represent the variance?
    
    For the centered data matrix, \(\hat{X} \in \mathbb{R}^{n \times d}\) projecting all samples onto direction \(v\) gives: \[ \,\\ \hat{X}v \in \mathbb{R}^{n \times 1}. \,\\ \,\\ \] The variance of the vector obtained by projecting all data onto \(v\) can be written as follows, according to the definition above: \[ \,\\ \begin{aligned} \text{Var}(\hat{X}v) &= \frac{1}{n} (\hat{X}v)^\top \hat{X}v \,\\ \,\\ &= \frac{1}{n} v^\top \hat{X}^\top \hat{X} v \,\\ \,\\ &= v^\top \left( \frac{1}{n} \hat{X}^\top \hat{X} \right) v \end{aligned} \,\\ \,\\ \] The covariance matrix is defined as \( \Sigma = \frac{1}{n}\hat{X}^\top \hat{X} \), so it can be rewritten: \[ \,\\ \text{Var}(\hat{X}v) = v^\top \Sigma \, v \,\\ \,\\ \] If \(v\) is an eigenvector of \(\Sigma\), by the definition of eigenvectors \(\Sigma v = \lambda v\), we can obtain the following: \[ \begin{aligned} \,\\ v^\top \Sigma \, v &= v^\top \lambda v \,\\ \,\\ &= \lambda \, v^\top v \,\\ \,\\ \end{aligned} \] Normalizing \(v\) to be \(\|v\|=1\) gives \(v^\top v = 1\). Therefore: \[ \begin{aligned} \,\\ \text{Var}(\hat{X}v) &= \lambda v^{\top} v \,\\ \,\\ &= \lambda \cdot 1 \,\\ \,\\ &= \lambda \,\\ \,\\ \end{aligned} \] So, the eigenvalue \(\lambda\) directly represents the variance of the data along its corresponding eigenvector \(v\).
- Notebook

References