Mathematics for Machine Learning

04/18/2024

Mathematics for Machine Learning, Marc Peter Disenroth, A. Aldo Faisal, Cheng Soon Ong

1. Introduction

We believe that the mathematical foundations of machine learning are important in order to understand fundamental principles upon which more complicated machine learning systems are built. Understanding these principles can facilitate creating new machine learning solutions, understanding and debugging existing approaches, and learning about the inherent assumptions and limitations of the methodologies we are working with.

A model is typically used to describe a process for generating data, similar to the dataset at hand. Therefore, good models can also be thought of as simplified versions of the real (unknown) data-generating process, capturing aspects that are relevant for modeling the data and extracting hidden patterns from it.

Let us summarize the main concepts of machine learning that we cover in this book

We represent data as vectors
We choose an appropriate model, either using the probabilistic or optimization view
We learn from available data by using numerical optimization methods with the aim that the model performs well on data not used for training.

2. Linear Algebra

Linear algebra is the study of vectors and certain rules to manipulate vectors. The vectors many of us know from school are called "geometric vectors", which are usually denoted by a small arrow above the letter, e.g., \(\overrightarrow{x}\) and \(\overrightarrow{y}\). Interpreting vectors as geometric vectors enables us to use our intuitions about direction and magnitude to reason about mathematical operations.

Here are some examples of such vector objects:

Geometric vectors. Two geometric vectors \(\overrightarrow{\boldsymbol{x}}\), \(\overrightarrow{\boldsymbol{y}}\) can be added, such that \(\overrightarrow{\boldsymbol{x}} + \overrightarrow{\boldsymbol{y}} = \overrightarrow{\boldsymbol{z}}\) is another geometric vector. Furthermore, multiplication by a scalar \(\lambda \overrightarrow{\boldsymbol{x}}, \lambda \in \mathbf{R} \), is also a geometric vector. Interpreting vectors as geometric vectors enables us to use our intuitions about direction and magnitude to reason about mathematical operations.
Polynomials are also vectors. Two polynomials can be added together, which results in another polynomial; and they can be multiplied by a scalar \(\lambda \in \mathbb{R}\) and the result is a polynomial as well. Note that polynomials are very different from geometric vectors. While geometric vectors are concrete “drawings”, polynomials are abstract concepts. However, they are both vectors in the sense.
Audio signals are vectors. Audio signals are represented as a series of numbers. We can add audio signals together, and their sum is a new audio signal. If we scale an audio signal, we also obtain an audio signal.
Elements of \(\mathbb{R}^n\) are vectors (tuples of \(n\) real numbers). \(\mathbb{R}^n \) is more abstract than polynomials, and it is the concept we focus on in this book. For instance, \[ \,\\ \boldsymbol{a} = \begin{bmatrix} 1 \\ 2 \\ 3 \end{bmatrix} \in \mathbb{R}^3 \,\\ \] Adding two vectors \(\boldsymbol{a}, \boldsymbol{b} \in \mathbb{R}^n \) component-wise results in another vector: \(\boldsymbol{a}+\boldsymbol{b}=\boldsymbol{c} \in \mathbb{R}^n \). Moreover, multiplying \(\boldsymbol{a} \in \mathbb{R}^n \) by \(\lambda \in \mathbb{R} \) results in a scaled vector \(\lambda a \in \mathbb{R}^n\). Considering vectors as elements of \(\mathbb{R}^n\) has an additional benefit that it loosely corresponds to arrays of real numbers on a computer. Many programming languages support array operations, which allow for convenient implementation of algorithms that involve vector operations.

Linear algebra focuses on the similarities between these vector concepts. We will largely focus on vectors in \(\mathbb{R}^n\) since most algorithms in linear algebra are formulated in \(\mathbb{R}^n\)

One major idea in mathematics is the idea of "closure". This is the question: What is the set of all things that can result from my proposed operations? In the case of vectors: What is the set of vectors that can result by starting with a small set of vectors, and adding them to each other and scaling them? This results in a vector space. The concept of a vector space and its properties underlie much of machine learning.

Linear algebra plays an important role in machine learning and general mathematics. The concepts introduced in this chapter are further expanded to include the idea of geometry.

A mind map of the concepts where Linear Algebra is used in other parts

2.1 Systems of Linear Equations

Systems of linear equations play a central part of linear algebra. Many problems can be formulated as systems of linear equations, and linear algebra gives us the tools for solving them.

Example 2.2
The system of linear equations \[ \,\\ \begin{matrix} x_1 & + & x_2 & + & x_3 & = & 3 & (1) \\ x_1 & - & x_2 & + & 2x_3 & = & 2 & (2) \\ 2x_1 & & & + & 3x_3 & = & 1 & (3) \end{matrix} \,\\ \] has no solution: Adding the first two equations yields \(2x_1+3_3 = 5 \), which constradicts the third equation \((3)\).

Let us have a look at the system of linear equations \[ \,\\ \begin{matrix} x_1 & + & x_2 & + & x_3 & = & 3 & (1) \\ x_1 & - & x_2 & + & 2x_3 & = & 2 & (2) \\ & & x_2 & + & x_3 & = & 2 & (3) \end{matrix} \,\\ \] From the first and third equation, it follows that \(x_1 = 1\). From \((1)+(2)\), we get \(2x_1 + 3x_3 = 5 \), i.e., \(x_3 = 1\). From \((3)\), we then get that \(x_2 = 1\). Therefore \((1,1,1)\) is the only possible and unique solution.

As a third example, we consider \[ \,\\ \begin{matrix} x_1 & + & x_2 & + & x_3 & = & 3 & (1) \\ x_1 & - & x_2 & + & 2x_3 & = & 2 & (2) \\ 2x_1 & & & + & 3x_3 & = & 5 & (3) \end{matrix} \,\\ \] Since \((1)+(2)=(3)\), we can obmit the third equation. From \((1)\) and \((2)\), we get \(2_x1 = 5-3x_3 \) and \(2x_2 = 1 + x_3\). We can define \(x_3 = a \in \mathbb{R}\) as a free variable, such that any triplet is a solution of the system of linear equations, i.e., we can obtain a solution set that contains infinitely many solutions. \[ \,\\ \left( \frac{5}{2}-\frac{3}{2}a, \quad \frac{1}{2}+\frac{1}{2}a, \quad a \right) \,\\ \]

In general, for a real-valued system of linear equations we obtain either no, exactly one, or infinitely many solutions.

Remark (Geometric Interpretation of Systems of Linear Equations). In a system of linear equations with two variables \(x_1, x_2\), each linear equation define a line on the \(x_1, x_2\text{-plane}\). Since a solution to a system of linear equations must satisfy all equations simultaneously, the solution set is the intersection of these lines. This intersection set can be a line (if the linear equations describe the same line), a point, or empty (when the lines are parallel). \[ \,\\ \begin{align} 4x_1 + 4x_2 &= 5 \\ 2x_1 - 4x_2 &= 1 \end{align} \,\\ \] For the above system, we can get geometric intuition of the solution by plotting both equations as lines in the \(x_1, x_2\)-plane. The point where the two lines intersect represents the unique solution to the system.

The solution space of a system of two linear equations with two variables can be geometrically interpreted as the… — The solution space of a system of two linear equations with two variables
can be geometrically interpreted as the intersection of two lines

2.2 Matrices

Matrices play a central role in linear algebra. They can be used to compactly represent systems of linear equations, but they also represent linear functions (linear mappings).

Definition 2.1 (Matrix). With \(m, n \in \mathbb{N} \) a real-valued \((m,n)\) matrix \(A\) is an \(m\cdot n\)-tuple of elements \(a_{ij}, i=1, ..., m, \, j=1, ..., n\), which ordered according to a rectangular scheme consisting of \(m\) rows and \(n\) columns: \[ \,\\ A = \begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & & \vdots \\ a_{m1} & a_{m2} & \cdots & a_{mn} \end{bmatrix} ,\quad a_{ij} \in \mathbb{R} \,\\ \] \((1, n)\) or \((n, 1)\) matrices are called row/column vectors. \(\mathbb{R}^{m \times n}\) is the set of all real-valued \((m,n)\)-matrices. \(A\in \mathbb{R}^{m \times n} \) can be equivalently represented as \(a \in \mathbb{R}^{mn} \) by stacking all \(n\) columns of the matrix into a long vector.

2.2.1 Matrix Addition and Multiplication

The sum of two matrices \(A \in \mathbb{R}^{m \times n} \), \(B \in \mathbb{R}^{m \times n} \) is defined as the element-wise sum. For matrices \(A \in \mathbb{R}^{m \times n} \), \(B \in \mathbb{R}^{n \times k} \), the elements of the product \(C = AB \in \mathbb{R}^{m \times k} \) are computed as \[ \,\\ c_{ij} = \sum_{l=1}^{n} a_{il} b_{lj}, \quad i = 1, ..., m, \quad j = 1, ..., k \,\\ \] This means, to compute element \(c_{ij}\) we multipy the elements of the \(i\)th row of \(A\) with the \(j\)th column of \(B\) and sum them up. We will call this the dot product of the corresponding row and column.

Remark. Matrices can only be multiplied if their "neighboring" dimensions match. For instance, an \(n \times k \)-matrix \(A\) can be multiplied with a \(k \times m \)-matrix \(B\), but only from the left side: \[ \,\\ \underbrace{A}_{n \times k} \; \underbrace{B}_{k \times m} = \underbrace{C}_{n \times m} \,\\ \] The product \(BA\) is not defined if \(m \neq n\) since the neighboring dimensions do not match.

Remark. Matrix multiplication is not defined as an element-wise operation on matrix elements, i.e., \(c_{ij} \neq a_{ij} b_{ij} \), and is called a Hadamard product.

Example 2.3
For \( A = \begin{bmatrix} 1 & 2 & 3 \\ 3 & 2 & 1 \end{bmatrix} \in \mathbb{R}^{2 \times 3}, \quad B = \begin{bmatrix} 0 & 2 \\ 1 & -1 \\ 0 & 1 \end{bmatrix} \in \mathbb{R}^{3 \times 2} \),　we obtain \[ \,\\ AB = \begin{bmatrix} 1 & 2 & 3 \\ 3 & 2 & 1 \end{bmatrix} \begin{bmatrix} 0 & 2 \\ 1 & -1 \\ 0 & 1 \end{bmatrix} = \begin{bmatrix} 2 & 3 \\ 2 & 5 \end{bmatrix} \in \mathbb{R}^{2 \times 2} \,\\ \,\\ BA = \begin{bmatrix} 0 & 2 \\ 1 & -1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 1 & 2 & 3 \\ 3 & 2 & 1 \end{bmatrix} = \begin{bmatrix} 6 & 4 & 2 \\ -2 & 0 & 2 \\ 3 & 2 & 1 \end{bmatrix} \in \mathbb{R}^{3 \times 3} \,\\ \]

From this example, we can already see that the matrix multiplication is not commutative, i.e., \(AB \neq BA\).

Definition 2.2 (Identity Matrix). In \(\mathbb{R}^{n\times m} \), we define the identity matrix as the \(n \times m \)-matrix containing \(1\) on the diagonal and \(0\) everywhere else. \[ \,\\ I_n = \begin{bmatrix} 1 & 0 & \cdots & 0 & \cdots & 0 \\ 0 & 1 & \cdots & 0 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 1 & \cdots & 0 \\ \vdots & \vdots & \ddots & \vdots & \ddots & \vdots \\ 0 & 0 & \cdots & 0 & \cdots & 1 \end{bmatrix} \in \mathbb{R}^{n\times n} \,\\ \] Now that we defined matrix multiplication, matrix addition and the identity matrix, let us have a look at some properties of matrices:

Associativity: \[ \forall A \in \mathbb{R}^{m\times n}, \quad B \in \mathbb{R}^{n \times p}, \quad C \in \mathbb{R}^{p \times q}: \,\, (AB)C = A(BC) \]
Distributivity: \[ \begin{align} \forall A, B \in \mathbb{R}^{m \times n}, \quad C, D \in \mathbb{R}^{n \times p}:& \,\, (A+B)C = AC+BC \\ & \,\, A(C+D) = AC+AD \end{align} \]
Multiplication with the identity matrix: \[ \forall A \in \mathbb{R}^{m \times n}: \,\, I_m A = AI_n = A \]

2.2.2 Inverse and Transpose

Definition 2.3 (Inverse). Consider a square matrix \(A \in \mathbb{R}^{n \times n} \). Let matrix \(B \in \mathbb{R}^{n \times n} \) have he property that \(AB = I_n = BA\). \(B\) is called the inverse of \(A\) and denoted by \(A^{-1}\).

Unfortunately, not every matrix \(A\) possesses an inverse \(A^{-1}\). If the inverse does exist, \(A\) is called regular/invertible/nonsingular, otherwise singular/noninvertible. When the matrix inverse exists, it is unique.

Remark (Existence of the Inverse of a \(2 \times 2\) matrix). Consider a matrix \[ \,\\ A := \begin{bmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{bmatrix} \in \mathbb{R}^{2 \times 2} \,\\ \] If we multiply \(A\) with \[ \,\\ B := \begin{bmatrix} a_{22} & -a_{12} \\ -a_{21} & a_{11} \end{bmatrix} \,\\ \] we obtain \[ \,\\ \begin{align} AB = \begin{bmatrix} a_{11}a_{22} - a_{12}a_{21} & 0 \\ 0 & a_{11}a_{22} - a_{12}a_{21} \end{bmatrix} &= (a_{11}a_{22} - a_{12}a_{21}) \, I \\ &= \det (A) \, I \end{align} \,\\ \] Therefore \[ \,\\ A^{-1} = \frac{1}{a_{11}a_{22}-a_{12}a_{21}} \begin{bmatrix} a_{22} & -a_{12} \\ -a_{21} & a_{11} \end{bmatrix} \,\\ \] if and only if \(\det(A) \neq 0\). Furthermore we can generally use the determinant to check whether a matrix is invertible.

Example 2.4 (Inverse Matrix)
The matrices \[ \,\\ A = \begin{bmatrix} 1 & 2 & 1 \\ 4 & 4 & 5 \\ 6 & 7 & 7 \end{bmatrix} , \quad B = \begin{bmatrix} -7 & -7 & 6 \\ 2 & 1 & -1 \\ 4 & 5 & -4 \end{bmatrix} \,\\ \,\\ \begin{align} AB &= \begin{bmatrix} -7+4+4 & -7+2+5 & 6-2-4 \\ -28+8+20 & -28+4+25 & 24-4-20 \\ -42+14+28 & -42+7+35 & 36-7-28 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = I \,\\ \,\\ BA &= \begin{bmatrix} -7-28+36 & -14-28+42 & -7-35+42 \\ 2+4-6 & 4+4-7 & 2+5-7 \\ 4+20-24 & 8+20-28 & 4+25-28 \end{bmatrix} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} = I \end{align} \,\\ \] are inverse to each other since \(AB = I = BA\)

Definition 2.4 (Transpose). For \(A \in \mathbb{R}^{m \times n} \) the matrix \(B \in \mathbb{R}^{n \times m} \) with \(b_{ij}=a_{ji} \) is called the transpose of \(A\). We write \(B = A^{\top} \).

In general, \(A^{\top}\) can be obtained by writing the columns of \(A\) as the rows of \(A^{\top}\). The following are some important properties of inverses and transposes: \[ \,\\ \begin{align} AA^{-1} &= I = A^{-1}A \\ \,\\ (AB)^{-1} &= B^{-1}A^{-1} \\ \,\\ (A+B)^{-1} &\neq A^{-1} + B^{-1} \\ \,\\ (A^{\top})^{\top} &= A \\ \,\\ (A+B)^{\top} &= A^{\top}+B^{\top} \\ \,\\ (AB)^{\top} &= B^{\top} + A^{\top} \end{align} \,\\ \] Definition 2.5 (Symmetric Matrix). A matrix \(A \in \mathbb{R}^{n \times n} \) is symmetric if \(A = A^{\top} \).

Note that, if \(A\) is invertible, then so is \(A^{\top}\), and \((A^{-1})^{\top} = (A^{\top})^{-1} =: A^{-\top} \).

Remark (Sum and Product of Symmetric Matrices). The sum of symmetric matrices \(A, B \in \mathbb{R}^{n\times n} \) is always symmetric. However, although their product is always defined, it is generally not symmetric: \[ \,\\ \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} = \begin{bmatrix} 1 & 1 \\ 0 & 0 \end{bmatrix} \,\\ \]

2.2.3 Multiplication by a Scalar

Let us look at what happens to matrices when they are multiplied by a scalar \(\lambda \in \mathbb{R}\). Let \(A \in \mathbb{R}^{m\times n} \). Then \(\lambda A = K, \, K_{ij} = \lambda a_{ij} \). Pratically, \(\lambda\) scales each element of \(A\). For \(\lambda, \psi \in \mathbb{R} \), the following holds:

Associativity: \[ (\lambda \psi)C = \lambda(\psi C), C \in \mathbb{R}^{m \times n} \\ \]
\(\lambda(BC) = (\lambda B)C = B(\lambda C) = (BC)\lambda, \quad B \in \mathbb{R}^{m \times n},\quad C \in \mathbb{R}^{n \times k}\)
Note that this allows us to move scalar values around.
\( (\lambda C)^{\top} = C^{\top}\lambda^{\top} = C^{\top}\lambda = \lambda C^{\top} \) since \(\lambda = \lambda^{\top}, \forall \lambda \in \mathbb{R} \)
Distributivity: \[ (\lambda + \psi)C = \lambda C + \psi C, \quad C \in \mathbb{R}^{m \times n} \\ \lambda (B+C) = \lambda B + \lambda C, \quad B, C \in \mathbb{R}^{m \times n} \]

Example 2.5 (Distributivity)
If we define \[ \,\\ C := \begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}, \,\\ \] then for any \(\lambda, \psi \in \mathbb{R}\) we obtain \[ \,\\ \begin{align} (\lambda + \psi)C &= \begin{bmatrix} (\lambda + \psi)1 & (\lambda + \psi)2 \\ (\lambda + \psi)3 & (\lambda + \psi)4 \end{bmatrix} = \begin{bmatrix} \lambda + \psi & 2\lambda + 2\psi \\ 3\lambda + 3\psi & 4\lambda + 4\psi \end{bmatrix} \,\\ \,\\ &= \begin{bmatrix} \lambda & 2\lambda \\ 3\lambda & 4\lambda \end{bmatrix} + \begin{bmatrix} \psi & 2\psi \\ 3\psi & 4\psi \end{bmatrix} = \lambda C + \psi C \end{align} \,\\ \]

2.3 Solving Systems of Linear Equations

2.3.1 Particular and General Solution

Before discussing how to generally solve systems of linear equations, let us have a look at an example. Consider the system of equations \[ \,\\ \begin{bmatrix} 1 & 0 & 8 & -4 \\ 0 & 1 & 2 & 12 \end{bmatrix} \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \end{bmatrix} = \begin{bmatrix} 42 \\ 8 \end{bmatrix} \,\\ \] The system has two equations and four unknowns. Therefore, in general we would expect infinitely many solutions. This system of equations is in a particularly easy form, where the first two columns consist of a \(1\) and a \(0\). A solution to the problem can be found found immediately by taking 42 times the first column and 8 times the second column so that \[ \,\\ b = \begin{bmatrix} 42 \\ 8 \end{bmatrix} = 42 \begin{bmatrix} 1 \\ 0 \end{bmatrix} + 8 \begin{bmatrix} 0 \\ 1 \end{bmatrix} \,\\ \] Therefore, a solution is \([42, 8, 0, 0]^{\top}\). This solution is called a particular solution or special solution. However, this is not the only solution of this system of linear equations. To capture all the other solutions, we need to be creative in generating 0 in a non-trivial way using the columns of the matrix \[ \,\\ c_3 = \begin{bmatrix} 8 \\ 2 \end{bmatrix} = 8 \begin{bmatrix} 1 \\ 0 \end{bmatrix} + 2 \begin{bmatrix} 0 \\ 1 \end{bmatrix} \,\\ \] so that \(8c_1 + 2c_2 = 1c_3 \), and \( 8c_1 + 2c_2 - 1c_3 + 0c_4 = 0\). Therefore \((x_1, x_2, x_3, x_4) = (8, 2, -1, 0)\). In fact, any scaling of this solution by \(\lambda_1 \in \mathbb{R}\) produces the 0 vector, i.e., \[ \,\\ \begin{bmatrix} 1 & 0 & 8 & -4 \\ 0 & 1 & 2 & 12 \end{bmatrix} \left( \lambda_1 \begin{bmatrix} 8 \\ 2 \\ -1 \\ 0 \end{bmatrix} \right) = \lambda_1(8c_1 + 2c_2-c_3) = 0 \,\\ \] Following the same line of reasoning, we express the fourth column of the matrix using the first two columns and generate another set as \[ \,\\ \begin{bmatrix} 1 & 0 & 8 & -4 \\ 0 & 1 & 2 & 12 \end{bmatrix} \left( \lambda_2 \begin{bmatrix} -4 \\ 12 \\ 0 \\ -1 \end{bmatrix} \right) = \lambda_2(-4c_1 + 12c_2-c_4) = 0 \,\\ \] for any \(\lambda \in \mathbb{R} \). Putting everthing together, we obtain all solutions of the equation system, which is called the general solution, as the set \[ \,\\ \left\{ x \in \mathbb{R}^4 : x = \begin{bmatrix} 42 \\ 8 \\ 0 \\ 0 \end{bmatrix} + \lambda_1 \begin{bmatrix} 8 \\ 2 \\ -1 \\ 0 \end{bmatrix} + \lambda_2 \begin{bmatrix} -4 \\ 12 \\ 0 \\ -1 \end{bmatrix} ,\quad \lambda_1, \, \lambda_2 \in \mathbb{R} \right\} \,\\ \] Remark. The general approach we followed consisted of the following three steps:

Find a particular solution to \(Ax = b\).
Find all solutions to \(Ax = 0\).
Combine the solutions from steps 1. and 2. to the general solution.

The system of linear equations in the preceding example was easy to solve because the matrix. However, general equation systems are not of this simple form. Fortunately, there exists a constructive algorithmic way of transforming any system of linear equations into this particularly simple form: Gaussian elimination. Key to Gaussian elimination are elementary transformations of systems of linear equations, which transform the equation system into a simple form. Then, we can apply the three steps to the simple form

2.3.2 Elementary Transformations

Exchange of two equations (rows in the matrix representing the system of equations)
Multiplication of an equation (row) with a constant \(\lambda \in \mathbb{R} \setminus \{0\} \)
Addition of two equations(rows)

Example 2.6
For \(a \in \mathbb{R}\), we seek all solutions of the following system of equations: \[ \,\\ \begin{array}{*{11}{r}} -2x_1 & + & 4x_2 & - & 2x_3 & - & x_4 & + & 4x_5 & = & -3 \\ 4x_1 & - & 8x_2 & + & 3x_3 & - & 3x_4 & + & x_5 & = & 2 \\ x_1 & - & 2x_2 & + & x_3 & - & x_4 & + & x_5 & = & 0 \\ x_1 & - & 2x_2 & & & - & 3x_4 & + & 4x_5 & = & a \end{array} \,\\ \] the system can be converted into compact matrix \(Ax = b\) as an augmented matrix (in the form \([A \mid b]\)) \[ \left[ \begin{array}{rrrrr|r} -2 & 4 & -2 & -1 & 4 & -3 \\ 4 & -8 & 3 & -3 & 1 & 2 \\ 1 & -2 & 1 & -1 & 1 & 0 \\ 1 & -2 & 0 & -3 & 4 & a \end{array} \right] \begin{array}{l} \text{Swap with } R_3 \\ \\ \text{Swap with } R_1 \\ \\ \end{array} \] Swapping \(R_1\) and \(R_3\) leads to \[ \,\\ \left[ \begin{array}{rrrrr|r} 1 & -2 & 1 & -1 & 1 & 0 \\ 4 & -8 & 3 & -3 & 1 & 2 \\ -2 & 4 & -2 & -1 & 4 & -3 \\ 1 & -2 & 0 & -3 & 4 & a \end{array} \right] \begin{array}{l} \\ -4R_1 \\ +2R_1 \\ -R_1 \end{array} \,\\ \] When we now apply the transformations, we obtain \[ \,\\ \begin{align} &\left[ \begin{array}{rrrrr|r} 1 & -2 & 1 & -1 & 1 & 0 \\ 0 & 0 & -1 & 1 & -3 & 2 \\ 0 & 0 & 0 & -3 & 6 & -3 \\ 0 & 0 & -1 & -2 & 3 & a \end{array} \right] \begin{array}{l} \\ \\ \\ -R_2 - R_3 \end{array} \,\\ \,\\ \rightarrow \,\, &\left[ \begin{array}{rrrrr|r} 1 & -2 & 1 & -1 & 1 & 0 \\ 0 & 0 & -1 & 1 & -3 & 2 \\ 0 & 0 & 0 & -3 & 6 & -3 \\ 0 & 0 & 0 & 0 & 0 & a + 1 \end{array} \right] \begin{array}{l} \\ \times \, (-1) \\ \times \, (-\frac{1}{3}) \\ \end{array} \,\\ \,\\ \rightarrow \,\, &\left[ \begin{array}{rrrrr|r} 1 & -2 & 1 & -1 & 1 & 0 \\ 0 & 0 & 1 & -1 & 3 & -2 \\ 0 & 0 & 0 & 1 & -2 & 1 \\ 0 & 0 & 0 & 0 & 0 & a + 1 \end{array} \right] \begin{array}{l} \\ \\ \\ \end{array} \end{align} \,\\ \] This matrix is in a convenient form, the row-echelon form. Reverting this compact notation back into the explicit notation with the variables we seek, we obtain \[ \,\\ \begin{array}{*{11}{r}} x_1 & - & 2x_2 & + & x_3 & - & x_4 & + & x_5 & = & 0 \\ & & & & x_3 & - & x_4 & + & 3x_5 & = & -2 \\ & & & & & & x_4 & - & 2x_5 & = & 1 \\ & & & & & & & & 0 & = & a + 1 \end{array} \,\\ \] Only for \(a = =1\) this system can be solved. A particular solution is \[ \,\\ \begin{bmatrix} x_1 \\ x_2 \\ x_3 \\ x_4 \\ x_5 \end{bmatrix} = \begin{bmatrix} 2 \\ 0 \\ -1 \\ 1 \\ 0 \end{bmatrix} \,\\ \] From the row-echelon form, we can get free variables \(x_2, x_5\) \[ \,\\ \begin{array}{*{11}{r}} x_1 & = & 2x_2 & + & 2x_5 \\ x_2 & = & x_2 & & \\ x_3 & = & & - & x_5 \\ x_4 & = & & & 2x_5 \\ x_5 & = & & & x_5 \end{array} \,\\ \] Then, substitute them as \(x_2 = \lambda_1\) and \(x_5 = \lambda_2\) \[ \,\\ \begin{array}{*{11}{r}} x_1 & = & 2\lambda_1 & + & 2\lambda_2 \\ x_2 & = & \lambda_1 & & \\ x_3 & = & & - & \lambda_2 \\ x_4 & = & & & 2\lambda_2 \\ x_5 & = & & & \lambda_2 \end{array} \,\\ \] The general solution, which capture the set of all possible solutions, is \[ \,\\ \left\{ x \in \mathbb{R}^5 : x = \begin{bmatrix} 2 \\ 0 \\ -1 \\ 1 \\ 0 \end{bmatrix} + \lambda_1 \begin{bmatrix} 2 \\ 1 \\ 0 \\ 0 \\ 0 \end{bmatrix} + \lambda_2 \begin{bmatrix} 2 \\ 0 \\ -1 \\ 2 \\ 1 \end{bmatrix} ,\quad \lambda_1, \, \lambda_2 \in \mathbb{R} \right\} \,\\ \]

Remark (Povots and Staircase Structure). The leading coefficient of a row (first nonzero number from the left) is called the pivot and is always strictly to the right of the pivot of the row above it. Therefore, any equation system in row-echelon form always has a "staircase" structure.

Definition 2.6 (Row-Echelon Form). A matrix is in row-echeon form if

All rows that contain only zeros are at the bottom of the matrix; all rows that contain at least one nonzero element are on top of rows that contain only zeros.
Looking at nonzero rows only, the first nonzero number from the left (also called the pivot or the leading coefficient) is always strictly to the right of the pivot of the row above it.

Remark (Basic and Free Variables). The variables corresponding to the pivots in the row-echelon form are called basic variables and the other variables are free variables.

Remark (Obtaining a Particular Solution). We express the right-hand side of the equation system using the pivot columns, such that, \(b = \sum_{i=1}^P \lambda_i p_i \), where \(p_i, \, i=1, ..., P\), are the pivot columns. In the previous example (2.6), we would try to find \(\lambda_1, \lambda_2, \lambda_3 \) so that \[ \,\\ \lambda_1 \begin{bmatrix} 1\\ 0\\ 0\\ 0 \end{bmatrix} + \lambda_2 \begin{bmatrix} 1\\ 1\\ 0\\ 0 \end{bmatrix} + \lambda_3 \begin{bmatrix} -1\\ -1\\ 1\\ 0 \end{bmatrix} = \begin{bmatrix} 0\\ -2\\ 1\\ 0 \end{bmatrix} \,\\ \] From here, we find relatively directly that \(\lambda_1 = 2, \lambda_2 = -1, \lambda_3 = 1\). When we put everything together, we must not forget the non-pivot columns for which we set the coefficients implicitly to 0, i.e., \([\lambda_1, 0, \lambda_2, \lambda_3, 0] \). Therefore, we get the particular solution \(x = [2, 0, -1, 1, 0]^{\top} \).

Remark (Reduced Row Echelon Form). An equation is in reduced row-echelon form if

It is in row-echelon form.
Every pivot is 1.
The pivot is the only nonzero entry in its column.

Gaussian Elimination. Gaussian elimination is an algorithm that performs elementary transformations to bring a system of linear equations into reduced row-echelon form.

Example 2.7
Verify that the following matrix is in reduced row-echelon form: \[ \,\\ A = \begin{bmatrix} 1 & 3 & 0 & 0 & 3 \\ 0 & 0 & 1 & 0 & 9 \\ 0 & 0 & 0 & 1 & -4 \\ \end{bmatrix} \,\\ \] The key idea for finding the solutions of \(Ax = 0\) is to look at the non-pivot columns, which we will need to express as a (linear) combination of the pivot columns. In other words,

\[ \text{non-pivot column} = \text{some combination of pivot columns} \]
Returning to the example, we want to solve \(Ax = 0\). Here, we can define \(A\) as a collection of column vectors as \(A = [c_1, c_2, c_3, c_4, c_5]\), and \(Ax\) is to find all \(x\) for the equation below: \[ \,\\ \begin{align} x_1 c_1 + x_2 c_2 + x_3 c_3 + x_4 c_4 + x_5 c_5 &= 0 \,\\ \,\\ x_1 \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix} + x_2 \begin{bmatrix} 3\\ 0\\ 0 \end{bmatrix} + x_3 \begin{bmatrix} 0\\ 1\\ 0 \end{bmatrix} + x_4 \begin{bmatrix} 0\\ 0\\ 1 \end{bmatrix} + x_5 \begin{bmatrix} 3\\ 9\\ 4 \end{bmatrix} &= 0 \end{align} \,\\ \] In the matrix \(A\), the first non-pivot column \(c_2\) is \[ \,\\ c_2 = \begin{bmatrix} 3\\ 0\\ 0 \end{bmatrix} = 3c_1 \,\\ \] The above is the first dependency. The second non-pivot column \(c_5\) is the second: \[ \,\\ c_5 = \begin{bmatrix} 3\\ 9\\ -4 \end{bmatrix} = 3c_1 + 9c_3 - 4c_4 \,\\ \] From the relationship between vectors below, we can get the right expressions behind rightarrow \[ \,\\ \begin{align} \begin{aligned} 3c_1 - c_2 + 0c_3 + 0c_4 + 0c_5 &= 0 \quad \Rightarrow \quad 3x_1 - x_2 + 0x_3 + 0x_4 + 0x_5 = 0 \\ 3c_1 + 0c_2 + 9c_3 - 4c_4 - c_5 &= 0 \quad \Rightarrow \quad 3x_1 + 0x_2 + 9x_3 - 4x_4 - x_5 = 0 \end{aligned} \end{align} \,\\ \] To summarize, all solutions of \(Ax = 0, x\in \mathbb{R}^5 \) are given by \[ \,\\ \left\{ x \in \mathbb{R}^5: x = \lambda_1 \begin{bmatrix} 3\\ -1\\ 0\\ 0\\ 0 \end{bmatrix} + \lambda_2 \begin{bmatrix} 3\\ 0\\ 9\\ -4\\ -1 \end{bmatrix}, \quad \lambda_1, \lambda_2 \in \mathbb{R} \right\} \,\\ \] where the free variables \(x_2, x_5\) are \(\lambda_1\) and \(\lambda_2\) respectively.

2.3.3 The Minus-1 Trick

We introduce a practical trick for reading out the solutions \(x\), where \(A \in \mathbb{R}^{k \times n}, x \in \mathbb{R}^n \) . To start, we assume that \(A\) is in reduced row-echelon form without any rows that just contain zeros. We extend the matrix \(A\) to an \(n \times n\)-matrix \(\tilde{A}\).

Example 2.8 (Minus-1 Trick)
Let us revisit the matrix below, which is already in reduced REF: \[ \,\\ A = \begin{bmatrix} 1 & 3 & 0 & 0 & 3 \\ 0 & 0 & 1 & 0 & 9 \\ 0 & 0 & 0 & 1 & -4 \\ \end{bmatrix} \,\\ \] we now augment this matrix to a \(5 \times 5\) matrix by adding rows at the places where the pivots on the diagonal are missing and obtain \[ \,\\ \tilde{A} = \begin{bmatrix} 1 & 3 & 0 & 0 & 3 \\ \color{blue}{0} & \color{blue}{-1} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} \\ 0 & 0 & 1 & 0 & 9 \\ 0 & 0 & 0 & 1 & -4 \\ \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{0} & \color{blue}{-1} \end{bmatrix} \,\\ \] From this form, we can immediately read out the solutions of \(Ax = 0\), which is the same as in the Example 2.7 \[ \,\\ \left\{ x \in \mathbb{R}^5: x = \lambda_1 \begin{bmatrix} 3\\ -1\\ 0\\ 0\\ 0 \end{bmatrix} + \lambda_2 \begin{bmatrix} 3\\ 0\\ 9\\ -4\\ -1 \end{bmatrix}, \quad \lambda_1, \lambda_2 \in \mathbb{R} \right\} \,\\ \]

Calculating the Inverse. To compute the inverse \(A^{-1}\) of \(A \in \mathbb{R}^{n\times n} \), we need to find a matrix \(X\) that satisfies \(AX = I_n\). Then \(X = A^{-1} \). We can write this down as a set of simultaneous linear equations \(AX = I_n\). If we bring the augmented equation system into row-echelon form, we can read out the inverse on the right-hand side of the equation system.

Example 2.9 (Calculating an Inverse Matrix by Gaussian Elimination)
To determine the inverse of \[ \,\\ A = \begin{bmatrix} 1 & 0 & 2 & 0 \\ 1 & 1 & 0 & 0 \\ 1 & 2 & 0 & 1 \\ 1 & 1 & 1 & 1 \end{bmatrix} \,\\ \] we write down the augmented matrix \[ \,\\ \left[ \begin{array}{cccc|cccc} 1 & 0 & 2 & 0 & 1 & 0 & 0 & 0 \\ 1 & 1 & 0 & 0 & 0 & 1 & 0 & 0 \\ 1 & 2 & 0 & 1 & 0 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \end{array} \right] \,\\ \] and use Gaussian elimination to bring it into reduced row-echelon form \[ \,\\ \begin{align} &\left[ \begin{array}{cccc|cccc} 1 & 0 & 2 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & -2 & 0 & -1 & 1 & 0 & 0 \\ 1 & 2 & 0 & 1 & 0 & 0 & 1 & 0 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \end{array} \right] \rightarrow \left[ \begin{array}{cccc|cccc} 1 & 0 & 2 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & -2 & 0 & -1 & 1 & 0 & 0 \\ 0 & 1 & -1 & 0 & 0 & 0 & 1 & -1 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \end{array} \right] \,\\ \,\\ \rightarrow &\left[ \begin{array}{cccc|cccc} 1 & 0 & 2 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & -2 & 0 & -1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 0 & 1 & -1 & 1 & -1 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \end{array} \right] \rightarrow \left[ \begin{array}{cccc|cccc} 1 & 0 & 2 & 0 & 1 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 1 & -1 & 2 & -2 \\ 0 & 0 & 1 & 0 & 1 & -1 & 1 & -1 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \end{array} \right] \,\\ \,\\ \rightarrow &\left[ \begin{array}{cccc|cccc} 1 & 0 & 0 & 0 & -1 & 2 & -2 & 2 \\ 0 & 1 & 0 & 0 & 1 & -1 & 2 & -2 \\ 0 & 0 & 1 & 0 & 1 & -1 & 1 & -1 \\ 1 & 1 & 1 & 1 & 0 & 0 & 0 & 1 \end{array} \right] \rightarrow \left[ \begin{array}{cccc|cccc} 1 & 0 & 0 & 0 & -1 & 2 & -2 & 2 \\ 0 & 1 & 0 & 0 & 1 & -1 & 2 & -2 \\ 0 & 0 & 1 & 0 & 1 & -1 & 1 & -1 \\ 0 & 0 & 0 & 1 & -1 & 0 & -1 & 2 \end{array} \right] \end{align} \,\\ \] such that the desired inverse is given as its right-hand side: \[ \,\\ A^{-1} = \left[ \begin{array}{cccc} -1 & 2 & -2 & 2 \\ 1 & -1 & 2 & -2 \\ 1 & -1 & 1 & -1 \\ -1 & 0 & -1 & 2 \end{array} \right] \,\\ \]

Each elementary row operation \(E\) can be represented as a left multiplication of a matrix \(A\), that is \(EA\). Performing multiple elementary row operations \(E_1, E_2, ..., E_k\) is equivalent to \[ \,\\ E_k\, E_{k-1} \cdots E_1 \, A = I \,\\ \] In other words, transforming \(A\) into the identity matrix \(I\) by using elementary transformations is equivalent to multiplying \(A\) by the matrix \(E\), where \(E\) is defined as \[ \,\\ E = E_k\, E_{k-1} \cdots E_1 \ \,\\ \] Therefore, \[ \,\\ \begin{align} EA &= I \\ E &= A^{-1} \end{align} \,\\ \]

2.3.4 Algorithms for Solving a System off Linear Equations

Gaussian elimination plays an important role when 1) computing determinants, 2) checking whether a set of vectors is linearly independent, 3) computing the inverse of a matrix, 4) computing the rank of a matrix, and 4) determining a basis of a vector space.

2.4 Vector Spaces

In the following, we will have a closer look at vector spaces, i.e., a structured space in which vectors live.

In the beginning of this chapter, we informally characterized vectors as objects that can be added together and multiplied by a scalar, and they remain objects of the same type Now, we are ready to formalize this, and we will start by introducing the concept of a group, which is a set of elements and an operation defined on these elements that keeps some structure of the set intact.

2.4.1 Groups

Definition 2.7 (Group). Consider a set \(\mathcal{G}\) and an operation \(\otimes : \mathcal{G} \times \mathcal{G} \rightarrow \mathcal{G} \) defined on \(\mathcal{G}\). Then \(G := (\mathcal{G}, \otimes)\) is called a group if the following hold:

Closure of \(\mathcal{G}\) under \(\otimes : \forall x, y \in \mathcal{G}: x \otimes y \in \mathcal{G} \)
Associativity: \(\forall x, y, z \in \mathcal{G}: (x \otimes y) \otimes z = x \otimes (y \otimes z) \)
Neutral element (Identity element): \(\exists e \in \mathcal{G} \, \forall x \in \mathcal{G}: x \otimes e = x\) and \(e \otimes x = x \)
Inverse element: \(\forall x \in \mathcal{G} \, \exists y \in \mathcal{G}: x \otimes y = e \) and \(y \otimes x = e \) where \(e\) is the neutral element (identity element). We often write \(x^{-1}\) to denote the inverse element of \(x\).

Remark. The inverse element is defined with respect to the operation \(\otimes\) and does not necessarily mean \(\frac{1}{x}\). If additionally \(\forall x, y \in \mathcal{G}: x \otimes y = y \otimes x \), then \(G = (\mathcal{G}, \otimes)\) is an Abelian group (commutative).

Example 2.10 (Groups)
Let us have a look at some examples of sets with associated operations and see whether they are groups:

\( (\mathbb{Z}, +) \) is an Abelian group.
\( (\mathbb{N}_0, +) \) is not a group, where \(\mathbb{N}_0 := \mathbb{N} \cup \{0\} \): Although \( (\mathbb{N}_0, +) \) possesses a neutral element \((0)\), the inverse elements are missing.
\( (\mathbb{Z}, \cdot) \) is not a group: Although \( (\mathbb{Z}, \cdot) \) contains a neutral element \((1)\), the inverse elements for any \(z \in \mathbb{Z} \setminus \{\pm 1\} \) are missing. In addition, \(0 \in \mathbb{Z} \) cannot have an inverse element since the equation \(0 \cdot x = 1 \) cannot be satisfied for any \(x \in \mathbb{Z}\).
\( (\mathbb{R}, \cdot) \) is not a group since \(0\) does not possess an inverse element.
\( (\mathbb{R} \setminus \{0\}, \cdot) \) is Abelian.
\( (\mathbb{R}^n, +), (\mathbb{Z}^n, + ), n \in \mathbb{N} \) are Abelian if \(+\) is defined componentwise, i.e., \[ \,\\ (x_1, \cdots, x_n) + (y_1, \cdots, y_n) = (x_1 + y_1, \cdots, x_n + y_n) \,\\ \] Then, \( (x_1, \cdots, x_n)^{-1} := (-x_1, \cdots, -x_n) \) is the inverse element and \(e = (0, \cdots, 0) \) is the neutral element. For example, if \( x = (1, 2, 3, 4, 5) \), then its inverse is \[ \,\\ x^{-1} = (-1, -2, -3, -4, -5) \,\\ \] and by adding them componentwise, we obtain the neutral element: \[ \,\\ (1-1, \,\, 2-2, \,\, 3-3, \,\, 4-4, \,\, 5-5) = (0, \,\, 0, \,\, 0, \,\, 0, \,\, 0). \,\\ \]
\( (\mathbb{R}^{m \times n}, +) \), the set of \(m \times n\)-matrices is Abelian (with componentwise addition)
\( (\mathbb{R}^{n \times n}, \cdot ) \), the set of \(n \times n\)-matrices with matrix multiplication
- Closure and associativity follow directly from the definition of matrix multiplication \[ \,\\ (A \cdot B) \cdot C = A\cdot (B \cdot C) \,\\ \]
- The identity matrix \(I_n\) is the neutral element \[ \,\\ I_n \cdot A = A \cdot I_n = A \,\\ \]
- For a regular matrix \( A \in \mathbb{R}^{n \times n} \), the inverse element \( A^{-1} \) exists. If \( A \) is singular, it does not have an inverse element.
  
  The set of all regular (invertible) matrices forms a group under matrix multiplication, denoted by \( GL(n, \mathbb{R}) = \{ A \in \mathbb{R}^{n \times n} \mid \det(A) \neq 0 \} \), which is called the General Linear Group.
  
  Therefore, \( (GL(n, \mathbb{R}), \cdot) \) is a group. However, since matrix multiplication is not commutative, the group is not Abelian.

2.4.2 Vector Spaces

Definition 2.9 (Vector Space). A real-valued vector space \(V = \mathcal{V, +, \cdot}\) is a set \(\mathcal{V} \) with two operations \[ \,\\ \begin{align} + &: \mathcal{V} \times \mathcal{V} \rightarrow \mathcal{V} \\ \cdot &: \mathbb{R} \times \mathcal{V} \rightarrow \mathcal{V} \end{align} \,\\ \] where \( (\mathcal{V}, +) \) is an abelian group. The elements \(x \in V\) are called vectors. The neutral element of \( (\mathcal{V}, +) \) is the ero vector \( [0, ..., 0]^{\top} \), and the operation \(+\) is called vector addition. The elements \(\lambda \in \mathbb{R} \) are called scalars and the operation \( \cdot \) is a multiplication by scalars.

Remark. A vector multiplication \(ab, \,\, a, b \in \mathbb{R}^n\), is not defined. By treating vectors as \(n \times 1\) matrices, we can use the matrix multiplication. Only the following multiplications for vectors are defined: \(ab^{\top} \in \mathbb{R}^{n\times n}\) (outer product), \( a^{\top}b \in \mathbb{R} \) (inner/scalar/dot product). \[ \,\\ a = \begin{bmatrix} 0\\ 1\\ 2 \end{bmatrix}, \,\, b = \begin{bmatrix} 3\\ 4\\ 5 \end{bmatrix} \,\\ \,\\ \begin{align} ab^{\top} &= \begin{bmatrix} 0\\ 1\\ 2 \end{bmatrix} \begin{bmatrix} 3&4&5 \end{bmatrix} = \begin{bmatrix} 0 & 0 & 0 \\ 3 & 4 & 5 \\ 6 & 8 & 10 \end{bmatrix} \,\\ \,\\ a^{\top}b &= \begin{bmatrix} 0&1&2 \end{bmatrix} \begin{bmatrix} 3\\ 4\\ 5 \end{bmatrix} = 14 \end{align} \,\\ \]

2.4.3 Vector Subspaces

Intuitively, they are sets contained in the original vector space with the property that when we perform vector space operations on elements within this subspace, we will never leave it. In this sense, they are "closed".

Definition 2.10 (Vector Subspace). Let \(V = (\mathcal{V}, +, \cdot) \) be a vector space and \(\mathcal{U} \subseteq \mathcal{V}, \mathcal{U} \neq \emptyset \). Then \(U = (\mathcal{U}, +, \cdot) \) is called vector subspace of \(V\) (or linear subspace) if \(U\) is a vector space with the vector space operations \(+\) and \(\cdot\) restricted to \(\mathcal{U} \times \mathcal{U} \) and \( \mathbb{R} \times \mathcal{U} \). We write \( U \subseteq V \) to denote a subspace \(U\) of \(V\).

If \(\mathcal{U} \subseteq \mathcal{V} \) and \(V\) is a vector space, then \(U\) naturally inherits the operations of \(V\), that is, \[ \,\\ \begin{align} + &: \mathcal{U} \times \mathcal{U} \rightarrow \mathcal{U} \\ \cdot &: \mathbb{R} \times \mathcal{U} \rightarrow \mathcal{U} \end{align} \,\\ \] For \(U\) to be a subspace of \(V\), the following conditions must hold:

neutral element (zero vector): \[ \mathcal{U} \neq \emptyset, \, \boldsymbol{0} \in \mathcal{U} \]
closure: \[ \begin{align} &\forall \lambda \in \mathbb{R} \, \forall x \in \mathcal{U} : \lambda x \in \mathcal{U} \,\\ &\forall x, y \in \mathcal{U} : x + y \in \mathcal{U} \end{align} \]

Example 2.12 (Vector Subspaces)

For every vector space \(V\), the trivial subspaces are \(V\) itself and \( \{\boldsymbol{0}\}\).
Only example \(D\) in the figure below is a subspace of \(\mathbb{R}^2 \). In \(A\) and \(C\), the closure property is violated; \(B\) does not contain \(\boldsymbol{0}\).

Only is a subspace — Only \(D\) is a subspace

Remark. Every subspace \(U \subseteq (\mathbb{R}^n, +, \cdot) \) is the solution space of a homogeneous system of linear equations \(Ax = 0\) for \(x \in \mathbb{R}^n\). \[ \,\\ A(\boldsymbol{0}) = \boldsymbol{0} \,\\ \,\\ A(x_1) = \boldsymbol{0}, \,\, A(x_2) = \boldsymbol{0} \,\\ \,\\ A(x_1 + x_2) = A(x_1) + A(x_2) = \boldsymbol{0} + \boldsymbol{0} = \boldsymbol{0} \,\\ \,\\ A(\lambda x_1) = \lambda A(X_1) = \lambda \cdot \boldsymbol{0} = \boldsymbol{0} \,\\ \]

2.5 Linear Independence

The closure property guarantees that we end up with another vector in the same vector space. It is possible to find a set of vectors with which we can represent every vector in the vector space by adding them together and scaling them. This set of vectors is a basis. Before we get concepts of the basis, we will need to introduce the concepts of linear combinations and linear independence.

Definition 2.11 (Linear Combination). Consider a vector space \(V\) and a finite number of vectors \(x_1, ..., x_k \in V \). Then, every \(v \in V\) of the form \[ \,\\ v = \lambda_1 x_1 + \cdots + \lambda_k x_k = \sum_{i=1}^k \lambda_i x_i \in V \,\\ \] with \(\lambda_1, ..., \lambda_k \in \mathbb{R} \) is a linear combination of the vectors \(x_1, ..., x_k \).

The \(\boldsymbol{0} \)-vector can always be written as the linear combination of \(k\) vectors \(x_1, ..., x_k \) because \(\boldsymbol{0} = \sum_{i=1}^k 0x_i \) is always true, and is trivial. \[ \,\\ \boldsymbol{0} = 0x_1 + \cdots + 0x_k \,\\ \] In the following, we are interested in non-trivial linear combinations of a set of vectors to represet \(\boldsymbol{0}\), i.e., linear combinations of vectors \(x_1, ..., x_k \), where not all coefficients \(\lambda_i\)in are \(0\).

Definition 2.12 (Linear (In)dependence). Let us consider a vector space \(V\) with \(k\in \mathbb{N}\) and \(x_1, ..., x_k \in V \). If there is a non-trivial linear combination, such that \(\boldsymbol{0} = \sum_{i=1}^k \lambda_i x_i \) with at least one \(\lambda_i \neq 0\), the vectors are linearly dependent. \[ \,\\ \begin{align} x_3 &= 2x_1 - 3x_2 \\ 0 &= 2x_1 - 3x_2 - x_3 \end{align} \,\\ \] In contrast, \(\boldsymbol{0} = \lambda_1 x_1 + \cdots + \lambda_k x_k\), only the trivial solution exists, i.e., \(\lambda_1 = \cdots = \lambda_k = 0\), the vectors \(x_1, \dots, x_k\) are said to be linearly independent. \[ \,\\ x_1 = \begin{bmatrix} 1\\ 0\\ \end{bmatrix}, \,\, x_2 = \begin{bmatrix} 0\\ 1\\ \end{bmatrix} \,\\ \,\\ \lambda_1 \begin{bmatrix} 1\\ 0\\ \end{bmatrix} + \lambda_2 \begin{bmatrix} 0\\ 1\\ \end{bmatrix} = 0 \begin{bmatrix} 1\\ 0\\ \end{bmatrix} + 0 \begin{bmatrix} 0\\ 1\\ \end{bmatrix} = \boldsymbol{0} \,\\ \] Remark. The following properties are useful to find out whether vectors are linearly independent:

\(k\) vectors are either linearly dependent or independent. There is no third option.
If at least one of the vectors \(x_1, ..., x_k \) is \(\boldsymbol{0}\) then they are linearly dependent. The same vectors exist, they are linearly dependent.
The vectors \( \{ x_1, ..., x_k : x_i \neq 0, i = 1, ..., k \}, k \ge 2 \), are linearly dependent if and only if (at least) one of them is a linear combination of the others. In particular, if one vector is a multiple of another vector, i.e., \(x_i = \lambda x_j, \lambda \in \mathbb{R} \ \) then the vectors linearly dependent.
A practical way of checking whether vectors are linearly independent is to use Gaussian elimination: Write all vectors as columns of a matrix \(A\) and perform Gaussian elimination until the matrix is in row echelon form (the reduced row-echelon form is unnecessary here).
The non-pivot columns can be expressed as linear combinations of the pivot columns. For example, the row-echelon form \[ x_1 = \begin{bmatrix} 1\\ 0\\ 0 \end{bmatrix}, \quad x_2 = \begin{bmatrix} 2\\ 1\\ 0 \end{bmatrix}, \quad x_3 = \begin{bmatrix} 3\\ 1\\ 0 \end{bmatrix} \,\\ \,\\ A = \begin{bmatrix} 1 & 2 & 3 \\ 0 & 1 & 1 \\ 0 & 0 & 0 \end{bmatrix} \] tells us that the first and the second columns are pivot columns. The third column is a non-pivot column because it can be expressed \(x_1 + x_2\).
All column vectors are linearly independent if and only if all columns are pivot columns.

Example 2.14
Consider \(\mathbb{R}^4\) with \[ \,\\ x_1 = \begin{bmatrix} 1 \\ 2 \\ -3 \\ 4 \\ \end{bmatrix}, \quad x_2 = \begin{bmatrix} 1 \\ 1 \\ 0 \\ 2 \\ \end{bmatrix}, \quad x_3 = \begin{bmatrix} -1 \\ -2 \\ 1 \\ 1 \\ \end{bmatrix}, \quad A = \begin{bmatrix} 1 & 1 & -1 \\ 2 & 1 & -2 \\ -3 & 0 & 1 \\ 4 & 2 & 1 \end{bmatrix} \,\\ \] To check whether they are linearly independent, perform elementary row operations until we identify the pivot columns: \[ \,\\ \begin{align} A &= \begin{bmatrix} 1 & 1 & -1 \\ 2 & 1 & -2 \\ -3 & 0 & 1 \\ 4 & 2 & 1 \end{bmatrix} \,\\ \,\\ & \Rightarrow \begin{bmatrix} 1 & 1 & -1 \\ 0 & 1 & 0 \\ -3 & 0 & 1 \\ 4 & 2 & 1 \end{bmatrix} \Rightarrow \begin{bmatrix} 1 & 1 & -1 \\ 0 & 1 & 0 \\ 0 & 3 & -2 \\ 0 & 2 & -5 \end{bmatrix} \Rightarrow \begin{bmatrix} 1 & 1 & -1 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 5 \end{bmatrix} \,\\ \,\\ &\Rightarrow \begin{bmatrix} 1 & 1 & -1 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \end{bmatrix} \end{align} \,\\ \] Here, every column of the matrix is a pivot column. Therefore, there is no non-trivial solution, and we require \(\lambda_1 = 0, \lambda_2 = 0, \lambda_3 = 0 \) to solve the equation system. Hence, the vectors \(x_1, x_2, x_3 \) are linearly independent.

Remark. In a vector space \(V\), \(m\) linear combinations of \(k\) vectors are linearly dependent if \(m > k\).

2.6 Basis and Rank

In a vector space \(V\) , we are particularly interested in sets of vectors \(\mathcal{A}\) that possess the property that any vector \(v \in V\) can be obtained by a linear combination of vectors in \(\mathcal{A}\).

2.6.1 Generating Set and Basis

Definition 2.13 (Generating Set and Span). Consider a vector space \(V - (\mathcal{V}, +, \cdot) \) and set of vectors \(\mathcal{A} = \{ a_1, ..., a_k\} \subseteq \mathcal{V} \). If every vector \(v \in \mathcal{V} \) can be expressed as a linear combination of \(a_1, ..., a_k \), \(\mathcal{A}\) is called a generating set of \(V\). The set of all linear combinations of vectors in \(\mathcal{A} \) is called the span of \(\mathcal{A}\). If \(\mathcal{A}\) spans the vector space \(V\), we write \[ \,\\ \begin{align} V &= \text{span}[\mathcal{A}] \\ &= \text{span}[a_1, ..., a_k] \end{align} \,\\ \] Definition 2.14 (Basis). Consider a vector space \(V = (\mathcal{V}, +, \cdot) \) and \(\mathcal{A} \subseteq \mathcal{V}\). A generating set \(\mathcal{A}\) of \(V\) is called minimal if there exists no smaller set \(\tilde{\mathcal{A}} \subsetneq \mathcal{A} \subseteq \mathcal{V}\) that spans \(V\). Every linearly independent generating set of \(V\) is minimal and is called a basis of \(V\). For example, consider the following two generating sets in \(\mathbb{R}^2\). \[ \,\\ \mathcal{A}_1 = \left\{ \begin{bmatrix}1 \\ 0 \end{bmatrix}, \begin{bmatrix}0 \\ 1 \end{bmatrix} \right\}, \quad \mathcal{A}_2 = \left\{ \begin{bmatrix}1 \\ 0 \end{bmatrix}, \begin{bmatrix}0 \\ 1 \end{bmatrix}, \begin{bmatrix}1 \\ 1 \end{bmatrix} \right\} \,\\ \] However, there is an unnecessary vector \( [1, 1]^{\top} \in \mathcal{A}_2\) since \( [1, 1]^{\top} = [1, 0]^{\top} + [0, 1]^{\top} \). Therefore, even if we remove \([1, 1]^{\top}\) from \(\mathcal{A}_2\), the remaining vectors can still generate \(\mathbb{R}^2\) Hence, \(\mathcal{A}_2\) is a generating set but not a minimal, while \(\mathcal{A}_1\) is a minimal generating set, that is a basis of \(\mathbb{R}^2\).

Let \(V = (\mathcal{V}, +, \cdot) \) be a vector space and \(\mathcal{B} \subseteq \mathcal{V}, \mathcal{B} \neq \emptyset \). Then, the following statements are equivalent:

\(\mathcal{B}\) is a basis of \(V\).
\(\mathcal{B}\) is a minimal generating set.
\(\mathcal{B}\) is a maximal linearly independent set of vectors in \(V\), i.e., adding any other vector to this set will make it linearly dependent.
Every vector \(x \in V \) is a linear combination of vectors from \(\mathcal{B}\), and every linear combination is unique.

Example 2.16

In \(\mathbb{R}^n\), the standard basis is \[ \mathcal{B} = \left\{ \begin{bmatrix} 1 \\ 0 \\ \vdots \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ \vdots \\ 0 \end{bmatrix}, \ldots, \begin{bmatrix} 0 \\ 0 \\ \vdots \\ 1 \end{bmatrix} \right\} \]
in \(\mathbb{R}^3\), the standard basis is \[ \mathcal{B} = \left\{ \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} \right\} \]
Different bases in \(\mathbb{R}^3\) are \[ \mathcal{B}_1 = \left\{ \begin{bmatrix} 1 \\ 0 \\ 0 \\ \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 0 \\ \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 1 \\ \end{bmatrix} \right\}, \quad \mathcal{B}_2 = \left\{ \begin{bmatrix} 0.5 \\ 0.8 \\ 0.4 \\ \end{bmatrix}, \begin{bmatrix} 1.8 \\ 0.3 \\ 0.3 \\ \end{bmatrix}, \begin{bmatrix} -2.2 \\ -1.3 \\ 3.5 \\ \end{bmatrix} \right\}, \quad \]
The set \[ \mathcal{A} = \left\{ \begin{bmatrix} 1 \\ 2 \\ 3 \\ 4 \\ \end{bmatrix}, \begin{bmatrix} 2 \\ -1 \\ 0 \\ 2 \\ \end{bmatrix}, \begin{bmatrix} 1 \\ 1 \\ 0 \\ 4 \\ \end{bmatrix} \right\} \] is linearly independent, but not a generating set (and no basis) of \(\mathbb{R}^4\): For instance, the vector \([0,0,0,1]^{\top} \) cannot be obtained by a linear combination of elements in \(\mathcal{A}\).

Remark. Every vector space \(V\) possesses a basis \(\mathcal{B}\). The Example 2.16 shows that there can be many bases of a vector space \(V\), i.e., there is no unique basis. For example, the following are all bases of \(\mathbb{R}^2\): \[ \,\\ \mathcal{B}_1 = \left\{ \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \end{bmatrix} \right\}, \quad \mathcal{B}_2 = \left\{ \begin{bmatrix} 1 \\ 1 \end{bmatrix}, \begin{bmatrix} 1 \\ -1 \end{bmatrix} \right\} \,\\ \] However, all bases possess the same number of elements, the basis vectors. In other words, in \(\mathbb{R}^n\) space, every basis consists of exactly \(n\) linearly independent vectors. This means that any set of \(n\) linearly independent vectors in \(\mathbb{R}^n\) automatically forms a basis, and no smaller or larger set of vectors can do so. Therefore, all possible bases of \(\mathbb{R}^n\) contain exactly \(n\) basis vectors.

We only consider finite-dimensional vector spaces \(V\). In this case, the dimension of \(V\) is the numer of basis vectors of \(V\), and we write \(\dim(V)\). \[ \mathcal{B} = \left\{ \begin{bmatrix} 1 \\ 0 \\ 0 \\ \end{bmatrix}, \begin{bmatrix} 0 \\ 1 \\ 0 \\ \end{bmatrix}, \begin{bmatrix} 0 \\ 0 \\ 1 \\ \end{bmatrix} \right\}, \quad \dim(\mathcal{V}) = 3 \] If \(U \subseteq V\) is a subspace of \(V\), then \(\dim(U) \le \dim(V) \) and \(\dim(U) = \dim(V) \) if and only if \(U = V\). Intuitively, the dimension of a vector space can be thought of as the number of independent directions in this vector space.

Remark. A basis of a subspace \(U = \text{span}[x_1, ..., x_m] \subseteq \mathbb{R}^n\) can be found by executing the following steps:

Write the spanning vectors as columns of a matrix \(A\).
Determine the row-echelon form of \(A\).
The spanning vectors associated with the pivot columns are a basis of \(U\).

Example 2.17 (Determining a Basis)
Let \(U \subseteq \mathbb{R}^3\) be spanned by four vectors \[ \,\\ x_1 = \begin{bmatrix} 1\\ 0\\ 0\\ \end{bmatrix} , \quad x_2 = \begin{bmatrix} 2\\ 0\\ 0\\ \end{bmatrix} , \quad x_3 = \begin{bmatrix} 0\\ 1\\ 1\\ \end{bmatrix} , \quad x_4 = \begin{bmatrix} 0\\ 2\\ 2\\ \end{bmatrix} \,\\ \] we are interested in finding out wich vectors can form a basis for \(U\). For this, we need to check whether \(x_1, ..., x_4\) are linearly independent. \[ \,\\ [x_1, x_2, x_3, x_4] = \begin{bmatrix} 1 & 2 & 0 & 0\\ 0 & 0 & 1 & 2\\ 0 & 0 & 1 & 2\\ \end{bmatrix} \,\, \xrightarrow{\text{RREF}} \,\, \begin{bmatrix} 1 & 2 & 0 & 0 \\ 0 & 0 & 1 & 2 \\ 0 & 0 & 0 & 0 \\ \end{bmatrix} \,\\ \] There are two pivot columns (\(x_1, x_3 \)). Since the pivot columns indicate which set of vectors is linearly independent, we see form the row-echelon form that \(x_1, x_3\) are linearly independent (because the system of linear equations \(\lambda_1 x_1 + \lambda_3 x_3 = 0\) can be only solved with \(\lambda_1 = \lambda_3 = 0\)). Therefore \(\{ x_1, x_3\} \) is a basis of \(U\).

2.6.2 Rank

The number of linearly independent columns of a matrix \(A \in \mathbb{R}^{n \times n} \) equals to the number of linearly independent rows. It is called the rank of \(A\) and is denoted by \[ \,\\ \text{rk}(A) \,\\ \] Remark. The rank of a matrix has some important properties:

\(\text{rk}(A) = \text{rk}(A^{\top}) \), i.e., the column rank equals to the row rank.
The columns of \(A \in \mathbb{R}^{m \times n} \) span a subspace \(U \subseteq \mathbb{R}^{m} \) with \(\dim(U) =\text{rk}(A)\). Later we will call this subspace the image or range. A basis of \(U\) can be found by applying Gaussian elimination to \(A\) to identify the pivot columns.
The rows of \(A\in \mathbb{R}^{m\times n} \) span subspace \(W \subseteq \mathbb{R}^n \) with \(\dim(W) = \text{rk}(A) \). A basis of \(W\) can be found by applying Gaussian elimination to \(A^{\top}\).
A matrix \(A \in \mathbb{R}^{m \times n} \) has full rank if its rank equals the largeest possible rank for a matrix of the same dimensions. \[ \text{rk}(A) = \min (m, n) \] Otherwise \(A\) is rank deficient. Let \(B \in \mathbb{R}^{4\times3} \) \[ B = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 1 & 1 & 1 \\ \end{bmatrix} \,\, \xrightarrow{R4-R3-R2-R1} \,\, \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \\ 0 & 0 & 0 \\ \end{bmatrix} \] The rank of matrix \(B\) is \(\text{rk}(B) = 3\), \(\text{rk}(B) = 3 = \min(4, 3) \). Thus \(B\) is full rank.

2.7 Linear Mappings

We will study mappings on vector spaces that preserve their structure, which will allow us to define the concept of a coordinate. In the beginning of the chapter, we said that vectors are objects that can be added together and multiplied by a scalar, and the resulting object is still a vector. We wish to preserve this property when applying the mapping: \[ \,\\ \begin{align} \begin{aligned} \Phi (x + y) &= \Phi(x) + \Phi(y) \quad &\text{Additivity} \\ \Phi(\lambda x) &= \lambda \Phi(x) \quad &\text{Homogeneity} \end{aligned} \end{align} \,\\ \] The mapping \(\Phi\) is linear if both conditions are satisfied.

Definition 2.15 (Linear Mapping). For vector spaces \(V, W\), a mapping \(\Phi : V \rightarrow W \) is called a linear mapping (or vector space homomorphism / linear transformation) if \[ \,\\ \forall x, y \in V \, \forall \lambda, \psi \in \mathbb{R} : \Phi(\lambda x + \psi y) = \lambda (\Phi(x)) + \psi (\Phi(y)) \,\\ \] We can represent linear mappings as matrices.

Definition 2.16 (Injective, Surjective, Bijective). Consider a mapping \(\Phi : \mathcal{V} \rightarrow \mathcal{W} \), where \(\mathcal{V}, \mathcal{W}\) can be arbitarary sets. Then \(\Phi\) is called

Injective if \(\forall x, y \in \mathcal{V} : \Phi(x) = \Phi(y) \Rightarrow x = y \).
Surjective if \(\Phi(\mathcal{V}) = \mathcal{W} \).
Bijective if it is injective and surjective.

A is bijective \(\Phi\) can be undone, i.e. there exists a mapping \(\Psi : \mathcal{W} \rightarrow \mathcal{V} \) so that \(\Psi \circ \Phi(x) = x \). This mapping \(\Psi \) is then called the inverse of \(\Phi\) and normally denoted by \(\Phi^{-1}\).

Isomorphism: \(\Phi : V \rightarrow W\) is called an isomorphism if \(\Phi\) is linear and bijective.
Endomorphism: \(\Phi : V \rightarrow V\) is called an endomorphism if \(\Phi\) is linear.
Automorphism: \(\Phi : V \rightarrow V\) is called an automorphism if \(\Phi\) is linear and bijective.
We define \(\text{id}_V: V \rightarrow V, x \mapsto x \) as the identity mapping or identity automorphism in \(V\).

Theorem 2.17. Finite-dimensional vector spaces \(V\) and \(W\) are isomorphic if and only if \(\dim(V) = \dim(W)\). Intuitively, this means that vector spaces of the same dimension are kind of the same thing, as they can be transformed into each other without incurring any loss.

Remark. Consider vector spaces \(V, W, X\). Then:

For linear mappings \(\Phi : V \rightarrow W \) and \(\Psi : W \rightarrow X \), the mapping \(\Psi \circ \Phi : V \rightarrow W \) is also linear.
If \(\Phi : V \rightarrow W\) is an isomorphism, then \(\Phi^{-1} : W \rightarrow V \) is an isomorphism too.

2.7.1 Matrix Representation of Linear Mappings

Any \(n\)-dimensional vector space is isomorphic to \(\mathbb{R}^n\) (Therorem 2.17). In the following, the order of the basis vectors will be important. Therefore, we write \[ \,\\ B = (b_1, ..., b_n) \,\\ \] and call ordered basis of \(V\).

Definition 2.18 (Coordinates). Consider a vector space \(V\) and an ordered basis \(B = (b_1, ..., b_n) \) of \(V\). For any \(x \in V\), we obtain a unique representation (linear combination) \[ \,\\ x = \alpha_1 b_1 + \cdots + \alpha_n b_n \,\\ \] of \(x\) with respect to \(B\). Then \(\alpha_1, ..., \alpha_n\) are the coordinates of \(x\) with respect to \(B\), and the vector \[ \,\\ \boldsymbol{\alpha} = \begin{bmatrix} \alpha_1 \\ \vdots \\ \alpha_n \end{bmatrix} \in \mathbb{R}^n \,\\ \] is the coordinate vector (coordinate representation) of \(x\) with respect to the ordered basis \(B\). A basis effectively defines a coordinate system. We are familiar with the Cartesian coordinate system in two dimensions, which is spanned by the standard basis vectors \(e_1\), \(e_2\). In this coordinate system, a vector \(x \in \mathbb{R}^2\) has a representation that tells us how to linearly combine \(e_1\) and \(e_2\) to obtain \(x\). However, any basis of \(\mathbb{R}^2\) defines a valid coordinate system, and the same vector \(x\) from before may have a different coordinate representation in the \((b_1, b_2)\) basis.

Two different coordinate systems defined by two sets of basis vectors

In the figure above, the coordinates of \(x\) with respect to the standard basis \(e_1, e_2\) is \([2, 2]^{\top}\). However, with respect to the basis \(b_1, b_2\) the same vector \(x\) is represented as \([1.09, 0.72]^{\top}\), i.e., \(x = 1.09b_1 + 0.72b_2\).

Example 2.20
Consider the vector \(x \in \mathbb{R}^2\) with coordinates \( [2, 3]^{\top} \) with respect to the standard basis \((e_1, e_2)\) of \(\mathbb{R}^2\). This standard basis allows us to write \(\color{blue}{x = 2e_1 + 3e_2}\). However, we do not have to choose the standard basis to represent the vector \(x\).

Different coordinate representations of a vector , depending on the choice of basis — Different coordinate representations of a vector \(x\), depending on the choice of basis

If we use \( b_1 = [1, -1]^{\top},\ b_2 = [1, 1]^{\top} \) as the basis, the vector \(x\) can be represented as \(\color{orange}{x = -\frac{1}{2} b_1 + \frac{5}{2} b_2}\).

Definition 2.19 (Transformation Matrix). Consider vector spaces \(V, W\) with corresponding ordered bases \(B = (b_1, ..., b_n) \) and \(C = (c_1, ..., c_m)\). Moreover, we consider a linear mapping \(\Phi : V \rightarrow W \). For \(j\in \{1,...,n\}\) \[ \,\\ \Phi(b_j) = \alpha_{1j} c_1 + \cdots + \alpha_{mj} c_m = \sum_{i=1}^m \alpha{ij}c_i \,\\ \] is the unique representation of \(\Phi(b_j)\) with respect to \(C\). where, \(\alpha_{ij}\) is defined as \[ \,\\ A_{\Phi}(i, j) = \alpha_{ij} \,\\ \] For example, consider the bases of \(V\) and \(W\) as follows: \[ \,\\ \begin{align} B = \begin{bmatrix} 1 & 0\\ 0 & 1\\ \end{bmatrix} \quad &\Rightarrow \quad b_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \,\, b_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \,\\ \,\\ C = \begin{bmatrix} 1 & 2 \\ 3 & 0 \\ 0 & -1 \\ \end{bmatrix} \quad &\Rightarrow \quad c_1 = \begin{bmatrix} 1 \\ 3 \\ 0 \end{bmatrix}, \,\, c_2 = \begin{bmatrix} 2 \\ 0 \\ -1 \end{bmatrix} \end{align} \,\\ \] Here, the unique representation of \(\Phi(b_1)\) is \[ \,\\ \begin{align} \Phi(b_1) &= \begin{bmatrix} 1 & 2\\ 3 & 0\\ 0 & -1 \end{bmatrix} \begin{bmatrix} 1\\ 0 \end{bmatrix} = \begin{bmatrix} 1\\ 3\\ 0 \end{bmatrix} \,\\ \,\\ &= 1 \cdot c_1 + 0 \cdot c_2 = c_1 \end{align} \,\\ \] The coordinates of \(\Phi(b_j)\) with respect to the ordered basis \(C\) of \(W\) are the \(j\)-th column of \(A_{\Phi}\). Consider vector spaces \(V, W\) with ordered bases \(B, C\) and a linear mapping \(\Phi: V \rightarrow W\) with transformation matrix \(A_{\Phi}\). If \(\hat{x}\) is the coordinate vector of \(x\in V\) with respect to \(B\) and \(\hat{y}\) the coordinate vector of \(y = \Phi(x) \in W \) with respect to \(C\), then \[ \,\\ \hat{y} = A_{\Phi}\hat{x} \,\\ \] This means that the transformation matrix can be used to map coordinates with respect to an ordered basis in \(V\) to coordinates with respect to an ordered basis in \(W\).

Example 2.22 (Linear Transformations of Vectors)
We consider three linear transformations of a set of vectors in \(\mathbb{R}^2\) with \[ \,\\ A_1 = \begin{bmatrix} \cos(\frac{\pi}{4}) & -\sin(\frac{\pi}{4}) \\ \sin(\frac{\pi}{4}) & \cos(\frac{\pi}{4}) \end{bmatrix}, \quad A_2 = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix}, \quad A_3 = \frac{1}{2} \begin{bmatrix} 3 & -1 \\ 1 & -1 \end{bmatrix} \,\\ \] The figure below shows three examples of linear transformations applied to a set of vectors. In the figure, the original data on the far left shows the vectors in \(\mathbb{R}^2\). The vectors are arranged in a square, with each dot corresponding to \((x_1, x_2)\)-coordinates. When we use matrix \(A_1\) to apply linear transformations to the original vectors, we obtain the second vectors, which is rotated by \(45^{\small{\circ}}\). If we apply \(A_2\), the data will be stretched along the x-axis by a factor of \(2\).

From the left, Original data · Rotation by 45° · Stretch along the horizontal axis · General linear mapping — From the left,
Original data · Rotation by 45° · Stretch along the horizontal axis · General linear mapping

2.7.2 Basis Change

Remark. In the following, we will have a closer look at how transformation matrices of a linear mapping \(\Phi: V \rightarrow W \) change if we change the bases in \(V\) and \(W\). We effectively get different coordinate representations of the identity mapping \(\text{id}_{V}\). In the context of Figure 4, this would mean to map coordinates with respect to \((e_1, e_2)\) onto coordinates with respect to \((b1, b2)\) without changing the vector \(x\). By changing the basis and correspondingly the representation of vectors, the transformation matrix with respect to this new basis can have a particularly simple form that allows for straightforward computation.

Consider a vector \(x\) with coordinates \((2, 3)\) in the standard basis of \(\mathbb{R}^2\). In this basis, the vector \(x\) is represented as \(2 e_1 + 3 e_2\). If we choose the basis \(a = [1, 0]^{\top} = c_1\) and \(b = [1, 3]^{\top} = c_2\), \(C = \{[c_1, c_2]\}\) then \(x\) can be represented as \(c_1 + c_2\).

Basis Change . Different coordinate representations — Basis Change.
Different coordinate representations \[ \,\\ \left[ x \right]_E = \begin{bmatrix} 2 \\ 3 \end{bmatrix}, \quad \left[ x \right]_C = \begin{bmatrix} 1 \\ 1 \end{bmatrix} \,\\ \]

Remark (Basis Change Transformation Matrix). A basis change transformation matrix is a matrix that converts the coordinate representation of a vector \(x\) from one basis \(B = (b_1, b_2)\) to another basis \(\hat{B} = (\hat{b_1}, \hat{b_2})\). In general the vector \(x\) can be written as \[ \,\\ x = b_1 x + b_2 x = \hat{b_1} \hat{x} + \hat{b_2} \hat{x} \,\\ \] If each vector of \(B\) is expressed in terms of \(\hat{B}\) as \[ \,\\ b_1 = p \hat{b_1} + r \hat{b_2} \\ b_2 = q \hat{b_1} + s \hat{b_2} \,\\ \] then \[ \,\\ [x]_{\hat{B}} = \begin{bmatrix} p & q \\ r & s \end{bmatrix}^{-1} [x]_{B} \,\\ \] The matrix \[ \,\\ [P]^{B}_{\hat{B}} = \begin{bmatrix} p & q \\ r & s \end{bmatrix}^{-1} \,\\ \] is called the basis change transformation matrix from \(B\) to \(\hat{B}\). For example, consider the coordinates \( (x, y) \) in the standard basis \(E\) of \(\mathbb{R}^2\), and the coordinates \(\hat{x}, \hat{y}\) in the basis \(C = \{ [1, 0]^{\top}, [1, 3]^{\top} \}\). \[ \,\\ \begin{align} x \begin{bmatrix} 1\\ 0 \end{bmatrix} + y \begin{bmatrix} 0\\ 1 \end{bmatrix} &= \hat{x} \begin{bmatrix} 1\\ 0 \end{bmatrix} + \hat{y} \begin{bmatrix} 1\\ 3 \end{bmatrix} \,\\ \,\\ \Rightarrow \begin{bmatrix} 1 & 0\\ 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} &= \begin{bmatrix} 1 & 1 \\ 0 & 3 \end{bmatrix} \begin{bmatrix} \hat{x} \\ \hat{y} \end{bmatrix} \end{align} \,\\ \] Therefore, \[ \,\\ \begin{align} \begin{bmatrix} \hat{x} \\ \hat{y} \end{bmatrix} &= \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 0 & 3 \\ \end{bmatrix}^{-1} \begin{bmatrix} x \\ y \end{bmatrix} \,\\ \,\\ &= \frac{1}{3} \begin{bmatrix} 3 & -1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} x \\ y \end{bmatrix} \end{align} \,\\ \] Here, \( \frac{1}{3}\begin{bmatrix} 3 & -1 \\ 0 & 1 \end{bmatrix} \) is the basis change transformation matrix, denoted by \( [P]_E^C \). In other words, it is the matrix that maps the coordinates in \(E\) to those in \(C\). To express the vector \((2, 3)\) in the standard basis \(E\) (Fig. 7) as coordinates relative to the basis \(C\), we can multiply it by \([P]_E^C\) as obtained above to get its representation in basis \(B\): \[ \,\\ \begin{align} [P]_E^C \begin{bmatrix} 2 \\ 3 \end{bmatrix} &= \frac{1}{3} \begin{bmatrix} 3 & -1 \\ 0 & 1 \end{bmatrix} \begin{bmatrix} 2 \\ 3 \end{bmatrix} \,\\ \,\\ &= \begin{bmatrix} 1 \\ 1 \end{bmatrix} \end{align} \,\\ \]

2.7.3 Image and Kernel

The image and kernel of a linear mapping are vector subspaces with certain important properties.

Definition 2.23 (Image and Kernel).

For \(\Phi : V \rightarrow W\), we define the kernel/nullspace \[ \,\\ \ker(\Phi) := \Phi^{-1}(0w) = \{ v \in V: \Phi(v) = 0w \} \,\\ \] and the image/range \[ \,\\ \text{Im}(\Phi) := \Phi(V) = \{ w \in W | \exists v \in V: \Phi(v) = w \} \,\\ \] Intuitively, the kernel is the set of vectors \(v \in V\) that \(\Phi\) maps onto the neutral element \(\boldsymbol{0}_W \in W\). The image is the set of vectors \(w \in W\) that can be reached by \(\Phi\) from any vector in \(V\).

Remark. Consider a linear mapping \(\Phi : V \rightarrow W \), where \(V, W\) are vector spaces:

It always holds that \(\Phi(\boldsymbol{0}_V) = \boldsymbol{0}_W \) and, therefore, \(\boldsymbol{0}_V \in \ker (\Phi) \). In particular, the null space is never empty.
\(\text{Im}(\Phi) \subseteq W\) is a subspace of \(W\), and \(\ker(\Phi) \subseteq V\) is a subspace of \(V\).
\(\Phi\) is injective (one to one) if and only if \(\ker(\Phi) = \{\boldsymbol{0}\}\).
\(\text{rk}(A) = \dim(\text{Im}(\Phi)) \)
The kernel/null space \(\ker(\Phi)\) is the general solution to the homogeneous system of inear equations \(\boldsymbol{A}x = \boldsymbol{0}\)
\(\dim(\ker(\Phi)) + \dim(\text{Im})(\Phi) = \dim(V) \)
If \(\dim(V) = \dim(W)\), the the three-way equivalence \[ \Phi \text{ is injective} \Longleftrightarrow \Phi \text{ is surjective} \Longleftrightarrow \Phi \text{ is bijective} \]

3. Analytic Geometry

In this chapter, we will look at geometric vectors and compute their lengths and distances or angles between two vectors. Inner products and their corresponding norms and metrics capture the intuitive notions of similarity and distances.

3.1 Norms

Definition 3.1 (Norm). A norm on a vector space \(V\) is a function \[ \,\\ \begin{align} \| \cdot \| : V &\rightarrow \mathbb{R}, \\ x &\mapsto \| x \| \end{align} \,\\ \] which assigns each vector \(x\) its length \(\| x \| \in \mathbb{R} \), such that for all \(\lambda \in \mathbb{R}\) and \(x, y \in V\) the following hold:

Absolutely homogeneous: \(\| \lambda x \| = |\lambda| \| x \| \)
Triangle inequality: \( \| x + y \| \leq \| x \| + \| y \| \)
Positive definite: \( \| x \| \ge 0 \) and \(\| x \| = 0 \Longleftrightarrow x = \boldsymbol{0} \)

For any triangle, the sum of the lengths of any two sides must be greater than or equal to the length of the remaining side.

Example 3.1 (Manhattan Norm)
The Manhattan norm on \(\mathbb{R}^n\) is defined for \(x \in \mathbb{R}^n \) as \[ \,\\ \| x \|_1 := \sum_{i=1}^n | x_i | \,\\ \] where \( | \cdot | \) is the absolute value, \(x_i\) is the \(i^{th} \) element of the vector \(x\). The left side of Fig.9 below shows all vectors \(x \in \mathbb{R}^2\) with \(\|x \|_1 = 1\). The Manhattan norm is also called L1 norm.

Example 3.2 (Euclidean Norm)
The Euclidean norm of \(x \in \mathbb{R}^n \) is defined as \[ \,\\ \| x \|_2 := \sqrt{ \sum_{i=1}^n x_i^2 } = \sqrt{x^{\top}x} \,\\ \] and computes the Euclidean distance of \(x\) from the origin. The right side of Fig.9 shows all vectors \(x \in \mathbb{R}^2 \) with \( \|x\|_2 = 1 \). The Euclidean norm is also called L2 norm.

From the left, Manhattan norm · Euclidean norm The red lines indiceate the set of vectors with norm 1 — From the left, Manhattan norm · Euclidean norm
The red lines indiceate the set of vectors with norm 1

3.2 Inner Products

A major purpose of inner products is to determine whether vectors are orthogonal to each other.

3.2.1 Dot Product

We may already be familiar with a particular type of inner product, the scalar product/dot product in \(\mathbb{R}^n\), which is given by \[ \,\\ \boldsymbol{x}^{\top}\boldsymbol{y} = \sum_{i=1}^nx_i y_i \,\\ \]

3.2.2 General Inner Products

Definition 3.2. Let \(V\) be a vector space and \(\Omega : V \times V \rightarrow \mathbb{R} \) be a mapping that takes two vectors and maps them onto a real number. Then

\(\Omega\) is called symmetric if \(\Omega(x, y) = \Omega(y, x), \forall x, y \in V \).
\(\Omega\) is called positive definite if \[ \forall x \in V \setminus \{\boldsymbol{0}\} : \Omega(x, x) > 0, \quad \Omega(\boldsymbol{0}, \boldsymbol{0}) = 0. \]

3.2.3 Symmetric, Positive Definite Matrices

Consider an \(n\)-dimensional vector space \(V\) with an inner product \( \langle \cdot, \cdot \rangle : V \times V \rightarrow \mathbb{R} \) and an ordered basis \(B = b_1, \cdots, b_n \) of \(V\). Any vectors \(x, y \in V \) can be written as linear combinations of the basis vectors so that \(x = \sum_{i=1}^n \psi_i b_i \in V \) and \( y = \sum_{j=1}^n \lambda_j b_j \in V \) for suitable \(\psi_i, \lambda_j \in \mathbb{R} \). Due to the bilinearity of the inner product, it holds for all \(x, y \in V\) that \[ \,\\ \langle x, y \rangle = \left\langle \sum_{i=1}^n \psi_i b_i,\,\, \sum_{j=1}^n \lambda_j b_j \right\rangle = \sum_{i=1}^n \sum_{j=1}^n \psi_i \langle b_i, b_j \rangle \lambda_j \,\\ \] The mid expression expanded by bilinearity as \[ \,\\ \left\langle \sum_{i=1}^n \psi_i b_i,\,\, \sum_{j=1}^n \lambda_j b_j \right\rangle = \sum_{i=1}^n \psi_i \left\langle b_i, \sum_{j=1}^n \lambda_j b_j \right\rangle \,\\ \,\\ \left\langle b_i, \sum_{j=1}^n \lambda_j b_j \right\rangle = \sum_{j=1}^n \lambda_j \langle b_i, b_j \rangle \,\\ \,\\ \left\langle \sum_{i=1}^n \psi_i b_i,\,\, \sum_{j=1}^n \lambda_j b_j \right\rangle = \sum_{i=1}^n \sum_{j=1}^n \psi_i \langle b_i, b_j \rangle \lambda_j \,\\ \] In the matrix form, where \(\hat{x} = (\psi_1, \cdots, \psi_n)^\top\), \(\hat{y} = (\lambda_1, \cdots, \lambda_n)^\top\), \[ \,\\ \langle x, y \rangle = \hat{x}^{\top}A\hat{y} = \sum_{i, j} \psi_i A_{ij} \lambda_j \,\\ \] where, \(\hat{x}, \hat{y}\) are the coordinates of \(x\) and \(y\) with respect to the basis \(B\). \(A_{ij}\) is \(\langle b_i, b_j \rangle \): \[ \,\\ A = \begin{bmatrix} \langle b_1, b_1 \rangle & \langle b_1, b_2 \rangle & \cdots \\ \langle b_2, b_1 \rangle & \langle b_2, b_2 \rangle & \cdots \\ \vdots & \vdots & \ddots \end{bmatrix} \,\\ \] The inner product \( \langle \cdot, \cdot \rangle \) is uniquely determined through \(A\). The symmetry of the inner product also means that \(A\) is symmetric. Furthermore, the positive definiteness of the inner product implies that \[ \,\\ \forall x \in V \setminus \{ 0 \} : x^\top Ax > 0 \,\\ \]
Definition 3.4 (Symmetric, Positive Definite Matrix). A symmetric matrix \(A \in \mathbb{R}^{n \times n} \) that satisfies the positive definiteness above is called symmetric, positive definite, or just positive definite. If \(x^\top Ax \ge 0\), then \(A\) is called symmetric, positive semidefinite.

Example 3.4 (Geometric Intuition of Symmetric, Positive Definite Matrices)
Consider the matrices, and \(x\) \[ \,\\ A = \begin{bmatrix} 3 & -1 \\ -1 & 2 \end{bmatrix} , \quad B = \begin{bmatrix} -3 & -1 \\ -1 & 4 \end{bmatrix} , \quad x = \begin{bmatrix} 1 \\ 1 \end{bmatrix} \,\\ \] Results of the \(x^\top Ax\) and \(x^\top Bx\) are as follows respectively \[ \,\\ x^\top Ax = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 3 & -1 \\ -1 & 2 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = 3 \,\\ \,\\ x^\top Bx = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 3 & -1 \\ -1 & 4 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = -1 \,\\ \] In a geometric perspective, positive definite means that the angle between any vector \(x\) and the vector transformed by the matrix (here, \(Ax\) or \(Bx\)) remains between -90 and 90 degrees. The figure below intuitively illustrates the concept of positive definite.

Geometric intuition of Positive Definite

The following properties hold if \(A\in\mathbb{R}^{n \times n}\) is symmetric and positive definite:

The null space (kernel) of \(A\) consists only of \(\boldsymbol{0}\) because \(x^\top Ax > 0, \) for all \(x \neq \boldsymbol{0}\). Therefore \[ \ker(A) = \{\boldsymbol{0}\} \]

3.3 Lengths and Distances

Remark (Cauchy-Schwarz Inequality). For an inner product vector space \((V, \langle \cdot, \cdot \rangle) \) the induced norm \(\| \cdot \|\) satisfies the Cauchy-Schwarz inequality \[ \,\\ | \langle x, y \rangle |^2 \le \langle x, x \rangle \langle y, y \rangle \Longleftrightarrow | \langle x, y \rangle | \le \sqrt{ \langle x, x \rangle \langle y, y \rangle } = \|x\| \|y\| \,\\ \] Consider two vectors \(x, y \in \mathbb{R}^2 \) \[ \,\\ x = [1, 2], \quad y = [3, 4]. \,\\ \] First, compute the inner product: \[ \,\\ \begin{align} \langle x, y \rangle &= 1 \times 3 + 2 \times 4 \\ &= 11 \end{align} \,\\ \] Then, compute the norms: \[ \,\\ \begin{align} \|x\| = \sqrt{1^2 + 2^2} &= \sqrt{5} \approx 2.23606797749979 \\ \|y\| = \sqrt{3^2 + 4^2} &= 5 \\ \end{align} \,\\ \] Now compare both sides of the Cauchy-Schwarz inequality: \[ \,\\ | \langle x, y \rangle | = 11, \quad \|x\| \|y\| = 5\sqrt{5} \approx 11.180339887498949 \,\\ \] Therefore, \[ \,\\ | \langle x, y \rangle | = 11 \le 11.180339887498949 = \|x\| \|y\| \,\\ \]

Definition 3.6 (Distance and Metric). Consider an inner product space \( (V, \langle \cdot, \cdot \rangle) \). Then \[ \,\\ d(x, y) := \|x - y\| = \sqrt{ \langle x-y , x-y \rangle } \,\\ \] is called the distance between \(x\) and \(y\) for \(x, y \in V\). If we use the dot product as the inner product, then the distance is called Euclidean distance.

The mapping \(d\) is called a metric \[ \,\\ \begin{align} d : V \times V &\rightarrow \mathbb{R} \\ (x, y) &\mapsto d(x, y) \end{align} \,\\ \] A metric \(d\) satisfies the following:

\(d\) is positive definite, i.e., \(d(x, y) \ge 0 \) for all \(x, y \in V\) and \(d(x, y) = 0 \Longleftrightarrow x = y \).
\(d\) is symmetric, i.e., \(d(x, y) = d(y, x) \) for all \(x, y \in V\).
Triangle inequality: \(d(x,z) \le d(x,y) + d(y,z) \) for all \(x, y, z \in V\).

Remark. We observe that \(\langle x, y \rangle\ \) and \(d(x, y) \) behave in opposite directions. Very similar \(x\) and \(y\) will result in a large value for the inner product and a small value for the metric.

3.4 Angles and Orthogonality

We use the Cauchy-Schwarz inequality to define angles \(\theta\) in inner product spaces between two vectors \(x, y\). Assume that \(x \neq 0, y \neq 0\). First, divide both sides of the Cauchy-Schwarz inequality by \( \|x\| \|y\| \). Two vectors are not zero vectors. therefore, \(\|x\| \ge 0, \|y\| \ge 0 \), and the direction of inequality are not changed. \[ \,\\ |\langle x, y \rangle| \le \|x\| \|y\| \Longleftrightarrow \frac{ | \langle x, y \rangle | }{ \|x\|\|y\| } \le 1 \,\\ \] Based on the definition of absolute value \( |a| \le 1 \Longleftrightarrow -1 \le a \le 1 \), we get \[ \,\\ -1 \le \frac{ \langle x, y \rangle }{ \|x\| \|y\| } \le 1 \,\\ \] Therefore, there exists a unique \(\theta \in [0, \pi] \). The number \(\theta\) is the angle between the two vectors \(x, y\). \[ \,\\ \cos \theta = \frac{ \langle x, y \rangle }{ \|x\| \|y\| } \,\\ \] Intuitively, the angle between two vectors tells us how similar their orientations are. Based on the above, the inner product can be rewritten as \[ \,\\ \langle x, y \rangle = \|x\| \|y\| \cos \theta \,\\ \]

Example 3.6 (Angle between Vectors)
Let us compute the angle between \(x = [1, 1]^\top \in \mathbb{R}^2 \) and \( y = [-1, 1]^\top \in \mathbb{R}^2 \). We use the dot product as the inner product. Then we get \[ \,\\ \begin{align} \cos \theta &= \frac{ \langle x, y \rangle }{ \|x\| \|y\| } \\ &= \frac{ 1 \times (-1) + 1 \times 1 }{ \sqrt{1^2 + 1^2} \times \sqrt{(-1)^2 + 1^2} } \\ &= \frac{0}{2} \\ &= 0 \end{align} \,\\ \] and the angle between the two vectors is \(\arccos (0) = \frac{\pi}{2} = 90^{\small{\circ}}\)

A key feature of the inner product is that it allows us to characterize vectors that are orthogonal.

Definition 3.7 (Orthogonality). Two vectors \(x\) and \(y\) are orthogonoal if and only if \( \langle x, y, \rangle = 0\) and we write \(x \perp y \). If \( \| x \| = \| y \| = 1 \), i.e., the vectors are unit vectors, then \(x\) and \(y\) are orthonormal. An implication of this definition is that the \(\boldsymbol{0}\)-vector is orthogonal to every vector in the vector space.

Remark. Geometrically we can think of orthogonal vectors as having a right angle with respect to a specific inner product.

Definition 3.8 (Orthogonal Matrix). A square matrix \(A \in \mathbb{R}^{n \times n} \) is an orthogonal matrix if and only if its columns are orthonormal so that \[ \,\\ AA^\top = I = A^\top A \,\\ \] which implies that \[ \,\\ A^\top = A^{-1} \,\\ \] i.e., the inverse is obtained by simply transposing the matrix.

Transformations by orthogonal matrices are special because the length of a vector \(x\) is not changed when transforming it using an orthogonal matrix \(A\). Moreover, the angle between any two vectors \(x, y\) is also unchanged when transforming both of them using an orthogonal matrix \(A\). Assuming the dot product as the inner product, the angle between \(Ax\) and \(Ay\) is given as \[ \,\\ \begin{align} \cos \theta &= \frac{ ( Ax )^\top ( Ay ) }{ \| Ax \| \| Ay \| } \,\\ \,\\ &= \frac{ x^\top A^\top Ay }{ \sqrt{ (Ax)^\top Ax (Ay)^\top Ay } } \,\\ \,\\ &= \frac{ x^\top A^\top Ay }{ \sqrt{ (x^\top A^\top A x) (y^\top A^\top A y) }} \,\\ \,\\ &= \frac{ x^\top y }{ \sqrt{ (x^\top x) (y^\top y) } } \,\\ \,\\ &= \frac{ x^\top y }{ \|x\| \|y\| } \end{align} \,\\ \] which gives exactly the angle between \(x\) and \(y\). This means that ortohonal matrices \(A\) with \( A^\top = A^{-1} \) preserve both angles and distances. It turns out that orthogonal matrices define transformations that are rotations.

3.5 Orthonormal Basis

We will discuss the special case where the basis vectors are orthogonal to each other and where the length of each basis vector is 1. We will call this basis then an orthonormal basis.

Definition 3.9 (Orthonormal Basis). Consider an \(n\)-dimensional vector space \(V\) and a basis \( \{ \boldsymbol{b}_1, \cdots, \boldsymbol{b}_n \} \) of \(V\). \[ \,\\ \begin{align} \langle \boldsymbol{b}_i, \boldsymbol{b}_j \rangle &= 0 \quad \text{for} \,\, i \neq j \\ \langle \boldsymbol{b}_i, \boldsymbol{b}_i \rangle &= 1 \end{align} \,\\ \] for all \(i, j = 1, \cdots, n\) then the basis is called an orthonormal basis (ONB).

Exampe 3.8 (Orthonormal Basis)
The standard basis for a Euclidean vector space \( \mathbb{R}^n \) is an orthonormal basis, where the inner product is the dot product of vectors. In \(\mathbb{R}^2\), consider the orthogonal vectors \(b_1, b_2\) \[ \,\\ b_1 = [1, 1]^\top, \quad b_2 = [1, -1]^\top \,\\ \] To make \(b_1\) and \(b_2\) orthonormal, divide \(b_1\) by \( \| b_1 \| \), and divdie \( b_2 \) by \( \|b_1\| \). \[ \,\\ \begin{align} \tilde{b}_1 &= \frac{1}{\| b_1 \|} [1, 1]^\top = \frac{1}{\sqrt{2}} [1, 1]^\top \,\\ \,\\ \tilde{b}_2 &= \frac{1}{\| b_2 \|} [1, -1]^\top = \frac{1}{\sqrt{2}} [1, -1]^\top \end{align} \,\\ \] Then, compute \( \tilde{b}_1^\top \tilde{b}_2 \) and \( \| \tilde{b}_1 \|, \| \tilde{b}_2 \| \) \[ \,\\ \tilde{b}_1^\top \tilde{b}_2 = \begin{bmatrix} \frac{1}{\sqrt{2}} & \frac{1}{\sqrt{2}} \end{bmatrix} \begin{bmatrix} \frac{1}{\sqrt{2}} \\ -\frac{1}{\sqrt{2}} \end{bmatrix} = \frac{1}{2} - \frac{1}{2} = 0 \,\\ \,\\ \,\\ \| \tilde{b}_1 \| = \sqrt{\left( \frac{1}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}} \right) + \left( \frac{1}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}} \right)} = \sqrt{ \frac{1}{2} + \frac{1}{2} } = 1 \,\\ \,\\ \,\\ \| \tilde{b}_2 \| = \sqrt{\left( \frac{1}{\sqrt{2}} \cdot \frac{1}{\sqrt{2}} \right) + \left( -\frac{1}{\sqrt{2}} \cdot (-\frac{1}{\sqrt{2}}) \right)} = \sqrt{ \frac{1}{2} + \frac{1}{2} } = 1 \,\\ \] Therefore, the vectors \( \tilde{b}_1, \tilde{b}_2 \) form an orthonormal basis.

3.6 Orthogonal Complement

Consider a \(D\)-dimensional vector space \(V\) and an \(M\)-dimensional subspace \(U \subseteq V \). Then its orthogonal complement \(U^\perp\) is a \( (D-M) \)-dimensional subspace of \(V\) and contains all vectors in \(V\) that are orthogonal to every vector in \(U\). Furthermore, \(U \cap U^\perp = \{ \boldsymbol{0}\} \).

3.7 Inner Product of Functions

We can think of a vector \(x\in \mathbb{R}^n\) as a function with \(n\) function values. An inner product of two functions \( u: \mathbb{R} \rightarrow \mathbb{R} \) and \( v: \mathbb{R} \rightarrow \mathbb{R} \) can be defined as the definite integral \[ \,\\ \langle u, v \rangle := \int_a^b u(x)v(x)\, dx \,\\ \]

3.8 Orthogonal Projections

In machine learning, we often deal with data that is high-dimensional. High-dimensional data is often hard to analyze or visualize.

However, high-dimensional data quite often possesses the property that only a few dimensions contain most information, and most other dimensions are not essential to describe key properties of the data. When we compress or visualize high-dimensional data, we will lose information. To minimize this compression loss, we ideally find the most informative dimensions in the data.

In this chapter, we will discuss some of the fundamental tools for data compression. More specifically, we can project the original high-dimensional data onto a lower-dimensional feature space.

For example, machine learning algorithms, such as principal component analysis (PCA), and deep auto-encoders, heavily exploit the idea of dimensionality reduction.

Definition 3.10 (Projection). Let \(V\) be a vector space and \(U \subseteq V\) a subspace of \(V\). A linear mapping \(\Phi: V \rightarrow U\) is called a projection if the following holds \[ \,\\ \Phi^2 = \Phi \,\\ \] Since linear mappings can be expressed by transformation matrices, the matrix representation also satisfies the property above. \[ \,\\ P^2_\pi = P_\pi \,\\ \]

3.8.1 Projection onto One-Dimensional Subspaces (Lines)

Using geometric arguments, let us characterize some properties of the projection \(\pi_U (x)\) , where \(b \in \mathbb{R}^n\), and the subspace is \(U \subseteq \mathbb{R}^n\) spanned by \(b\).

The projection \(\pi_U (x) \) is closest to \(x\), where "closest" implies that the distance \( \| x - \pi_U (x) \| \) is minimal.
The vector \(\pi_U (x) - x\) is orthogonal to \(U\).
The orthogonality condition yields \( \langle \pi_U (x) - x, b \rangle = 0 \).
The projection \(\pi_U (x)\) of \(x\) onto \(U\) must be an element of \(U\). Hence, \(\pi_U (x) = \lambda b\) for some \(\lambda \in \mathbb{R}\).
The projected vector can be obtained by the dot product and normalized basis \(\tilde{b} = \frac{b}{\|b\|} \) \[ \pi_U (x) = (x^\top \tilde{b}) \tilde{b} \]

In the following steps, we determine the \(\lambda\), the projection \(\pi_U (x) \in U\), and the projection matrix \(P_{\pi}\) that maps any \(x \in \mathbb{R}^n\) onto \(U\).

Finding \(\lambda\). The orthogonality condition yields \[ \,\\ \langle x-\pi_U (x), b \rangle = 0 \Longleftrightarrow \langle x - \lambda b, b \rangle = 0 \,\\ \] In the vector form, the right expression is \[ \,\\ \langle x - \lambda b, b \rangle = 0 \,\, \Longrightarrow \,\, (x - \lambda b)^\top b = 0 \,\\ \] By expanding it, we get \[ \,\\ \begin{align} (x - \lambda b)^\top b &= (x^\top - (\lambda b)^\top) b \,\\ \,\\ &= x^\top b - \lambda (b^\top b) \,\\ \,\\ &= 0 \end{align} \,\\ \] Now, solve the expression for \(\lambda\) \[ \,\\ x^\top b - \lambda (b^\top b) = 0 \,\\ \,\\ \lambda = \frac{x^\top b}{b^\top b} = \frac{b^\top x}{\|b\|^2} \,\\ \] Therefore, \[ \,\\ \pi_U (x) = \lambda b = \frac{b^\top x}{\|b\|^2} b \,\\ \] Hence, \[ \,\\ \pi_U (x) = \lambda b = b \lambda = P_\pi (x) = b\frac{b^\top x}{\|b\|^2} = \frac{b b^\top}{\|b\|^2}x \,\\ \,\\ P_\pi = \frac{b b^\top}{\|b\|^2} \,\\ \] Here, the \(bb^\top\) is a symmetric matrix, and \(\|b\|^2 = \langle b, b \rangle \) is a scalar.

Remark. The projection matrix \(\pi_U (x) \in \mathbb{R}^n \) is still an \(n\)-dimensional vector, not a scalar.

Example 3.10 (Projection onto a Line)
Find the projeection matrix \(P_\pi\) onto the line through the origin spanned by \(b = [1, 2, 2]^\top \). \(b\) is a direction and a basis of the one dimensional subspace. Based on the above expression, \(P_\pi\) for \(b\) is \[ \,\\ \begin{align} P_\pi &= \frac{b b^\top}{\|b\|^2} = \frac{b b^\top}{b^\top b} \,\\ \,\\ b^\top b &= 1 + 4 + 4 = 9 \,\\ \,\\ b b^\top &= \begin{bmatrix} 1 \\ 2 \\ 2 \\ \end{bmatrix} \begin{bmatrix} 1& 2& 2 \end{bmatrix} = \begin{bmatrix} 1 & 2 & 2 \\ 2 & 4 & 4 \\ 2 & 4 & 4 \\ \end{bmatrix} \,\\ \,\\ P_\pi &= \frac{1}{9} \begin{bmatrix} 1 & 2 & 2 \\ 2 & 4 & 4 \\ 2 & 4 & 4 \\ \end{bmatrix} \end{align} \,\\ \] Let us now choose a particular \(x\) and see whether it lies in the subspace spanned by \(b\). The illustration below shows the result of applying the projection matrix \(P_\pi\) to \(x\).

3.8.2 Projection onto General Subspaces

In the following, we look at orthogonal projections of vectors \(x \mathbb{R}^n\) onto lower-dimensional subspaces \(U\subseteq \mathbb{R}^n\) with \(\dim(U) = m \ge 1 \).

Assume that \( (b_1, \cdots, b_m) \) is an ordered basis of \(U\). Any projeection \(\pi_U (x)\) onto \(U\) is necessarily an element of \(U\). Therefore, they can be represented as linear combinations of the basis vectors \(\pi_U (x) = \sum_{i=1}^m \lambda_i b_i \).

We follow three-step procedure to find the projection \(\pi_U (x)\) and the projection matrix \(P_\pi\):

Find the coordinates \(\lambda_1, \cdots, \lambda_m\) of the projection, such that the linear combination is closest to \(x \mathbb{R}^n\). \[ \begin{align} &\pi_U (x) = \sum_{i=1}^m \lambda_i b_i = \boldsymbol{B}\boldsymbol{\lambda} \,\\ \,\\ &\boldsymbol{B} = [b_1, \cdots, b_m] \in \mathbb{R}^{n \times m}, \quad \boldsymbol{\lambda} = [\lambda_1, \cdots, \lambda_m]^\top \in \mathbb{R}^m \end{align} \,\\ \] The "closest" means "minimum distance", which implies that the vector connecting \(\pi_U (x) \in U\) and \(x \in \mathbb{R}^n\) must be orthogonal to all basis vectors of \(U\). \[ \langle b_1, x - \pi_U (x) \rangle = b_1^\top(x - \pi_U (x)) = 0 \\ \vdots \\ \langle b_m, x - \pi_U (x) \rangle = b_m^\top(x - \pi_U (x)) = 0 \,\\ \] In the matrix form, with \(\pi_U (x) = \boldsymbol{B\lambda}\) we can written as \[ \boldsymbol{B}^\top (x - \boldsymbol{B\lambda}) = 0 \,\\ \,\\ \boldsymbol{B}^\top x - \boldsymbol{B}^\top \boldsymbol{B\lambda} = 0 \,\\ \] Then, solve the expression for \(\boldsymbol{\lambda}\) \[ \boldsymbol{B}^\top x = \boldsymbol{B}^\top \boldsymbol{B\lambda} \,\\ \,\\ (\boldsymbol{B}^\top \boldsymbol{B})^{-1} \boldsymbol{B}^\top x = (\boldsymbol{B}^\top \boldsymbol{B})^{-1} \boldsymbol{B}^\top \boldsymbol{B\lambda} \,\\ \,\\ \boldsymbol{\lambda} = (\boldsymbol{B}^\top \boldsymbol{B})^{-1} \boldsymbol{B}^\top x \,\\ \]
Find the projection \(\pi_U (x) \in U\). We already established that \(\pi_U (x) = \boldsymbol{B\lambda}\) \[ \pi_U (x) = \boldsymbol{B\lambda} = \boldsymbol{B} (\boldsymbol{B}^\top \boldsymbol{B})^{-1} \boldsymbol{B}^\top x \,\\ \]
Find the projection matrix \(P_\pi\). From the above, we can obtain the projection matrix immediately with \(\pi_U (x) = P_\pi (x)\) \[ P_{\pi} = \boldsymbol{B\lambda} = \boldsymbol{B} (\boldsymbol{B}^\top \boldsymbol{B})^{-1} \boldsymbol{B}^\top \,\\ \]

Example 3.11 (Projection onto a Two-dimensional Subspace)
For a subspace \(U\) with ordered basis \(b_1 = [2, 0, 0]^\top\) and \(b_2 = [0, 2, 2]^\top\), we can write the basis vectors of \(U\) into a maxtrix \(\boldsymbol{B}\) \[ \,\\ \boldsymbol{B} = \begin{bmatrix} 2 & 0 \\ 0 & 2 \\ 0 & 2 \\ \end{bmatrix} \,\\ \] Then, we can compute the projection matrix \( P_\pi \) \[ \,\\ P_\pi = \boldsymbol{B} (\boldsymbol{B}^\top \boldsymbol{B})^{-1} \boldsymbol{B}^\top = \frac{1}{2} \begin{bmatrix} 2 & 0 & 0 \\ 0 & 1 & 1 \\ 0 & 1 & 1 \\ \end{bmatrix} \,\\ \] We can now apply the projection matrix to \( x_1=[0, -3, 1]^\top, x_2=[-1, 0, 3]^\top \). The illustration below shows the result of applying \(P_\pi (x)\).

To verify the results, we can check whether the vector \(P_\pi (x) - x\) is orthogonal to all basis vectors of \(U\). Here, the \(X\) is consisting of \(x_1, x_2\). \[ \,\\ \begin{align} \boldsymbol{X} &= \begin{bmatrix} 0 & -1 \\ -3 & 0 \\ 1 & 3 \\ \end{bmatrix} \,\\ \,\\ P_\pi (\boldsymbol{X}) &= \frac{1}{2} \begin{bmatrix} 2 & 0 & 0 \\ 0 & 1 & 1 \\ 0 & 1 & 1 \\ \end{bmatrix} \begin{bmatrix} 0 & -1 \\ -3 & 0 \\ 1 & 3 \\ \end{bmatrix} = \begin{bmatrix} 0 & -1 \\ -1 & 1.5 \\ -1 & 1.5 \end{bmatrix} \,\\ \,\\ P_\pi(\boldsymbol{X}) - \boldsymbol{X} &= \begin{bmatrix} 0 & -1 \\ -3 & 0 \\ 1 & 3 \\ \end{bmatrix} - \begin{bmatrix} 0 & -1 \\ -1 & 1.5 \\ -1 & 1.5 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ -2 & -1.5 \\ 2 & 1.5 \end{bmatrix} \,\\ \,\\ \boldsymbol{B}^\top (P_\pi(\boldsymbol{X}) - \boldsymbol{X}) &= \begin{bmatrix} 2 & 0 & 0 \\ 0 & 2 & 2 \end{bmatrix} \begin{bmatrix} 0 & 0 \\ -2 & -1.5 \\ 2 & 1.5 \end{bmatrix} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \\ \end{bmatrix} \end{align} \,\\ \]

Remark. If the basis is an orthonormal basis (ONB), the projection equation \(\pi_U (x)\) is simplified to \[ \,\\ \begin{align} \pi_U (x) &= \boldsymbol{B} (\boldsymbol{B}^\top \boldsymbol{B})^{-1} \boldsymbol{B}^\top x \,\\ \,\\ &= \boldsymbol{B} I \boldsymbol{B}^\top x \,\\ \,\\ &= \boldsymbol{B} \boldsymbol{B}^\top x \end{align} \,\\ \] since an ONB satisfies \(\boldsymbol{B}^\top \boldsymbol{B} = \boldsymbol{I}\). This means that we no longer have to compute the inverse, which saves computation time.

3.8.3 Gram-Schmidt Orthogonalization

Projections are at the core of the Gram-Schmidt method that allows us to constructively transform any basis \((b_1, \cdots, b_n) \) of \(n\)-dimensional vector space \(V\) into an orthogonal/orthonormal basis \((u_1, \cdots, u_n)\) of \(V\).

The Gram-Schmidt orthogonalization method iteratively constructs an orthogonal basis \((u_1, \cdots, u_n) \) from any basis \( (b_1, \cdots, b_n) \) of \(V\) as follows: \[ \,\\ \begin{align} &u_1 := b_1 \\ &u_2 := b2 - \text{proj}_{u_1} (b2) \\ &u_3 := b3 - \text{proj}_{u_1} (b3) - \text{proj}_{u_2} (b3) \,\\ \,\\ &\quad \quad \vdots \,\\ \,\\ &u_n := b_n - \text{proj}_{u_1} (b_n) - \text{proj}_{u_2} (b_n) - \text{proj}_{u_3} (b_n) - \,\, \cdots \,\, - \text{proj}_{u_{n-1}} (b_n) \end{align} \,\\ \]

Example 3.12 (Gram-Schmidt Orthogonalization)
Consider a basis \((b_1, b_2)\) of \(\mathbb{R}^2\), where \[ \,\\ b_1 = \begin{bmatrix} 2 \\ 0 \\ \end{bmatrix}, \quad b_2 = \begin{bmatrix} 1 \\ 1 \\ \end{bmatrix} \,\\ \]
Using the Gram-Schmidt method, \[ \,\\ \begin{align} &u_1 := b_1 = \begin{bmatrix} 2 \\ 0 \end{bmatrix} \,\\ \,\\ &u_2 := b_2 - \mathrm{proj}_{u_1}(b_2) \end{align} \,\\ \] From Example 3.10 (Projection onto a Line), we can compute the projection as \(\text{proj}_{u_1} (b_2) = \frac{u_1 u_1^\top}{u_1^\top u_1} b_2\). Therefore, \(u_2\) is \[ \,\\ \begin{align} u_1^\top u_1 &= \begin{bmatrix} 2 & 0 \end{bmatrix} \begin{bmatrix} 2 \\ 0 \end{bmatrix} = 4 \,\\ \,\\ u_1 u_1^\top &= \begin{bmatrix} 2 \\ 0 \end{bmatrix} \begin{bmatrix} 2 & 0 \end{bmatrix} = \begin{bmatrix} 4 & 0 \\ 0 & 0 \end{bmatrix} \,\\ \,\\ \,\\ u_2 &= b_2 - \frac{u_1 u_1^{\top}}{u_1^{\top} u_1} b_2 \,\\ \,\\ &= \begin{bmatrix} 1 \\ 1 \end{bmatrix} - \frac{1}{4} \begin{bmatrix} 4 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} \,\\ \,\\ &= \begin{bmatrix} 1 \\ 1 \end{bmatrix} \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} \,\\ \,\\ &= \begin{bmatrix} 0 \\ 1 \end{bmatrix} \end{align} \,\\ \]
The figure below illustrates the process of constructing the orthogonal basis \(u_1, u_2\) from \(b_1, b_2\) using the Gram-Schmidt method.

Gram-Schmidt Orthogonalization in \(\mathbb{R}^2\)

In \(\mathbb{R}^3\), let \(b_1, b_2, b_3 \) are as follows \[ \,\\ b_1 = \begin{bmatrix} 2 \\ 2 \\ 0 \\ \end{bmatrix}, \quad b_2 = \begin{bmatrix} -1 \\ -1 \\ 3 \end{bmatrix}, \quad b_3 = \begin{bmatrix} 0 \\ 2 \\ 0 \end{bmatrix} \,\\ \]
Applying Gram-Schmidt method, we can also obtain orthogonal basis \(u_1, u_2, u_3\) in \(\mathbb{R}^3\) \[ \,\\ \begin{align} u1 &= b1 \,\\ \,\\ \text{proj}_{u_1} (b_2) &= \frac{u_1 u_1^\top}{u_1^\top u_1} b_2\ = \begin{bmatrix} -1 \\ -1 \\ 0 \end{bmatrix} \,\\ \,\\ u_2 &= b_2 - \text{proj}_{u_1} (b_2) = \begin{bmatrix} 0 \\ 0 \\ 3 \end{bmatrix} \,\\ \,\\ \text{proj}_{u_1} (b_3) &= \frac{u_1 u_1^\top}{u_1^\top u_1} b_3\ = \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix} \,\\ \,\\ \text{proj}_{u_2} (b_3) &= \frac{u_2 u_2^\top}{u_2^\top u_2} b_3\ = \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \,\\ \,\\ u3 &= b_3 - \text{proj}_{u_1} (b_3) - \text{proj}_{u_2} (b_3) = \begin{bmatrix} -1 \\ 1 \\ 0 \end{bmatrix} \end{align} \,\\ \]
The figure below shows the process of constructing the orthogonal basis \(u_1, u_2, u_3\) from \(b_1, b_2, b_3\) in \(\mathbb{R}^3\).

To check whether the constructed orthogonal basis vectors \(u_1, u_2, u_3\) are perpendicular, we can compute \(U^\top U\), where \(U\) is the matrix whose columns are \(u_1, u_2, u_3\). \[ \,\\ U^\top U = \begin{bmatrix} 2 & 2 & 0 \\ 0 & 0 & 3 \\ -1 & 1 & 0 \\ \end{bmatrix} \begin{bmatrix} 2 & 0 & -1 \\ 2 & 0 & 1 \\ 0 & 3 & 0 \\ \end{bmatrix} = \begin{bmatrix} 8 & 0 & 0 \\ 0 & 9 & 0 \\ 0 & 0 & 2 \\ \end{bmatrix} \,\\ \]
The matrix derived from \(U^\top U\) shows that all off-diagonal elements are zero. This means that all basis vectors are orthogonal to each other, and when performing the inner product with themselves, only the inner product produces a non-zero value.

3.9 Rotations

A rotation is a linear mapping that rotates a plane by an angle \(\theta\) about the origin, i.e., the origin is a fixed point.

3.9.1 Rotations in \(\mathbb{R}^2\)

Consider the standard basis \(e_1 = [1, 0]^\top, e_2 = [0, 1]^\top \) of \(\mathbb{R}^2\). We aim to rotate this coordinate system by an angle \(\theta\) as illustrated in the figure below.

Note that the rotated vectors are still linearly independent and, therefore, are a basis of \(\mathbb{R}^2\). This means that the rotation performs a basis change. Trigonometry allows us to determine the coordinates of the rotated axes with respect to the standard basis in \(\mathbb{R}^2\). We obtain \[ \,\\ \Phi (e_1) = [ \cos(\theta), \sin(\theta) ]^\top ,\quad \Phi (e_2) = [ -\sin(\theta), \cos(\theta) ]^\top \,\\ \] In the matrix form, the axes can be represented as \(R(\theta)\) \[ \,\\ R(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) \\ \sin(\theta) & \cos(\theta) \\ \end{bmatrix} \,\\ \] You can check whether the rotated basis remains orthogonal by computing the inner product: \(\Phi(e_1)^\top \Phi(e_2) = 0\).

3.9.2 Rotations in \(\mathbb{R}^3 \)

In \(\mathbb{R}^3\), there are three rotations about the three standard basis vectors. We can obtain a general rotation matrix \(R\) by combining the images of the standard basis. Consider the standard basis \(e_1, e_2, e_3\) in \(\mathbb{R}^3\) \[ \,\\ e_1 = [1, 0, 0]^\top, \quad e_2 = [0, 1, 0]^\top, \quad e_3 = [0, 0, 1]^\top \,\\ \] Rotation about the \(e_1\)-axis \(R_1(\theta)\) is \[ \,\\ R_1(\theta) = \begin{bmatrix} 1 & 0 & 0 \\ 0 & \cos(\theta) & -\sin(\theta) \\ 0 & \sin(\theta) & \cos(\theta) \\ \end{bmatrix} \,\\ \] Here, the \(e_1\) coordinate is fixed, and the counterclockwise rotation is performed in the \(e_2, e_3\) plane.

Rotation about the \(e_2\)-axis \(R_2(\theta)\) is \[ \,\\ R_2(\theta) = \begin{bmatrix} \cos(\theta) & 0 & -\sin(\theta) \\ 0 & 1 & 0 \\ \sin(\theta) & 0 & \cos(\theta) \end{bmatrix} \,\\ \] In this matrix, rotation is performed in the \(e_1, e_3\) plane

Rotation about the \(e_3\)-axis \(R_3(\theta)\) is \[ \,\\ R_3(\theta) = \begin{bmatrix} \cos(\theta) & -\sin(\theta) & 0 \\ \sin(\theta) & \cos(\theta) & 0\\ 0 & 0 & 1 \\ \end{bmatrix} \,\\ \] The figure below illustrates rotations in \(\mathbb{R}^3\).

3.9.3 Rotations in \(n\) Dimensions

The generalization of rotations from 2D and 3D to n-dimensional Euclidean vector spaces can be intuitively described as fixing \(n - 2\) dimensions and restrict the rotation to a two-dimensional plane in the \(n\)-dimensional space. As in the three-dimensional case, we can rotate any plane (two-dimensional subspace of \(\mathbb{R}^n\)).

3.9.4 Properties of Rotations

Rotations exhibit a number of useful properties, which can be derived by considering them as orthogonal matrices.

Rotations preserve distances, i.e., \( \| x - y \| = \| R_{\theta}(x) - R_{\theta}(y) \| \).
Rotations preserve angles, i.e., the angle between \(R_{\theta}(x)\) and \(R_{\theta}(y) \) equals the angle between \(x\) and \(y\).
Rotations in three (or more) dimensions are generally not commutative. Therefore, the order in which rotataions are applied is important. Only in two dimensions vector rotataions are commutative.

4. Matrix Decompositions

In this chapter, we present three aspects of matrices: how to summarize matrices, how matrices can be decomposed, and how these decompositions can be used for matrix approximations.

4.1 Determinant and Trace

A determinant is a mathematical object in the analysis and solution of systems of linear equations. Determinants are only defined for square matrices \(A\in \mathbb{n \times n}\). The determinant of a square matrix \(A\in \mathbb{n \times n}\) is a function that maps \(A\) onto a real number.

Example 4.1 (Testing for Matrix Invertibility)
Let us begin with exploring if a square matrix \(A\) is invertible. For \(2 \times 2\) matrices, by the definition of the inverse, we know that \(AA^{-1} = I\). Then with the definition 2.3 (Inverse), the inverse of \(A\) is \[ \,\\ A^{-1} = \frac{1}{a_{11}a_{22}-a_{12}a_{21}} \begin{bmatrix} a_{22} & -a_{12} \\ -a_{21} & a_{11} \end{bmatrix} \,\\ \] Hence, \(A\) is invertible if and only if \[ \,\\ a_{11}a_{22} - a_{12}a_{21} \neq 0 \,\\ \] This quantity is the determinant of \(A\in \mathbb{R}^{2\times 2}\), i.e., \[ \,\\ \det(A) = \begin{vmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{vmatrix} = a_{11}a_{22} - a_{12}a_{21} \,\\ \] This example points at the relationship between determinants and the existence of inverse matrices.

Theorem 4.1. For any square matrix \(A\in \mathbb{R}^{n \times n}\) it holds that \(A\) is invertible if and only if \(\det(A) \neq 0\).

For \(n = 1\), \[ \,\\ \det(A) = det(a_{11}) = a_{11} \,\\ \] For \(n = 2\), \[ \,\\ \det(A) = \begin{vmatrix} a_{11} & a_{12} \\ a_{21} & a_{22} \end{vmatrix} = a_{11}a_{22} - a_{12}a_{21} \,\\ \] For \(n = 3\) (known as Sarrus' rule), \[ \,\\ \begin{align} \det(A) = \begin{vmatrix} a_{11} & a_{12} & a_{13} \\ a_{21} & a_{22} & a_{23} \\ a_{31} & a_{32} & a_{33} \\ \end{vmatrix} = \color{red}{a_{11}a_{22}a_{33} + a_{12}a_{23}a_{31} + a_{13}a_{21}a_{32}} \\ \color{blue}{-a_{13}a_{22}a_{31} - a_{11}a_{23}a_{32} - a_{12}a_{21}a_{33}} \end{align} \,\\ \] Sarrus' rule can be understood intuitively using the augmented matrix. \[ \,\\ \left[ \begin{array}{ccc|cc} \color{red}{a_{11}} & \color{red}{a_{12}} & \color{red}{a_{13}} & a_{11} & a_{12} \\ a_{21} & \color{red}{a_{22}} & \color{red}{a_{23}} & \color{red}{a_{21}} & a_{22} \\ a_{31} & a_{32} & \color{red}{a_{33}} & \color{red}{a_{31}} & \color{red}{a_{32}} \\ \end{array} \right] \quad \left[ \begin{array}{ccc|cc} a_{11} & a_{12} & \color{blue}{a_{13}} & \color{blue}{a_{11}} & \color{blue}{a_{12}} \\ a_{21} & \color{blue}{a_{22}} & \color{blue}{a_{23}} & \color{blue}{a_{21}} & a_{22} \\ \color{blue}{a_{31}} & \color{blue}{a_{32}} & \color{blue}{a_{33}} & a_{31} & a_{32} \\ \end{array} \right] \,\\ \]

Example 4.2 (Determinants as Measure of Volume)
It turns out that the determinant \(\det(A)\) is the signed volume of an n-dimensional parallelepiped formed by columns of the matrix \(A\).

For \(n=2\), the columns of the matrix form a parallelogram. If the angle between vectors is smaller, the area of a parallelogram shrinks, too. In particular, if \(b, g\) are linearly dependent so that \(b = \lambda g \) for some \(\lambda \in \mathbb{R} \), they no longer form a two-dimensional parallelogram. Therefore, the corresponding area is \(0\).

Parallelogram for , Parallelepiped for . — Parallelogram for \(n=2\), Parallelepiped for \(n=3\).

The sign of the determinant indicates the orientation of the spanning vectors. If flipping the order \(b, g\) to \(g, b\), the columns of \(A = [b, g]\) becomes \(A = [g, b]\) and the \(A\) has negative area. For example, let \(b\) and \(g\) be standard basis \(b = e_1 = [1, 0]^\top, g = e_2 = [0, 1]^\top\). \[ \,\\ \begin{align} A &= \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix}, \quad \det{A} = 1 - 0 = 1 \,\\ \,\\ \tilde{A} &= \begin{bmatrix} 0 & 1 \\ 1 & 0 \end{bmatrix}, \quad \det(\tilde{A}) = 0 - 1 = -1 \end{align} \,\\ \]
In \(\mathbb{R}^3\), the absolute value of the determinant of the \(3 \times 3\) matrix \([r, g, b] \) is the volume of the solid. Thus, the determinant acts as a function that measures the signed volume formed by column vectors composed in a matrix. Let's say that the three linearly independent vectors \(r, g, b\in \mathbb{R}^3\) are given as \[ \,\\ r = \begin{bmatrix} 1 \\ 0 \\ 0 \end{bmatrix}, \quad g = \begin{bmatrix} 1 \\ 1 \\ 0 \end{bmatrix}, \quad b = \begin{bmatrix} 1\\1\\1 \end{bmatrix} \,\\ \] The vectors given can be represented matrix form as \[ \,\\ A = [r, g, b] = \begin{bmatrix} 1 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 1 \end{bmatrix}, \quad \det(A) = 1 \,\\ \]

Theorem 4.2 (Laplace Expansion) Consider a matrix \(A\in\mathbb{R}^{n\times n}\). Then for all \(j=1, \cdots, n\)

Expansion along column \(j\) \[ \,\\ \det(A) = \sum_{k=1}^n (-1)^{k+j} a_{kj} \det(A_{k, j}) \,\\ \]
Expansion along row \(j\) \[ \,\\ \det(A) = \sum_{n=1}^k (-1)^{k+j} a_{jk} \det(A_{j, k}) \,\\ \]

Here, \(A_{k, j}\) is the submatrix of \(A\) that we obtain when deleting row \(k\) and column \(j\).

Example 4.3 (Laplace Expansion)
Let \(A\) be \[ \,\\ A = \begin{bmatrix} 2 & 1 & 3 \\ 0 & 2 & 1 \\ 0 & 0 & 2 \end{bmatrix} \,\\ \] To compute the determinant of \(A\), we can use the Laplace Expansion along the first column \[ \,\\ \begin{align} \det(A) &= (-1)^{1+1} \cdot 2 \begin{pmatrix} 2 & 1 \\ 0 & 2 \end{pmatrix} + (-1)^{2+1} \cdot 0 \begin{pmatrix} 1 & 3 \\ 0 & 2 \end{pmatrix} + (-1)^{3+1} \cdot 0 \begin{pmatrix} 1 & 3 \\ 2 & 1 \end{pmatrix} \,\\ \,\\ &= 2 \cdot 4 + 0 + 0 = 8 \end{align} \,\\ \] Additionally, if the matrix is a triangular matrix, computing the determinant of \(A\) is simplified to multiplying all the diagonal elements. In this case, \(2 \times 2 \times 2 = 8\). \[ \,\\ \det(\tilde{A}) = \prod_{i=1}^n \tilde{A}_{ii} \,\\ \] where, \(\tilde{A}\) is a triangular matrix.

For \(A\in\mathbb{R}^{n\times n}\), the determinant satisfies the following properties

\(\det(AB) = \det(A) \cdot \det(B) \).
\(\det(A) = \det(A^\top) \).
If \(A\) is invertible (regular), then \(\det(A^{-1}) = \frac{1}{\det(A)} \).
Adding a multiple of a column/row to another one does not change \(\det(A)\).
Multiplication of a column/row with \(\lambda \in \mathbb{R}\) scales \(\det(A)\) by \(\lambda\). In particular, \(\det(\lambda A) = \lambda^n \det(A) \).
Swapping two rows/columns changes the sign of \(\det(A)\).

Theorem 4.3. A square matrix \(A\ in \mathbb{R}^{n \times n}\) has \(\det(A) \neq 0\) if and only if \(\text{rk}(A) = n\). In other words, \(A\) is invertible if and only if it is full rank.

Definition 4.4. The trace of a square matrix \(A\in \mathbb{R}^{n\times n}\) is defined as \[ \,\\ \text{tr}(A) := \sum_{i=1}^n A_ii \,\\ \] The trace is the sum of the diagonal elements of \(A\). The trace staisfies the following properties

\(\text{tr}(A) + \text{tr}(B) = \text{tr}(A + B) \), where \(A, B \in \mathbb{R}^{n \times n} \)
\(\text{tr}(\alpha A) = \alpha \text{tr}(A) \), where \(\alpha \in \mathbb{R}, A \in \mathbb{R}^{n \times n} \)
\(\text{tr} (I_n) = n \)
\(\text{tr}(AB) = \text{tr}(BA) \), where \(A \in \mathbb{R}^{n \times k}, B \in \mathbb{R}^{k \times n} \)

The only function that satisfies the above four properties together is trace.

The trace is invariant under cyclic permutations, \[ \,\\ \text{tr}(AKL) = \text{tr}(KLA) \,\\ \] for the matrices \( A \in \mathbb{R}^{a \times k}, K \in \mathbb{R}^{k \times l}, L \in \mathbb{R}^{l \times a} \). The traces of similar matrices \(A, B\) are the same due to the cyclic permutations. The similar matrix can be defined as \( B = P^{-1} A P \), where \( A, B, P \in \mathbb{R}^{n \times n} \) \[ \,\\ \begin{align} \text{tr}(B) &= \text{tr}(P^{-1}AP) \,\\ \,\\ &= \text{tr}(APP^{-1}) \,\\ \,\\ &= \text{tr}(AI) \,\\ \,\\ &= \text{tr}(A) \end{align} \,\\ \] Hence, while matrix representations of linear mappings are basis dependent the trace of a linear mapping \(\Phi\) is independent of the basis.

Consider a mapping \(\Phi: \mathbb{R}^2 \rightarrow \mathbb{R}^2 \) given by \[ \,\\ \Phi (x, y) = (2x + y, x + 3y) \,\\ \] With respect to the standard basis \(\mathcal{B} = \{ [1, 0]^\top, [0, 1]^\top \}\), the mapping \(\Phi\) is represented by \[ \,\\ \Phi_{\mathcal{B}} = A = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}, \quad \text{tr}(A) = 2+3 = 5 \,\\ \] Now, choose a new basis \(\mathcal{C} = \{ [1, 1]^\top, [1, 0]^\top \}\). Then, \[ \,\\ P = \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix}, \quad P^{-1} = \begin{bmatrix} 0 & 1 \\ 1 & -1 \end{bmatrix} \,\\ \] Therefore, \(B = P^{-1}AP \) is

\[ \,\\ \begin{align} B = P^{-1}AP = \begin{bmatrix} 0 & 1 \\ 1 & -1 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \begin{bmatrix} 1 & 1 \\ 1 & 0 \end{bmatrix} = \begin{bmatrix} 4 & 1 \\ -1 & 1 \end{bmatrix}, \quad &\text{tr}(B) = 4 + 1 = 5 \,\\ \,\\ B = APP^{-1} = AI = A = \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix}, \quad &\text{tr}(A) = 2 + 3 = 5 \end{align} \,\\ \]

Although the matrices \(A\) and \(B\) are different, they are similar, and by the cyclic permutation property of the trace, we have \[ \,\\ \text{tr}(A) = \text{tr}(B) = 5 \,\\ \]

4.2 Eigenvalues and Eigenvectors

We can interpret linear mappings and their associated transformation matrices by performing an "eigen" analysis.

Definition 4.6. Let \(A \in \mathbb{R}^{n \times n} \) be a square matrix. Then \(\lambda \mathbb{R}\) is an eigenvalue of \(A\) and \(x\in \mathbb{R}^n \setminus \{0\}\) is the corresponding eigenvector of \(A\) if \[ \,\\ Ax = \lambda x \,\\ \] We call this equation the eigenvalue equation.

Remark. The following statements are equivalent:

There exists an \(x\in \mathbb{R}^n \setminus \{0\} \) with \(Ax = \lambda x \) or, equivalently, \( (A - \lambda I_n)x = 0 \) can be s olved non-trivially, i.e., \(x \neq 0\).
\(\text{rk}(A - \lambda I_n) < n \).
\(\det(A - \lambda I_n) = 0 \).

Definition 4.7 (Colinearity and Codirection). Two vectors that point in the same direction are called codirected. Two vectors are collinear if they point in the same or the opposite direction.

Remark (Non-uniqueness of eigenvectors). If \(x\) is an eigenvector of \(A\) associated with eigenvalue \(\lambda\), then for any \(c\in \mathbb{R} \setminus \{0\}\) it holds that \(cx\) is an eigenvector of \(A\) \[ \,\\ A(cx) = cAx = c\lambda x = \lambda(cx) \,\\ \]
Definition 4.10 (Eigenspace and Eigenspectrum). For \(A\ \in \mathbb{R}^{n \times n}\), the set of all eigenvectors of \(A\) associated with an eigenvalue \(\lambda\) spans a subspace of \(\mathbb{R}^{n}\), which is called the eigenspace of \(A\) with respect to \(\lambda\) and is denote by \(E_{\lambda}\). The set of all eigenvalues of \(A\) is called the eigenspectrum, or just spectrum of \(A\).

If \(\lambda\) is an eigenvalue of \(A\in \mathbb{R}^{n\times n} \), then the corresponding eigenspace \(E_{\lambda}\) is the solution of the linear equations \((A - \lambda I)x = 0 \). Geometrically, the eigenvector corresponding to a nonzero eigenvalue points in a direction that is stretched by the linear mapping.

The properties of eigenvalues and eigenvectors include:

A matrix \(A\) and its transpose \(A^\top\) possess the same eigenvalues, but not necessarily the same eigenvectors.
The eigenspace \(E_{\lambda}\) is the null space of \(A - \lambda I \) since
\[ Ax - \lambda x \,\, \Longleftrightarrow \,\, Ax - \lambda x = 0 \,\, \Longleftrightarrow \,\, (A - \lambda I) x = 0 \,\, \Longleftrightarrow \,\, x \in \ker(A - \lambda I) \]
If \(\det(A - \lambda I = 0)\), then the linear transformation \(A - \lambda I\) is singular. In a geometric sense, a zero determinant means that the transformation flattens the space into a lower-dimensional subspace.

Example 4.5 (Computing Eigenvalues, Eigenvectors, and Eigenspaces) Let us find the eigenvalues and eigenvectors of the matrix \(A \in \mathbb{R}^{2 \times 2} \) \[ \,\\ A = \begin{bmatrix} 4 & 2 \\ 1 & 3 \end{bmatrix} \,\\ \] By the definition of an eigenvector \(x \neq 0\) and eigenvalue \(\lambda\) of \(A\), we have \(Ax = \lambda x \Longleftrightarrow (A - \lambda I)x = 0\). Since \(x \neq 0\), the matrix \((A - \lambda I)\) must be singular, i.e., its determinant must be zero. Therefore, we need to compute the roots of the characteristic polynomial to find the eigenvalues.

The characteristic polynomial is \[ \,\\ \begin{align} \det(A - \lambda I) &= \det \left( \begin{bmatrix} 4 & 2 \\ 1 & 3 \end{bmatrix} - \begin{bmatrix} \lambda & 0 \\ 0 & \lambda \end{bmatrix} \right) \,\\ \,\\ &= \det \left( \begin{bmatrix} 4 - \lambda & 2 \\ 1 & 3 - \lambda \end{bmatrix} \right) \,\\ \,\\ &= (4 - \lambda)(3 - \lambda) - 2 \,\\ \,\\ &= \lambda^2 - 7 \lambda + 10 \end{align} \,\\ \] As the determinant of \((A - \lambda I)\) must be zero, the roots of characteristic equation are \[ \,\\ \begin{align} \lambda^2 - 7 \lambda + 10 &= 0 \,\\ \,\\ (2 - \lambda)(5 - \lambda) &= 0 \end{align} \,\\ \] Therefore, \(\lambda = 2\) and \(\lambda = 5\).

Using the characteristic polynomial and the eigenvalues computed above, we can find the eigenvectors \[ \,\\ \begin{bmatrix} 4 - \lambda & 2 \\ 1 & 3 - \lambda \end{bmatrix} x = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \,\\ \] For \(\lambda = 5\), we obtain \[ \,\\ \begin{bmatrix} 4 - 5 & 2 \\ 1 & 3 - 5 \end{bmatrix} \begin{bmatrix} x_1 \\ x_ 2 \end{bmatrix} = \begin{bmatrix} -1 & 2 \\ 1 & -2 \end{bmatrix} \begin{bmatrix} x_1 \\ x_ 2 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \,\\ \] Solving the linear equation below, \[ \,\\ \begin{matrix} - & x_1 & + & 2x_2 & = & 0 \\ & x_1 & - & 2x_2 & = & 0 \end{matrix} \,\\ \] we obtain the solution \(x_1 = 2x_2\). Therefore, the eigenvector corresponding the eigenvalue \(\lambda = 5\) spans \[ \,\\ e_1 = \text{span} \left[ \begin{bmatrix} 2 \\ 1 \end{bmatrix} \right] \,\\ \] For \(\lambda = 2\), using the same method, we obtain \[ \,\\ e_2 = \text{span} \left[ \begin{bmatrix} -1 \\ 1 \end{bmatrix} \right] \,\\ \] The figure below provides geometric intuition of Eigenvalues, Eigenvectors.

For the vector \(x = [0, -1]^\top\), applying the linear transformation matrix \(A\) to \(x\) results in \(Ax = [-2, -3]^\top\). In contrast, the eigenvectors painted in green are only scaled by their eigenvalues and do not change direction.

Theorem 4.12. The eigenvectors \(x_1, \cdots, x_n\) of a matrix \(A \in \mathbb{R}^{n \times n} \) with \(n\) distinct eigenvalues \(\lambda_1, \cdots, \lambda_n \) are linearly independent. This theorem states that the eigenvectors of a matrix with \(n\) distinct eigenvalues form a basis of \(\mathbb{R}^n\).

Definition 4.13. A square matrix \(A \in \mathbb{R}^{n \times n}\) is defective if it possesses fewer that \(n\) linearly independent eigenvectors. Looking at the eigenspaces of a defective matrix, it follows that the sum of the sum of the dimensions of the eigenspaces is less than \(n\).

Example 4.8 (Linear combination of the eigenvectors corresponding to the repeated eigenvalues)
Let the symmetric matrix \(A\) be \[ \,\\ A = \begin{bmatrix} 3 & 2 & 2 \\ 2 & 3 & 2 \\ 2 & 2 & 3 \end{bmatrix} \,\\ \] The eigenvalues of \(A\) are \(\lambda_1 = 1\) and \(\lambda_2 = 7\), where \(\lambda_1\) is a repeated eigenvalue. For \(\lambda_1 = 1 \,\, \text{(multiplicity 2)} \), its eigenvectors \(E_1, E_2\) are \[ \,\\ E_1 = \text{span}\left[ \begin{bmatrix} -1 \\ 1 \\ 0 \end{bmatrix} \right], \quad E_2 = \text{span}\left[ \begin{bmatrix} -1 \\ 0 \\ 1 \end{bmatrix} \right] \,\\ \] and for \(\lambda_2 = 7\), its eigenvector is \[ \,\\ E_3 = \text{span}\left[ \begin{bmatrix} 1 \\ 1 \\ 1 \end{bmatrix} \right] \,\\ \] Computing a linear combination of eigenvectors corresponding to the repeated eigenvalue \(\lambda_1\) also yields an eigenvector. \[ \,\\ \begin{align} E_1 + E_2 &= \begin{bmatrix} -1 \\ 1 \\ 0 \end{bmatrix} + \begin{bmatrix} -1 \\ 0 \\ 1 \end{bmatrix} = \begin{bmatrix} -2 \\ 1 \\ 1 \end{bmatrix} \,\\ \,\\ A(E_1+E_2) &= \begin{bmatrix} 3 & 2 & 2 \\ 2 & 3 & 2 \\ 2 & 2 & 3 \end{bmatrix} \begin{bmatrix} -2 \\ 1 \\ 1 \end{bmatrix} = \begin{bmatrix} -2 \\ 1 \\ 1 \end{bmatrix} \end{align} \,\\ \]

Theorem 4.14. Given a matrix \(A \in \mathbb{R}^{m \times n}\), we can always obtain a symmetric, positive semidefinite matrix \(S \in \mathbb{R}^{n \times n}\) by defining \[ \,\\ S := A^\top A \,\\ \] Symmetry requires \( S = S^\top \), and by inserting, we obtain \(S = A^\top A = (A^\top A)^\top = A^\top (A^\top)^\top = S^\top\). Moreover, positive semidefiniteness requires that \(x^\top Sx \ge 0 \) and inserting, we obtain \( x^\top Sx = x^\top A^\top Ax = (x^\top A^\top) (Ax) = (Ax)^\top (Ax) \ge 0 \).

Theorem 4.15 (Spectral Theorem). If \(A\in\mathbb{R}^{n\times n}\) is symmetric, there exists an orthonormal basis of the corresponding vector space \(V\) consisting of eigenvectors of \(A\).

Theorem 4.16. The determinant of a matrix \(A \in \mathbb{R}^{n\times n}\) is the product of its eigenvalues, where \(\lambda_i\) are possibly repeated eigenvalues of \(A\). \[ \,\\ \det(A) = \prod_{i=1}^n \lambda_i \,\\ \]
Theorem 4.17. The trace of a matrix \(A\in\mathbb{R}^{n\times n}\) is the sum of its eigenvalues, where \(\lambda_i\) are possibly repeated eigenvalues of \(A\). \[ \,\\ \text{tr}(A) = \sum_{i=1}^n \lambda_i \,\\ \]

4.3 Cholesky Decomposition

Theorem 4.18 (Cholesky Decomposition). A symmetric, positive definite matrix \(A\) can be factorized into a product \(A = LL^\top\), where \(L\) is a lower triangular matrix with positive diagonal elements and \(L\) is unique.

\[ \,\\ \begin{bmatrix} a_{11} & \cdots & a_{1n} \\ \vdots & \ddots & \vdots \\ a_{n1} & \cdots & a_{nn} \end{bmatrix} = \begin{bmatrix} l_{11} & \cdots & 0 \\ \vdots & \ddots & \vdots \\ l_{n1} & \cdots & l_{nn} \end{bmatrix} \begin{bmatrix} l_{11} & \cdots & l_{1n} \\ \vdots & \ddots & \vdots \\ 0 & \cdots & l_{nn} \end{bmatrix} \,\\ \]

The Cholesky decomposition also allows us to compute determinants very efficiently. Given the Cholesky decomposition \(A=LL^\top\), we know that \(\det(A) = \det(L)\det(L^\top) = \det(L)^2 \). Since \(L\) is a triangular matrix, the determinant is simply the product of its diagonal entries so that \(\det(A) = \prod_i l_{ii}^2 \).

4.4 Eigendecomposition and Diagonalization

In this section, we will discuss how to transform matrices into diagonal form. Recall that two matrices \(A, D\) are similar if there exists an invertible matrix \(P\), such that \(D = P^{-1}AP\). More specifically, we will look at matrices \(A\) that are similar to diagonal matrices \(D\) that contain the eigenvalues of \(A\) on the diagonal.

Definition 4.19 (Diagonalizable). A matrix \(A\in \mathbb{R}^{n\times n}\) is diagonalizable if there exists an invertible matrix \(P\in \mathbb{R}^{n\times n}\) such that \[ \,\\ \begin{align} D = P^{-1}AP \,\, \Longleftrightarrow \,\, &PD = PP^{-1}AP \\ &PD = IAP \\ &PD = AP \end{align} \,\\ \] We can see that this statement holds because \[ \,\\ \begin{align} AP &= A[p_1, \cdots, p_n] = [Ap_1, \cdots, Ap_n] \,\\ \,\\ PD &= [p_1, \cdots, p_n] \begin{bmatrix} \lambda_1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \lambda_n \end{bmatrix} = [\lambda_1 p_1, \cdots , \lambda_n p_n] \end{align} \,\\ \] Thus, \(PD = AP\) implies that \[ \,\\ \lambda_1 p_1 = A p_1 \,\\ \vdots \,\\ \lambda_n p_n = A p_n \,\\ \] Therefore, by the definition of Eigenvectors and Eigenvalues \(Ax = \lambda x\), the columns of \(P\) must be eigenvectors of \(A\). The definition of diagonalization requires that \(P \in \mathbb{R}^{n\times n} \) is invertible, i.e., \(P\) has full rank. This requires us to have \(n\) linearly independent eigenvectors \(p_1, \cdots, p_n\), i.e., the \(p_i\) form a basis of \(\mathbb{R}^n\).

Theorem 4.20 (Eigendecomposition). A square matrix \(A \in \mathbb{R}^{n \times n} \) can be factorized from \(D = P^{-1}AP\) to \(A = PDP^{-1}\), \[ \,\\ \begin{align} D = P^{-1}AP \,\, \Longleftrightarrow \,\, &PD = PP^{-1}AP \\ &PD = IAP \\ &PD = AP \\ &PDP^{-1} = APP^{-1} \\ &PDP^{-1} = AI \\ &PDP^{-1} = A \end{align} \,\\ \] where \(P \in \mathbb{R}^{n\times n} \) consists of the eigenvectors of \(A\) as column vectors, and \(D\) is a diagonal matrix whose diagonal entries are the eigenvalues of \(A\).

Theorem 4.20 implies that only non-defective matrices can be diagonalized.

Theorem 4.21. A symmetric matrix \(S\in \mathbb{R}^{n\times n}\) can always be diagonalized by the Spectral Theorem. The theorem guarantees the existence of an orthonormal basis of \(\mathbb{R}^n\) consisting of eigenvectors of \(S\).

Let \(P\) be the matrix whose columns are these orthonormal eigenvectors. Then \(P\) is an orthogonal matrix satisfying \(P^{-1} = P^\top\), and \[ \,\\ D = P^\top SP \,\\ \]

4.41 Geometric Intuition for the Eigendecomposition

Eigendecomposition shows that a linear transformation represented by a matrix can be decomposed into three processes: 'rotate', 'stretch', and 'rotate'.

4.5 Singular Value Decomposotion

The singular value decomposition (SVD) of a matrix is a central matrix decomposition method in linear algebra. It can be applied to all matrices, not only to square matrices, and it always exists.

Theorem 4.22 (SVD Theorem). Let \(A\in\mathbb{R}^{m\times n}\) be a rectangular matrix. The SVD of \(A\) is a decomposition of the form \[ \,\\ A = U \Sigma V^\top \,\, \Longleftrightarrow \,\, AV = U \Sigma \,\\ \] with an orthogonal matrix \(U \in \mathbb{R}^{m \times m}\), an orthogonal matrix \(V \in \mathbb{R}^{n \times n} \), and \(\Sigma\) is an \(m \times n\) diagonal matrix.

The singular value matrix \(\Sigma\) is unique. The matrix \(\Sigma\) is of the same size as \(A\). This means that \(\Sigma\) has a diagonal submatrix that contains the singular values and needs additional zero padding. If \(m > n\), then the \(\Sigma\) has the following shape \[ \,\\ \Sigma = \begin{bmatrix} \sigma_1 & 0 & 0 \\ 0 & \ddots & 0 \\ 0 & 0 & \sigma_n \\ 0 & \cdots & 0 \\ \vdots & & \vdots \\ 0 & \cdots & 0 \end{bmatrix} \,\\ \] If \(m < n \), the matrix \(\Sigma\) has \[ \,\\ \Sigma = \begin{bmatrix} \sigma_1 & 0 & 0 & 0 & \cdots & 0 \\ 0 & \ddots & 0 & \vdots & & \vdots \\ 0 & 0 & \sigma_m & 0 & \cdots & 0 \end{bmatrix} \,\\ \]

4.5.2 Construction of the SVD

The SVD of a general matrix shares some similarities with the eigendecomposition of a square matrix

Remark. Compare the eigendecomposition of an SPD matrix \[ \,\\ S = S^\top = PDP^\top \,\\ \] with the corresponding SVD \[ \,\\ S = U\Sigma V^\top \,\\ \] If we set \[ \,\\ U = P = V, \quad D = \Sigma \,\\ \] we see that the SVD of SPD matrices is their eigendecomposition.

Computing the SVD of \(A \in \mathbb{R}^{m \times n} \) is equivalent to finding two sets of orthonormal bases \(U = (u_1, \cdots, u_m) \) and \(V = (v_1, \cdots, v_n) \). From these ordered bases, we will construct the matrices \(U \) and \(V\). The plan is to start with construction the orthonormal set of right-singular vectors \( v_1, \cdots, v_n \in \mathbb{R}^{m} \). Thereafter, we will link the two and require that the orthogonality of the \(v_i\) is preserved under the transformation of \(A\).

Let us begin with constructing the right-singular vectors. \[ \,\\ S = A^\top A = PDP^\top = P \begin{bmatrix} \lambda_1 & \cdots & 0 \\ \vdots & \ddots & \vdots \\ 0 & \cdots & \lambda_n \end{bmatrix} P^\top \,\\ \] where \(P\) is an orthogonal matrix, which is composed of the orthonormal eigenbasis. The \(\lambda_i \ge 0\) are the eigenvalues of \(A^\top A\). Injecting the theorem 4.22 into the above yields \[ \,\\ A^\top A = (U\Sigma V^\top)^\top (U\Sigma V^\top) = V\Sigma^\top U^\top U \Sigma V^\top \,\\ \] where \(U, V\) are orthogonal matrices. Therefore, with \(U^\top U = I\), we obtain \[ \,\\ \begin{align} A^\top A \in \mathbb{R}^{n \times n} &= V \Sigma^\top I \Sigma V^\top \,\\ \,\\ &= V \Sigma^\top \Sigma V^\top \,\\ \,\\ &= V \begin{bmatrix} \sigma_1^2 & 0 & 0 \\ 0 & \ddots & 0 \\ 0 & 0 & \sigma_n^2 \end{bmatrix} V^\top \end{align} \,\\ \] Now, we can identify \[ \,\\ \begin{align} V^\top &= P^\top \,\\ \,\\ \sigma_i^2 &= \lambda_i \end{align} \,\\ \] Therefore, the eigenvectors of \(A^\top A\) that compose \(P\) are the right-singular vectors \(V\). The eigenvalues of \(A^\top A\) are the squared singular values of \(\Sigma\). To obtain the left-singular vectors U, we follow a similar procedure. \[ \,\\ \begin{align} AA^\top \in \mathbb{R}^{m \times m} &= (U\Sigma V^\top)(U\Sigma V^\top)^\top = U\Sigma V^\top V \Sigma^\top U^\top \,\\ \,\\ &= U \begin{bmatrix} \sigma_1^2 & 0 & 0 \\ 0 & \ddots & 0 \\ 0 & 0 & \sigma_m^2 \end{bmatrix} U^\top \end{align} \,\\ \] The orthonormal eigenvectors of \(AA^\top\) are the left-singular vectors U.

4.5.3 Eigenvalue Decompsition vs. Singular Valude Decomposition

Let us consider the eigendecomposition \(A = PDP^{-1} \) and the SVD \(A = U\Sigma V^\top \) and review the core elements of the past sections.

The SVD always exists for any matrix \(\mathbb{R}^{m \times n} \). The eigendecomposition is only defined for square matrices \(\mathbb{R}^{n \times n} \) and only exists if we can find a basis of eigenvectors of \(\mathbb{R}^n\).
The vectors in the eigendecomposition matrix \(P\) are not necessarily orthogonal, i.e., the change of basis is not a simple rotation and scaling. On the other hand, the vectors in the matrices \(U\) and \(V\) in the SVD are orthonormal, so they do represent rotations.
In the SVD, the left- and right-singular vector matrices \(U\) and \(V\) are generally not inverse of each other (they perform basis changes in different vector spaces). In the eigendecomposition, the basis change matrices \(P\) and \(P^{-1} \) are inverses of each other.
In the SVD, the entries in the diagonal matrix \(\Sigma\) are all real and nonnegative.
For symmetric matrices, the eigendecomposition and the SVD are one and the same.
The SVD and the eigendecomposition are closely related through their projections
- The left-singular vectors of \(A\) are eigenvectors of \(AA^\top\)
- The right-singular vectors of \(A\) are eigenvectors of \(A^\topA\)
- The nonzero singular values of A are the square roots of the nonzero eigenvalues of both \(AA^\top\) and \(A^\top A\).