Linear Algebra for Programmers (프로그래머를 위한 선형대수)

We live in a three-dimensional space. To handle problems in a 3D world, we need terms that describe this 'space' well. Examples include computer graphics, car navigation, and games. The vector spaces in linear algebra abstract certain properties of our real space. Therefore, linear algebra provides convenient terms and concepts to describe space. For instance, if you're thinking about 'how to draw a 3D object on a 2D plane,' you encounter the problem of 'what 2D image appears when you move and rotate the viewpoint in 3D space.' Linear algebra plays a fundamental role in solving such problems.

However, we don't learn linear algebra just to solve problems in space.

You'll often want to handle data composed of multiple numbers, not just single values. These cases might not be directly related to 'space,' so you can handle them without explicitly thinking about 'space.' But if you interpret this data as 'points in high-dimensional space,' you can use your spatial intuition.

Although we can only recognize three-dimensional space, many 'general n-dimensional phenomena' can be intuitively understood by reasoning in three-dimensional space. This interpretation is effective in data analysis. When it comes to 'space' problems, linear algebra steps in.

\[ Data = \begin{bmatrix} -4.310345 & 4.655172 & -0.517241 & -1.551724 & 3.965517 & 3.275862 & 4.310345 & \cdots \\ -4.310345 & -4.310345 & -4.310345 & -4.310345 & -4.310345 & -4.310345 & -4.310345 & \cdots \\ 0.701754 & 4.350436 & 2.368040 & 4.004554 & 2.602387 & 3.394398 & 3.603163 & \cdots \\ \end{bmatrix} \]

When thinking of data as space, intuition works 💡

Thinking of \( Data \) as 3D space, it easier to understand the meaning of data.
From the left, Data points · Data points projection · Intuitive understanding of data shape

🧬 Basis
A basis is a set of vectors that defines a coordinate system for a vector space. By choosing a set of basis vectors, we can specify the position of any vector within that space relative to these basis vectors.

For example, consider the standard basis vectors \( e_{1} \) and \( e_{2} \) in \( \mathbb{R}^2 \). These vectors form a convenient coordinate system. Any vector \( v \) in this space can be expressed as a linear combination of these basis vectors. This means \( v_{1} \) can be decomposed into 2 units of \( e_{1} \) and 3 units of \( e_{2} \).

\[ e_{1} = [1, 0]^{T} \quad e_{2} = [0, 1]^{T} \\ 　 \\ v_{1} = [2, 3]^{T} = 2e_{1} + 3e_{2} \quad v_{2} = [3.3, 1.3]^{T} = 3.3e_{1} + 1.3e_{2} \]

Standard basis vectors in \( \mathbb{R}^2 \)

Basis vectors do not need to be orthogonal or have unit length. Let's consider a different set of basis vectors in \( \mathbb{R}^2 \)! We can express any vector in terms of these new basis vectors with the below specific basis. Here, \( v_{2} \) is decomposed into 1.2 units of \( e_{1} \) and 2.7 units of \( e_{2} \).

\[ e_{1} = [1, -0.1]^{T} \quad e_{2} = [1, 1]^{T} \\ 　 \\ v_{1} = [4, 1.8]^{T} = 2e_{1} + 2e_{2} \quad v_{2} = [3.9, 2.58]^{T} = 1.2e_{1} + 2.7e_{2} \]

Specific basis vectors in \( \mathbb{R}^2 \)

🧬 Conditions for becoming Basis
In linear algebra, a basis for a vector space is a set of vectors that satisfy two fundamental conditions:

Spanning the Vector Space: A set of vectors spans a vector space if every vector in that space can be expressed as a linear combination of the vectors in the set. This means that any vector in the space can be reached by scaling and adding the basis vectors.
Linear Independence: The vectors in the set must be linearly independent, which means that no vector in the set can be written as a linear combination of the others. Put simply, each vector creates a unique dimension to the space that cannot be duplicated by any combination of the other vectors.

Imagine a vector space as a room, and the basis vectors as different directions you can travel within that room. If you can move in every possible direction (span), then the basis vectors cover the entire space. If \( \{ v_{1}, v_{2}, \ldots, v_{n} \} \) is a set of vectors in a vector space \( V \), then \( V \) is spanned by these vectors if every vector \( v \in V \) can be expressed as \( v = c_{1}v_{1} + c_{2}v_{2} + \ldots + c_{n}v_{n} \), where \( c_{1}, c_{2}, \ldots, c_{n} \) are scalars.

Example of whether a set of vectors is a basis or not From the left, ✔️ · ❌ · ❌ · ❌ — Example of whether a set of vectors is a basis or not
From the left, ✔️ · ❌ · ❌ · ❌

🧬 Dimension
The number of elements (or coordinates) needed to describe a vector in a given space corresponds to the dimension of that space. For instance, in 3-dimensional space, a vector is typically represented as (x, y, z), requiring three coordinates. Therefore,

Dimension = the number of basis vectors = the number of elements of coordinates

🧬 Matrix is mapping
When an \( m \times n \) matrix \( A \) is multiplied by an \( n \)-dimensional vector \( x \), an \( m \)-dimensional vector \( y = Ax \) is obtained. In other words, specifying matrix \( A \) determines a mapping that transforms one vector into another. This is the most important role of a matrix. From now on, when you see a matrix, do not simply think of it as a collection of numbers, but rather as a given mapping.

If you can think of 'how the entire space changes', it will help you intuitively understand linear algebra. Let's look at the animations below. These animations depict the transformation of data and the standard basis vectors by matrix mappings. Each animation uses the following matrix as its transformation matrix. The entire space is mapped from the standard basis to \(A1\) and \(A2\) respectively:

\[ \text{A1} = \begin{bmatrix} 1 & -0.7 \\ -0.3 & 0.6 \end{bmatrix} \]

\[ \text{A2} = \begin{bmatrix} -1 & 0.3 \\ -0.3 & -1.2 \end{bmatrix} \]

how the entire space changes

Let's look at matrix \( A1 \). It was mentioned above that a matrix is a mapping. What \( A1 \) represents is the transformation of the existing space's basis into the basis represented by \( A1 \). Therefore, \( [1, 0]^T \) is mapped to \( [1, -0.3]^T \), and \( [0, 1]^T \) is mapped to \( [-0.7, 0.6]^T \). In other words, the first column of \( A1 \) represents the destination of \( [1, 0]^T \), and the second column represents the destination of \( [0, 1]^T \). Imagining where each vector moves helps you visualize the form of the mapping.

In summary, an \(m \times n\) matrix \(A\) represents a mapping that moves an \(n\)-dimensional space to an \(m\)-dimensional space. The first column of \(A\) represents the destination of \(e_1 = (1, 0, 0, \ldots)^T\), and the second column of \(A\) represents the destination of \(e_2 = (0, 1, 0, \ldots)^T\).

The matrix \( A \) accepts vectors from an \( n \)-dimensional space as inputs and transforms them into vectors in an \( m \)-dimensional space. For example, if we have an \( n \)-dimensional vector \( x \), the matrix-vector product \( Ax \) results in an \( m \)-dimensional vector.

\[ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} 　\rightarrow　 \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \]

\[ \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} 　\rightarrow　 \begin{bmatrix} 2 & 9 & 4 \\ 7 & 5 & 3 \\ 6 & 8 & 1 \end{bmatrix} \]

\[ \vdots \]

\[ \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ 0 & 0 & 1 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix} 　\rightarrow　 \begin{bmatrix} a_{11} & a_{12} & a_{13} & \cdots & a_{1n} \\ a_{21} & a_{22} & a_{23} & \cdots & a_{2n} \\ a_{31} & a_{32} & a_{33} & \cdots & a_{3n} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & a_{n3} & \cdots & a_{nn} \end{bmatrix} \]

Destinations of the standard basis vectors
\[ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} 　\rightarrow　 \begin{bmatrix} 2 & 1 \\ 1 & 3 \end{bmatrix} \]

🧬 Matrix multiplication
Multiplication between matrices is a combination of mappings. If there are matrices \( B \in \mathbb{R}^{m \times n} \) and \( A \in \mathbb{R}^{n \times k} \), its multiplication is defined as follows:

\[ \begin{bmatrix} b_{11} & \cdots & b_{1n} \\ \vdots & 　 & \vdots \\ b_{m1} & \cdots & b_{mn} \end{bmatrix} \begin{bmatrix} a_{11} & \cdots & a_{1k} \\ \vdots & 　 & \vdots \\ a_{n1} & \cdots & a_{nk} \end{bmatrix} \]
\[ = \begin{bmatrix} (b_{11}a_{11}) + \cdots + (b_{1n}a_{n1}) & \cdots & (b_{11}a_{1k}) + \cdots + (b_{1n}a_{nk}) \\ \vdots & 　 & \vdots \\ (b_{m1}a_{11} + \cdots + b_{mn}a_{n1}) & \cdots & (b_{m1}a_{1k} + \cdots + b_{mn}a_{nk}) \end{bmatrix} \] \[
\] \[ \begin{bmatrix} 2 & 7 \\ 9 & 5 \\ 4 & 3 \end{bmatrix} \begin{bmatrix} 1 & 3 \\ 2 & -1 \end{bmatrix} = \begin{bmatrix} (2 \cdot 1 + 7 \cdot 2) & (2 \cdot 3 + 7 \cdot -1) \\ (9 \cdot 1 + 5 \cdot 2) & (9 \cdot 3 + 5 \cdot -1) \\ (4 \cdot 1 + 3 \cdot 2) & (4 \cdot 3 + 3 \cdot -1) \end{bmatrix} = \begin{bmatrix} 16 & -1 \\ 19 & 22 \\ 10 & 9 \end{bmatrix} \]

What we need to pay attention to here is the dimension of the matrix multiplication result. The matrix multiplication result in the above case where multiplying a \( 3 \times 2 \) matrix by a \( 2 \times 2 \) matrix has dimensions of \( 3 \times 2 \). In other words, the multiplication of an \( m \times n \) matrix and an \( n \times k \) matrix can be expressed as \( m \times k \).

Next, let's think about the multiplication of three or more matrices. In the matrix multiplication, the associative law is valid. Therefore, \[ C(BA) = (CB)A = CBA \] \[ D(C(BA)) = D((CB)A) = (D(CB))A = ((DC)B)A = ((DC)(BA)) = (DCBA) \]
The order of applying matrix multiplication is right-to-left. In the above, \( CBA \) means that the first operation is \( A \times B \), and then the result is multiplied by \( C \)

But in the matrix multiplication, the commutative law is invalid. So, \( AB \neq BA \). Depending on the sizes of matrices \( A \) and \( B \), the multiplication may not be defined. For example, consider two matrices where \( A \) is a \( 2 \times 3 \) matrix and \( B \) is a \( 3 \times 4 \) matrix.

\[ \begin{bmatrix} * & * & * \\ * & * & * \end{bmatrix} \begin{bmatrix} * & * & * & * & * \\ * & * & * & * & * \\ * & * & * & * & * \end{bmatrix} = \begin{bmatrix} * & * & * & * & * \\ * & * & * & * & * \end{bmatrix} \] \[
\] \[ \begin{bmatrix} * & * & * & * & * \\ * & * & * & * & * \\ * & * & * & * & * \end{bmatrix} \begin{bmatrix} * & * & * \\ * & * & * \end{bmatrix} 　\rightarrow 　❌ \]

If the multiplications \(AB\) and \(BA\) are possible, the results are different. Let's think about the following matrices: \[ A = \begin{bmatrix} 0 & -1 \\ 1 & 0 \end{bmatrix} 　 B = \begin{bmatrix} 2 & 0 \\ 0 & 1 \end{bmatrix} \]

Different results of matrix multiplication for \( AB \) and \( BA \)
\[ AB = \begin{bmatrix} 0 & -2 \\ 1 & 0 \end{bmatrix} 　 BA = \begin{bmatrix} 0 & -1 \\ 2 & 0 \end{bmatrix} \]

The matrix \(A\) rotates the space by 90 degrees relative to the standard basis, and matrix \(B\) extends the space horizontally relative to the standard basis. Therefore, the result of the space by \(AB\) is firstly extended horizontally and then rotated 90 degrees. On the other hand, the result of the space by \(BA\) is firstly rotated 90 degrees and then extended horizontally.

🧬 Zero matrix, Identity matrix, Diagonal matrix
The matrix consisting of all zeros is called the zero matrix. \[ O_{2, 3} = \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} 　 O_{3} = \begin{bmatrix} 0 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 0 \end{bmatrix} \] The zero matrix mapping moves all elements in the vector space to the origin. This is because for any vector \( x \), \( Ax = O \).

The zero matrix mapping process
\[ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \rightarrow \begin{bmatrix} 0 & 0 \\ 0 & 0 \end{bmatrix} \]

In the square matrix, the identity matrix is a matrix where the diagonal elements are all ones and the rest are all zeros. The identity matrix is expressed as \( I \). \[ I_{2} = \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} 　 I_{3} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} 　 I_{n} = \begin{bmatrix} 1 & 0 & 0 & \cdots & 0 \\ 0 & 1 & 0 & \cdots & 0 \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & 1 \end{bmatrix} \]

The identity matrix \( I \) represents the transformation that does not change the vector space. In other words, the identity matrix moves any vector \( x \) to the same vector \( x \). This is because for any vector \( x \), \( Ix = x \).

The identity matrix mapping process There is no change in the vector space — The identity matrix mapping process
There is no change in the vector space \[ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \rightarrow \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \]

In a square matrix, all elements located at the diagonal are called the diagonal elements. If the non-diagonal elements are all 0, this matrix is called a diagonal matrix. \[ \begin{bmatrix} 2 & 0 \\ 0 & 5 \end{bmatrix} 　 \begin{bmatrix} -1.3 & 0 & 0 \\ 0 & \sqrt{7} & 0 \\ 0 & 0 & 1/\pi \end{bmatrix} 　 \begin{bmatrix} 3 & 0 & 0 & 0 & 0 \\ 0 & 1 & 0 & 0 & 0 \\ 0 & 0 & 4 & 0 & 0 \\ 0 & 0 & 0 & 1 & 0 \\ 0 & 0 & 0 & 0 & 5 \end{bmatrix} \]

The mapping for the diagonal matrix is scaling. Each diagonal element is a scaling factor for the corresponding basis vector.

The mapping process for the diagonal matrix
\[ \begin{bmatrix} 1 & 0 \\ 0 & 1 \end{bmatrix} \\ 　 \\ \downarrow \\ 　 \\ \begin{bmatrix} 0.7 & 0 \\ 0 & 1.5 \end{bmatrix} 　 \begin{bmatrix} 0 & 0 \\ 0 & 1.5 \end{bmatrix} 　 \begin{bmatrix} -0.7 & 0 \\ 0 & -1.5 \end{bmatrix} \]

In the above figure 2 where \( D \) has a 0 for the diagonal element, the vector space is flattened. In the above figure 3 where \( D \) has negative values for the diagonal elements, the vector space is reflected.

🧬 Inverse matrix
The inverse matrix is a matrix that undoes the transformation matrix \(A\) to its original mapping. For a square matrix \(A\), the matrix corresponding to the inverse mapping is called the inverse matrix of \(A\). It denoted as \(A^{-1}\).

For a vector \(x\), \(Ax = y\) or \(A^{-1}y = x\) are validated. The other way, for a vector \(y\), \(A^{-1}y = x\) or \(Ax = y\) are validated. This concept can be expressed as: \[ x \overset{A}{\xrightarrow{\hspace{0.7cm}}} y \\ x \underset{A^{-1}}{\xleftarrow{\hspace{0.7cm}}} y \]

In other words, doing \(A \cdot A^{-1}\) is to back again to its original mapping, and doing \(A^{-1} \cdot A\) is the same as before mentioned. That is \(A^{-1}A = AA^{-1} = I\). By leveraging this concept, we can construct a local coordinate system.

The inverse matrix may exist or not exist, if the inverse matrix exists, it is unique. When this matrix exists, it is called regular / invertible / nonsingular, otherwise singular / noninvertible. Thinking intuitively, there is no inverse matrix when the space is flattened. Imagine a 2D space where a transformation (matrix \(A\)) maps two different points (e.g., \(x_1\), \(x_2\)) to the same point \(y\). The space is "flattened," meaning the transformation is not bijective. In other words, flattening the space means that the transformation does not have unique input values, so one output corresponds to multiple input values.

Visualization for the existence of inverse matrix
\[ A = \begin{bmatrix} 1 & 0.2 \\ 0.5 & -1.5 \end{bmatrix} 　 A^{-1} = \begin{bmatrix} 0.9375 & 0.125 \\ 0.3125 & -0.625 \end{bmatrix} \\ 　 \\ A = \begin{bmatrix} 2 & -1 \\ 1 & -0.5 \end{bmatrix} 　 A^{-1} = \color{red}{\text{np.linalg.LinAlgError}} \]

🧬 Transposed matrix
For a matrix \(A\), the transposed matrix is an operation that flips rows and columns, it denoted as \(A^{T}\). In other words, the \(A_{ij}\) element is located at the \(A^{T}_{ji}\) position. \[ B = \begin{bmatrix} 2 & 9 & 4 \\ 7 & 5 & 3 \end{bmatrix} 　 \rightarrow 　 B^{T} = \begin{bmatrix} 2 & 7 \\ 9 & 5 \\ 4 & 3 \end{bmatrix} \] \[ A = \begin{bmatrix} 1.2 & \color{green}{-0.8} \\ \color{red}{0.3} & 1.7 \end{bmatrix} 　 \rightarrow 　 A^{T} = \begin{bmatrix} 1.2 & \color{red}{0.3} \\ \color{green}{-0.8} & 1.7 \end{bmatrix} \]

Visualization for the transformation matrix \(A^{T}\)

For a diagonal matrix \(D\), the transposed matrix is the same as the original matrix. It has only diagonal elements, so the \(ij\) indices is the same as the \(ji\) indices. \[ D^{T} = D \\ 　 \\ D = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} 　 \rightarrow 　 D^{T} = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} \]

🧬 Affine transformation
An affine transformation is a combination of a linear transformation and a translation. Mathematically, an affine transformation of a vector \(x\) can be represented as: \[ y = Ax + b \]
where \(A\) is a matrix representing the linear transformation, and \(b\) is a vector representing the translation. The importance of affine transformations lies in their ability to not only rotate, scale, and shear data, as linear transformations do, but also to translate it. This makes affine transformations extremely powerful and versatile in mapping and manipulating data in space.

In the context of neural networks, affine transformations are used in the layers of the network as: \(y = Wx + b\). Each layer applies an affine transformation followed by a non-linear activation function. By combining linear and non-linear functions, neural networks can experiment with various data representation ways.

Let's take a look at how the original data can be transformed using an affine transformation with the hyperbolic tangent \(\sigma = tanh\) activation function. The transformation can be expressed as: \[ z = \sigma(Wx + b) \]
For example, let me assume the following values for \(A(=W)\), \(b\) and \(x\) where \(x\) are 2D vectors: \[ A = \begin{bmatrix} 0.5 & 1 \\ 1 & 0.5 \end{bmatrix} 　 b = \begin{bmatrix} -0.9 \\ 0.9 \end{bmatrix} 　 x = \text{circular points} \]
In the provided figure, you can see the effect of an affine transformation on the original data. The first plot shows the original data points, the second plot shows the data after a linear transformation \(Ax\), and the third plot shows the result of applying the affine transformation with \(Ax + b\). The last plot shows the result of applying the affine transformation with \(\sigma(Ax + b)\). The thing to notice is that the data after applying the affine transformation with a non-linear function is separable in one straight line. In other words, the neural networks can learn the data by applying this transformation followed by a non-linear function.

Visualization from to — Visualization from \(Ax\) to \(\sigma(Ax + b)\)

🧬 Determinant = magnification
Let's assume that the following 2D square matrix is given. \[ A = \begin{bmatrix} 1.5 & 0 \\ 0 & 0.5 \end{bmatrix} 　 A = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \]
Now, we have an intuition about the transformation that such a simple matrix performs. The matrix \(A\) scales the x-axis by \(1.5\) and the y-axis by \(0.5\). No inclined axes exist so we can imagine a rectangle with a width of \(1.5\) and a height of \(0.5\).

For a 2D square matrix, the determinant can be calculated as \(ad - bc\). In this case, \(ad - bc = 1.5 \times 0.5 - 0 \times 0 = 0.75\). This means that the determinant tells us the amount of magnification after the transformation.

At this time, let me assume that the following 2D square matrix with inclined axes is given. Also, a determinant of \(A\) can be calculated as shown in the above expression. \[ A = \begin{bmatrix} 1 & -0.3 \\ -0.7 & 0.6 \end{bmatrix} 　 \mathbf{det}A = ad - bc = 0.39 \]
The matrix \(A\) transforms the space as follows and the determinant of \(A\) is calculated as:

Determinant of = an area of a parallelogram — Determinant of \(A\) = an area of a parallelogram

This property is also applicable to 3D space. In 3D space, the determinant can be interpreted as a volume of a parallelepiped. \[ A = \begin{bmatrix} 1 & 0 & 0.2 \\ 0 & 1 & 0.4 \\ 0 & 0 & 0.9 \end{bmatrix} 　 \mathbf{det}A = \text{volume of a parallelepiped} = 0.9 \]

Determinant of = volume of a parallelepiped — Determinant of \(A\) = volume of a parallelepiped

If the geometry is mirrored, the determinant has a negative sign. You can get an intuition for this by seeing the following cases where \(A\), \(B\), \(C\), and \(D\) are given. In this, the one thing to know is that the rotation and reflection are different. \[ A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} 　 B = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1.3 & 0 \\ 0 & 0 & -1 \end{bmatrix} \\ 　 \\ C = \begin{bmatrix} -1 & 0 & 0 \\ 0 & 0.6 & 0 \\ 0 & 0 & 1.2 \end{bmatrix} 　 D = \begin{bmatrix} -1.5 & 0 & 0 \\ 0 & -0.7 & 0 \\ 0 & 0 & 1 \end{bmatrix} 　 \]

Determinants of , , , and — Determinants of \(A\), \(B\), \(C\), and \(D\)

As I mentioned above, the determinant is a volume of a parallelepiped. Therefore the transformation matrix that flattens a space has 0 for the determinant. Let's look at the following figure to understand how the determinant changes when the space is flattened.

Change of \(\mathbf{det}A\). From the left, \[ A = \begin{bmatrix} -1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & 1 \end{bmatrix} 　 A = \begin{bmatrix} 1 & 0 & 0\\ 0 & 1 & 0\\ 0 & 0 & -1 \end{bmatrix} \]

🧬 Properties of a determinant
A perspective for the 'Determinant = magnification' gives us \(\mathbf{det}I = 1\), and previously, I mentioned that the inverse matrix undoes the transformation matrix. We can get the \(\mathbf{det}A \cdot \mathbf{det}A^{-1} = 1 \) from these properties. In other words \(\mathbf{det}A^{-1} = 1/\mathbf{det}A\). Also, we can realize that if \(\mathbf{det}A = 0\), \(A^{-1}\) does not exist.

In a diagonal matrix \(\mathbf{det}(\mathbf{diag}(a_{1}, a_{2}, ..., a_{n}) = a_{1} \cdot a_{2} \cdot ... \cdot a_{n}\). This is because the diagonal matrix is a transformation matrix for scaling axes. If a matrix \(A\) isn't independent linearly, also determinant of \(A\) is 0.

The determinant has the property that its value does not change even if a scalar multiple of one column(or row) is added to another. Let me assume there are two matrices, \(A\) and \(B\). The matrix \(B\) is obtained by multiplying the second column of \(A\) by a scalar \(s\) and then adding it to the third column of \(A\).
\[ s = 1.3 \\ 　 \\ A = \begin{bmatrix} 1 & 0.5 & -0.3 \\ 0 & 0.7 & -0.3 \\ 0 & 0 & 1.5 \end{bmatrix} 　 B = \begin{bmatrix} 1 & 0.5 & -0.3 + 0.5 \cdot s \\ 0 & 0.7 & -0.3 + 0.7 \cdot s \\ 0 & 0 & 1.5 + 0 \cdot s \end{bmatrix} 　 \]
A visualization of the two matrices is the following. Multiplying a column by a scalar and adding it to the column means that the column is shifted toward the added column. So, the determinant of \(B\) is the same as the determinant of \(A\).

The matrices and have the same determinant — The matrices \(A\) and \(B\) have the same determinant

A matrix with a shape like \(A\) above is called an upper triangular matrix. The upper triangular matrix has all 0s below the diagonal elements. If the matrix is an upper triangular matrix, the determinant is the product of the diagonal elements. The expression for the volume of a parallelepiped is the product of the height and area of the bottom face, so the determinant of a matrix is the volume(in 2D, area) of a parallelepiped. This property can be applied to a lower triangular matrix as well. \[ A = \begin{bmatrix} 1 & 0.5 & -0.3 \\ 0 & 0.7 & -0.3 \\ 0 & 0 & 1.5 \end{bmatrix} 　 A = \begin{bmatrix} a_{11} & a_{12} & a_{13} \\ 0 & a_{22} & a_{23} \\ 0 & 0 & a_{33} \end{bmatrix} \\ 　 \\ \mathbf{det}A = a_{11} \cdot a_{22} \cdot a_{33} \]

For an \(n\) dimensional square matrix \(A\), if each element of the matrix is multiplied by a scalar \(s\), the determinant of the resulting matrix is \(s^{n} \cdot \mathbf{det}A\). That is, \(\mathbf{det}(cA) = \mathbf{s}^{n} \cdot \mathbf{det}A\). This is because all the columns are multiplied by a scalar \(s\). The following example gives an intuition for this property.
\[ s = 2 \\ 　 \\ A = \begin{bmatrix} 0.5 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 0.5 \end{bmatrix} 　 B = \begin{bmatrix} 0.5 & 0 & 0 \\ 0 & 0.5 & 0 \\ 0 & 0 & 0.5 \cdot s \end{bmatrix} \\ 　 \\ C = \begin{bmatrix} 0.5 & 0 & 0 \\ 0 & 0.5 \cdot s & 0 \\ 0 & 0 & 0.5 \cdot s \end{bmatrix} 　 D = \begin{bmatrix} 0.5 \cdot s & 0 & 0 \\ 0 & 0.5 \cdot s & 0 \\ 0 & 0 & 0.5 \cdot s \end{bmatrix} \]

Visualization of the property for \(s^{n} \cdot \mathbf{det}A\)

\(\mathbf{det}A = 0.5 \cdot 0.5 \cdot 0.5 = 0.125\)
\(\mathbf{det}B = 0.5 \cdot 0.5 \cdot (0.5 \cdot s) = 0.250\)
\(\mathbf{det}C = 0.5 \cdot (0.5 \cdot s) \cdot (0.5 \cdot s) = 0.500\)
\(\mathbf{det}D = (0.5 \cdot s) \cdot (0.5 \cdot s) \cdot (0.5 \cdot s) = 1.000\)

🧬 Kernel
For A cause vector \(x = (x_{1}, ..., x_{n})^{T} \) and a result vector \(y = (y_{1}, ..., y_{n})^{T} \), what happens if the dimensions of \(x\), \(y\) are different? This case is a problem finding the variables \(x\) with a number of \(n\) using clues with a number of \(m\).

First, let's think about the case where the dimension of \(y\) is less than the dimension of \(x\) (\(m < n\)). In the \(y = Ax\), the \(x\) has \(n\)-dimension and \(y\) has \(m\)-dimension, so the size of \(A\) is \(m \times n\). In other words, the \(A\) is longer horizontally than vertically. \[ y = Ax 　 \rightarrow 　 \begin{bmatrix} * \\ * \end{bmatrix} = \begin{bmatrix} * & * & * \\ * & * & * \end{bmatrix} \begin{bmatrix} * \\ * \\ * \end{bmatrix} \]
We must remember the 'Matrix is mapping' section again. The matrix \(A\) like the above is a mapping to transform \(x\) in the 3-dimensional space to \(y\) in the 2-dimensional space. Let me give an example:

\[ A = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \end{bmatrix} \quad x = \begin{bmatrix} 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \\ 0 & 0 & 0 & 0 & 1 & 1 & 1 & 1 \end{bmatrix} \\ 　 \\ y = Ax = \begin{bmatrix} 0 & 1 & 1 & 0 & 0 & 1 & 1 & 0 \\ 0 & 0 & 1 & 1 & 0 & 0 & 1 & 1 \end{bmatrix} \]

In this example, \(A\) is \(2 \times 3\) matrix and \(x\) is \(3 \times 8\) matrix. So, the result of matrix \(Ax\) is \(2 \times 8\) matrix, that is the dimension of \(Ax\) is lower than the dimension of \(x\). In the image below, you can see that matrix \(A\) maps higher dimensional data into a lower dimensional space.

The case that the dimension of is lower than the dimension of From the left, · · — The case that the dimension of \(A\) is lower than the dimension of \(x\)
From the left, \(x\) · \(Ax\) · \(\mathbf{Ker}A\)
\[ \mathbf{Ker}A = \left\{ \begin{bmatrix} 0 \\ 0 \\ 0 \end{bmatrix} \begin{bmatrix} 0 \\ 0 \\ 1 \end{bmatrix} \right\} \]

The kernel of \(A\) is a set of vectors \(x\) where \(Ax = 0\), express as \(\mathbf{Ker}A\), that is \(\mathbf{Ker}A = \left\{v \in V\text{ }|\text{ }Av = 0\right\}\). This means information is loss.

🧬 Image
In the kernel section above, we learned the case where the matrix \(A\) has a lower dimension than vector \(x\). This time, let's look the case contrary where the matrix \(A\) has a higher dimension than vector \(x\).

\[ y = Ax 　 \rightarrow 　 \begin{bmatrix} * \\ * \\ * \end{bmatrix} = \begin{bmatrix} * & * \\ * & * \\ * & * \end{bmatrix} \begin{bmatrix} * \\ * \end{bmatrix} \]
Once again, we must remember the 'Matrix is mapping' section. Let me give an example where the matrix \(A\) is \(3 \times 2\) and \(x\) is \(2 \times 5\). The matrix \(A\) maps lower dimensional data into higher dimensional space. Let's see how the matrix \(A\) maps the vector \(x\) into a higher dimensional space:

\[ A = \begin{bmatrix} 1 & 0 \\ 0 & 1 \\ 0.2 & 1 \end{bmatrix} 　 x = \begin{bmatrix} 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 \end{bmatrix} \\ 　 \\ y = Ax = \begin{bmatrix} 0 & 1 & 1 & 0 & 0 \\ 0 & 0 & 1 & 1 & 0 \\ 0 & 0.2 & 1.2 & 1 & 0 \end{bmatrix} \]

The case that the dimension of is higher than the dimension of — The case that the dimension of \(A\) is higher than the dimension of \(x\)

The 2D square in the image left is mapped to the 3D space, but the basis number is 2. Therefore the possible spanning space is still 2D plane. In other words, it can not cover the entire 3D space. A concept of the image \(\mathbf{Im}A\) means the matrix \(A\) that can span the entire space.

The mapping from 1D to 3D is the same as above. Let's see the below case where the matrix \(A\) is \(3 \times 1\) and \(x\) is \(1 \times 2\). In this case, we can realize that the image of \(A\) is a straight line both before and after the transformation.

\[ A = \begin{bmatrix} 0.7 \\ 0.3 \\ 1.3 \end{bmatrix} 　 x = \begin{bmatrix} 2 \, -2 \end{bmatrix} \\ 　 \\ y = Ax = \begin{bmatrix} 1.4 & -1.4 \\ 0.6 & -0.6 \\ 2.6 & -2.6 \end{bmatrix} \]

Finally, let me describe the concepts of the image once again. For a matrix \(A^{m \times n}\), the \(\mathbf{Im}\, A \) represents the range of the transformation \(y = Ax\). For \(A = (a_{1}, \, ..., \, a_{n}) \) and \(x = (x_{1}, \, ..., \, x_{n})^{T} \), the output \(y\) can be written as \(y = x_{1}a_{1} + \, \cdots \, + x_{n}a_{n} \). In this context, the range that x can move \(x_{1}a_{1} + \, \cdots \, + x_{n}a_{n}\) becomes to \(\mathbf{Im}\,A \). In other words, a set of vectors that can be created by linearly combining with \(a_{1}, \, ..., \, a_{n} \).

This is expressed as \(\mathbf{span}\{a_{1}, \, ..., \, a_{n}\}\), and is called the linear subspace created by the vectors \(a_{1}, \, ..., \, a_{n} \). If the all \(a_{n} \in A\) are 0, the span becomes a point. Similarly, if all vectors lie on a line, the span is that line. If the vectors lie on a plane, the span is the entire plane.

🧬 Singular matrix
Like this, it is important to have the same dimensions between \(x\) and \(y\). Then, if the \(x\) and \(y\) have the same dimensions, is everything okay in all cases? The answer is no. Let's see the below example:

\[ A = \begin{bmatrix} 0.8 & -0.6 \\ 0.4 & -0.3 \end{bmatrix} \]
This matrix is a \(2 \times 2\) matrix that flattens the space. It means the \(y\) can not determine the \(x\) uniquely. Additionally, because the \(y\) is flattened, it can not cover the entire space. In other words, whether the properties of a matrix are good or bad cannot be determined based on its dimensions. The keys are a kernel and an image.

🧬 Surjective, Injective, Bijective
To summarize the concepts above, the keys are as follows:

Can \(\mathbf{Im}\, A\) cover the entire space?
Is the \(\mathbf{Ker}\, A\) only 0? In other words, is the result \(y\) by \(x\) is unique?

If the first condition is satisfied, mapping \(y = Ax\) is surjective. If the second condition is satisfied, mapping \(y = Ax\) is injective. If both conditions are satisfied, mapping \(y = Ax\) is bijective.

🧬 Properties of Rank
As I mentioned above, the image \(\mathbf{Im}\, A\) is named the rank of a matrix \(A\). It is expressed as \(\mathbf{rank}\, A\). Depending on the dimension theorem, \(\mathbf{dim}\, \mathbf{Ker} A + \mathbf{rank}\, A = n\). 'Can \(\mathbf{Im}\, A\) cover the entire space?', the first condition mentioned earlier asks whether \(\mathbf{Im}\, A\) has m dimensions of a transformation matrix \(A\).

The above concepts can be expressed with the rank as follows:

\(\mathbf{rank}\, A = n \Leftrightarrow A\) is the injective (Rank is the same as the dimension of the \(x\))
\(\mathbf{rank}\, A = m \Leftrightarrow A\) is the surjective (Rank is the same as the dimension of the \(A\))

For the \(m \times n\) matrix \(A\), the below properties are natural intuitively. This is because the dimension of the \(x\) is \(n\) and the dimension of the \(A\) is \(m\). That is, the \(\mathbf{dim}\, \mathbf{Im} A\) can not be bigger than \(m\)

\[ \mathbf{rank}\, A \leq m \quad \mathbf{rank}\, A \leq n \]
Additionally, if the \(P\), \(Q\) are regular (or nonsingular, invertible) matrices, there is no change in the rank when multiplying the matrix \(A\) and them. Since the regular matrix isn't a matrix that flattens the space, the rank of the matrix \(A\) is the same before and after multiplying the matrix \(A\) and them.

\[ \mathbf{rank}(PA) \, = \mathbf{rank}\, A \quad \mathbf{rank}(AQ) \, = \mathbf{rank}\, A \]
For general matrices \(A\), \(B\) that are square matrices or not, the below properties are natural intuitively. This is because the \(\mathbf{rank}(BA)\) is:

Firstly, the existing space \(U\) for \(V\) is mapped to the space \(A\)
Secondly, the transformed \(V\) by \(A\) is mapped to the space \(B\)

By the first step, the dimension of the space \(U\) becomes \(\mathbf{rank}\, A\). Therefore, after the first step, any transformation matrix can not be bigger than \(\mathbf{rank}\, A\). Additionally, since the second step maps the space to \(B\), the resulting space after all transformations can not exceed \(\mathbf{rank}\, B\).

\[ \mathbf{rank}(BA) \, \leq \mathbf{rank}\, A \quad \mathbf{rank}(BA) \, \leq \mathbf{rank}\, B \]
Let's see intuitively the above concept with the below example:

\[ A' = \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 0 \\ 0 & 0 & 1 \end{bmatrix} 　 A = \begin{bmatrix} 1.2 & 0 & 0 \\ 0 & 0 & 0 \\ 0 & 0 & 1.2 \end{bmatrix} 　 B = \begin{bmatrix} 0.8 & 0 & 0 \\ 0.5 & 0.7 & 0.8 \\ 0 & 0 & 0.8 \end{bmatrix} \]

Changes of the From the left, · · — Changes of the \(\mathbf{rank}\)
From the left, \(\mathbf{rank}\, A'\) · \(\mathbf{rank}\, A\) · \(\mathbf{rank}\, BA\)

In this example, the original space is 3-dimensional with \(\mathbf{rank}\, A' = 3\). When multiplying \(A'\) by the matrix \(A\), where the second column consists of zeros, the information from the second column of \(A'\) is lost. Consequently, the space is flattened to a 2-dimensional space. Therefore, since the \(\mathbf{rank}\, A\) became to 2, the \(\mathbf{rank}\, BA \) can not be bigger than 2.

🧬 Extracting rank
In the image section, we learned that the \(\mathbf{Im}\, A = \mathbf{span}\, \{a_{1}, \, ..., \, a_{n}\}\). The \(\mathbf{span}\) is the same as the \(\mathbf{rank}\, A\). In other words, the \(\mathbf{rank}\, A\) is the number of the vectors that can be the basis for the space represented by the \(A\). If the vectors \(a_{1}, \, ..., \, a_{n} \) of \(A\) are linearly independent where the space is not flattened, \(\mathbf{rank}\, A = n\). Otherwise, the \(\mathbf{rank}\, A < n\).

For example, there are the following matrices, and the space for the matrices is visualized as:

\[ A = \begin{bmatrix} -0.5 & 0.75 & -1.5 \\ -0.5 & 0.75 & -1.5 \\ -0.5 & 0.75 & -1.5 \end{bmatrix} \\ 　 \\ B = \begin{bmatrix} 0.5 & -1 & -1.35 & 0.7\\ 1.3 & -0.8 & -1.71 & -0.34\\ 1 & 0 & -0.7 & -1 \end{bmatrix} \\ 　 \\ C = \begin{bmatrix} 0.5 & 1.5 & 0.7 & -0.75 & -2.25 \\ 1 & -1.7 & 0.8 & -1.5 & 0.2 \\ 0 & 0 & 1.1 & 0 & 0 \end{bmatrix} \]

The number of the vectors that can be the basis From the left, · · — The number of the vectors that can be the basis
From the left, \(\mathbf{rank}\, A\) · \(\mathbf{rank}\, B\) · \(\mathbf{rank}\, C\)

The example above explains that the number of columns in the matrix is not equal to the number of basis vectors.

Frist of all, the matrix \(A\) has three column vectors \(a_{1}, a_{2}, a_{3}\) but, this vectors also can be expressed as \(A = [a_{1}, \, -a_{1} \cdot 1.5, \, a{1} \cdot 3] \). Therefore the vector that can be the basis is, the \(\mathbf{Im}\, A = \mathbf{span}\{a_{1},\, a_{2},\, a_{3}\} = \mathbf{rank}\, A = 1\). Since the basis is a vector, \(\mathbf{Im}\, A\) is a straight line.

Similarly, the second matrix \(B\) can be converted as \(B = [b_{1},\, b_{2},\ -b_{1} \cdot 0.7 + b_{2},\ -b_{2} \cdot 1.2 - b_{1}] \) where \(b_{1} = [0.5, 1.3, 1]^{T}, \, b_{2} = [-1, -0.8, 0]^{T}\). In this case, \(\mathbf{Im}\, B = \mathbf{span}\{b_{1},\, b_{2},\, b_{3}, b_{4}\} = \mathbf{rank}\, B = 2\). Because the rank is 2, the \(\mathbf{Im}\, B\) becomes a plane.

For the matrix \(C\) is \(C = [c_{1}, \, c_{2}, \, c_{3}, \, -c_{1} \cdot 1.5, \, -c_{2} + c_{4}] \) where \(c_{1} = [0.5, 1, 0]^{T}, \, c_{2} = [1.5, -1.7, 0]^{T}, c_{3} = [0.7, 0.8, 1.1]^{T} \). In this case, \(\mathbf{Im}\, C = \mathbf{span}\{c_{1},\, c_{2},\, c_{3}, c_{4}, c_{5}\} = \mathbf{rank}\, C = 3\), and has the parallelepiped-shaped image.

This time, let's find out how to calculate the rank by hand. We can determine the rank by leveraging the property that the rank remains unchanged after multiplying by a regular matrix. This approach is a concept used to simplify the matrix to determine its rank as:

Multiply a specific row(or column) by a constant\(C\) where \(C \neq 0\)
Multiply a specific row(or column) by a constant \(C\) and add it to the other row(or column)
Swap the places of a row(or column)

Consider matrices \(A\) and \(B\). If the operation is performed on the rows, it is expressed as \(BA\). If the operation is performed on the columns, it is expressed as \(AB\). For example:

Multiply the second column by 5:

Multiply the second column by 10 and add it to the first column:

Swap the second column and the fourth column:

In fact, it's not necessary to explicitly write the transformation matrix when determining the rank. It is important to understand that the rank does not change even if the above concept is applied to rows or columns and that each operation can be interpreted as multiplying the left or right.

Now, let's calculate the rank for the following matrix \(A\):

\( A = \begin{bmatrix} 1 & 4 & 7 \\ 2 & 5 & 8 \\ 3 & 6 & 9 \end{bmatrix} \quad \text{Multiply the first row by -2 and add it to the second, third rows} \)

\( \quad \rightarrow \begin{bmatrix} 1 & 4 & 7 \\ 0 & -3 & -6 \\ 0 & -6 & -12 \end{bmatrix} \quad \text{Multiply the second column by -4 and -7 and add them to the second, third columns} \)

\( \quad \rightarrow \begin{bmatrix} 1 & 0 & 0 \\ 0 & -3 & -6 \\ 0 & -6 & -12 \end{bmatrix} \quad \text{Multiply the second row by}\, -\frac{1}{3} \)

\( \quad \rightarrow \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 2 \\ 0 & -6 & -12 \end{bmatrix} \quad \text{Multiply the second row by 6 and add it to the third row} \)

\( \quad \rightarrow \begin{bmatrix} 1 & 0 & 0 \\ 0 & 1 & 2 \\ 0 & 0 & 0 \end{bmatrix} \quad \text{Multiply the second column by -2 and add it to the third column} \)

\( \quad \rightarrow \begin{bmatrix} \color{purple}{\mathbf{1}} & 0 & 0 \\ 0 & \color{purple}{\mathbf{1}} & 0 \\ 0 & 0 & 0 \end{bmatrix} \quad \mathbf{rank}\, A = \color{purple}{\mathbf{2}} \)

After repeating this process, the number of 1s in the diagonal elements is the rank.

🧬 Geometric interpretation
The geometric meaning of an eigenvector is 'multiplying by \(A\) only causes stretching, and the direction does not change'. For a square matrix \(A\), \(Ax = \lambda x \quad x \neq 0\) the \(\lambda\), \(x\) that satisfy the equation are called eigenvalues and eigenvectors respectively.

The following animation gives a geometric understanding of the eigenvectors. In the case of the matrix \( A \), you can observe that the eigenvectors are just stretched, and their directions remain unchanged. The \(\lambda \) for the eigenvalues is the expansion factor.

eigenvectors of \(A\), \(B\)
\[ A = \begin{bmatrix} 2 & -2 & 0 \\ 0 & -0.5 & -1 \\ 0 & -2 & 1.8 \end{bmatrix} 　 B = \begin{bmatrix} 2.3 & -0.8 \\ -0.2 & 1.5 \end{bmatrix} \]

🧬 Properties of the eigenvalues and the eigenvectors
When assume the \(\lambda\) and \(x\) are eigenvalues and eigenvectors respectively,

The matrix \(A\) with 0 for eigenvalues is the same that the matrix \(A\) is singular
The \(\alpha x\) is also eigenvectors with the eigenvalues \(\lambda\) where the \(\alpha \neq 0\)
The eigenvalues of a diagonal matrix are diagonal elements
The upper triangular matrix (or lower triangular matrix) eigenvalues consist of diagonal elements
The determinant of a matrix \(A\) is equal to the product of eigenvalues

The equation \(Ax = \lambda x\) can be rewritten as \((A - \lambda I)x = 0\). By this equation and the property of \(x \neq 0\), we can get an equation that \(\mathbf{det}(A - \lambda I) = 0\).

Consider the \(A - \lambda I\) has an inverse matrix, and then multiply both sides by the inverse matrix as \((A - \lambda I)^{-1} \cdot (A - \lambda I)x = (A - \lambda I)^{-1} \cdot 0 \). This equation is calculated as \(I \cdot x = (A - \lambda I)^{-1} \cdot 0 \rightarrow x = 0 \). Therefore, if \(\mathbf{det}(A - \lambda I) \neq 0\), the property of \(x \neq 0\) is not satisfied.

Linear Algebra for Programmers (프로그래머를 위한 선형대수)

When thinking of data as space, intuition works 💡

Vector and Space

Matrix and Mapping

Affine Transformation

Determinant

Rank

Eigenvalues and eigenvectors

References