2 Prerequisites

Before diving into the course, it’s important to have a solid understanding of the following foundational concepts. These are categorized into five key topics:

You can check some of the requirements on Chapter 1 of the textbook.

2.1 General Math

You should be familiar with the summation operator \(\sum\). This operator is defined as follows:

\[\sum_{i=1}^n x_i = x_1 + x_2 + \ldots + x_n \]

Key properties of the summation operator include:

Linearity: \[\sum_{i=1}^N (a + b x_i) = aN + b \sum_{i=1}^N x_i\]
Additivity: \[\sum_{i=1}^N (x_i + y_i) = \sum_{i=1}^N x_i + \sum_{i=1}^N y_i\]

2.2 Linear Algebra

You should be familiar with the following linear algebra concepts:

2.2.1 Linear Independence

Linear independence is a fundamental concept in linear algebra that describes a set of vectors where no vector can be written as a linear combination of the others. In other words, the vectors are not “redundant,” meaning none of the vectors depends on any other in the set.

2.2.1.1 Definition:

A set of vectors \(\{ \mathbf{v}_1, \mathbf{v}_2, \ldots, \mathbf{v}_n \}\) in a vector space is linearly independent if the only solution to the equation:

\[ c_1 \mathbf{v}_1 + c_2 \mathbf{v}_2 + \cdots + c_n \mathbf{v}_n = \mathbf{0} \]

is when all the scalar coefficients \(c_1, c_2, \ldots, c_n\) are zero, i.e., \(c_1 = c_2 = \cdots = c_n = 0\).

If any of the coefficients can be non-zero while still satisfying this equation, then the vectors are linearly dependent.

2.2.1.2 Example:

Consider two vectors in \(\mathbb{R}^2\):

\[ \mathbf{v}_1 = \begin{bmatrix} 1 \\ 0 \end{bmatrix}, \quad \mathbf{v}_2 = \begin{bmatrix} 0 \\ 1 \end{bmatrix} \]

These vectors are linearly independent because there is no way to express one as a multiple of the other. The only solution to:

\[ c_1 \begin{bmatrix} 1 \\ 0 \end{bmatrix} + c_2 \begin{bmatrix} 0 \\ 1 \end{bmatrix} = \begin{bmatrix} 0 \\ 0 \end{bmatrix} \]

is \(c_1 = 0\) and \(c_2 = 0\).

In contrast, if:

\[ \mathbf{v}_1 = \begin{bmatrix} 1 \\ 2 \end{bmatrix}, \quad \mathbf{v}_2 = \begin{bmatrix} 2 \\ 4 \end{bmatrix} \]

These vectors are linearly dependent, because \(\mathbf{v}_2 = 2 \mathbf{v}_1\). Therefore, you can express \(mathbf{v}_2\) as a linear combination of \(\mathbf{v}_1\).

2.2.1.3 Key Points:

Linearly independent vectors carry distinct information and cannot be derived from each other.
Linearly dependent vectors are redundant because one or more can be expressed as a combination of others.
In a set of linearly independent vectors, removing any vector would reduce the span of the vector space they cover.

2.2.1.4 Importance:

Linear independence is crucial in determining the rank of a matrix.
In systems of equations, linear independence of the rows or columns determines if the system has a unique solution.
In vector spaces, the dimension of the space is the maximum number of linearly independent vectors.

2.2.2 Column Space of a Matrix

The column space of a matrix is the set of all possible linear combinations of its columns. If you have a matrix \(\mathbf{A}\) with \(n\) rows and \(p\) columns, the column space of \(\mathbf{A}\), denoted as Col(\(\mathbf{A}\)), consists of all vectors in \(\mathbb{R}^n\) that can be expressed as a linear combination of the columns of \(\mathbf{A}\).

2.2.2.1 Definition:

Given a matrix \(\mathbf{A}\) with columns \(\mathbf{a}_1, \mathbf{a}_2, \dots, \mathbf{a}_p\), the column space of \(\mathbf{A}\) is defined as:

\[ \text{Col}(\mathbf{A}) = \left\{ \mathbf{y} \in \mathbb{R}^n \mid \mathbf{y} = \mathbf{A} \mathbf{c} \text{ for some } \mathbf{c} \in \mathbb{R}^p \right\} \]

This means the column space is the span of the columns of \(\mathbf{A}\), or equivalently, all vectors that can be written as \(\mathbf{y} = c_1 \mathbf{a}_1 + c_2 \mathbf{a}_2 + \dots + c_p \mathbf{a}_p\), where \(c_1, c_2, \dots, c_p\) are scalars.

2.2.2.2 Properties:

The column space of \(\mathbf{A}\) is a subspace of \(\mathbb{R}^n\).
The dimension of the column space of \(\mathbf{A}\) is called the rank of the matrix and corresponds to the number of linearly independent columns in \(\mathbf{A}\).
The column space provides valuable information about the linear independence and span of the columns of a matrix.

2.2.2.3 Geometric Interpretation:

In geometric terms, the column space represents the set of all possible vectors that can be “reached” by linearly combining the columns of the matrix. For example: - For a matrix with 2 columns in \(\mathbb{R}^3\), the column space will be a plane in \(\mathbb{R}^3\) if the columns are linearly independent. - For a matrix with 3 columns in \(\mathbb{R}^2\), the column space will span all of \(\mathbb{R}^2\) (if the columns are linearly independent) or a line (if they are dependent).

2.2.3 Rank of a Matrix

The rank of a matrix is the dimension of its column space, which is the number of linearly independent columns in the matrix. Alternatively, it is also the dimension of the row space, which is the number of linearly independent rows.

2.2.3.1 Definition:

For a matrix \(\mathbf{A}\), the rank is defined as:

\[ \text{rank}(\mathbf{A}) = \dim(\text{Col}(\mathbf{A})) = \dim(\text{Row}(\mathbf{A})) \]

This is the maximum number of linearly independent rows or columns in the matrix. In other words, it tells you how many of the matrix’s columns (or rows) are not redundant and cannot be written as a linear combination of the others.

2.2.3.2 Key Points:

The rank of a matrix \(\mathbf{A}\) is denoted as rank(\(\mathbf{A}\)).
It measures the number of independent directions in the column space or row space.
Full rank: A matrix is said to have full rank if its rank is equal to the smaller of the number of rows or columns. For an \(m \times n\) matrix:
- If \(\text{rank}(\mathbf{A}) = m\) (number of rows), it has full row rank.
- If \(\text{rank}(\mathbf{A}) = n\) (number of columns), it has full column rank.
Rank-deficient: If the rank of the matrix is less than the smaller of the number of rows or columns, the matrix is called rank-deficient, meaning that some of its rows or columns are linearly dependent.
\(\text{rank}(\mathbf{A}) = \text{rank}(\mathbf{A}') = \text{rank}(\mathbf{A}' \mathbf{A}) = \text{rank}(\mathbf{A}\mathbf{A}')\)

2.2.3.3 Example:

Consider the matrix: \[ \mathbf{A} = \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \\ 7 & 8 & 9 \end{bmatrix} \]

The rank of \(\mathbf{A}\) is 2 because two of the rows (or columns) are linearly independent, but the third row (or column) is a linear combination of the others.

2.2.3.4 Properties:

The rank of a matrix is always less than or equal to the minimum of the number of rows and columns: \[ \text{rank}(\mathbf{A}) \leq \min(m, n) \]
The rank of a matrix is equal to the number of non-zero singular values in its singular value decomposition (SVD).
In square matrices, the rank gives insight into whether the matrix is invertible. A square matrix is invertible if and only if it has full rank.

2.2.4 Full Rank Matrix

A full rank matrix is a matrix in which the rank is equal to the largest possible value for that matrix, meaning:

For an \(m \times n\) matrix \(A\), the rank is the maximum number of linearly independent rows or columns.
- If the rank is equal to \(m\) (the number of rows), the matrix has full row rank.
- If the rank is equal to \(n\) (the number of columns), the matrix has full column rank.

2.2.4.1 For a square matrix (\(m = n\)):

A square matrix is full rank if its rank is equal to its dimension, i.e., if the matrix is invertible.
In this case, \(\text{rank}(\mathbf{A}) = n\), meaning all rows and columns are linearly independent, and the matrix has an inverse.

2.2.4.2 For a rectangular matrix (\(m \neq n\)):

A matrix is full rank if the rank equals the smaller of the number of rows or columns. For an \(m \times n\) matrix, the rank is at most \(\min(m, n)\).
- If the matrix has full row rank, all rows are linearly independent.
- If the matrix has full column rank, all columns are linearly independent.

2.2.4.3 Example:

Consider the matrix:

\[ \mathbf{A}= \begin{bmatrix} 1 & 2 & 3 \\ 4 & 5 & 6 \end{bmatrix} \]

This is a \(2 \times 3\) matrix. Since its two rows are linearly independent, it has full row rank, with rank = 2 (the number of rows). However, it does not have full column rank because it has only two independent rows for three columns.

2.2.4.4 Key Properties:

A full rank matrix has no redundant rows or columns (no row or column can be written as a linear combination of others).
A square matrix with full rank is invertible (non-singular).
For a rectangular matrix, full rank implies the matrix has maximal independent information in terms of its rows or columns.

2.2.4.5 Importance:

Full rank matrices are crucial in solving systems of linear equations. A system \(\mathbf{A}\mathbf{x} = \mathbf{b}\) has a unique solution if \(\mathbf{A}\) is a square, full rank matrix.
In linear algebra and machine learning, the rank provides insight into the dimensionality and the independence of the data or transformation matrix.

2.2.5 Inverse Matrix

An inverse matrix of a square matrix \(\mathbf{A}\), denoted as \(\mathbf{A}^{-1}\), is a matrix that, when multiplied by \(\mathbf{A}\), results in the identity matrix \(I\). This relationship is expressed as:

\[ \mathbf{A}\mathbf{A}^{-1} = \mathbf{A}^{-1} \mathbf{A}= \mathbf{I} \]

where \(\mathbf{I}\) is the identity matrix, and its diagonal elements are 1, with all off-diagonal elements being 0.

2.2.5.1 Conditions for a Matrix to Have an Inverse:

The matrix \(\mathbf{A}\) must be square, meaning it has the same number of rows and columns.
The matrix \(\mathbf{A}\) must be non-singular, meaning its determinant is non-zero (\(|\mathbf{A}| \neq 0\)).

2.2.5.2 Properties of the Inverse Matrix:

Uniqueness: If a matrix has an inverse, it is unique.
Inverse of a Product: The inverse of the product of two matrices \(\mathbf{A}\) and \(\mathbf{B}\) is given by \((\mathbf{A}\mathbf{B})^{-1} = \mathbf{B}^{-1} \mathbf{A}^{-1}\).
Inverse of the Inverse: \((\mathbf{A}^{-1})^{-1} = \mathbf{A}\).
Transpose of the Inverse: \((\mathbf{A}^{-1})' = (\mathbf{A}')^{-1}\).

2.2.5.3 Special Case 2 by 2 Matrix

For a \(2 \times 2\) matrix:

\[ \mathbf{A}= \begin{bmatrix} a & b \\ c & d \end{bmatrix} \]

The inverse of \(\mathbf{A}\) (if \(|\mathbf{A}|=\det(\mathbf{A}) \neq 0\)) is:

\[ A^{-1} = \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix} \]

where \(ad - bc\) is the determinant of the matrix \(\mathbf{A}\).

2.2.5.4 Special Case 2 by 2 Block Matrix

The inverse of a \(2 \times 2\) block matrix can be expressed under certain conditions. Let’s consider a block matrix \(\mathbf{M}\) of the form:

\[ \mathbf{M} = \begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \end{bmatrix} \]

where: - \(\mathbf{A}\) and \(\mathbf{D}\) are themselves square matrices, and \(\mathbf{B}\) and \(\mathbf{C}\) are matrices (not necessarily square).

Then the inverse of \(\mathbf{M}\) is given by:

\[ \mathbf{M}^{-1} = \begin{bmatrix} \mathbf{A}^{-1} + \mathbf{A}^{-1} \mathbf{B} \mathbf{S}^{-1} \mathbf{C} \mathbf{A}^{-1} & -\mathbf{A}^{-1} \mathbf{B} \mathbf{S}^{-1} \\ -\mathbf{S}^{-1} \mathbf{C} \mathbf{A}^{-1} & \mathbf{S}^{-1} \end{bmatrix} \]

where \(\mathbf{S}\) is the Schur complement of \(\mathbf{A}\) and is defined as:

\[ \mathbf{S} = \mathbf{D} - \mathbf{C} \mathbf{A}^{-1} \mathbf{B} \]

2.2.5.4.1 Conditions for the Inverse to Exist:

\(\mathbf{A}\) must be invertible,
The Schur complement \(\mathbf{S}\) must also be invertible.

2.2.5.4.2 Explanation of the Terms:

\(\mathbf{A}^{-1}\): The inverse of matrix \(\mathbf{A}\),
\(\mathbf{S}^{-1}\): The inverse of the Schur complement \(\mathbf{S}\), which can be interpreted as the “effective” part of matrix \(\mathbf{D}\) once the contribution of \(\mathbf{A}\) has been removed.

This formula generalizes the concept of inverting a matrix when it’s partitioned into blocks.

2.2.5.5 Sherman-Morrison Formula

The Sherman-Morrison formula provides a way to compute the inverse of a matrix after it has been updated by a rank-one modification. Specifically, it addresses the situation where a matrix A has been updated by the outer product of two vectors u and v. The formula is:

\[ (\mathbf{A} + \mathbf{u} \mathbf{v}^T)^{-1} = \mathbf{A}^{-1} - \frac{\mathbf{A}^{-1} \mathbf{u} \mathbf{v}^T \mathbf{A}^{-1}}{1 + \mathbf{v}^T \mathbf{A}^{-1} \mathbf{u}} \]

2.2.5.5.1 Requirements:

A must be an invertible matrix.
The scalar \(1 + \mathbf{v}^T \mathbf{A}^{-1} \mathbf{u}\) must not be zero.

2.2.5.5.2 Explanation of the terms:

A is an invertible \(n \times n\) matrix.
u and v are \(n \times 1\) column vectors.
The outer product \(\mathbf{u} \mathbf{v}^T\) is an \(n \times n\) rank-one matrix.
The term \(1 + \mathbf{v}^T \mathbf{A}^{-1} \mathbf{u}\) is a scalar.

This formula is useful in situations where you need to efficiently update the inverse of a matrix after a low-rank modification, rather than recomputing the inverse from scratch.

2.2.6 Positive Definite Matrix

A positive definite matrix is a symmetric matrix \(\mathbf{A}\) where, for any non-zero vector \(\mathbf{x}\), the following condition holds:

\[ \mathbf{x}' \mathbf{A}\mathbf{x} > 0 \]

2.2.6.1 Key Properties:

Symmetry: The matrix \(\mathbf{A}\) must be symmetric, meaning \(\mathbf{A}= \mathbf{A}'\).
Positive quadratic form: For any non-zero vector \(\mathbf{x}\), the quadratic form \(\mathbf{x}' \mathbf{A}\mathbf{x}\) must yield a positive value.

2.2.6.2 Characteristics of a Positive Definite Matrix:

All the eigenvalues of a positive definite matrix are positive.
The determinants of the leading principal minors (submatrices) of the matrix are positive.
The diagonal elements of a positive definite matrix are positive.
\(\mathbf{A}\) is invertible.
\(\mathbf{A}^{-1}\) is also positive definite matrix.

2.2.6.3 Example:

The matrix: \[ \mathbf{A}= \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \] is positive definite, because for any non-zero vector \(\mathbf{x}\), \(\mathbf{x}' \mathbf{A}\mathbf{x} > 0\). For instance, if \(\mathbf{x} = \begin{bmatrix} 1 \\ 1 \end{bmatrix}\), then:

\[ \mathbf{x}' \mathbf{A}\mathbf{x} = \begin{bmatrix} 1 & 1 \end{bmatrix} \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \begin{bmatrix} 1 \\ 1 \end{bmatrix} = 6 > 0 \]

2.2.7 Singular Value Decomposition

Singular Value Decomposition (SVD) is a fundamental matrix factorization technique used in linear algebra to break down a matrix into three distinct components. It provides valuable insight into the structure of a matrix and is widely used in applications like data compression, signal processing, and dimensionality reduction.

2.2.7.1 Definition:

For any real (or complex) matrix \(\mathbf{A}\) of size \(m \times n\), the Singular Value Decomposition is given by:

\[ \mathbf{A} = \mathbf{U} \mathbf{\Sigma} \mathbf{V}' \]

where: - \(\mathbf{U}\) is an \(m \times m\) orthogonal matrix, whose columns are called the left singular vectors. - \(\mathbf{\Sigma}\) is an \(m \times n\) diagonal matrix, where the diagonal entries are the singular values of \(\mathbf{A}\). The singular values are always non-negative and arranged in decreasing order. - \(\mathbf{V}\) is an \(n \times n\) orthogonal matrix, whose columns are called the right singular vectors.

2.2.7.2 Interpretation of the Components:

\(\mathbf{U}\) represents the orthonormal basis for the column space of \(\mathbf{A}\).
\(\mathbf{V}\) represents the orthonormal basis for the row space of \(\mathbf{A}\).
\(\mathbf{\Sigma}\) contains the singular values, which provide information about the importance or magnitude of the corresponding singular vectors. Large singular values indicate directions with significant data spread, while small or zero singular values correspond to directions with little or no data variation.

2.2.7.3 Geometric Interpretation:

SVD can be viewed geometrically as a transformation where: 1. \(\mathbf{V}\) applies a rotation or reflection in the input space. 2. \(\mathbf{\Sigma}\) stretches or compresses the data along certain axes. 3. \(\mathbf{U}\) applies a final rotation or reflection in the output space.

2.2.7.4 Key Points:

Rank: The number of non-zero singular values in \(\mathbf{\Sigma}\) equals the rank of the matrix \(\mathbf{A}\).
Dimensionality Reduction: By truncating small singular values in \(\mathbf{\Sigma}\), we can approximate \(\mathbf{A}\) with a lower-rank matrix, which is useful in compressing data while retaining most of its structure.
Condition Number: The ratio of the largest to the smallest non-zero singular value gives the condition number of the matrix, which indicates how sensitive a matrix is to numerical errors or perturbations.

2.2.7.5 Example:

For a matrix \(\mathbf{A}\) of size \(3 \times 2\), the SVD would look like:

\[ \mathbf{A} = \mathbf{U} \begin{bmatrix} \sigma_1 & 0 \\ 0 & \sigma_2 \\ 0 & 0 \end{bmatrix} \mathbf{V}' \]

where \(\sigma_1\) and \(\sigma_2\) are the singular values, and \(\mathbf{U}\) and \(\mathbf{V}\) are orthogonal matrices.

2.2.7.6 Applications of SVD:

Dimensionality Reduction: SVD is widely used in Principal Component Analysis (PCA) for reducing the dimensionality of large datasets.
Low-Rank Approximations: In data compression, SVD helps to approximate matrices with fewer dimensions while maintaining the core structure.
Solving Linear Systems: In cases where a matrix is close to singular, SVD can be used to solve linear systems more stably.
Latent Semantic Analysis (LSA): In natural language processing, SVD is used to reduce the dimensionality of word-document matrices to capture latent relationships between words and documents.

2.2.8 Eigendecomposition

Eigendecomposition is a matrix factorization technique used in linear algebra, where a square matrix is decomposed into its eigenvalues and eigenvectors. It is applicable to square matrices and provides deep insight into the matrix’s structure, particularly in understanding transformations, systems of linear equations, and differential equations.

2.2.8.1 Definition:

Given a square matrix \(\mathbf{A}\) of size \(n \times n\), eigendecomposition is a factorization of the matrix into the following form:

\[ \mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} \]

where: - \(\mathbf{V}\) is the matrix of eigenvectors of \(\mathbf{A}\), and each column of \(\mathbf{V}\) is an eigenvector. - \(\mathbf{\Lambda}\) is a diagonal matrix of eigenvalues of \(\mathbf{A}\), with each diagonal element corresponding to an eigenvalue of \(\mathbf{A}\). - \(\mathbf{V}^{-1}\) is the inverse of the matrix of eigenvectors.

2.2.8.2 Eigenvalues and Eigenvectors:

Eigenvalue (\(\lambda\)): A scalar \(\lambda\) is an eigenvalue of \(\mathbf{A}\) if there exists a non-zero vector \(\mathbf{v}\) such that:

\[ \mathbf{A} \mathbf{v} = \lambda \mathbf{v} \]

In this case, \(\mathbf{v}\) is called the eigenvector corresponding to the eigenvalue \(\lambda\).
Eigenvector: A non-zero vector \(\mathbf{v}\) that remains parallel to itself (i.e., only scaled) when multiplied by \(\mathbf{A}\) is called an eigenvector.

2.2.8.3 Conditions for Eigendecomposition:

A matrix \(\mathbf{A}\) is diagonalizable (i.e., it can be factored into \(\mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1}\)) if and only if it has \(n\) linearly independent eigenvectors.
Not all matrices are diagonalizable. However, if \(\mathbf{A}\) has \(n\) distinct eigenvalues, then it is guaranteed to be diagonalizable.
Symmetric matrices are always diagonalizable.

2.2.8.4 Geometric Interpretation:

Eigendecomposition reveals the directions (eigenvectors) along which the matrix transformation \(\mathbf{A}\) acts as a simple scaling by the eigenvalues. Geometrically: - Eigenvectors point in directions that remain invariant under the transformation by \(\mathbf{A}\). - The corresponding eigenvalues tell us how much the matrix stretches or compresses vectors in the direction of those eigenvectors.

2.2.8.5 Example:

For a matrix \(\mathbf{A}\): \[ \mathbf{A} = \begin{bmatrix} 4 & 1 \\ 2 & 3 \end{bmatrix} \] The eigenvalues \(\lambda_1 = 5\) and \(\lambda_2 = 2\) can be found by solving the characteristic equation \(\det(\mathbf{A} - \lambda \mathbf{I}) = 0\). Corresponding eigenvectors can then be computed, allowing the matrix to be diagonalized as:

\[ \mathbf{A} = \mathbf{V} \mathbf{\Lambda} \mathbf{V}^{-1} \]

where \(\mathbf{\Lambda} = \text{diag}(5, 2)\) and \(\mathbf{V}\) is the matrix of eigenvectors.

2.2.8.6 Applications of Eigendecomposition:

Diagonalization: Eigendecomposition allows matrices to be diagonalized, simplifying many computations (such as raising matrices to powers).
Principal Component Analysis (PCA): In data science, eigendecomposition is used in PCA to find directions of maximum variance in data.
Solving Differential Equations: Eigenvalues and eigenvectors are useful in solving systems of linear differential equations.
Quantum Mechanics: In physics, eigenvalues and eigenvectors describe the measurable properties (like energy levels) of systems.

In summary, eigendecomposition is a powerful tool in linear algebra that provides insight into how a matrix transforms space, offering valuable properties through its eigenvalues and eigenvectors.

2.2.9 Idempotent Matrix

An idempotent matrix is a matrix that, when multiplied by itself, yields the same matrix. In other words, a matrix \(\mathbf{M}\) is idempotent if it satisfies the condition:

\[ \mathbf{M}^2 = \mathbf{M} \]

2.2.9.1 Key Properties of Idempotent Matrices:

Eigenvalues: The eigenvalues of an idempotent matrix are either 0 or 1. This is because for an eigenvector \(\mathbf{v}\) with eigenvalue \(\lambda\), the equation \(\mathbf{M}^2 \mathbf{v} = \mathbf{M} \mathbf{v}\) simplifies to \(\lambda^2 \mathbf{v} = \lambda \mathbf{v}\), meaning \(\lambda(\lambda - 1) = 0\), so \(\lambda = 0\) or \(\lambda = 1\).
Rank: The rank of an idempotent matrix \(\mathbf{M}\) is equal to the trace of the matrix (the sum of the diagonal elements), which is also the number of eigenvalues equal to 1.
Projection Interpretation: Idempotent matrices often represent projection matrices in linear algebra. A projection matrix projects vectors onto a subspace, and applying the projection multiple times doesn’t change the result beyond the first application, which is why it satisfies \(\mathbf{M}^2 = \mathbf{M}\).

2.2.9.2 Examples:

Identity Matrix: The identity matrix \(\mathbf{I}\) is idempotent because: \[ \mathbf{I}^2 = \mathbf{I} \]
Zero Matrix: The zero matrix \(\mathbf{0}\) is also idempotent because: \[ \mathbf{0}^2 = \mathbf{0} \]
Projection Matrix: Consider a projection matrix onto the x-axis in 2D: \[ \mathbf{P} = \begin{bmatrix} 1 & 0 \\ 0 & 0 \end{bmatrix} \] This matrix is idempotent since: \[ \mathbf{P}^2 = \mathbf{P} \]

2.2.9.3 Use in Statistics:

Idempotent matrices are commonly used in statistics, particularly in the context of regression analysis. For example, the hat matrix \(\mathbf{H}\) in linear regression, which transforms the observed values into the predicted values, is idempotent:

\[ \mathbf{H} = \mathbf{X}(\mathbf{X}^\top \mathbf{X})^{-1}\mathbf{X}^\top \] where \(\mathbf{X}\) is the design matrix.

In summary, idempotent matrices have unique properties and are frequently encountered in linear algebra, projections, and statistical applications.

2.2.10 Determinant of a Matrix

The determinant of a matrix is a scalar value that provides important information about the properties of a square matrix. It is a fundamental concept in linear algebra, often used to determine whether a matrix is invertible, as well as to describe geometric transformations and volume scaling.

2.2.10.1 Key Charachteristics of the Determinant:

Square Matrices Only: The determinant is only defined for square matrices (i.e., matrices with the same number of rows and columns).
Invertibility: A matrix is invertible (i.e., it has an inverse) if and only if its determinant is non-zero. If the determinant is zero, the matrix is singular and does not have an inverse.
Geometric Interpretation: The determinant represents the scaling factor of the linear transformation described by the matrix. For example, in two or three dimensions, the determinant tells you how much the matrix scales area or volume:
- A determinant of 1 means the matrix preserves the area (in 2D) or volume (in 3D).
- A determinant greater than 1 means the matrix scales the area or volume by that factor.
- A negative determinant indicates that the transformation also includes a reflection.
Significance of Zero Determinant: If the determinant is zero, it means that the matrix maps some vectors to a lower-dimensional space. For instance, in 2D, it might map points onto a line, collapsing the area to zero.

2.2.10.2 Definition:

For a 2x2 matrix: \[ \mathbf{A} = \begin{bmatrix} a & b \\ c & d \end{bmatrix} \] The determinant of \(\mathbf{A}\) is: \[ \det(\mathbf{A}) = ad - bc \] This formula gives the area scaling factor for the linear transformation represented by the matrix \(\mathbf{A}\).

For a 3x3 matrix: \[ \mathbf{B} = \begin{bmatrix} a & b & c \\ d & e & f \\ g & h & i \end{bmatrix} \] The determinant of \(\mathbf{B}\) is: \[ \det(\mathbf{B}) = a(ei - fh) - b(di - fg) + c(dh - eg) \] This formula can be extended to higher dimensions using recursive expansion (called cofactor expansion or Laplace expansion).

2.2.10.3 Applications of the Determinant:

Solving Linear Systems: The determinant is used in Cramer’s Rule, a method for solving systems of linear equations. If the determinant of the coefficient matrix is non-zero, the system has a unique solution.
Eigenvalues and Eigenvectors: The determinant plays a key role in computing the eigenvalues of a matrix. The determinant of a matrix \(\mathbf{A} - \lambda \mathbf{I}\), where \(\lambda\) is a scalar and \(\mathbf{I}\) is the identity matrix, gives the characteristic equation whose solutions are the eigenvalues.
Volume and Area Calculations: In geometry, the determinant helps calculate the area (in 2D) or volume (in 3D) of a region after applying a linear transformation.
Singular Value Decomposition (SVD) and Principal Component Analysis (PCA): The determinant is important in these techniques for understanding data structures, transformations, and dimensionality reduction.

2.2.10.4 Properties of the Determinant

For a general block matrix of the form: \[ \mathbf{M} = \begin{bmatrix} \mathbf{A} & \mathbf{B} \\ \mathbf{C} & \mathbf{D} \end{bmatrix} \] where \(\mathbf{A}\), \(\mathbf{B}\), \(\mathbf{C}\), and \(\mathbf{D}\) are submatrices, the determinant formula is more complex. If \(\mathbf{D}\) is invertible, we can use the Schur complement to compute the determinant: \[ \det(\mathbf{M}) = \det(\mathbf{D}) \det(\mathbf{A} - \mathbf{B} \mathbf{D}^{-1} \mathbf{C}) \] Here, \(\mathbf{A} - \mathbf{B} \mathbf{D}^{-1} \mathbf{C}\) is called the Schur complement of \(\mathbf{D}\) in \(\mathbf{M}\).

For any invertible matrix \(\mathbf{X}\in \mathbb{R}^{p \times p}\) and matrices \(\mathbf{A}\in\mathbb{R}^{p\times q}\) and \(\mathbf{B}\in\mathbb{R}^{q\times p}\) we have, that:

\[\det(\mathbf{X}+ \mathbf{A}\mathbf{B}) = det(\mathbf{X})\det(\mathbf{I}_q + \mathbf{B}\mathbf{X}^{-1}\mathbf{A}) \]

2.3 Calculus

Key calculus topics include:

2.3.1 Gradient

The gradient of a function is a vector that contains the partial derivatives of the function with respect to each of its variables. It points in the direction of the steepest ascent of the function, and its magnitude indicates the rate of change in that direction.

For a scalar function \(f(x_1, x_2, \ldots, x_n)\), where \(x_1, x_2, \ldots, x_n\) are the variables, the gradient is defined as:

\[ \nabla f = \frac{d}{d \mathbf{x}} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} \]

2.3.1.1 Key Points:

Direction: The gradient points in the direction of the greatest increase of the function.
Magnitude: The magnitude of the gradient represents how fast the function increases in that direction.
Zero Gradient: If \(\nabla f = 0\), it indicates that the function has a critical point, which could be a local minimum, maximum, or saddle point.

2.3.1.2 Example:

For a function \(f(x, y) = x^2 + y^2\), the gradient is:

\[ \nabla f = \begin{bmatrix} \frac{\partial}{\partial x} (x^2 + y^2) \\ \frac{\partial}{\partial y} (x^2 + y^2) \end{bmatrix} = \begin{bmatrix} 2x \\ 2y \end{bmatrix} \]

This shows that the gradient points outward from the origin, and its magnitude increases as \(x\) and \(y\) increase.

2.3.1.3 Applications:

In optimization, the gradient is used to find the minimum or maximum of a function (e.g., in gradient descent, a common optimization algorithm).
In vector calculus, the gradient is used to describe the slope or rate of change of scalar fields (such as temperature, pressure, or altitude in physical applications).

2.3.2 Hessian Matrix

The Hessian matrix is a square matrix of second-order partial derivatives of a scalar-valued function. It describes the local curvature of a multivariable function and is used to assess the nature of critical points (i.e., whether they are minima, maxima, or saddle points).

For a scalar function \(f(x_1, x_2, \ldots, x_n)\), the Hessian matrix \(\mathbf{H}\) is defined as:

\[ \mathbf{H}(f) = \frac{d}{d \mathbf{x} d\mathbf{x}'} f(\mathbf{x}) =\begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} \]

2.3.2.1 Key Properties:

The Hessian is symmetric if the second-order partial derivatives are continuous (by Clairaut’s theorem, also called Schwarz’s theorem).
It provides important information about the local behavior of the function, particularly around critical points where the gradient is zero.
Eigenvalues of the Hessian matrix determine the type of critical points:
- If all eigenvalues are positive, the function has a local minimum.
- If all eigenvalues are negative, the function has a local maximum.
- If some eigenvalues are positive and others are negative, the function has a saddle point.

2.3.2.2 Example:

For a function \(f(x, y) = x^2 + xy + y^2\), the Hessian matrix is:

\[ \mathbf{H}(f) = \begin{bmatrix} \frac{\partial^2 f}{\partial x^2} & \frac{\partial^2 f}{\partial x \partial y} \\ \frac{\partial^2 f}{\partial y \partial x} & \frac{\partial^2 f}{\partial y^2} \end{bmatrix} = \begin{bmatrix} 2 & 1 \\ 1 & 2 \end{bmatrix} \]

2.3.3 Applications:

In optimization, the Hessian is used to assess the convexity or concavity of a function, which helps in identifying the nature of critical points.
In machine learning, it is used to optimize loss functions and can be part of second-order optimization methods like Newton’s method.
In economics and engineering, the Hessian helps in analyzing systems involving multiple variables and understanding how they interact with each other.

2.3.4 Matrix Calculus

You need to know the following matrix calculus operations:

\[ \frac{d}{d \mathbf{x}} \left(\mathbf{c}'\mathbf{x}\right) \] \[ \frac{d}{d \mathbf{x}} \left(\mathbf{x}'\mathbf{A}\mathbf{x}\right) \] \[ \frac{d}{d \mathbf{x} d\mathbf{x}'} \left(\mathbf{x}'\mathbf{A}\mathbf{x}\right) \]

Let \(\mathbf{c}\) be a constant vector and \(\mathbf{x}\) be a variable vector, both of size \(n \times 1\). We want to compute the derivative of the product:

\[ f(\mathbf{x}) = \mathbf{c}' \mathbf{x} \] Where: \[ \mathbf{c}' \mathbf{x} = \sum_{i=1}^{n} c_i x_i \]

To differentiate \(f(\mathbf{x}) = \mathbf{c}' \mathbf{x}\) with respect to the variable vector \(\mathbf{x}\), we take the derivative of each component separately:

\[ \nabla f = \frac{d}{d \mathbf{x}} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \frac{\partial }{\partial x_1} (\mathbf{c}' \mathbf{x}) \\ \frac{\partial }{\partial x_2} (\mathbf{c}' \mathbf{x}) \\ \vdots \\ \frac{\partial }{\partial x_n} (\mathbf{c}' \mathbf{x}) \end{bmatrix} = \begin{bmatrix} \frac{\partial }{\partial x_1} \left(\sum_{i=1}^{n} c_i x_i\right) \\ \frac{\partial }{\partial x_2} \left(\sum_{i=1}^{n} c_i x_i\right) \\ \vdots \\ \frac{\partial }{\partial x_n} \left(\sum_{i=1}^{n} c_i x_i\right) \end{bmatrix} \]

Since \(\mathbf{c}\) is a constant vector, the derivative of each term \(c_i x_i\) is simply \(c_i\), that is:

\[ \frac{d}{d x_j} \left(\sum_{i=1}^{n} c_i x_i\right) = c_j \]

Thus, the derivative of the entire sum is the vector:

\[ \frac{d}{d \mathbf{x}} \left( \mathbf{c}' \mathbf{x} \right) = \begin{bmatrix} c_1 \\ c_2 \\ \vdots \\ c_n \end{bmatrix} = \mathbf{c} \]

Now, let’s go through the derivative of the quadratic form \(f(\mathbf{x}) = \mathbf{x}' \mathbf{A}\mathbf{x}\), where:

\(\mathbf{x}\) is a variable vector of size \(n \times 1\),
\(\mathbf{A}\) is a constant, symmetric matrix of size \(n \times n\).

\[ f(\mathbf{x}) = \mathbf{x}' \mathbf{A}\mathbf{x} \] First, expand the quadratic form:

\[ f(\mathbf{x}) = \sum_{i=1}^{n} \sum_{j=1}^{n} x_i a_{ij} x_j \] Then

\[ \nabla f = \frac{d}{d \mathbf{x}} f(\mathbf{x}) = \begin{bmatrix} \frac{\partial f}{\partial x_1} \\ \frac{\partial f}{\partial x_2} \\ \vdots \\ \frac{\partial f}{\partial x_n} \end{bmatrix} = \begin{bmatrix} \frac{\partial }{\partial x_1} (\mathbf{x}' \mathbf{A}\mathbf{x}) \\ \frac{\partial }{\partial x_2} (\mathbf{x}' \mathbf{A}\mathbf{x}) \\ \vdots \\ \frac{\partial }{\partial x_n} (\mathbf{x}' \mathbf{A}\mathbf{x}) \end{bmatrix} = \begin{bmatrix} \frac{\partial }{\partial x_1} \left(\sum_{i=1}^{n} \sum_{j=1}^{n} x_i a_{ij} x_j\right) \\ \frac{\partial }{\partial x_2} \left(\sum_{i=1}^{n} \sum_{j=1}^{n} x_i a_{ij} x_j\right) \\ \vdots \\ \frac{\partial }{\partial x_n} \left(\sum_{i=1}^{n} \sum_{j=1}^{n} x_i a_{ij} x_j\right) \end{bmatrix} \] For each component \(x_k\) in the vector \(\mathbf{x}\), the derivative of \(f(\mathbf{x})\) is:

\[ \frac{\partial}{\partial x_k} f(\mathbf{x}) = \frac{\partial}{\partial x_k} \left( \sum_{i=1}^{n} \sum_{j=1}^{n} x_i a_{ij} x_j \right) = \sum_{i=1}^{n} \sum_{j=1}^{n} \frac{\partial}{\partial x_k} x_i a_{ij} x_j \]

Each term \(x_i a_{ij} x_j\) has two components that depend on \(\mathbf{x}\):

If \(i = j = k\), the derivative with respect to \(x_k\) is:

\[ \frac{\partial}{\partial x_k} (x_i a_{ij} x_j) = 2 a_{kk} x_k \]

If \(i \neq j\) and \(i = k\), the derivative with respect to \(x_k\) is:

\[ \frac{\partial}{\partial x_k} (x_i a_{ij} x_j) = a_{kj} x_j \] - Similarly, if \(i \neq j\) and \(j = k\), the derivative with respect to \(x_k\) is:

\[ \frac{\partial}{\partial x_k} (x_i a_{ij} x_j) = a_{ik} x_i \] - Finally, if \(i \neq k\) and \(j \neq k\), then:

\[ \frac{\partial}{\partial x_k} (x_i a_{ij} x_j) = 0 \] Then

\[ \frac{\partial}{\partial x_k} f(\mathbf{x}) = 2 a_{kk} x_k + \sum_{i \neq k} a_{ik} x_i + \sum_{j \neq k} a_{kj} x_j \]

Now since \(\mathbf{A}\) is symmetric (\(a_{ij} = a_{ji}\)), then:

\[\begin{align*} \frac{\partial}{\partial x_k} f(\mathbf{x}) &= 2 a_{kk} x_k + \sum_{i \neq k} a_{ik} x_i + \sum_{i \neq k} a_{ik} x_i \\ &= 2 a_{kk} x_k + 2\sum_{i \neq k} a_{ik} x_i \\ &= 2 \left(\sum_{i \neq k} a_{ik} x_i + a_{kk}x_k \right) \\ &= 2 \left(\sum_{i = 1}^n a_{ki} x_i\right) \end{align*}\]

Then:

\[ \nabla f = \begin{bmatrix} \frac{\partial }{\partial x_1} \left(\sum_{i=1}^{n} \sum_{j=1}^{n} x_i a_{ij} x_j\right) \\ \frac{\partial }{\partial x_2} \left(\sum_{i=1}^{n} \sum_{j=1}^{n} x_i a_{ij} x_j\right) \\ \vdots \\ \frac{\partial }{\partial x_n} \left(\sum_{i=1}^{n} \sum_{j=1}^{n} x_i a_{ij} x_j\right) \end{bmatrix} = \begin{bmatrix} 2 \sum_{i = 1}^n a_{1i} x_i \\ 2 \sum_{i = 1}^n a_{2i} x_i \\ \vdots \\ 2 \sum_{i = 1}^n a_{ni} x_i \end{bmatrix} = 2 \mathbf{A}\mathbf{x} \]

Finally for the second derivative we have that:

In general, the Hessian matrix of a scalar function \(f(\mathbf{x})\), where \(\mathbf{x} \in \mathbb{R}^n\) is a vector of variables, is a matrix that contains all the second-order partial derivatives of the function. It is defined as:

\[ \mathbf{H}(f) = \frac{d^2 f}{d\mathbf{x}d\mathbf{x}'} = \begin{bmatrix} \frac{\partial^2 f}{\partial x_1^2} & \frac{\partial^2 f}{\partial x_1 \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_1 \partial x_n} \\ \frac{\partial^2 f}{\partial x_2 \partial x_1} & \frac{\partial^2 f}{\partial x_2^2} & \cdots & \frac{\partial^2 f}{\partial x_2 \partial x_n} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial^2 f}{\partial x_n \partial x_1} & \frac{\partial^2 f}{\partial x_n \partial x_2} & \cdots & \frac{\partial^2 f}{\partial x_n^2} \end{bmatrix} = \begin{bmatrix} \frac{\partial }{\partial x_1}\left(\frac{d f}{d\mathbf{x}}\right)' \\ \frac{\partial }{\partial x_2}\left(\frac{d f}{d\mathbf{x}}\right)' \\ \vdots \\ \frac{\partial }{\partial x_n}\left(\frac{d f}{d\mathbf{x}}\right)' \end{bmatrix} = \begin{bmatrix} \frac{\partial }{\partial x_1}\left(2\mathbf{A}\mathbf{x}\right)' \\ \frac{\partial }{\partial x_2}\left(2\mathbf{A}\mathbf{x}\right)' \\ \vdots \\ \frac{\partial }{\partial x_n}\left(2\mathbf{A}\mathbf{x}\right)' \end{bmatrix} \]

Now

\[ \frac{\partial }{\partial x_k}\left(2\mathbf{A}\mathbf{x}\right) = 2\frac{\partial }{\partial x_k}\begin{bmatrix} \sum_{i = 1}^n a_{1i} x_i \\ \sum_{i = 1}^n a_{2i} x_i \\ \vdots \\ \sum_{i = 1}^n a_{ni} x_i \end{bmatrix} = 2 \begin{bmatrix} \frac{\partial }{\partial x_k} \left(\sum_{i = 1}^n a_{1i} x_i \right)\\ \frac{\partial }{\partial x_k} \left(\sum_{i = 1}^n a_{2i} x_i \right)\\ \vdots \\ \frac{\partial }{\partial x_k} \left(\sum_{i = 1}^n a_{ni} x_i \right) \end{bmatrix} = 2 \begin{bmatrix} a_{k1} \\ a_{k2} \\ \vdots \\ a_{kn} \end{bmatrix} \]

Then

\[ \mathbf{H}(f) = \frac{d^2 f}{d\mathbf{x}d\mathbf{x}'} = \begin{bmatrix} 2\begin{bmatrix} a_{11} \\ a_{12} \\ \vdots \\ a_{1n} \end{bmatrix}' \\ 2\begin{bmatrix} a_{21} \\ a_{22} \\ \vdots \\ a_{2n} \end{bmatrix}' \\ \vdots \\ 2\begin{bmatrix} a_{n1} \\ a_{n2} \\ \vdots \\ a_{nn} \end{bmatrix}' \end{bmatrix} = 2\begin{bmatrix} a_{11} & a_{12} & \cdots & a_{1n} \\ a_{21} & a_{22} & \cdots & a_{2n} \\ \vdots & \vdots & \ddots & \vdots \\ a_{n1} & a_{n2} & \cdots & a_{nn} \end{bmatrix} = 2 \mathbf{A} \]

2.4 Probability

Key probability concepts to understand include:

2.4.1 Expected Value

The expected value (or mean) of a random vector is a fundamental concept in multivariate statistics. Just as the expected value of a random variable provides a measure of the “center” or “average” of the distribution, the expected value of a random vector captures the central location of a multidimensional distribution.

2.4.1.1 Definition

Let \(\mathbf{X}\) be a random vector in \(\mathbb{R}^n\), represented as: \[ \mathbf{X} = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix}, \] where \(X_1, X_2, \dots, X_n\) are its components. The expected value of \(\mathbf{X}\), denoted by \(\mathbb{E}[\mathbf{X}]\), is defined as the vector of the expected values of each component: \[ \mathbb{E}[\mathbf{X}] = \begin{bmatrix} \mathbb{E}[X_1] \\ \mathbb{E}[X_2] \\ \vdots \\ \mathbb{E}[X_n] \end{bmatrix}. \]

This vector \(\mathbb{E}[\mathbf{X}]\) is also called the mean vector of \(\mathbf{X}\).

2.4.1.2 Key Properties of the Expected Value of a Random Vector

Linearity: For any scalar \(a\) and random vectors \(\mathbf{X}\) and \(\mathbf{Y}\), \[ \mathbb{E}[a\mathbf{X} + \mathbf{Y}] = a\mathbb{E}[\mathbf{X}] + \mathbb{E}[\mathbf{Y}]. \]
Expectation of a Constant Vector: If \(\mathbf{c}\) is a constant vector in \(\mathbb{R}^n\), then the expectation is simply the vector itself: \[ \mathbb{E}[\mathbf{c}] = \mathbf{c}. \]
Expectation of a Linear Transformation: If \(\mathbf{A}\) is an \(m \times n\) constant matrix, then for a random vector \(\mathbf{X}\) in \(\mathbb{R}^n\), \[ \mathbb{E}[\mathbf{A} \mathbf{X}] = \mathbf{A} \mathbb{E}[\mathbf{X}]. \]

2.4.2 Variance

2.4.2.1 Definition

Let \(\mathbf{X}\) be a random vector in \(\mathbb{R}^n\), represented as: \[ \mathbf{X} = \begin{bmatrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{bmatrix}, \] where each \(X_i\) is a random variable. The variance-covariance matrix of \(\mathbf{X}\), denoted by \(\mathbb{V}(\mathbf{X})\) or \(\boldsymbol{\Sigma}\), is defined as: \[ \mathbb{V}(\mathbf{X}) = \mathbb{E}[(\mathbf{X} - \mathbb{E}[\mathbf{X}])(\mathbf{X} - \mathbb{E}[\mathbf{X}])']. \] The \((i, j)\) entry of \(\mathbb{V}(\mathbf{X})\) is given by \(\mathbb{C}(X_i, X_j)\), representing the covariance between components \(X_i\) and \(X_j\).

If \(\mathbf{X}\) has \(n\) components, \(\mathbb{V}(\mathbf{X})\) will be an \(n \times n\) symmetric matrix where: - Diagonal entries represent the variances of each component, i.e., \(\mathbb{V}(X_i) = \mathbb{C}(X_i, X_i)\). - Off-diagonal entries represent the covariances between different components, i.e., \(\mathbb{C}(X_i, X_j)\) for \(i \neq j\).

2.4.2.2 Key Properties of the Variance-Covariance Matrix

Symmetry: The variance-covariance matrix is symmetric, meaning: \[ \mathbb{V}(\mathbf{X}) = \mathbb{V}(\mathbf{X})^T. \]
Non-negativity: The variance-covariance matrix is positive semi-definite, which implies: \[ \mathbf{z}^T \mathbb{V}(\mathbf{X}) \mathbf{z} \geq 0 \quad \text{for any vector } \mathbf{z} \in \mathbb{R}^n. \] This property indicates that variances (the diagonal entries) are always non-negative.
Variance of a Linear Transformation: If \(\mathbf{A}\) is an \(m \times n\) matrix, then the variance of the transformed random vector \(\mathbf{A} \mathbf{X}\) is given by: \[ \mathbb{V}(\mathbf{A} \mathbf{X}) = \mathbf{A} \, \mathbb{V}(\mathbf{X}) \, \mathbf{A}^T. \]
Variance of Independent Random Variables: If the components of \(\mathbf{X}\) are mutually independent, the off-diagonal entries of \(\mathbb{V}(\mathbf{X})\) (the covariances) are zero, resulting in a diagonal covariance matrix.
Variance of the Sum of Two Arbitrary Random Vectors: If \(\mathbf{X}\) and \(\mathbf{Y}\) are two arbitrary random vectors in \(\mathbb{R}^n\), the variance of their sum is given by: \[ \mathbb{V}(\mathbf{X} + \mathbf{Y}) = \mathbb{V}(\mathbf{X}) + \mathbb{V}(\mathbf{Y}) + 2 \, \mathbb{C}(\mathbf{X}, \mathbf{Y}), \] where \(\mathbb{C}(\mathbf{X}, \mathbf{Y})\) is the cross-covariance matrix between \(\mathbf{X}\) and \(\mathbf{Y}\): \[ \mathbb{C}(\mathbf{X}, \mathbf{Y}) = \mathbb{E}\left[(\mathbf{X} - \mathbb{E}[\mathbf{X}])(\mathbf{Y} - \mathbb{E}[\mathbf{Y}])'\right]. \]
Variance of the Sum of Two Independent Random Vectors: If \(\mathbf{X}\) and \(\mathbf{Y}\) are two independent random vectors in \(\mathbb{R}^n\), then the variance of their sum is given by the sum of their individual variances: \[ \mathbb{V}(\mathbf{X} + \mathbf{Y}) = \mathbb{V}(\mathbf{X}) + \mathbb{V}(\mathbf{Y}). \] Since \(\mathbf{X}\) and \(\mathbf{Y}\) are independent, their covariance \(\mathbb{C}(\mathbf{X}, \mathbf{Y})\) is zero, simplifying the variance of the sum.
Trace of the Variance: A related measure of the variance of a random vector \(\mathbf{X}\), is given by \[ \text{tr}(\mathbb{V}[\mathbf{X}]) = \mathbb{E}[(\mathbf{X} - \mathbb{E}[\mathbf{X}])'(\mathbf{X} - \mathbb{E}[\mathbf{X}])] \]

2.4.3 Cross-Covariance Matrix

The cross-covariance matrix measures the covariance between two random vectors, providing insights into the linear relationships between different components of these vectors. Given two random vectors \(\mathbf{X} \in \mathbb{R}^n\) and \(\mathbf{Y} \in \mathbb{R}^m\), the cross-covariance matrix between \(\mathbf{X}\) and \(\mathbf{Y}\) captures how each component of \(\mathbf{X}\) varies with each component of \(\mathbf{Y}\).

2.4.3.1 Definition

Let \(\mathbf{X}\) and \(\mathbf{Y}\) be two random vectors with mean vectors \(\mathbb{E}[\mathbf{X}] = \boldsymbol{\mu}_{\mathbf{X}}\) and \(\mathbb{E}[\mathbf{Y}] = \boldsymbol{\mu}_{\mathbf{Y}}\). The cross-covariance matrix of \(\mathbf{X}\) and \(\mathbf{Y}\) is defined as:

\[ \mathbb{C}(\mathbf{X}, \mathbf{Y}) = \mathbb{E}[(\mathbf{X} - \boldsymbol{\mu}_{\mathbf{X}})(\mathbf{Y} - \boldsymbol{\mu}_{\mathbf{Y}})^T]. \]

This matrix is of dimension \(n \times m\), where each element \((\mathbb{C}(\mathbf{X}, \mathbf{Y}))_{ij}\) represents the covariance between the \(i\)-th component of \(\mathbf{X}\) and the \(j\)-th component of \(\mathbf{Y}\):

\[ (\mathbb{C}(\mathbf{X}, \mathbf{Y}))_{ij} = \mathbb{C}(X_i, Y_j) = \mathbb{E}[(X_i - \mu_{X_i})(Y_j - \mu_{Y_j})]. \]

2.4.3.2 Key Properties

Symmetry in Covariance:
- If \(\mathbf{X} = \mathbf{Y}\), then \(\mathbb{C}(\mathbf{X}, \mathbf{Y})\) reduces to the covariance matrix of \(\mathbf{X}\), which is symmetric. For distinct vectors \(\mathbf{X}\) and \(\mathbf{Y}\), the cross-covariance matrix \(\mathbb{C}(\mathbf{X}, \mathbf{Y})\) is generally not symmetric.
Relationship with Joint Covariance Matrix:
- If \(\mathbf{Z} = \begin{bmatrix} \mathbf{X} \\ \mathbf{Y} \end{bmatrix}\) is a combined random vector, then the covariance matrix of \(\mathbf{Z}\) is: \[ \mathbb{C}(\mathbf{Z}) = \begin{bmatrix} \mathbb{C}(\mathbf{X}) & \mathbb{C}(\mathbf{X}, \mathbf{Y}) \\ \mathbb{C}(\mathbf{Y}, \mathbf{X}) & \mathbb{C}(\mathbf{Y}) \end{bmatrix}. \]
Linearity:
- If \(a\) and \(b\) are constants and \(\mathbf{X}_1\) and \(\mathbf{X}_2\) are random vectors of the same dimension as \(\mathbf{X}\), then: \[ \mathbb{C}(a\mathbf{X}_1 + b\mathbf{X}_2, \mathbf{Y}) = a \mathbb{C}(\mathbf{X}_1, \mathbf{Y}) + b \mathbb{C}(\mathbf{X}_2, \mathbf{Y}). \]

Furthermore, if \(\mathbf{A}\) and \(\mathbf{B}\) are matrices of compatible dimensions, and \(\mathbf{X}\) and \(\mathbf{Y}\) are random vectors then the cross-covariance of \(\mathbf{A}\mathbf{X}\) and \(\mathbf{B}\mathbf{Y}\) is given by:

\[ \mathbb{C}(\mathbf{A}\mathbf{X}, \mathbf{B}\mathbf{Y}) = \mathbf{A} \mathbb{C}(\mathbf{X}, \mathbf{Y}) \mathbf{B}^T. \]

Zero Cross-Covariance and Independence:
- If \(\mathbb{C}(\mathbf{X}, \mathbf{Y}) = \mathbf{0}\), it implies that \(\mathbf{X}\) and \(\mathbf{Y}\) are uncorrelated, but it does not necessarily imply independence unless \(\mathbf{X}\) and \(\mathbf{Y}\) are jointly normally distributed.

2.4.4 Multivariate Normal Distribution

The multivariate normal distribution generalizes the concept of the normal distribution to multiple dimensions, describing the behavior of random vectors whose elements are jointly normally distributed. It is widely used in statistics and machine learning due to its well-behaved properties and its applicability to modeling correlations between variables.

2.4.4.1 Definition

A random vector \(\mathbf{X} = (X_1, X_2, \dots, X_n)^T\) is said to follow a multivariate normal distribution if it has a probability density function of the form:

\[ f(\mathbf{X}) = \frac{1}{(2\pi)^{n/2} |\boldsymbol{\Sigma}|^{1/2}} \exp\left( -\frac{1}{2} (\mathbf{X} - \boldsymbol{\mu})^T \boldsymbol{\Sigma}^{-1} (\mathbf{X} - \boldsymbol{\mu}) \right), \]

where:

\(\mathbf{X} \in \mathbb{R}^n\) is the random vector,
\(\boldsymbol{\mu} \in \mathbb{R}^n\) is the mean vector,
\(\boldsymbol{\Sigma} \in \mathbb{R}^{n \times n}\) is the covariance matrix, assumed to be symmetric and positive definite.

This distribution is denoted as \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\).

2.4.4.2 Key Properties

Marginal Distributions:
- Any subset of the components of \(\mathbf{X}\) is also normally distributed.
- If \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\), then any partition of \(\mathbf{X}\) results in a marginal distribution that is also multivariate normal.
Affine Transformation:
- For a matrix \(\mathbf{A}\) and vector \(\mathbf{b}\) of appropriate dimensions, the affine transformation \(\mathbf{Y} = \mathbf{A}\mathbf{X} + \mathbf{b}\) also follows a multivariate normal distribution, \(\mathbf{Y} \sim \mathcal{N}(\mathbf{A} \boldsymbol{\mu} + \mathbf{b}, \mathbf{A} \boldsymbol{\Sigma} \mathbf{A}^T)\).
Conditional Distributions:
- For a partitioned vector \(\mathbf{X} = (\mathbf{X}_1, \mathbf{X}_2)^T\), with corresponding partitions of the mean vector and covariance matrix: \[ \boldsymbol{\mu} = \begin{pmatrix} \boldsymbol{\mu}_1 \\ \boldsymbol{\mu}_2 \end{pmatrix}, \quad \boldsymbol{\Sigma} = \begin{pmatrix} \boldsymbol{\Sigma}_{11} & \boldsymbol{\Sigma}_{12} \\ \boldsymbol{\Sigma}_{21} & \boldsymbol{\Sigma}_{22} \end{pmatrix}, \]
- the conditional distribution \(\mathbf{X}_1 | \mathbf{X}_2 = \mathbf{x}_2\) is also multivariate normal: \[ \mathbf{X}_1 | \mathbf{X}_2 = \mathbf{x}_2 \sim \mathcal{N}(\boldsymbol{\mu}_1 + \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} (\mathbf{x}_2 - \boldsymbol{\mu}_2), \boldsymbol{\Sigma}_{11} - \boldsymbol{\Sigma}_{12} \boldsymbol{\Sigma}_{22}^{-1} \boldsymbol{\Sigma}_{21}). \]
Independence and Uncorrelatedness:
- For a multivariate normal distribution, zero covariance implies independence. If two components \(X_i\) and \(X_j\) (or two subvectors) have a covariance of zero, they are independent.
Moment Generating Function:
- The moment generating function of \(\mathbf{X} \sim \mathcal{N}(\boldsymbol{\mu}, \boldsymbol{\Sigma})\) is: \[ M_{\mathbf{X}}(\mathbf{t}) = \exp\left( \mathbf{t}^T \boldsymbol{\mu} + \frac{1}{2} \mathbf{t}^T \boldsymbol{\Sigma} \mathbf{t} \right). \]
Maximum Entropy:
- Among all distributions with a given mean vector and covariance matrix, the multivariate normal distribution has the maximum entropy, making it the most “uninformative” or spread out.

2.4.5 \(\chi^2\) Distribution

The chi-squared distribution is a probability distribution that arises naturally in statistics, particularly in hypothesis testing and estimation problems. It is defined as the distribution of a sum of the squares of independent standard normal random variables.

Formally, let \(Z_1, Z_2, \dots, Z_k\) be \(k\) independent random variables, each following a standard normal distribution (mean \(0\) and variance \(1\)). Then, the sum of their squares:

\[ Q = Z_1^2 + Z_2^2 + \cdots + Z_k^2 \]

follows a chi-squared distribution with \(k\) degrees of freedom. We denote this as:

\[ Q \sim \chi^2_k \]

2.4.5.1 Key Properties of the Chi-Squared Distribution

Degrees of Freedom (\(k\)): The parameter \(k\) determines the shape of the distribution. Larger \(k\) values result in a distribution that becomes more symmetric and approaches a normal distribution (as \(k \to \infty\)).
Mean and Variance:
- Mean: \(\mathbb{E}[Q] = k\)
- Variance: \(\mathbb{V}[Q] = 2k\)
Additivity: If \(Q_1 \sim \chi^2_{k_1}\) and \(Q_2 \sim \chi^2_{k_2}\) are independent, then \(Q_1 + Q_2 \sim \chi^2_{k_1 + k_2}\).
Special Case: For \(k = 1\), the chi-squared distribution is simply the square of a standard normal variable, \(Q = Z^2\).

The chi-squared distribution is widely used in statistical tests, such as the chi-squared test for independence or goodness-of-fit, and in the construction of confidence intervals for variances.

2.4.6 \(t\) Distribution

The t-distribution, also known as Student’s t-distribution, is a probability distribution that arises frequently in hypothesis testing, particularly when the sample size is small, and the population standard deviation is unknown. It is defined as the distribution of a standard normal random variable divided by the square root of an independent chi-squared random variable, scaled by its degrees of freedom.

2.4.6.1 Definition

Let: - \(Z\) be a standard normal random variable (\(Z \sim N(0, 1)\)), - \(Q\) be a chi-squared random variable with \(k\) degrees of freedom (\(Q \sim \chi^2_k\)), independent of \(Z\).

Then, the random variable:

\[ T = \frac{Z}{\sqrt{\frac{Q}{k}}} \]

follows a t-distribution with \(k\) degrees of freedom. We write:

\[ T \sim t_k \]

2.4.6.2 Key Properties of the t-Distribution

Degrees of Freedom (\(k\)): The parameter \(k\) determines the shape of the t-distribution. As \(k\) increases, the t-distribution approaches the standard normal distribution.
Symmetry: The t-distribution is symmetric about zero, similar to the normal distribution.
Tails: The t-distribution has heavier tails than the normal distribution, meaning it gives more probability to extreme values. This reflects increased uncertainty when estimating the population mean with small sample sizes.
Moments:
- Mean: \(\mathbb{E}[T] = 0\) for \(k > 1\)
- Variance: \(\mathbb{V}[T] = \frac{k}{k-2}\) for \(k > 2\)
- Higher moments exist only for \(k > m\), where \(m\) is the order of the moment.

2.4.6.3 Applications

The t-distribution is widely used in: 1. t-tests for hypothesis testing, such as testing means of small samples. 2. Constructing confidence intervals for a population mean when the population standard deviation is unknown. 3. Regression analysis, where it appears in tests for regression coefficients.

The t-distribution plays a fundamental role in statistics, bridging the gap between small-sample and large-sample inference.

2.4.7 \(F\) Distribution

The F-distribution, also known as Fisher-Snedecor distribution, arises frequently in statistics, especially in hypothesis testing and variance analysis (ANOVA). It is defined as the distribution of the ratio of two independent chi-squared random variables, each divided by their respective degrees of freedom.

2.4.7.1 Definition

Let: - \(Q_1 \sim \chi^2_{k_1}\), a chi-squared random variable with \(k_1\) degrees of freedom, - \(Q_2 \sim \chi^2_{k_2}\), a chi-squared random variable with \(k_2\) degrees of freedom, - \(Q_1\) and \(Q_2\) are independent.

Then, the random variable:

\[ F = \frac{\frac{Q_1}{k_1}}{\frac{Q_2}{k_2}} \]

follows an F-distribution with \(k_1\) and \(k_2\) degrees of freedom. We write:

\[ F \sim F(k_1, k_2) \]

2.4.7.2 Key Properties of the F-Distribution

Degrees of Freedom: The parameters \(k_1\) and \(k_2\) determine the shape of the F-distribution.
- \(k_1\) (numerator degrees of freedom) is associated with the variability of the first chi-squared variable.
- \(k_2\) (denominator degrees of freedom) is associated with the variability of the second chi-squared variable.
Support: \(F\) is defined for \(F \geq 0\).
Asymmetry: The F-distribution is not symmetric; it is skewed to the right, especially for small degrees of freedom. As \(k_1\) and \(k_2\) increase, it approaches a normal distribution.
Mean:
- \(\mathbb{E}[F] = \frac{k_2}{k_2 - 2}\) for \(k_2 > 2\).
Variance:
- \(\mathbb{V}[F] = \frac{2k_2^2 (k_1 + k_2 - 2)}{k_1 (k_2 - 2)^2 (k_2 - 4)}\) for \(k_2 > 4\).
Special Cases:
- When \(k_1 = 1\), the F-distribution is equivalent to the square of a t-distribution with \(k_2\) degrees of freedom: \(F(1, k_2) = t_{k_2}^2\).

2.4.7.3 Applications

ANOVA (Analysis of Variance): The F-statistic is used to test whether multiple groups have the same variance or means.
Model Comparison: In regression, the F-distribution is used to compare nested models, assessing whether additional predictors improve the fit of the model.
Hypothesis Testing: The F-distribution appears in tests of equality of variances (Levene’s test or Bartlett’s test).

The F-distribution is essential in statistics for comparing variances and testing model adequacy, making it a cornerstone of many inferential procedures.

2.5 Statistics

Essential statistical concepts include:

Bias of an Estimator
Unbiased Estimator
Mean Square Error of an Estimator
Consistent
Minimum Variance
Interval Estimation
Hypothesis Testing

2.5.1 Bias of an Estimator

The bias of an estimator \(\hat{\mathbf{\theta}}\) for a vector parameter \(\mathbf{\theta}\) is defined as the difference between the expected value of the estimator and the true value of the parameter. Formally, if \(\mathbf{\theta} \in \mathbb{R}^n\) is a vector parameter, then the bias of the estimator \(\hat{\mathbf{\theta}}\) is:

\[ \text{Bias}(\hat{\mathbf{\theta}}) = \mathbb{E}[\hat{\mathbf{\theta}}] - \mathbf{\theta}. \]

2.5.1.1 Key Points

Interpretation: Bias measures how far, on average, the estimator \(\hat{\mathbf{\theta}}\) is from the true parameter \(\mathbf{\theta}\). If the bias is zero, the estimator is said to be unbiased.
Component-wise Bias: For each component of the estimator \(\hat{\mathbf{\theta}} = (\hat{\theta}_1, \dots, \hat{\theta}_n)^T\), the bias can be expressed as: \[ \text{Bias}(\hat{\theta}_i) = \mathbb{E}[\hat{\theta}_i] - \theta_i, \quad \text{for each } i = 1, \dots, n. \] This component-wise breakdown is helpful for understanding how each part of the vector estimator deviates from the true values.
Total Bias: The total bias can be viewed as a vector, summarizing the systematic error in each dimension of the estimator.

2.5.1.2 Example

In practice, if \(\hat{\mathbf{\theta}}\) is the sample mean vector of a random vector \(\mathbf{X}\), and \(\mathbf{\theta} = \mathbb{E}[\mathbf{X}]\), then \(\text{Bias}(\hat{\mathbf{\theta}})\) will be zero, making the sample mean an unbiased estimator of the population mean.

2.5.2 Unbiased Estimator

An unbiased estimator of a vector parameter is an estimator that, on average, accurately estimates the true value of the parameter. Formally, let \(\mathbf{\theta} \in \mathbb{R}^n\) be a vector parameter of interest, and let \(\hat{\mathbf{\theta}}\) be an estimator of \(\mathbf{\theta}\). Then, \(\hat{\mathbf{\theta}}\) is an unbiased estimator of \(\mathbf{\theta}\) if the following condition holds:

\[ \mathbb{E}[\hat{\mathbf{\theta}}] = \mathbf{\theta} \quad \forall \boldsymbol{\theta}\in \mathbb{R}^n. \]

2.5.2.1 Key Points

Component-wise Unbiasedness: Each component of \(\hat{\mathbf{\theta}}\), say \(\hat{\theta}_i\), must be an unbiased estimator of the corresponding component \(\theta_i\) of \(\mathbf{\theta}\): \[ \mathbb{E}[\hat{\theta}_i] = \theta_i, \quad \text{for all } i = 1, \dots, n. \]
No Systematic Bias: Unbiasedness implies that, on average, the estimator neither overestimates nor underestimates the true value of \(\mathbf{\theta}\) across repeated sampling.
Applications: Unbiased estimators are crucial in statistics, as they provide estimations that are theoretically centered around the true parameter values, though they may vary due to sampling variation.

An example is the sample mean \(\mathbf{\bar{X}}\) as an estimator for the population mean \(\mathbf{\mu}\) of a random vector \(\mathbf{X}\), where \(\mathbb{E}[\mathbf{\bar{X}}] = \mathbf{\mu}\).

2.5.3 Mean Square Error of an Estimator

The Mean Square Error (MSE) of an estimator \(\hat{\mathbf{\theta}}\) for a vector parameter \(\mathbf{\theta}\) measures the average squared deviation of the estimator from the true parameter. Formally, for a parameter vector \(\mathbf{\theta} \in \mathbb{R}^n\) and an estimator \(\hat{\mathbf{\theta}}\), the MSE is defined as:

\[ \mathbb{M}(\hat{\mathbf{\theta}}) = \mathbb{E} \left[ \|\hat{\mathbf{\theta}} - \mathbf{\theta}\|^2 \right], \]

where \(\|\cdot\|\) denotes the Euclidean (or 2-norm) of a vector. Expanding this expression, we can write:

\[ \mathbb{M}(\hat{\mathbf{\theta}}) = \mathbb{E} \left[ (\hat{\mathbf{\theta}} - \mathbf{\theta})^T (\hat{\mathbf{\theta}} - \mathbf{\theta}) \right]. \]

2.5.3.1 Key Components

The MSE can be decomposed into two main parts: variance and squared bias.

\[ \mathbb{M}(\hat{\mathbf{\theta}}) = \text{tr}(\text{Var}(\hat{\mathbf{\theta}})) + \|\text{Bias}(\hat{\mathbf{\theta}})\|^2, \]

where: - Variance: \(\text{Var}(\hat{\mathbf{\theta}}) = \mathbb{E}[(\hat{\mathbf{\theta}} - \mathbb{E}[\hat{\mathbf{\theta}}])(\hat{\mathbf{\theta}} - \mathbb{E}[\hat{\mathbf{\theta}}])^T]\), representing the variability in the estimator. - Bias: \(\text{Bias}(\hat{\mathbf{\theta}}) = \mathbb{E}[\hat{\mathbf{\theta}}] - \mathbf{\theta}\), representing the systematic deviation from the true parameter.

2.5.3.2 Key Properties

Unbiased Estimator: If \(\hat{\mathbf{\theta}}\) is unbiased, then \(\text{Bias}(\hat{\mathbf{\theta}}) = 0\), and the MSE is simply the trace of the covariance matrix of \(\hat{\mathbf{\theta}}\): \[ \mathbb{M}(\hat{\mathbf{\theta}}) = \text{tr}(\text{Var}(\hat{\mathbf{\theta}})). \]
Bias-Variance Trade-off: The MSE quantifies the balance between bias and variance. Lowering bias often increases variance and vice versa, a key concept in statistical estimation.
Overall Error Metric: The MSE provides a comprehensive measure of the estimator’s accuracy, taking both variance and bias into account, making it an essential criterion in evaluating estimators.