8 Mean and Varaince Assumptions
So far we have not made any probabilistic assumptions about the different elements in our model. So the model errors are unknown but are not random.
Now we will assume that the random errors are random variables. However, we will not specify the complete distribution of the errors and will limit ourselves to make assumptions about the mean and variance of the errors.
8.1 Mean Assumptions
We will begin by making the assumptions about the mean of the errors. Specifically, we will assume that:
\[ \mathbb{E}[e_i] = 0 \in \mathbb{R}\quad \forall i\in\{1,\ldots,n\}\]
This can be expressed in vector form as follows:
\[ \mathbb{E}[\mathbf{e}] = \mathbf{0}\in \mathbb{R}^n\]
Note that \(\mathbf{X}\) is a known constant and \(\boldsymbol{\beta}\) is an unknown constant (an unknown parameter). This implies that + \(\mathbf{y}\) is a random vector. And therefor, any function of \(\mathbf{y}\) will be a random varaible. In particular, our estimates:
\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{y}\] \[ \hat{\mathbf{y}} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{y}= \mathbf{H}\mathbf{y}\] \[ \hat{\mathbf{e}} = \mathbf{y}- \hat{\mathbf{y}} = (\mathbf{I}- \mathbf{H})\mathbf{y}\]
are random vectors. Then we can try to compute the mean of these values. This should be possible since all 3 estimates are linear combinations of \(\mathbf{y}\) and we can compute the mean of \(\mathbf{y}\).
8.1.1 Expectation of \(\mathbf{y}\):
\[\begin{align*} \mathbb{E}[\mathbf{y}] &= \mathbb{E}[\mathbf{X}\boldsymbol{\beta}+ \mathbf{e}] \\ &= \mathbf{X}\boldsymbol{\beta}+ \mathbb{E}[\mathbf{e}] \\ &= \mathbf{X}\boldsymbol{\beta} \end{align*}\]
8.1.2 Expectation of \(\hat{\boldsymbol{\beta}}\)
\[\begin{align*} \mathbb{E}[\hat{\boldsymbol{\beta}}] &= \mathbb{E}[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{y}] \\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbb{E}[\mathbf{y}] \\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{X}\boldsymbol{\beta}\\ &= \boldsymbol{\beta}\\ \end{align*}\]
So, \(\hat{\boldsymbol{\beta}}\) is an unbiased estimator of \(\boldsymbol{\beta}\).
8.1.3 Expectation of \(\hat{\mathbf{y}}\)
\[\begin{align*} \mathbb{E}[\hat{\mathbf{y}}] &= \mathbb{E}[\mathbf{X}\hat{\boldsymbol{\beta}}] \\ &= \mathbf{X}\mathbb{E}[\hat{\boldsymbol{\beta}}] \\ &= \mathbf{X}\boldsymbol{\beta}\\ &= \mathbb{E}[\mathbf{y}] \\ \end{align*}\]
So \(\mathbf{y}\) and \(\hat{\mathbf{y}}\) have the same mean.
8.1.4 Expectation of \(\hat{\mathbf{e}}\)
\[\begin{align*} \mathbb{E}[\hat{\mathbf{e}}] &= \mathbb{E}[\mathbf{y}- \hat{\mathbf{y}}] \\ &= \mathbb{E}[\mathbf{y}] - \mathbb{E}[\hat{\mathbf{y}}] \\ &= \mathbf{0} \end{align*}\]
So \(\mathbf{y}\) and \(\hat{\mathbf{y}}\) have the same mean.
Without any further assumptions, we can get more additional results. Next, we move to assumptions on the variance of the errors.
8.2 Variance Assumptions
While the assumptions on the mean are assumptions about the first moment of the errors, now we will make assumptions about the second moment of the errors. In particular, we will assume that all the errors have the same (finite) variance and are uncorrelated. That is:
\[ \mathbb{V}[e_i] = \sigma^2 < \infty \quad \forall i \in \{1,\ldots,n\}, \quad \text{and} \quad \mathbb{C}[e_i, e_j] = 0 \quad \forall i \neq j\]
We can express this assumption in vector form as:
\[ \mathbb{V}[\mathbf{e}] = \sigma^2 \mathbf{I}\quad \sigma^2 < \infty \]
As we did before, we can try to compute the variance of all the random quantities we have.
8.2.1 Variance of \(\mathbf{y}\)
\[\begin{align*} \mathbb{V}[\mathbf{y}] &= \mathbb{V}[\mathbf{X}+ \mathbf{e}] \\ &= \mathbb{V}[\mathbf{e}] \\ &= \sigma^2 \mathbf{I} \end{align*}\]
So, the observations \(\mathbf{y}\) and the errors \(\mathbf{e}\) have the same variance. This makes sense since they only differ by a non-random element \(\mathbf{X}\boldsymbol{\beta}\).
8.2.2 Variance of \(\hat{\boldsymbol{\beta}}\)
\[\begin{align*} \mathbb{V}[\hat{\boldsymbol{\beta}}] &= \mathbb{V}[(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{y}] \\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbb{V}[\mathbf{y}] \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \\ &= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' (\sigma^2 \mathbf{I}) \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}'\mathbf{X})^{-1} \\ \end{align*}\]
8.2.3 Variance of \(\hat{\mathbf{y}}\)
\[\begin{align*} \mathbb{V}[\hat{\mathbf{y}}] &= \mathbb{V}[\mathbf{H}\mathbf{y}] \\ &= \mathbf{H}\mathbb{V}[\mathbf{y}] \mathbf{H}' \\ &= \mathbf{H}(\sigma^2 \mathbf{I}) \mathbf{H}& \text{since $\mathbf{H}$ is symmetric} \\ &= \sigma^2 \mathbf{H}\mathbf{H}\\ &= \sigma^2 \mathbf{H}& \text{since $\mathbf{H}$ is idempotent} \\ \end{align*}\]
So, while \(\mathbf{y}\) and \(\hat{\mathbf{y}}\) have the same mean, they do not have the same variance. We will see that the variance of \(\hat{\mathbf{y}}\) is “smaller” than the variance of \(\mathbf{y}\). I say “smaller” since for matrices it is not clear what does it mean to be smaller or bigger.
Not for now, that the diagonal elements of these matrices satisfy the following:
\[[\mathbb{V}[\mathbf{y}]]_{ii} = \sigma^2 \leq \sigma^2 h_{ii} = [\mathbb{V}[\hat{\mathbf{y}}]]_{ii}\] since the leverages \(h_{ii}\) satisfy \(0 \leq h_{ii} \leq 1\).
8.2.4 Variance of \(\hat{\mathbf{e}}\)
\[\begin{align*} \mathbb{V}[\hat{\mathbf{e}}] &= \mathbb{V}[(\mathbf{I}- \mathbf{H}) \mathbf{y}] \\ &= (\mathbf{I}- \mathbf{H}) \mathbb{V}[\mathbf{y}] (\mathbf{I}- \mathbf{H})' \\ &= (\mathbf{I}- \mathbf{H}) (\sigma^2 \mathbf{I}) (\mathbf{I}- \mathbf{H}) & \text{since $(\mathbf{I}- \mathbf{H})$ is symmetric} \\ &= \sigma^2 (\mathbf{I}- \mathbf{H})(\mathbf{I}- \mathbf{H}) \\ &= \sigma^2 (\mathbf{I}- \mathbf{H}) & \text{since $(\mathbf{I}- \mathbf{H})$ is idempotent} \\ \end{align*}\]
Then we can see that \(\mathbf{y}\), \(\hat{\mathbf{y}}\) and \(\hat{\mathbf{e}}\) have variances that are idempotent matrices multiplied by the scalar \(\sigma^2\).
8.3 Cross-Covariances
When we introduce the assumption of the variance, we can check the cross-covariances of our estimators. This will help us later in the course.
We can check several cross-covariances but for now I will only check 2.
8.3.1 Cross-covaraince of \(\hat{y}\) and \(\hat{e}\)
\[\begin{align*} \mathbb{C}[\hat{\mathbf{y}}, \hat{\mathbf{e}}] &=\mathbb{C}[(\mathbf{I}- \mathbf{H}) \mathbf{y}, \mathbf{H}\mathbf{y}] \\ &=(\mathbf{I}- \mathbf{H})\mathbb{C}[\mathbf{y},\mathbf{y}] \mathbf{H}' \\ &=(\mathbf{I}- \mathbf{H})\mathbb{V}[\mathbf{y}] \mathbf{H}& \text{since $\mathbf{H}$ is symmetric and $\mathbb{C}[\mathbf{y},\mathbf{y}] = \mathbb{V}[\mathbf{y}]$} \\ &=(\mathbf{I}- \mathbf{H})(\sigma^2 \mathbf{I}) \mathbf{H}& \text{since $\mathbb{V}[\mathbf{y}] = \sigma^2 \mathbf{I}$} \\ &=\sigma^2 (\mathbf{I}- \mathbf{H}) \mathbf{H}\\ &=\sigma^2 (\mathbf{H}- \mathbf{H}\mathbf{H}) \\ &=\sigma^2 (\mathbf{H}- \mathbf{H}) & \text{since $\mathbf{H}$ is idempotent} \\ &=\mathbf{0}\in \mathbb{R}^{n \times n} \\ \end{align*}\]
Then, the residuals \(\hat{\mathbf{e}}\) and the estimates of the observations \(\hat{\mathbf{y}}\) are uncorrelated. We had a similar result before, that however is not the same (we couldn’t have even talked before expectation and covariance since we didn’t have random variables). That result was:
\[ \hat{\mathbf{e}}'\hat{\mathbf{y}} = 0 \in \mathbb{R}\]
Notice that, the dimension of the zero’s.
8.3.2 Cross-covaraince of \(\hat{y}\) and \(\hat{\boldsymbol{\beta}}\)
\[\begin{align*} \mathbb{C}[\hat{\mathbf{e}}, \hat{\boldsymbol{\beta}}] &= \mathbb{C}[(\mathbf{I}- \mathbf{H}) \mathbf{y}, (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y}] \\ &= (\mathbf{I}- \mathbf{H}) \mathbb{C}[\mathbf{y}, \mathbf{y}] \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \\ &= (\mathbf{I}- \mathbf{H}) \mathbb{V}[\mathbf{y}] \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \\ &= (\mathbf{I}- \mathbf{H}) (\sigma^2 \mathbf{I}) \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{I}- \mathbf{H}) \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{I}- \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}') \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}- \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}'\mathbf{X})(\mathbf{X}' \mathbf{X})^{-1} \\ &= \sigma^2 (\mathbf{X}- \mathbf{X})(\mathbf{X}' \mathbf{X})^{-1} \\ &= \mathbf{0}\in \mathbb{R}^{n \times p} \\ \end{align*}\]
So the residuals and the estimate of \(\boldsymbol{\beta}\) are uncorrelated. We will use this result in the next chapter.
8.4 Gauss-Markov Theorem
This theorem justifies the use of the Ordinary Least Squares (OLS) estimator, since it is the “best” estimator in a way.
8.4.1 Assumptions
The theorem assumes that:
- The errors have the same finite variance.
- The errors are uncorrelated.
This assumption is equivalent to our assumption of:
\[ \mathbb{V}[\mathbf{e}] = \sigma^2 \mathbf{I}\quad \sigma^2 < \infty \]
8.4.2 Statement
In the linear regression model, the OLS estimator \(\hat{\boldsymbol{\beta}} = (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y}\) has the smallest variance among all unbiased linear estimators.
That is, if \(\tilde{\boldsymbol{\beta}}\) is a linear estimator, then:
\[ \mathbb{V}[\tilde{\boldsymbol{\beta}}] - \mathbb{V}[\hat{\boldsymbol{\beta}}] \]
is semi-positive definite.
Because of this, the OLS estimator is called the Best Linear Unbiased Estimator (BLUE), where best means smaller variance.
8.4.3 Proof
Let \(\tilde{\boldsymbol{\beta}}\) be an unbiased linear estimator of \(\boldsymbol{\beta}\). Then, we can write \(\tilde{\boldsymbol{\beta}}\) as follows:
\[\tilde{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y}\]
where \(\mathbf{A}\) is a constant matrix of the appropriate dimensions. Then we can write \(\tilde{\boldsymbol{\beta}}\) as follows:
\[\begin{align*} \tilde{\boldsymbol{\beta}} &=(\mathbf{A}- (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' + (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}') \mathbf{y}\\ &=(\mathbf{A}- (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}')\mathbf{y}+ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y}\\ &=(\mathbf{D}) + \hat{\boldsymbol{\beta}} & \text{with $\mathbf{D}= \mathbf{A}- (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}'$} \\ \end{align*}\]
Now, since \(\tilde{\boldsymbol{\beta}}\) is unbiased, we have that:
\[\mathbb{E}[\tilde{\boldsymbol{\beta}}] = \boldsymbol{\beta}\quad \forall \boldsymbol{\beta}\in \mathbb{R}^p \]
And
\[\begin{align*} \mathbb{E}[\tilde{\boldsymbol{\beta}}] &= \mathbb{E}[\mathbf{D}\mathbf{y}+ \hat{\boldsymbol{\beta}}] \quad \forall \boldsymbol{\beta}\in \mathbb{R}^p \\ &= \mathbf{D}\mathbb{E}[\mathbf{y}] + \mathbb{E}[\hat{\boldsymbol{\beta}}] \quad \forall \boldsymbol{\beta}\in \mathbb{R}^p \\ &= \mathbf{D}\mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\beta}\quad \forall \boldsymbol{\beta}\in \mathbb{R}^p & \text{since $\mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta}$ and $\mathbb{E}[\hat{\boldsymbol{\beta}}] = \boldsymbol{\beta}$} \\ \end{align*}\]
Then
\[\begin{align*} \mathbb{E}[\tilde{\boldsymbol{\beta}}] = \boldsymbol{\beta}\quad \forall \boldsymbol{\beta}\in \mathbb{R}^p \\ &\implies \mathbf{D}\mathbf{X}\boldsymbol{\beta}+ \boldsymbol{\beta}= \boldsymbol{\beta}\quad \forall \boldsymbol{\beta}\in \mathbb{R}^p \\ &\implies \mathbf{D}\mathbf{X}\boldsymbol{\beta}= \mathbf{0}\quad \forall \boldsymbol{\beta}\in \mathbb{R}^p \\ &\implies \mathbf{D}\mathbf{X}= \mathbf{0}\\ \end{align*}\]
Now, we can analyze the variance of \(\tilde{\boldsymbol{\beta}}\).
\[\begin{align*} \mathbb{V}[\tilde{\boldsymbol{\beta}}] &= \mathbb{V}[\mathbf{D}\mathbf{y}+ \hat{\boldsymbol{\beta}}] \\ &= \mathbb{V}[\mathbf{D}\mathbf{y}+ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y}] && \text{since $\hat{\boldsymbol{\beta}} = (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{y}$} \\ &= \mathbb{V}[(\mathbf{D}+ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}') \mathbf{y}] \\ &= (\mathbf{D}+ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}') \mathbb{V}[\mathbf{y}] (\mathbf{D}+ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}')' \\ &= (\mathbf{D}+ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}') \mathbb{V}[\mathbf{y}] (\mathbf{D}' + \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1}) \\ &= (\mathbf{D}+ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}') (\sigma^2 \mathbf{I}) (\mathbf{D}' + \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1}) && \text{since $\mathbb{V}[\mathbf{y}] = \sigma^2 \mathbf{I}$} \\ &= \sigma^2 (\mathbf{D}+ (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}') (\mathbf{D}' + \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1}) \\ &= \sigma^2 (\mathbf{D}\mathbf{D}' + (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{D}' \\ &\quad + \mathbf{D}\mathbf{X}(\mathbf{X}' \mathbf{X})^{-1} + (\mathbf{X}' \mathbf{X})^{-1} \mathbf{X}' \mathbf{X}(\mathbf{X}' \mathbf{X})^{-1}) \\ &= \sigma^2 (\mathbf{D}\mathbf{D}' + (\mathbf{X}' \mathbf{X})^{-1}) && \text{since $\mathbf{D}\mathbf{X}= \mathbf{0}$} \\ &= \sigma^2 (\mathbf{X}' \mathbf{X})^{-1} + \sigma^2 (\mathbf{D}\mathbf{D}') \\ &= \mathbb{V}[\hat{\boldsymbol{\beta}}] + \sigma^2 (\mathbf{D}\mathbf{D}') && \text{since $\mathbb{V}[\hat{\boldsymbol{\beta}}] = \sigma^2 (\mathbf{X}' \mathbf{X})^{-1}$} \\ \end{align*}\]
Since \(\mathbf{D}\mathbf{D}'\) is semi-positive definite, we have that:
\[ \mathbb{V}[\tilde{\boldsymbol{\beta}}] - \mathbb{V}[\hat{\boldsymbol{\beta}}] = \sigma^2 (\mathbf{D}\mathbf{D}') \]
is semi-positive definite. This concludes the end of the proof.
So, as it turns out, the OLS estimator that came out of minimizing the least squares, and made no assumptions about the expectation and variances is a very good estimator, since it is:
- Unbiased
- Among the unbiased and linear estimators, it has the smallest variance.
Another common measure of performance is the Mean Square Error (MSE) of an estimator with respect to a parameter. Using the Gauss-Markov Theorem we can conclude that, among the unbiased and linear estimators, the OLS estimator has the smallest MSE.
Since, both estimators are unbiased, we have that:
\[ \mathbb{M}(\hat{\mathbf{\boldsymbol{\beta}}}) = \text{tr}(\text{Var}(\hat{\mathbf{\boldsymbol{\beta}}})) + \|\text{Bias}(\hat{\mathbf{\boldsymbol{\beta}}})\|^2 = \text{tr}(\text{Var}(\hat{\mathbf{\boldsymbol{\beta}}})) \] \[ \mathbb{M}(\tilde{\mathbf{\boldsymbol{\beta}}}) = \text{tr}(\text{Var}(\tilde{\mathbf{\boldsymbol{\beta}}})) + \|\text{Bias}(\tilde{\mathbf{\boldsymbol{\beta}}})\|^2 = \text{tr}(\text{Var}(\tilde{\mathbf{\boldsymbol{\beta}}})) \]
Then, by the Gauss-Markov theorem:
\[\begin{align*} \mathbb{M}[\tilde{\boldsymbol{\beta}}] - \mathbb{M}[\hat{\boldsymbol{\beta}}] &= \text{tr}(\text{Var}(\tilde{\mathbf{\boldsymbol{\beta}}})) - \text{tr}(\text{Var}(\hat{\mathbf{\boldsymbol{\beta}}})) \\ &= \text{tr}(\text{Var}(\tilde{\mathbf{\boldsymbol{\beta}}})) - \text{Var}(\hat{\mathbf{\boldsymbol{\beta}}})) && \text{since the trace operator is linear.} \\ &\geq 0 \\ \end{align*}\]
Where we use the fact that \(\text{Var}(\tilde{\mathbf{\boldsymbol{\beta}}})) - \text{Var}(\hat{\mathbf{\boldsymbol{\beta}}})\) is positive semi-definite by the Gauss-Markov theorem and the trace of a positive semi-definite matrix is non-negative.
8.5 Estimate of \(\sigma^2\)
Since we have introduced a new parameter \(\sigma^2\), we might be interested in estimate it. We will propose an estimate of \(\sigma^2\) of the following shape:
\[\hat{\sigma}^2 = a \hat{\mathbf{e}}'\hat{\mathbf{e}}\]
where we will choose the scalar \(a\) later.
In particular, we want to take the expectation of this estimate. To do so, we will take a roundabout way. First we will show that:
\[(\mathbf{y}- \mathbb{E}[\mathbf{y}])'(\mathbf{y}- \mathbb{E}[\mathbf{y}]) = \hat{\mathbf{e}}'\hat{\mathbf{e}} + (\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])\]
and we will use this fact to compute the expectation of \(\hat{\mathbf{e}}'\hat{\mathbf{e}}\).
\[\begin{align*} (\mathbf{y}- \mathbb{E}[\mathbf{y}])'& (\mathbf{y}- \mathbb{E}[\mathbf{y}]) \\ &= (\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) && \text{since $\mathbb{E}[\mathbf{y}] = \mathbf{X}\boldsymbol{\beta}$} \\ &= (\mathbf{y}- \mathbf{X}\hat{\boldsymbol{\beta}} + \mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\hat{\boldsymbol{\beta}} + \mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\beta}) \\ &= (\mathbf{y}- \mathbf{X}\hat{\boldsymbol{\beta}})'(\mathbf{y}- \mathbf{X}\hat{\boldsymbol{\beta}}) + 2(\mathbf{y}- \mathbf{X}\hat{\boldsymbol{\beta}})'(\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\beta}) \\ &\quad + (\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\beta}) \\ &= (\mathbf{y}- \hat{\mathbf{y}})'(\mathbf{y}- \hat{\mathbf{y}}) + 2(\mathbf{y}- \hat{\mathbf{y}})'(\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\beta}) \\ &\quad + (\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}]) \\ &= \hat{\mathbf{e}}'\hat{\mathbf{e}} + 2\hat{\mathbf{e}}'(\mathbf{X}\hat{\boldsymbol{\beta}} - \mathbf{X}\boldsymbol{\beta}) + (\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}]) \\ &= \hat{\mathbf{e}}'\hat{\mathbf{e}} + 2\hat{\mathbf{e}}'\mathbf{X}(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}) + (\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}]) \\ &= \hat{\mathbf{e}}'\hat{\mathbf{e}} + (\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}]) && \text{since $\hat{\mathbf{e}}'\mathbf{X}=\mathbf{0}$} \\ \end{align*}\]
Since
\[\mathbb{E}[(\mathbf{y}- \mathbb{E}[\mathbf{y}])'(\mathbf{y}- \mathbb{E}[\mathbf{y}])] = \text{tr}(\mathbb{V}[\mathbf{y}]) = \text{tr}(\sigma^2 \mathbf{I}) = \sigma^2 \text{tr}(\mathbf{I}) = \sigma^2 n\] \[\mathbb{E}[(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])] = \text{tr}(\mathbb{V}[\hat{\mathbf{y}}]) = \text{tr}(\sigma^2 \mathbf{H}) = \sigma^2 \text{tr}(\mathbf{H}) = \sigma^2 p\] where we use the fact that the trace of an idempotent matrix is equal to the rank of the matrix. Then we get that:
\[\begin{align*} (\mathbf{y}- \mathbb{E}[\mathbf{y}])'& (\mathbf{y}- \mathbb{E}[\mathbf{y}]) = \hat{\mathbf{e}}'\hat{\mathbf{e}} + (\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}]) \\ &\implies \mathbb{E}[(\mathbf{y}- \mathbb{E}[\mathbf{y}])'(\mathbf{y}- \mathbb{E}[\mathbf{y}])] = \mathbb{E}[\hat{\mathbf{e}}'\hat{\mathbf{e}}] + \mathbb{E}[(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])] \\ &\implies \text{tr}(\mathbb{V}[\mathbf{y}]) = \mathbb{E}[\hat{\mathbf{e}}'\hat{\mathbf{e}}] + \text{tr}(\mathbb{V}[\hat{\mathbf{y}}]) \\ &\implies \sigma^2 n = \mathbb{E}[\hat{\mathbf{e}}'\hat{\mathbf{e}}] + \sigma^2 p \\ &\implies \mathbb{E}[\hat{\mathbf{e}}'\hat{\mathbf{e}}] = \sigma^2 n - \sigma^2 p \\ &\implies \mathbb{E}[\hat{\mathbf{e}}'\hat{\mathbf{e}}] = \sigma^2 (n - p) \\ \end{align*}\]
Then, if we set \(a=(\frac{1}{n-p})\) we have that our proposed estimator \(\hat{\sigma}^2\) is unbiased. Since:
\[\begin{align*} \mathbb{E}[a \hat{\sigma}^2] &= \mathbb{E}[a \hat{\mathbf{e}}'\hat{\mathbf{e}}] \\ &= a \mathbb{E}[\hat{\mathbf{e}}'\hat{\mathbf{e}}] \\ &= a \sigma^2 (n-p) \\ &= \frac{1}{n-p} \sigma^2 (n-p) \\ &= \sigma^2 \\ \end{align*}\]
We can choose or obtain other values for \(a\), however the estimator will not be unbiased.