9 Normality Assumption
9.1 Introduction
Up to this point, we have treated the linear regression model primarily as an optimization problem, where the goal was to find parameter estimates that minimize the sum of squared errors. By making only mild assumptions about the error term—such as a zero mean and constant variance—we were able to establish key results like the Gauss–Markov theorem, which guarantees that the Ordinary Least Squares (OLS) estimator is the Best Linear Unbiased Estimator (BLUE) under those conditions.
We now take a further step by introducing a distributional assumption on the error term. Specifically, we assume that the errors follow a Normal distribution. This assumption is much stronger than those introduced before, but it provides powerful analytical advantages. In particular, it allows us to:
- Derive the sampling distributions of the estimators,
- Conduct statistical inference (e.g., confidence intervals and hypothesis tests),
- Formulate and maximize the likelihood function, leading to Maximum Likelihood Estimates (MLE) of the parameters.
Formally, we assume:
\[ \mathbf{e} \sim N(\mathbf{0}, \sigma^2 \mathbf{I}) \]
The Normal distribution is completely characterized by its mean and variance. Therefore, once we know the expected value and variance of an estimator, assuming normality allows us to fully determine its distribution. Since in previous chapters we have already computed the mean and variance of the OLS estimators, the normality assumption now enables us to describe their entire probabilistic behavior.
Furthermore, the assumption of normality allows us to derive the likelihood function of the observed data, which serves as the foundation for Maximum Likelihood Estimation. This framework not only provides an alternative route to parameter estimation but also forms the basis for many modern extensions of regression analysis, including Bayesian regression and generalized linear models.
9.2 Maximum Likelihood Estimation
To obtain the maximum likelihood estimates of \(\boldsymbol{\beta}\) and \(\sigma^2\), we first need the distribution of \(\mathbf{y}\). From the regression model
\[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e}, \]
and the assumption \(\mathbf{e} \sim N(\mathbf{0}, \sigma^2 \mathbf{I})\), we see that \(\mathbf{y}\) is simply a linear transformation of a multivariate normal random vector. Therefore, \(\mathbf{y}\) is also normally distributed, with mean and variance given by:
\[ \mathbf{y} \sim N(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}). \]
Consequently, the likelihood function of \(\boldsymbol{\beta}\) and \(\sigma^2\), given \(\mathbf{y}\), is:
\[ \mathcal{L}(\boldsymbol{\beta}, \sigma^2 \mid \mathbf{y}) = N(\mathbf{y} \mid \mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{I}), \]
which, when written explicitly, becomes:
\[\begin{align*} \mathcal{L}(\boldsymbol{\beta}, \sigma^2 | \mathbf{y}) &= (2 \pi)^{-\frac{n}{2}} |\sigma^2 \mathbf{I}|^{-\frac{1}{2}} \exp\left\{ -\frac{1}{2}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\sigma^2 \mathbf{I})^{-1}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \right\} \\ &= (2 \pi)^{-\frac{n}{2}} (\sigma^2)^{-\frac{n}{2}} |\mathbf{I}| \exp\left\{ -\frac{1}{2}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'\frac{\mathbf{I}}{\sigma^2}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \right\} \\ &= (2 \pi \sigma^2)^{-\frac{n}{2}} \exp\left\{ -\frac{1}{2 \sigma^2}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \right\} \\ &= (2 \pi \sigma^2)^{-\frac{n}{2}} \exp\left\{ -\frac{1}{2 \sigma^2}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \right\} \\ &= (2 \pi \sigma^2)^{-\frac{n}{2}} \exp\left\{ -\frac{1}{2 \sigma^2}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \right\} \\ \end{align*}\]
9.2.1 Decomposition of the Quadratic Form
Recall that in the previous chapter we showed the following decomposition:
\[ (\mathbf{y} - \mathbb{E}[\mathbf{y}])'(\mathbf{y} - \mathbb{E}[\mathbf{y}]) = \hat{\mathbf{e}}'\hat{\mathbf{e}} + (\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}])'(\hat{\mathbf{y}} - \mathbb{E}[\hat{\mathbf{y}}]), \]
which implies that
\[ (\mathbf{y} - \mathbf{X}\boldsymbol{\beta})'(\mathbf{y} - \mathbf{X}\boldsymbol{\beta}) = \hat{\mathbf{e}}'\hat{\mathbf{e}} + (\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})' \mathbf{X}'\mathbf{X} (\hat{\boldsymbol{\beta}} - \boldsymbol{\beta}). \]
Substituting this into the likelihood function gives:
\[\begin{align*} \mathcal{L}(\boldsymbol{\beta}, \sigma^2 | \mathbf{y}) &= (2 \pi \sigma^2)^{-\frac{n}{2}} \exp\left\{ -\frac{1}{2 \sigma^2}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \right\} \\ &= (2 \pi \sigma^2)^{-\frac{n}{2}} \exp\left\{ -\frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{2 \sigma^2} -\frac{(\hat{\boldsymbol{\beta}} - \boldsymbol{\beta})'\mathbf{X}\mathbf{X}( \hat{\boldsymbol{\beta}} - \boldsymbol{\beta})}{2 \sigma^2} \right\} \\ \end{align*}\]
This expression is convenient for optimization. Notice that, for any fixed \(\sigma^2\), the likelihood is maximized with respect to \(\boldsymbol{\beta}\) when \(\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}\), since that value nullifies the second exponential term. Therefore, the MLE for \(\boldsymbol{\beta}\) coincides with the OLS estimator.
9.2.2 Estimation of \(\sigma^2\)
Once we have \(\boldsymbol{\beta} = \hat{\boldsymbol{\beta}}\), the likelihood simplifies to a function of \(\sigma^2\) only:
\[ \mathcal{L}(\sigma^2 | \mathbf{y}, \boldsymbol{\beta}= \hat{\boldsymbol{\beta}}) = (2 \pi \sigma^2)^{-\frac{n}{2}} \exp\left\{ \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{2\sigma^2} \right\} \]
Taking logs gives the log-likelihood function:
\[ \ell(\sigma^2 \mid \mathbf{y}, \boldsymbol{\beta} = \hat{\boldsymbol{\beta}}) = -\frac{n}{2}\log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{2\sigma^2}. \]
Differentiating with respect to \(\sigma^2\) and setting the derivative equal to zero yields:
\[ \tilde{\sigma}^2 = \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{n}. \]
A second derivative check confirms that this value indeed maximizes the log-likelihood. Thus, \(\tilde{\sigma}^2\) is the maximum likelihood estimator of \(\sigma^2\).
9.2.3 Remarks on Bias and Practical Use
It is important to note that \(\tilde{\sigma}^2\) is biased as an estimator of \(\sigma^2\). In contrast, the usual OLS variance estimator
\[ \hat{\sigma}^2 = \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{n - p} \]
is unbiased under the Gauss–Markov assumptions. Both estimators are useful: \(\tilde{\sigma}^2\) arises naturally in likelihood-based methods and is convenient for theoretical developments, while \(\hat{\sigma}^2\) is preferred for unbiased estimation and inferential procedures.
In the following sections, we will build on these results to derive the exact sampling distributions of the OLS estimators and the basis for hypothesis testing within the classical linear model framework.
9.3 Distribution of the Estimates
In the previous chapters, we derived the mean and variance of the least squares estimators under the assumption of normally distributed errors. We now go further and obtain their sampling distributions, which fully characterize the stochastic behavior of these estimators. This is possible because, under normality, any linear combination of normally distributed random variables is itself normal. Since most of our estimators are linear functions of the observed data, their distributions are readily available.
9.3.1 Distribution of \(\hat{\boldsymbol{\beta}}\), \(\hat{\mathbf{y}}\), and \(\hat{\mathbf{e}}\)
The least squares estimators \(\hat{\boldsymbol{\beta}}\), \(\hat{\mathbf{y}}\), and \(\hat{\mathbf{e}}\) are all linear transformations of the response vector \(\mathbf{y}\). Therefore, they are normally distributed, and the corresponding mean and variance expressions derived earlier allow us to determine their full distributions. Specifically, we have:
\[ \hat{\boldsymbol{\beta}}, \quad \hat{\mathbf{y}}, \quad \hat{\mathbf{e}} \]
and their respective distributions are given by:
\[ \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}) \]
\[ \hat{\mathbf{y}} \sim N(\mathbf{X}\boldsymbol{\beta}, \sigma^2 \mathbf{H}) \]
\[ \hat{\mathbf{e}} \sim N(\mathbf{0}, \sigma^2 (\mathbf{I}- \mathbf{H})) \]
These results reveal that both fitted values and residuals are multivariate normal, each with a distinct covariance structure determined by the projection matrices \(\mathbf{H}\) and \(\mathbf{I}- \mathbf{H}\). The matrix \(\mathbf{H}\) projects \(\mathbf{y}\) onto the column space of \(\mathbf{X}\), while \((\mathbf{I}- \mathbf{H})\) projects it onto its orthogonal complement.
Note that the estimator of the error variance, \(\hat{\sigma}^2\), is not a linear function of \(\mathbf{y}\), so its distribution must be derived differently.
9.3.2 Distribution of \(\hat{\sigma}^2\)
The estimator \(\hat{\sigma}^2\) measures the variability of the residuals around the fitted model. Its distribution is central to many inferential procedures, such as constructing confidence intervals and hypothesis tests for the regression coefficients. However, because \(\hat{\sigma}^2\) involves the quadratic form \(\hat{\mathbf{e}}'\hat{\mathbf{e}}\), its derivation is more involved. We will obtain its distribution in three steps:
- Express \(\frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2}\) as a quadratic form of a standard normal vector with an idempotent matrix.
- Show that such a quadratic form follows a chi-squared distribution with degrees of freedom equal to the rank of the idempotent matrix.
- Relate this result to the distribution of \(\hat{\sigma}^2\) itself.
9.3.2.1 Step 1: Expressing \(\frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2}\) as a Quadratic Form
We begin by rewriting the residual vector in terms of the error vector \(\mathbf{e}\):
\[\begin{align*} (\mathbf{I}- \mathbf{H}) \mathbf{y} &= (\mathbf{I}- \mathbf{H}) (\mathbf{X}\boldsymbol{\beta}+ \mathbf{e}) && \text{since $\mathbf{y}= \mathbf{X}\boldsymbol{\beta}+ \mathbf{e}$} \\ &= \mathbf{I}\mathbf{X}\boldsymbol{\beta}- \mathbf{H}\mathbf{X}\boldsymbol{\beta}+ \mathbf{I}\mathbf{e}- \mathbf{H}\mathbf{e}\\ &= \mathbf{X}\boldsymbol{\beta}- \mathbf{X}\boldsymbol{\beta}+ \mathbf{e}- \mathbf{H}\mathbf{e}&& \text{since $\mathbf{H}\mathbf{X}= \mathbf{X}$} \\ &= \mathbf{e}- \mathbf{H}\mathbf{e}\\ &= (\mathbf{I}- \mathbf{H}) \mathbf{e}\\ \end{align*}\]
Then:
\[\begin{align*} \mathbf{y}' (\mathbf{I}- \mathbf{H}) \mathbf{y} &= \mathbf{y}' (\mathbf{I}- \mathbf{H}) (\mathbf{I}- \mathbf{H}) \mathbf{y}&& \text{since $(\mathbf{I}- \mathbf{H})$ is idempotent} \\ &= \mathbf{e}' (\mathbf{I}- \mathbf{H}) (\mathbf{I}- \mathbf{H}) \mathbf{e}&& \text{since $(\mathbf{I}- \mathbf{H}) \mathbf{y}= (\mathbf{I}- \mathbf{H}) \mathbf{e}$} \\ &= \mathbf{e}' (\mathbf{I}- \mathbf{H}) \mathbf{e}&& \text{since $(\mathbf{I}- \mathbf{H})$ is idempotent} \\ \end{align*}\]
Thus,
\[\begin{align*} \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2} &= \frac{\mathbf{y}' (\mathbf{I}- \mathbf{H}) \mathbf{y}}{\sigma^2} && \text{since $\hat{\mathbf{e}}'\hat{\mathbf{e}} = \mathbf{y}' (\mathbf{I}- \mathbf{H}) \mathbf{y}$} \\ &= \frac{\mathbf{e}' (\mathbf{I}- \mathbf{H}) \mathbf{e}}{\sigma^2} && \text{since $\mathbf{e}' (\mathbf{I}- \mathbf{H}) \mathbf{e}= \mathbf{y}' (\mathbf{I}- \mathbf{H}) \mathbf{y}$} \\ &= \frac{\mathbf{e}}{\sqrt{\sigma^2}}' (\mathbf{I}- \mathbf{H}) \frac{\mathbf{e}}{\sqrt{\sigma^2}} \\ \end{align*}\]
Finally, because \(\frac{\mathbf{e}}{\sqrt{\sigma^2}}\) is a scaled version of the normal vector \(\mathbf{e}\), it remains normally distributed with zero mean and identity covariance:
\[ \mathbb{E}\left[\frac{\mathbf{e}}{\sqrt{\sigma^2}}\right] = \mathbf{0}, \qquad \mathbb{V}\left[\frac{\mathbf{e}}{\sqrt{\sigma^2}}\right] = \mathbf{I} \]
Hence,
\[ \frac{\mathbf{e}}{\sqrt{\sigma^2}} \sim N(\mathbf{0}, \mathbf{I}) \]
This vector is known as a standard multivariate normal, completing Step 1.
9.3.2.2 Step 2: The Chi-Squared Distribution of a Quadratic Form
Let \(\mathbf{z}\in \mathbb{R}^n\) be a standard multivariate normal vector and \(\mathbf{M}\in \mathbb{R}^{n \times n}\) an idempotent matrix of rank \(m\). We now show that:
\[ \mathbf{z}' \mathbf{M}\mathbf{z}\sim \chi^2_m \]
To prove this, we use the spectral decomposition of \(\mathbf{M}\):
\[ \mathbf{M}= \mathbf{V}\boldsymbol{\Sigma}\mathbf{V}' \]
where \(\mathbf{V}\) is an orthonormal matrix and \(\boldsymbol{\Sigma}\) is diagonal. Since \(\mathbf{M}\) is idempotent, the diagonal of \(\boldsymbol{\Sigma}\) consists of \(m\) ones and \(n-m\) zeros. Without loss of generality, we can assume that the first \(m\) entries of the diagonal are equal to \(1\) and the next entries equal to \(0\).
Then, first note that \(\mathbf{V}' \mathbf{z}\) is a linear combination of a normal distribution. We will show, that \(\mathbf{V}' \mathbf{z}\in \mathbb{R}^{n}\) is also standard normal.
Note that:
\[ \mathbb{E}[\mathbf{V}' \mathbf{z}] = \mathbf{V}' \mathbb{E}[\mathbf{z}] = \mathbf{V}' \mathbf{0}= \mathbf{0}\] \[ \mathbb{V}[\mathbf{V}' \mathbf{z}] = \mathbf{V}' \mathbb{V}[\mathbf{z}] \mathbf{V}= \mathbf{V}' \mathbf{I}\mathbf{V}= \mathbf{V}' \mathbf{V}= \mathbf{I}\]
Then \(\mathbf{V}' \mathbf{z}\) is also standard normal. Let’s name \(\mathbf{w}= \mathbf{V}' \mathbf{z}\), then each of the components \(w_1,\ldots,w_n\) of \(\mathbf{w}\) are independent univariate standard normally distributed.
Then
\[\begin{align*} \mathbf{z}' \mathbf{M}\mathbf{z} &= \mathbf{z}' (\mathbf{V}\boldsymbol{\Sigma}\mathbf{V}') \mathbf{z}&& \text{using the spectral decomposition of $\mathbf{M}$} \\ &= (\mathbf{V}' \mathbf{z})' \boldsymbol{\Sigma}(\mathbf{V}' \mathbf{z}) \\ &= \mathbf{w}' \boldsymbol{\Sigma}\mathbf{w}&& \text{since $\mathbf{w}= \mathbf{V}' \mathbf{z}$} \\ &= \sum_{i=1}^n [\boldsymbol{\Sigma}]_{ii} w_i^2 \\ &= \sum_{i=1}^{m} w_i^2 && \text{since only the first $m$ entries are equal to $1$} \\ &\sim \chi^2_m && \text{by definition of the $\chi^2$ distribution} \\ \end{align*}\]
This fundamental result connects linear algebra and probability theory and is essential to statistical inference in linear regression.
9.3.2.3 Step 3: Distribution of \(\hat{\sigma}^2\)
Combining the results from Steps 1 and 2, we conclude that:
\[ \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2} \sim \chi^2_{n-p} \]
since the idempotent matrix \((\mathbf{I}- \mathbf{H})\) has rank \(n-p\). Therefore:
\[ \hat{\sigma}^2 = \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{n-p} = \frac{\sigma^2}{n-p}\frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2} \sim \frac{\sigma^2}{n-p}\chi^2_{n-p} \]
This result shows that the estimator of \(\sigma^2\) is scaled chi-squared distributed, a fact that will be crucial when constructing confidence intervals and hypothesis tests.
9.3.3 Independence of \(\hat{\mathbf{e}}\) and \(\hat{\mathbf{y}}\)
Previously, we showed that \(\hat{\mathbf{e}}\) and \(\hat{\mathbf{y}}\) are uncorrelated:
\[ \mathbb{C}[\hat{\mathbf{e}}, \hat{\mathbf{y}}] = \mathbf{0} \]
However, uncorrelatedness does not necessarily imply independence. In general, two random variables can have zero covariance and still exhibit nonlinear dependence.
Under the normality assumption, this distinction disappears: if two normally distributed vectors are uncorrelated, they are also independent. Since both \(\hat{\mathbf{e}}\) and \(\hat{\mathbf{y}}\) are linear combinations of the normal vector \(\mathbf{y}\), they are jointly normal. Therefore, the absence of correlation between them implies that they are independent:
- \(\hat{\mathbf{e}}\) depends only on \((\mathbf{I}- \mathbf{H})\mathbf{y}\), the projection onto the residual space.
- \(\hat{\mathbf{y}}\) depends only on \(\mathbf{H}\mathbf{y}\), the projection onto the column space of \(\mathbf{X}\).
These two subspaces are orthogonal, and under normality, orthogonality implies statistical independence. Consequently, any statistic that is a function of \(\hat{\mathbf{e}}\) (such as \(\hat{\sigma}^2\)) is independent of any statistic that is a function of \(\hat{\mathbf{y}}\) (such as \(\hat{\boldsymbol{\beta}}\)).
This independence property plays a pivotal role in classical regression inference, as it underlies the derivation of \(t\) and \(F\) distributions for hypothesis testing.
9.4 Interval Estimation
So far, we have derived point estimates for several parameters of interest—namely the regression coefficients \(\boldsymbol{\beta}\), the errors \(\mathbf{e}\), and the residual variance \(\hat{\sigma}^2\). While point estimates provide single “best guesses” of the true parameters, they do not convey how much uncertainty is associated with these estimates.
However, since we have already obtained the sampling distributions of these estimators, we can now use this probabilistic information to construct interval estimators, which express a range of plausible values for the unknown parameters, given the observed data.
9.4.1 Confidence Intervals for the Coefficients
Recall that under the classical linear model assumptions, the ordinary least squares estimator satisfies
\[ \hat{\boldsymbol{\beta}} \sim N(\boldsymbol{\beta}, \sigma^2 (\mathbf{X}'\mathbf{X})^{-1}) \]
This result implies that each individual coefficient estimator \(\hat{\beta}_i\) follows a univariate normal distribution:
\[ \hat{\beta}_i \sim N(\beta_i, \sigma^2 [(\mathbf{X}'\mathbf{X})^{-1}])_{ii}) \]
The quantity
\[ \sigma^2_{\beta_i} = \sigma^2 [(\mathbf{X}'\mathbf{X})^{-1}]_{ii} \]
represents the variance of \(\hat{\beta}_i\), which depends both on the noise variance \(\sigma^2\) and on the geometry of the design matrix \(\mathbf{X}\).
Thus, we can rewrite the distribution as
\[ \hat{\beta}_i \sim N \left(\beta_i, \sigma^2_{\beta_i} \right) \]
Since \(\sigma^2\) is unknown, this distribution cannot be used directly for inference. To overcome this, we first standardize the estimator by subtracting its mean and dividing by its true standard deviation:
\[ t^0_{\beta_i}=\frac{\hat{\beta}_i - \beta_i}{\sqrt{\sigma^2_{\beta_i}}} \]
This standardized statistic follows the standard normal distribution, as confirmed by the derivations of its mean and variance. However, it still involves the unknown \(\sigma^2\).
Therefore, we replace \(\sigma^2\) with its unbiased estimator \(\hat{\sigma}^2\) to form
\[ t_{\beta_i} = \frac{\hat{\beta}_i - \beta_i}{\sqrt{\hat{\sigma}^2_{\beta_i}}} \]
where \(\hat{\sigma}^2_{\beta_i} = \hat{\sigma}^2 [(\mathbf{X}'\mathbf{X})^{-1}]_{ii}\), so \(t_{\beta_i}\) doesn’t depend on \(\sigma^2\). Let’s compute the distribution of this quantity, first lets re-write the statistic as follows:
\[ t_{\beta_i} = \frac{\hat{\beta}_i - \beta_i}{\sqrt{\hat{\sigma}^2_{\beta_i}}} = \frac{\sqrt{\frac{1}{\sigma^2}}}{\sqrt{\frac{1}{\sigma^2}}}\frac{\hat{\beta}_i - \beta_i}{\sqrt{\hat{\sigma}^2[(\mathbf{X}'\mathbf{X})^{-1}]_{ii}}} = \frac{\frac{\left(\hat{\beta}_i - \beta_i\right)}{\sqrt{\sigma^2[(\mathbf{X}'\mathbf{X})^{-1}]_{ii}}}}{ \sqrt{\frac{\hat{\sigma}^2}{\sigma^2}}} = \frac{\frac{\left(\hat{\beta}_i - \beta_i\right)}{\sqrt{\sigma^2_{\beta_i}}}}{ \sqrt{\frac{(n-p)\frac{\hat{\sigma}^2}{\sigma^2}}{n-p}}} = \frac{t^0_{\beta_i}}{ \sqrt{\frac{\frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2}}{n-p}}}\]
Now, we know \(t^0_{\beta_i}\) is standard normal distributed, and from the distribution of \(\hat{\sigma}^2\) we have that:
\[ \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2} \sim \chi^2_{n-p} \] and from the independence of \(\hat{\boldsymbol{\beta}}\) and \(\hat{\mathbf{e}}\) we have that any function of both variables is independent, in particular
\[ t^0_{\beta_i} \quad \text{and} \quad \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2} \] are independent. Therefore the ratio \(t_{\beta_i}\) follows a Student’s \(t\) distribution with \(n - p\) degrees of freedom:
\[ t_{\beta_i} \sim t_{n-p} \]
This fundamental result enables us to construct confidence intervals for each regression coefficient. Now, let \(t \sim t_m\) a random variable with a \(t\) distribution with \(m\) degrees of freedom. Then call:
\[ t_m\left(a\right) \quad \text{such that} \quad \mathbb{P}\left(t\leq t_m\left(a\right) \right) = a\] for any \(a\in[0,1]\)
Then, we have that:
\[\begin{align*} \mathbb{P} &\left( -t_{n-p}\left(\frac{\alpha}{2}\right) \leq t_{\beta_i} \leq t_{n-p}\left(\frac{\alpha}{2}\right) \right) = \alpha && \text{since the $t$ distribution is symmetric} \\ &\implies \mathbb{P}\left( -t_{n-p}\left(\frac{\alpha}{2}\right) \leq \frac{\hat{\beta}_i - \beta_i}{\sqrt{\hat{\sigma}^2_{\beta_i}}} \leq t_{n-p}\left(\frac{\alpha}{2}\right) \right) = \alpha && \text{since the $t_{\beta_i} = \frac{\hat{\beta}_i - \beta_i}{\sqrt{\hat{\sigma}^2_{\beta_i}}}$} \\ &\implies \mathbb{P}\left( -t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\beta_i}} \leq \hat{\beta}_i - \beta_i \leq t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\beta_i}} \right) = \alpha \\ &\implies \mathbb{P}\left( -t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\beta_i}} \leq \beta_i - \hat{\beta}_i \leq t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\beta_i}} \right) = \alpha \\ &\implies \mathbb{P}\left( \hat{\beta}_i - t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\beta_i}} \leq \beta_i \leq \hat{\beta}_i + t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\beta_i}} \right) = \alpha \\ \end{align*}\]
So
\[ \left(\hat{\beta}_i - t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\beta_i}}, \hat{\beta}_i + t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\beta_i}} \right) \]
This random interval depends on the sample through \(\hat{\beta}_i\) and \(\hat{\sigma}^2_{\beta_i}\) and contains the true parameter \(\beta_i\) with probability \(\alpha\). Once data are observed, the interval becomes fixed, and while it either includes or excludes \(\beta_i\), the confidence level reflects the long-run frequency with which such intervals contain the true value under repeated sampling.
9.4.2 Confidence Intervals for the Expected Mean of a New Observation \(\mathbf{x}_{new}\)
In many applications, the researcher is not only interested in the regression coefficients themselves but also in the expected value of the response for a new vector of predictors \(\mathbf{x}_{new}\).
The expected response is given by
\[ \mathbb{E}[y_{new}] = \mathbf{x}_{new}' \boldsymbol{\beta}\]
and its natural estimator is
\[ \mathbf{x}_{new}' \hat{\boldsymbol{\beta}} \]
Since \(\hat{\boldsymbol{\beta}}\) is normally distributed, this linear combination is also normal with
\[ \mathbf{x}_{new}' \hat{\boldsymbol{\beta}} \sim N \left(\mathbf{x}_{new}' \boldsymbol{\beta}, \sigma^2 \mathbf{x}_{new}' (\mathbf{X}' \mathbf{X})^{-1} \mathbf{x}_{new} \right) \]
Replacing \(\sigma^2\) with \(\hat{\sigma}^2\) and applying the same reasoning as before, we obtain the statistic
\[ t_{\mathbf{x}_{new}'\boldsymbol{\beta}} = \frac{\mathbf{x}_{new}'\hat{\boldsymbol{\beta}} - \mathbf{x}_{new}'\boldsymbol{\beta}}{\sqrt{\hat{\sigma}^2_{\mathbf{x}_{new}'\boldsymbol{\beta}}}}=\frac{\frac{\mathbf{x}_{new}'\hat{\boldsymbol{\beta}} - \mathbf{x}_{new}'\boldsymbol{\beta}}{\sqrt{\sigma^2_{\mathbf{x}_{new}'\boldsymbol{\beta}}}}}{\sqrt{\frac{\frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2}}{n-p}}} \sim \chi^2_{n-p} \] where \(\hat{\sigma}^2_{\mathbf{x}_{new}'\boldsymbol{\beta}} = \hat{\sigma}^2 \mathbf{x}_{new}' (\mathbf{X}' \mathbf{X})^{-1} \mathbf{x}_{new}\). This quantity is distributed as a \(t\) with \(n-p\) degrees of freedom since:
\[\frac{\mathbf{x}_{new}'\hat{\boldsymbol{\beta}} - \mathbf{x}_{new}'\boldsymbol{\beta}}{\sqrt{\sigma^2_{\mathbf{x}_{new}'\boldsymbol{\beta}}}} \sim N(0, 1)\] \[ \frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2} \sim \chi^2_{n-p}\] and this random variables are independent since one is a function of \(\hat{\boldsymbol{\beta}}\) and the other a function of \(\hat{\mathbf{e}}\).
Then we can conclude that:
\[\begin{align*} \mathbb{P} &\left( -t_{n-p}\left(\frac{\alpha}{2}\right) \leq t_{\mathbf{x}_{new}'\boldsymbol{\beta}} \leq t_{n-p}\left(\frac{\alpha}{2}\right) \right) = \alpha \\ &\implies \mathbb{P}\left( \mathbf{x}_{new}'\hat{\boldsymbol{\beta}} - t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\mathbf{x}_{new}'\boldsymbol{\beta}}} \leq \mathbf{x}_{new}'\boldsymbol{\beta}\leq \mathbf{x}_{new}'\hat{\boldsymbol{\beta}} + t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\mathbf{x}_{new}'\boldsymbol{\beta}}} \right) = \alpha \\ \end{align*}\]
so, the random interval is given by:
\[ \left( \mathbf{x}_{new}'\hat{\boldsymbol{\beta}} - t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\mathbf{x}_{new}'\boldsymbol{\beta}}} , \mathbf{x}_{new}'\hat{\boldsymbol{\beta}} + t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\mathbf{x}_{new}'\boldsymbol{\beta}}} \right) \] is a random interval that captures \(\mathbf{x}_{new}'\boldsymbol{\beta}\) with probability \(\alpha\).
This interval quantifies uncertainty about the mean response at \(\mathbf{x}_{new}\), not about an individual observation, which would require adding the variance of the random error term.
9.4.3 Confidence Intervals for Linear Combinations of \(\boldsymbol{\beta}\)
A particularly elegant aspect of the linear model is that it allows inference on any linear combination of the coefficients. Consider a vector \(\mathbf{a}\in \mathbb{R}^p\) and the parameter of interest \(\mathbf{a}' \boldsymbol{\beta}\).
This formulation encompasses several important special cases:
- \(\mathbf{a}= (0, \ldots, 0, 1, 0, \ldots, 0)\) yields \(\mathbf{a}' \boldsymbol{\beta}= \beta_i\), the \(i\)-th coefficient;
- \(\mathbf{a}= \mathbf{x}_{new}\) yields \(\mathbf{a}' \boldsymbol{\beta}= \mathbf{x}_{new}' \boldsymbol{\beta}\), the expected mean at a new data point.
Since \(\mathbf{a}' \hat{\boldsymbol{\beta}}\) is a linear combination of normally distributed estimators, it follows that
\[ \mathbf{a}' \hat{\boldsymbol{\beta}} \sim N(\mathbf{a}' \boldsymbol{\beta}, \sigma^2 \mathbf{a}' (\mathbf{X}'\mathbf{X})^{-1} \mathbf{a}) \]
and replacing \(\sigma^2\) by its estimator leads to a \(t_{n-p}\) distribution for the corresponding standardized statistic. Therefore, the general form of a \((1 - \alpha)\) confidence interval for any linear combination is
\[ \left( \mathbf{a}'\hat{\boldsymbol{\beta}} - t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\mathbf{a}'\boldsymbol{\beta}}} , \mathbf{a}'\hat{\boldsymbol{\beta}} + t_{n-p}\left(\frac{\alpha}{2}\right)\sqrt{\hat{\sigma}^2_{\mathbf{a}'\boldsymbol{\beta}}} \right) \]
where \(\hat{\sigma}^2_{\mathbf{a}'\boldsymbol{\beta}} = \hat{\sigma}^2 \mathbf{a}' (\mathbf{X}' \mathbf{X})^{-1} \mathbf{a}\).
This result unifies the previous confidence intervals into a single, compact framework and highlights the power and flexibility of the linear model for statistical inference.
9.5 Hypothesis Testing
We will approach hypothesis testing using an implausibility framework. This involves formulating a null hypothesis, \(H_0\), and assuming it to be true. Next, we calculate a test statistic that follows a specific distribution under the null hypothesis. By comparing the observed value of the statistic to this distribution, we assess how plausible it is to observe such a value if \(H_0\) is true.
9.5.1 Testing for the Overall Regression
For this hypothesis, we will use the notation of:
\[ \mathbf{X}^* = [\mathbf{1}\mathbf{X}] \quad \text{and} \quad \boldsymbol{\beta}^* = [\beta_0, \boldsymbol{\beta}]' \in \mathbb{R}^{p}\] that is, the \(*\) indicates all the independent variables. With \(\mathbf{X}\) of full rank.
Our first test is to see if the Linear Regression framework is useful at all. That is, we want to test \(\mathcal{H}_0: \boldsymbol{\beta}= \mathbf{0}\). Before designing our test statistic we will show the following auxiliary results:
- \(SS_{reg} = \mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}\).
- \(\mathbf{H}\mathbf{H}_0 = \mathbf{H}_0 \mathbf{H}= \mathbf{H}_0\).
- \((\mathbf{H}- \mathbf{H}_0)\) is idempotent.
- \(\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}\) and \(\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}\) are independent.
- Under the null hypothesis \(\mathcal{H}_0: \boldsymbol{\beta}= \mathbf{0}\), \(\frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}}{\sigma^2}\) is distributed like a \(\chi^2_{p-1}\).
For auxiliary result 1, we have that:
\[\begin{align*} SS_{tot} &= SS_{reg} + SS_{res} \\ &\implies \mathbf{y}'(\mathbf{I}- \mathbf{H}_0)\mathbf{y}= SS_{reg} + \mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}&& \text{since $SS_{tot} = \mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}$ and $SS_{res} = \mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}$.} \\ &\implies SS_{reg} = \mathbf{y}'(\mathbf{I}- \mathbf{H}_0)\mathbf{y}- \mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}&& \\ &\implies SS_{reg} = \mathbf{y}'(\mathbf{I}- \mathbf{H}_0 - \mathbf{I}+ \mathbf{H})\mathbf{y}&& \\ &\implies SS_{reg} = \mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}&& \\ \end{align*}\]
For auxiliary result 2, we have that:
Both \(\mathbf{H}\) and \(\mathbf{H}0\) are symmetric, then \(\mathbf{H}\mathbf{H}_0 = \mathbf{H}_0 \mathbf{H}\), and:
\[\begin{align*} \mathbf{H}\mathbf{H}_0 = \mathbf{H}\mathbf{1}(\mathbf{1}' \mathbf{1})^{-1} \mathbf{1}' && \\ \mathbf{H}\mathbf{H}_0 = \mathbf{1}(\mathbf{1}' \mathbf{1})^{-1} \mathbf{1}' && \text{since $\mathbf{H}\mathbf{1}= \mathbf{1}$.} \\ \mathbf{H}\mathbf{H}_0 = \mathbf{H}_0 && \\ \end{align*}\]
For auxiliary result 3, we have that:
\[\begin{align*} (\mathbf{H}- \mathbf{H}_0)(\mathbf{H}- \mathbf{H}_0) &= \mathbf{H}\mathbf{H}- \mathbf{H}\mathbf{H}_0 - \mathbf{H}_0 \mathbf{H}+ \mathbf{H}_0 \mathbf{H}_0 && \\ &= \mathbf{H}- \mathbf{H}\mathbf{H}_0 - \mathbf{H}_0 \mathbf{H}+ \mathbf{H}_0 && \text{since $\mathbf{H}_0$ and $\mathbf{H}$ are idempotent.} \\ &= \mathbf{H}- \mathbf{H}_0 - \mathbf{H}_0 + \mathbf{H}_0 && \text{since $\mathbf{H}\mathbf{H}_0 = \mathbf{H}_0 \mathbf{H}= \mathbf{H}_0$.} \\ &= \mathbf{H}- \mathbf{H}_0 && \\ \end{align*}\]
so, \((\mathbf{H}- \mathbf{H}_0)\) is idempotent.
For auxiliary result 4, first we have that:
\[\begin{align*} \mathbb{C}[(\mathbf{H}- \mathbf{H}_0) \mathbf{y}, (\mathbf{I}- \mathbf{H}) \mathbf{y}] &= (\mathbf{H}- \mathbf{H}_0) \mathbb{C}[\mathbf{y},\mathbf{y}] (\mathbf{I}- \mathbf{H}) \\ &= (\mathbf{H}- \mathbf{H}_0) \mathbb{V}[\mathbf{y}] (\mathbf{I}- \mathbf{H}) \\ &= \sigma^2 (\mathbf{H}- \mathbf{H}_0) (\mathbf{I}- \mathbf{H}) \\ &= \sigma^2 (\mathbf{H}- \mathbf{H}_0 - \mathbf{H}\mathbf{H}+ \mathbf{H}_0 \mathbf{H}) \\ &= \sigma^2 (\mathbf{H}- \mathbf{H}_0 - \mathbf{H}+ \mathbf{H}_0 \mathbf{H}) && \text{since $\mathbf{H}$ is idempotent.} \\ &= \sigma^2 (\mathbf{H}- \mathbf{H}_0 - \mathbf{H}+ \mathbf{H}_0) && \text{since $\mathbf{H}_0 \mathbf{H}= \mathbf{H}_0$.} \\ &= \sigma^2 \mathbf{0}&& \\ &= \mathbf{0}&& \\ \end{align*}\]
This tells us that \((\mathbf{H}- \mathbf{H}_0)\mathbf{y}\) and \((\mathbf{I}- \mathbf{H})\mathbf{y}\) are uncorrelated. Now, since \((\mathbf{H}- \mathbf{H}_0)\mathbf{y}\) and \((\mathbf{I}- \mathbf{H})\mathbf{y}\) are normally distributed, then zero correlation implies independence. Then, any function of this 2 quantities are independent. Note that:
\[\begin{align*} \mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y} &= \mathbf{y}'(\mathbf{H}- \mathbf{H}_0)(\mathbf{H}- \mathbf{H}_0)(\mathbf{H}- \mathbf{H}_0)\mathbf{y}&& \text{since $(\mathbf{H}- \mathbf{H}_0)$ is idempotent.} \end{align*}\]
Then, \(\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}\) is a quadratic function of \((\mathbf{H}- \mathbf{H}_0)\mathbf{y}\). Similarly, \(\mathbf{y}'(\mathbf{H}- \mathbf{H})\mathbf{y}\) is a quadratic function of \((\mathbf{H}- \mathbf{H})\mathbf{y}\). Therefore, \(\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}\) and \(\mathbf{y}'(\mathbf{H}- \mathbf{H})\mathbf{y}\) are independent.
For result 5, we have that:
\[\begin{align*} (\mathbf{H}- \mathbf{H}_0)\mathbf{y} &= (\mathbf{H}- \mathbf{H}_0)(\mathbf{X}^* \boldsymbol{\beta}^* + \mathbf{e}) && \text{since $\mathbf{y}= \mathbf{X}\boldsymbol{\beta}+ \mathbf{e}$.} \\ &= (\mathbf{H}- \mathbf{H}_0)([\mathbf{1}\mathbf{X}] [\beta_0, \boldsymbol{\beta}]' + \mathbf{e}) && \text{since $\mathbf{X}^* = [\mathbf{1}\mathbf{X}] \quad \text{and} \quad \boldsymbol{\beta}^* = [\beta_0, \boldsymbol{\beta}]'$.} \\ &= (\mathbf{H}- \mathbf{H}_0)(\mathbf{1}\beta_0 + \mathbf{X}\boldsymbol{\beta}+ \mathbf{e}) && \\ &= (\mathbf{H}- \mathbf{H}_0)(\mathbf{1}\beta_0) + (\mathbf{H}- \mathbf{H}_0)(\mathbf{X}\boldsymbol{\beta}+ \mathbf{e}) && \\ &= (\mathbf{H}\mathbf{1}- \mathbf{H}_0 \mathbf{1})\beta_0 + (\mathbf{H}- \mathbf{H}_0)(\mathbf{X}\boldsymbol{\beta}+ \mathbf{e}) && \\ &= (\mathbf{1}- \mathbf{1})\beta_0 + (\mathbf{H}- \mathbf{H}_0)(\mathbf{X}\boldsymbol{\beta}+ \mathbf{e}) && \\ &= (\mathbf{H}- \mathbf{H}_0)(\mathbf{X}\boldsymbol{\beta}+ \mathbf{e}) && \\ &= (\mathbf{H}- \mathbf{H}_0)\mathbf{e}&& \text{iff $\mathcal{H}_0: \boldsymbol{\beta}= \mathbf{0}$ for any full rank $\mathbf{X}$.} \end{align*}\]
That is, for any full rank \(\mathbf{X}\), we have that:
\[ (\mathbf{H}- \mathbf{H}_0)\mathbf{y}= (\mathbf{H}- \mathbf{H}_0)\mathbf{e}\iff \mathcal{H}_0: \boldsymbol{\beta}= \mathbf{0}\]
Then:
\[\begin{align*} \mathbf{e}\sim N(0, \sigma^2 \mathbf{I}) &\implies \mathbf{e}'(\mathbf{H}- \mathbf{H}_0)\mathbf{e}\sim \sigma^2 \chi^2_{p-1} && \text{since $(\mathbf{H}- \mathbf{H}_0)$ is idempotent of rank $p-1$}. \\ &\implies \frac{\mathbf{e}'(\mathbf{H}- \mathbf{H}_0)\mathbf{e}}{\sigma^2} \sim \chi^2_{p-1} && \\ &\implies \frac{\mathbf{e}'(\mathbf{H}- \mathbf{H}_0)(\mathbf{H}- \mathbf{H}_0)(\mathbf{H}- \mathbf{H}_0)\mathbf{e}}{\sigma^2} \sim \chi^2_{p-1} && \text{since $(\mathbf{H}- \mathbf{H}_0)$ is idempotent}. \\ &\implies \frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)(\mathbf{H}- \mathbf{H}_0)(\mathbf{H}- \mathbf{H}_0)\mathbf{y}}{\sigma^2} \sim \chi^2_{p-1} && \text{iff the null hypothesis holds}. \\ &\implies \frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}}{\sigma^2} \sim \chi^2_{p-1} && \text{since $(\mathbf{H}- \mathbf{H}_0)$ is idempotent}. \\ \end{align*}\]
That is:
\[ \frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}}{\sigma^2} \sim \chi^2_{p-1} \iff \mathcal{H}_0: \boldsymbol{\beta}= \mathbf{0}\]
With this results, we propose the following statistic:
\[ F_{\boldsymbol{\beta}= 0} = \frac{\frac{SS_{reg}}{p-1}}{\frac{SS_{res}}{n-p}} \]
and we will show that, this statistic is distributed like an \(F_{p-1,n-p}\) only under the null hypothesis.
\[\begin{align*} F_{\boldsymbol{\beta}= 0} &= \frac{\frac{SS_{reg}}{p-1}}{\frac{SS_{res}}{n-p}} && \\ &= \frac{\frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}}{p-1}}{\frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{n-p}} && \text{since $SS_{reg} = \mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}$ and $SS_{res}=\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}$} \\ &= \frac{\frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}}{\sigma^2}\frac{1}{p-1}}{\frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{\sigma^2}\frac{1}{n-p}} && \\ &\sim \frac{\frac{\chi^2_{p-1}}{p-1}}{\frac{\chi^2_{n-p}}{n-p}} && \text{since $\frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}}{\sigma^2} \sim \chi^2_{p-1}$ under the null hypothesis.} \\ &\sim F_{p-1,n-p} && \text{since $\mathbf{y}'(\mathbf{H}- \mathbf{H}_0)\mathbf{y}$ and $\mathbf{y}'(\mathbf{H}- \mathbf{H})\mathbf{y}$ are independent.} \end{align*}\]
So, once we observe the value of this statistic, we can contrast it with respect respect to this distribution. Call \(F^*_{\boldsymbol{\beta}= 0}\) the observed value, and consider a random variable \(F \sim F_{p-1,n-p}\), then we can see what would be the probability of observing the value of the statistic (or a more extreme value).
\[ \mathbb{P}(F \geq F^*_{\boldsymbol{\beta}= 0}) \] depending on how small or big is this probability, we can reject or not reject the null hypothesis. This value is a called a p-value.
9.5.2 Testing if one variable is not relevant
We can test if a particular variable is not relevant for the regression. That is, \(\mathcal{H}_0: \beta_i = 0\). We will use the same strategy, that is, we will build a test statistic that has a certain distribution only under the null hypothesis.
For this hypothesis we propose the following test statistic:
\[ t_{\beta_i = 0} = \frac{\hat{\beta}_i}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}} \] First note that:
\[\begin{align*} \hat{\boldsymbol{\beta}} \sim N \left(\boldsymbol{\beta}, \sigma^2(\mathbf{X}\mathbf{X})^{-1}\right) &\implies \hat{\beta}_i \sim H(\beta_i, \sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}) \\ &\implies \frac{\hat{\beta}_i}{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}} \sim N \left(\frac{\beta_i}{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}}, 1 \right) \\ &\implies \frac{\hat{\beta}_i}{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}} \sim N \left( 0, 1 \right) && \iff \mathcal{H}_0: \beta_i = 0 \\ \end{align*}\]
Then we have:
$$$$
\[\begin{align*} t_{\beta_i = 0} &= \frac{\hat{\beta}_i}{\sqrt{\hat{\sigma}^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}} \\ &= \frac{\frac{\hat{\beta}_i }{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}}}{\frac{\sqrt{\hat{\sigma}^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}}{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}}} \\ &= \frac{\frac{\hat{\beta}_i }{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}}}{\sqrt{\frac{\hat{\sigma}^2}{\sigma^2}}} \\ &= \frac{\frac{\hat{\beta}_i }{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}}}{\sqrt{\frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2}\frac{1}{n-p}}} && \text{since $\hat{\sigma}^2 = \frac{\hat{\mathbf{e}}\hat{\mathbf{e}}}{n-p}$} \\ &\sim \frac{N \left(\frac{\beta_i}{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}}, 1 \right)}{\sqrt{\frac{\chi^2_{n-p}}{n-p}}} && \text{since $\frac{\hat{\beta}_i}{\sqrt{\sigma^2 [(\mathbf{X}\mathbf{X})^{-1}]_{ii}}} \sim N \left( 0, 1 \right)$ and $\frac{\hat{\mathbf{e}}'\hat{\mathbf{e}}}{\sigma^2} \sim \chi^2_{n-p}$} \\ &\sim \frac{N \left(0, 1 \right)}{\sqrt{\frac{\chi^2_{n-p}}{n-p}}} && \iff \mathcal{H}_0: \beta_i = 0 \\ &\sim t_{n-p} && \text{since $\hat{\beta}_i$ and $\hat{\sigma}^2$ are independent}. \\ \end{align*}\]
Then, under the null hypothesis we have that:
\[ t_{\beta_i = 0} \sim t_{n-p}\] So, if we call \(t_{\beta_i = 0}^*\) the observed value of \(t_{\beta_i = 0}\), and if we let \(t\) be distributed as \(t_{n-p}\), we can compute:
\[ \mathbb{P}(t \geq |t_{\beta_i = 0}^*|) \] and depending on the value, we can reject or accept the null hypothesis.
9.5.3 Testing if a Subgroup of the Variables is Relevant
For this test, we can assume without loss of generality, that the variables we want to see if it is relevant are the first \(k\). So we can divide the design matrix as:
\[ \mathbf{X}= [\mathbf{X}_1 \mathbf{X}_2] \] where the variables to test are in \(\mathbf{X}_1\) and the rest of the variables are in \(\mathbf{X}_2\) (including possibly the intercept). And similarly we have \(\boldsymbol{\beta}= [\boldsymbol{\beta}_1 \boldsymbol{\beta}_2]'\).
This test is similar to the first test once we express it accordingly. We will consider two linear regressions. One including all variables and one excluding the variables to be tested indexed by \(2\). With this we can build the following test statistics:
\[ F_{\boldsymbol{\beta}_1=\mathbf{0}} = \frac{\frac{SS_{res,2} - SS_{res}}{k}}{\frac{SS_{res}}{n-p}} \] Then note the following:
\[\begin{align*} SS_{res,2} - SS_{res} &= \mathbf{y}'(\mathbf{I}- \mathbf{H}_2)\mathbf{y}- \mathbf{y}'(\mathbf{I}- \mathbf{H}) \mathbf{y}\\ &= \mathbf{y}'(\mathbf{I}- \mathbf{H}_2 - \mathbf{I}+ \mathbf{H}) \mathbf{y}\\ &= \mathbf{y}'(\mathbf{H}- \mathbf{H}_2) \mathbf{y}\\ \end{align*}\]
Again, we will see that \((\mathbf{H}- \mathbf{H}_2)\) is idempotent and \((\mathbf{H}- \mathbf{H}_2)\mathbf{y}= (\mathbf{H}- \mathbf{H}_2)\mathbf{e}\) only under the null hypothesis.
First, let us see that \((\mathbf{H}- \mathbf{H}_2)\) is idempotent. First note that:
\[ \mathbf{H}\mathbf{H}_2 = \mathbf{H}_2 \mathbf{H}= \mathbf{H}_2\] since \(\mathbf{H}_2\) is the projection matrix of the columns of \(\mathbf{X}_2\) a subspace of the columns of \(\mathbf{X}\). Then:
\[\begin{align*} (\mathbf{H}- \mathbf{H}_2)(\mathbf{H}- \mathbf{H}_2) &= \mathbf{H}\mathbf{H}- \mathbf{H}_2 \mathbf{H}- \mathbf{H}\mathbf{H}_2 + \mathbf{H}_2 \mathbf{H}_2 \\ &= \mathbf{H}- \mathbf{H}_2 \mathbf{H}- \mathbf{H}\mathbf{H}_2 + \mathbf{H}_2 && \text{since $\mathbf{H}_2$ and $\mathbf{H}$ are idempotent}. \\ &= \mathbf{H}- \mathbf{H}_2 - \mathbf{H}_2 + \mathbf{H}_2 && \text{since $\mathbf{H}\mathbf{H}_2 = \mathbf{H}_2 \mathbf{H}= \mathbf{H}_2$}. \\ &= \mathbf{H}- \mathbf{H}_2 && \\ \end{align*}\]
then \((\mathbf{H}- \mathbf{H}_2)\) is idempotent.
Now let us see that \((\mathbf{H}- \mathbf{H}_2)\mathbf{y}= (\mathbf{H}- \mathbf{H}_2)\mathbf{e}\) under the null hypothesis. First, let us note that:
\[ \mathbf{H}\mathbf{X}_2 = \mathbf{X}_2 \] since space generated by \(\mathbf{X}_2\) is a subspace of the space generated by \(\mathbf{X}\), since \(\mathbf{X}\) contains the columns of \(\mathbf{X}_2\). And we also note that:
\[ \mathbf{H}_2 \mathbf{X}_2 = \mathbf{X}_2 \] since \(\mathbf{H}_2\) is the projection matrix of the space generated by the columns of \(\mathbf{X}_2\). We note that this results can be proven algebraically.
Then:
\[\begin{align*} (\mathbf{H}- \mathbf{H}_2)\mathbf{y} &= (\mathbf{H}- \mathbf{H}_2)(\mathbf{X}\boldsymbol{\beta}+ \mathbf{e}) \\ &= (\mathbf{H}- \mathbf{H}_2)([\mathbf{X}_1 \mathbf{X}_2] [\boldsymbol{\beta}_1' \boldsymbol{\beta}_2']' + \mathbf{e}) \\ &= (\mathbf{H}- \mathbf{H}_2)(\mathbf{X}_1 \boldsymbol{\beta}_1 \mathbf{X}_2 \boldsymbol{\beta}_2 + \mathbf{e}) \\ &= (\mathbf{H}- \mathbf{H}_2)(\mathbf{X}_2 \boldsymbol{\beta}_2) + (\mathbf{H}- \mathbf{H}_2)(\mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{e}) \\ &= (\mathbf{H}\mathbf{X}_2 - \mathbf{H}_2\mathbf{X}_2)\boldsymbol{\beta}_2 + (\mathbf{H}- \mathbf{H}_2)(\mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{e}) \\ &= (\mathbf{X}_2 - \mathbf{X}_2)\boldsymbol{\beta}_2 + (\mathbf{H}- \mathbf{H}_2)(\mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{e}) && \text{since $\mathbf{H}\mathbf{X}_1 = \mathbf{X}_1$ and $\mathbf{H}_1 \mathbf{X}_1 = \mathbf{X}_1$} \\ &= (\mathbf{H}- \mathbf{H}_2)(\mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{e}) && \\ &= (\mathbf{H}- \mathbf{H}_2)\mathbf{e}&& \iff \mathcal{H}_0: \boldsymbol{\beta}_1 = \mathbf{0}\\ \end{align*}\]
So, if \(\mathbf{X}_1\) is full rank, then we have that:
\[ (\mathbf{H}- \mathbf{H}_2)\mathbf{y}= (\mathbf{H}- \mathbf{H}_2)\mathbf{e}\iff \mathcal{H}_0: \boldsymbol{\beta}_1 = \mathbf{0}\]
Then we can proceed to see what is the distribution of our test statistic under the null hypothesis.
\[\begin{align*} F_{\boldsymbol{\beta}_1=\mathbf{0}} &= \frac{\frac{SS_{res,2} - SS_{res}}{k}}{\frac{SS_{res}}{n-p}} && \\ &= \frac{\frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_2)\mathbf{y}}{k}}{\frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{n-p}} && \text{since $SS_{res,2} - SS_{res} = \mathbf{y}'(\mathbf{H}- \mathbf{H}_2)\mathbf{y}$ and $SS_{res}=\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}$} \\ &= \frac{\frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_2)\mathbf{y}}{\sigma^2}\frac{1}{k}}{\frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{\sigma^2}\frac{1}{n-p}} && \\ &= \frac{\frac{\mathbf{y}'(\mathbf{H}- \mathbf{H}_2)(\mathbf{H}- \mathbf{H}_2)(\mathbf{H}- \mathbf{H}_2)\mathbf{y}}{\sigma^2}\frac{1}{k}}{\frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{\sigma^2}\frac{1}{n-p}} && \text{snce $(\mathbf{H}- \mathbf{H}_2)$ is idempotent}. \\ &= \frac{\frac{\mathbf{e}'(\mathbf{H}- \mathbf{H}_2)(\mathbf{H}- \mathbf{H}_2)(\mathbf{H}- \mathbf{H}_2)\mathbf{e}}{\sigma^2}\frac{1}{k}}{\frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{\sigma^2}\frac{1}{n-p}} && \iff \mathcal{H}_0: \boldsymbol{\beta}_1 = \mathbf{0}\\ &= \frac{\frac{\mathbf{e}'(\mathbf{H}- \mathbf{H}_2)\mathbf{e}}{\sigma^2}\frac{1}{k}}{\frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{\sigma^2}\frac{1}{n-p}} && \text{since $(\mathbf{H}- \mathbf{H}_2)$ is idempotent}. \\ &\sim \frac{\frac{\chi^2_{k}}{k}}{\frac{\chi^2_{n-p}}{n-p}} && \text{since $(\mathbf{H}- \mathbf{H}_2)$ is idempotent and $\frac{\mathbf{e}}{\sqrt{\sigma^2}} \sim N(0, \mathbf{I})$}. \\ &\sim F_{k,n-p} && \end{align*}\]
So, before we observe the data, \(F_{\boldsymbol{\beta}_1=\mathbf{0}}\) has a \(F_{k,n-p}\) distribution. Then, once we observe the data, call \(F_{\boldsymbol{\beta}_1=\mathbf{0}}^*\) the observed value of the statistic, and let \(F\) be distributed as an \(F_{k,n-p}\), we can compute:
\[ \mathbb{P}(F \geq F_{\boldsymbol{\beta}_1=\mathbf{0}}^*) \] and reject the null hypothesis if this probability is small and not reject if this probability is small.
Below is a textbook-style introduction to power analysis in the context of multiple linear regression, consistent with the level and tone of graduate coursework in applied linear models. No equations have been altered if you later integrate this into your existing notes.
9.6 Power Analysis
Power analysis in multiple linear regression extends the general principles of hypothesis testing to the setting where several predictors jointly explain variability in a response variable. Because many inferential questions in regression involve testing whether one or more regression coefficients differ from zero, power analysis provides a framework for assessing how sensitive these tests are to detecting true effects of practical importance. It also informs decisions about sample size, model specification, and the expected precision of future studies.
Recall the multiple linear regression model \[ \mathbf{y} = \mathbf{X}\boldsymbol{\beta} + \mathbf{e}, \] with \(\mathbf{e} \sim N(\mathbf{0}, \sigma^2 \mathbf{I})\). In this framework, power analysis concerns the probability of rejecting a null hypothesis about one or more components of \(\boldsymbol{\beta}\) under specific alternatives.
9.6.1 Power for Testing a Single Coefficient
A common hypothesis test asks whether a particular predictor has any linear association with the response, controlling for the other predictors: \[ H_0: \beta_j = 0 \qquad\text{vs.}\qquad H_1: \beta_j \neq 0. \]
The usual test statistic takes the form \[ t_{\beta_j} = \frac{\hat{\beta}_j - \beta_j}{\sqrt{\hat{\sigma}^2_{\beta_j}}}, \] which, under the null hypothesis, follows a \(t\) distribution with \(n - p\) degrees of freedom. Under an alternative \(\beta_j = \theta_1 \neq 0\), the statistic follows a noncentral \(t\) distribution, whose noncentrality parameter depends on:
- the magnitude of the effect \(\beta_j\),
- the sampling variability of \(\hat{\beta}_j\) (involving \((\mathbf{X}'\mathbf{X})^{-1}\)),
- the error variance \(\sigma^2\), and
- the sample size.
In particular the noncentrality parameter is given by:
\[ \mu_{\beta_j} = \frac{\beta_j}{\sqrt{\sigma^2 [(\mathbf{X}'\mathbf{X})]_{jj}}} \]
If \(R\) is the rejection region for the two-sided test, then the power at \(\beta_j = \theta_1\) is \[ \pi(\theta_1) = \mathbb{P}(t_{\beta_j} \in R \mid \beta_j = \theta_1), \] which increases with larger sample sizes, larger effect sizes, smaller noise variance, and designs that reduce the variance of \(\hat{\beta}_j\). The structure of the design matrix therefore plays a crucial role: multicollinearity inflates \(\hat{\sigma}^2_{\beta_j}\) and can dramatically reduce power.
9.6.2 Power for Testing Multiple Coefficients (Partial F-Tests)
Many inferential questions in regression concern the joint contribution of several predictors. For example, \[ H_0: \beta_{j_1} = \beta_{j_2} = \cdots = \beta_{j_k} = 0 \quad\text{versus}\quad H_1:\ \text{at least one }\beta_{j_r} \neq 0, \] is evaluated using a partial F-test. The test statistic is of the form \[ F = \frac{\left(\text{SSR}_{\text{full}} - \text{SSR}_{\text{reduced}}\right)/k}{\hat{\sigma}^2}, \] which follows an \(F_{k,; n-p}\) distribution under \(H_0\). Under an alternative model in which the set of coefficients differs from zero, the statistic follows a noncentral \(F\) distribution, with noncentrality parameter \[ \lambda = \frac{(\mathbf{C}\boldsymbol{\beta})' \left[\mathbf{C}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{C}'\right]^{-1} (\mathbf{C}\boldsymbol{\beta})}{\sigma^2}, \] where \(\mathbf{C}\) selects the coefficients under test.
The power for the partial F-test is then \[ \pi(\lambda) = \mathbb{P}(F > F_{k,; n-p}(1 - \alpha) \mid \lambda), \] where \(F_{k,; n-p}(1-\alpha)\) is the \((1-\alpha)\) quantile of the central \(F\) distribution. In this setting, the noncentrality parameter encodes the joint effect size of the tested predictors. The power increases with:
- larger joint effect sizes,
- lower error variance,
- larger sample sizes, and
- predictor configurations that reduce multicollinearity.
9.6.3 The Role of the Design Matrix
Unlike simpler settings such as one-sample or two-sample tests, power in multiple regression depends critically on the geometry of the predictors. The matrix \((\mathbf{X}'\mathbf{X})^{-1}\) directly determines how much information the data provide about each coefficient and about linear combinations of coefficients. As a consequence:
- highly correlated predictors inflate variances and reduce power,
- balanced and orthogonal designs maximize power, and
- adding irrelevant predictors can decrease power for testing the important ones.
Thus, power analysis in regression provides insight not only into sample size requirements but also into the effect of model specification and predictor structure.
9.6.4 Practical Applications
Power analysis in multiple linear regression is essential in the following contexts:
- Experimental design: Determining the sample size needed to detect expected effects.
- Study planning with observational data: Evaluating whether available data can reasonably support planned inferences.
- Model assessment: Understanding how collinearity or the inclusion of unnecessary predictors affects inferential strength.
- Interpretation of non-significant results: Distinguishing between evidence supporting the null hypothesis and insufficient statistical sensitivity.
In summary, power analysis for multiple linear regression extends classical power concepts to a multivariate linear setting, where the design matrix, error variance, and effect size structure collectively determine the sensitivity of hypothesis tests. This framework is fundamental for constructing well-designed studies and for obtaining reliable inferential conclusions about the regression coefficients.