6 Multiple Linear Regression

6.1 Introduction

Multiple regression is a statistical technique used to model the relationship between a dependent variable and two or more independent variables. It extends simple linear regression by allowing for a more complex analysis of how various factors impact an outcome. The general form of the multiple regression equation is:

\[ y_i = \beta_0 + \beta_1 x_{1,i} + \beta_2 x_{2,i} + \ldots + \beta_p x_{p,i} + e_i \quad i=\{1,\ldots,n\} \]

Where: - $y$ is the dependent variable. - $\beta_0$ is the intercept. - $\beta_1, \beta_2, \ldots, \beta_p$ are the coefficients of the independent variables $X_1, X_2, \ldots, X_p$. - $e represents the error term.

This technique is widely used across various fields, including economics, social sciences, and healthcare, to control for multiple factors and enhance prediction accuracy. However, it requires careful attention to assumptions. Beyond linearity, now we have to consider issues like independence, multicollinearity, and variable selection which can affect the results.

We already have seen an example of Multiple linear regression when we worked with Polynomial regression. However, multiple linear regression is more general.

6.2 Example

Consider, for example, the task of explaining a country’s GDP using other economic variables such as inflation, unemployment, reference interest rate, government spending (as a percentage of GDP), and exports (as a percentage of GDP).

In this case, visualization is not as straightforward, and visually inspecting these relationships is much less practical. Nevertheless, with this number of variables, we can visually explore the relationships between them as follows:

# Reads Data
dat <- read.csv(file = "Gdp Data.csv")

# Plot the scatterplots for each pair of variables
pairs(dat)

Here we can see, that some independent variables are more related to GDP and some independent variables are more related between themselves. This is valuable information that will help us to develop the right linear model with this variables.

We can also observe the correlation between these variables as follows:

# Computes the correlation between variables
cor(dat)

##            gdp          inf         une        int         gov         exp
## gdp  1.0000000  0.875131082 -0.74874795  0.6964256  0.22172279 0.173602651
## inf  0.8751311  1.000000000 -0.78173033  0.8292061  0.31103644 0.005685918
## une -0.7487479 -0.781730327  1.00000000 -0.3642453 -0.16674407 0.010553855
## int  0.6964256  0.829206121 -0.36424525  1.0000000  0.21389456 0.015699798
## gov  0.2217228  0.311036436 -0.16674407  0.2138946  1.00000000 0.018475446
## exp  0.1736027  0.005685918  0.01055386  0.0156998  0.01847545 1.000000000

We can also fit simple linear regression with each one of the independent variables.

Inflation Rate

# Fits with Inflation
outRegInf <- lm(gdp ~ inf, data = dat)
varVal    <- dat$inf
out       <- outRegInf
varNam    <- "Inflation Rate"
# Plots Regression Line and Scatterplot and residuals plot
par(mfrow = c(1, 2))
plot(x    = varVal,
     y    = dat$gd,
     xlab = varNam,
     ylab = "GDP")
abline(a   = out$coefficients[1],
       b   = out$coefficients[2],
       col = 'red',
       lwd = 2)
plot(x    = varVal,
     y    = out$residuals,
     xlab = varNam,
     ylab = "Residuals")
abline(h   = 0,
       lwd = 2)

Unemployment Rate

# Fits with Inflation
outRegUne <- lm(gdp ~ une, data = dat)
varVal    <- dat$une
out       <- outRegUne
varNam    <- "Unemplyment Rate"
# Plots Regression Line and Scatterplot and residuals plot
par(mfrow = c(1, 2))
plot(x    = varVal,
     y    = dat$gd,
     xlab = varNam,
     ylab = "GDP")
abline(a   = out$coefficients[1],
       b   = out$coefficients[2],
       col = 'red',
       lwd = 2)
plot(x    = varVal,
     y    = out$residuals,
     xlab = varNam,
     ylab = "Residuals")
abline(h   = 0,
       lwd = 2)

Interest Rate

# Fits with Inflation
outRegInt <- lm(gdp ~ int, data = dat)
varVal    <- dat$int
out       <- outRegInt
varNam    <- "Interest Rate"
# Plots Regression Line and Scatterplot and residuals plot
par(mfrow = c(1, 2))
plot(x    = varVal,
     y    = dat$gd,
     xlab = varNam,
     ylab = "GDP")
abline(a   = out$coefficients[1],
       b   = out$coefficients[2],
       col = 'red',
       lwd = 2)
plot(x    = varVal,
     y    = out$residuals,
     xlab = varNam,
     ylab = "Residuals")
abline(h   = 0,
       lwd = 2)

Goverment Spending

# Fits with Inflation
outRegGov <- lm(gdp ~ gov, data = dat)
varVal    <- dat$gov
out       <- outRegGov
varNam    <- "Goverment Spending"
# Plots Regression Line and Scatterplot and residuals plot
par(mfrow = c(1, 2))
plot(x    = varVal,
     y    = dat$gd,
     xlab = varNam,
     ylab = "GDP")
abline(a   = out$coefficients[1],
       b   = out$coefficients[2],
       col = 'red',
       lwd = 2)
plot(x    = varVal,
     y    = out$residuals,
     xlab = varNam,
     ylab = "Residuals")
abline(h   = 0,
       lwd = 2)

Exports

# Fits with Inflation
outRegExp <- lm(gdp ~ exp, data = dat)
varVal    <- dat$exp
out       <- outRegExp
varNam    <- "Exports"
# Plots Regression Line and Scatterplot and residuals plot
par(mfrow = c(1, 2))
plot(x    = varVal,
     y    = dat$gd,
     xlab = varNam,
     ylab = "GDP")
abline(a   = out$coefficients[1],
       b   = out$coefficients[2],
       col = 'red',
       lwd = 2)
plot(x    = varVal,
     y    = out$residuals,
     xlab = varNam,
     ylab = "Residuals")
abline(h   = 0,
       lwd = 2)

All of them seem like good candidates for a linear relationship with the GDP, however when we use them all together, a more careful analysis should be made.

We can see the summary reports for the individual regressions and the regression with all independent variables as follows:

outRegAll <- lm(gdp ~ inf + une + int + gov + exp, data = dat)

# Summary All
print("All Independent Variables")

## [1] "All Independent Variables"

summary(outRegAll)

## 
## Call:
## lm(formula = gdp ~ inf + une + int + gov + exp, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.56610 -0.38300 -0.00634  0.36630  1.22542 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.555301   0.380440   4.088 6.41e-05 ***
## inf          0.312218   0.090012   3.469 0.000647 ***
## une         -0.377334   0.129275  -2.919 0.003938 ** 
## int          0.177827   0.128312   1.386 0.167403    
## gov         -0.008483   0.010361  -0.819 0.413950    
## exp          0.064657   0.011930   5.420 1.80e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5024 on 190 degrees of freedom
## Multiple R-squared:  0.8096, Adjusted R-squared:  0.8046 
## F-statistic: 161.6 on 5 and 190 DF,  p-value: < 2.2e-16

print("Only Inflation Rate")

## [1] "Only Inflation Rate"

summary(outRegInf)

## 
## Call:
## lm(formula = gdp ~ inf, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.40960 -0.38896  0.03562  0.37998  1.33364 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.02551    0.08705   11.78   <2e-16 ***
## inf          0.49489    0.01965   25.19   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5514 on 194 degrees of freedom
## Multiple R-squared:  0.7659, Adjusted R-squared:  0.7646 
## F-statistic: 634.5 on 1 and 194 DF,  p-value: < 2.2e-16

print("Only Unemployment Rate")

## [1] "Only Unemployment Rate"

summary(outRegUne)

## 
## Call:
## lm(formula = gdp ~ une, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.42276 -0.49693  0.02667  0.49525  2.79562 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)   6.0868     0.2046   29.74   <2e-16 ***
## une          -1.0400     0.0661  -15.73   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.7554 on 194 degrees of freedom
## Multiple R-squared:  0.5606, Adjusted R-squared:  0.5584 
## F-statistic: 247.5 on 1 and 194 DF,  p-value: < 2.2e-16

print("Only Interest Rate")

## [1] "Only Interest Rate"

summary(outRegInt)

## 
## Call:
## lm(formula = gdp ~ int, data = dat)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -2.27807 -0.50801 -0.00257  0.50336  2.69719 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -1.31600    0.32323  -4.071  6.8e-05 ***
## int          0.86483    0.06398  13.517  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.8178 on 194 degrees of freedom
## Multiple R-squared:  0.485,  Adjusted R-squared:  0.4824 
## F-statistic: 182.7 on 1 and 194 DF,  p-value: < 2.2e-16

print("Only Government Spending")

## [1] "Only Government Spending"

summary(outRegGov)

## 
## Call:
## lm(formula = gdp ~ gov, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.4370 -0.6442 -0.1258  0.7429  3.1839 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)   
## (Intercept)  1.37117    0.51450   2.665  0.00835 **
## gov          0.06464    0.02041   3.167  0.00179 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.111 on 194 degrees of freedom
## Multiple R-squared:  0.04916,    Adjusted R-squared:  0.04426 
## F-statistic: 10.03 on 1 and 194 DF,  p-value: 0.001789

print("Only Exports")

## [1] "Only Exports"

summary(outRegExp)

## 
## Call:
## lm(formula = gdp ~ exp, data = dat)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -3.0084 -0.6679 -0.1133  0.6581  3.0810 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  2.32709    0.27818   8.365 1.16e-14 ***
## exp          0.06540    0.02664   2.455    0.015 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.122 on 194 degrees of freedom
## Multiple R-squared:  0.03014,    Adjusted R-squared:  0.02514 
## F-statistic: 6.028 on 1 and 194 DF,  p-value: 0.01496

As we can see, the values for the coefficients can change when doing simple linear regression and multiple linear regression. If the changes are very dramatic (like change in the sign of the coefficient) further inspection is necessary for that variable.

6.3 Least Squares Estimation

For least squares estimation, we need to solve the problem:

\[ \min_\boldsymbol{\beta}Q(\boldsymbol{\beta}) = \sum_{i=1}^n (y_i - \hat{y}(\boldsymbol{\beta}))^2 = (\mathbf{y}- \hat{\mathbf{y}})'(\mathbf{y}- \hat{\mathbf{y}}) = (\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \] The representation in matrix notation of the problem, allows us to use the same expression to solve this problem as with simple linear regression. The solution is obtained in the exact same way, and is given by:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}' \mathbf{X})^{-1}\mathbf{X}'\mathbf{y} \] however in this case:

\[ \hat{\boldsymbol{\beta}} = \left(\hat{\beta}_0, \hat{\beta}_1, \hat{\beta}_2,\ldots,\hat{\beta}_p\right)' \] this is the reason, working in matrix form is very useful.

6.4 Properties of the Estimates

As with simple linear regression, we can consider several estimates:

$\hat{\mathbf{y}} = \mathbf{X}\boldsymbol{\beta}$ the estimates of the observations,
$\hat{\mathbf{e}} = \mathbf{y}- \hat{\mathbf{y}} = \mathbf{y}- \mathbf{X}\hat{\boldsymbol{\beta}}$ the estimates of the errors.

We also note that:

\[ \hat{\mathbf{y}} = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{y}= \mathbf{H}y \] where $\mathbf{H}$ is called the hat matrix, because it transforms $\mathbf{y}$ into $\hat{\mathbf{y}}$, or the projection matrix.

We will see that:

$\hat{\boldsymbol{\beta}}$ is a linear combination of $y$.
The sum of the estimated errors is equal to zero, $\sum_{i=1}^n \hat{e_i} = 0$.
$\hat{\mathbf{e}}$ and $\hat{\mathbf{x}_j}$ are orthogonal for $j=\{1,\ldots,p\}$.
$\hat{\mathbf{e}}$ and $\hat{\mathbf{y}}$ are orthogonal.
$\bar{y} = \hat{\bar{y}}$.

To see that $\hat{\boldsymbol{\beta}}$ is a linear combination of $y$, we need to express $\hat{\boldsymbol{\beta}}$ as follows:

\[ \hat{\boldsymbol{\beta}} = \mathbf{A}\mathbf{y} \]

for some matrix $\mathbf{A}$. This is very easy to do, we just let $\mathbf{A}= (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'$, so:

\[ \hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}= \mathbf{A}\mathbf{y} \] Now to see that the sum of the estimated errors is equal to zero, $\sum_{i=1}^n \hat{e_i} = 0$, we notice that we need to show that:

\[ \hat{\mathbf{e}}' \mathbf{1}= 0 \]

To do so we notice that:

\[\begin{align*} \hat{\boldsymbol{\beta}} = (\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y} &\implies (\mathbf{X}'\mathbf{X})\hat{\boldsymbol{\beta}} = \mathbf{X}'\mathbf{y}\\ &\implies \mathbf{X}'\mathbf{y}- \mathbf{X}'\mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{0}\\ &\implies \mathbf{X}'\left(\mathbf{y}- \hat{\mathbf{y}}\right) = \mathbf{0}\\ &\implies \mathbf{X}'\hat{\mathbf{e}} = \mathbf{0} \end{align*}\]

Now focusing on the product $\mathbf{X}'\hat{\mathbf{e}}$ we have that:

\[ \mathbf{X}'\hat{\mathbf{e}} = \left[\begin{matrix} \mathbf{1}' \\ \mathbf{x}_1 \\ \mathbf{x}_2 \\ \vdots \\ \mathbf{x}_p \end{matrix}\right] \hat{\mathbf{e}} = \left[\begin{matrix} \mathbf{1}' \hat{\mathbf{e}} \\ \mathbf{x}_1 \hat{\mathbf{e}} \\ \mathbf{x}_2 \hat{\mathbf{e}} \\ \vdots \\ \mathbf{x}_p \hat{\mathbf{e}} \end{matrix}\right] \] So we have that:

\[ \left[\begin{matrix} \mathbf{1}' \hat{\mathbf{e}} \\ \mathbf{x}_1 \hat{\mathbf{e}} \\ \mathbf{x}_2 \hat{\mathbf{e}} \\ \vdots \\ \mathbf{x}_p \hat{\mathbf{e}} \end{matrix}\right] = \left[\begin{matrix} 0 \\ 0 \\ 0 \\ \vdots \\ 0 \end{matrix}\right] \] So from the first line of this result, we have that:

\[ \mathbf{1}' \hat{\mathbf{e}} = 0 \] which is the result we wanted to proof.

Now, to show that $\hat{\mathbf{e}}$ and $\hat{\mathbf{x}_j}$ are orthogonal for $j=\{1,\ldots,p\}$, we use again on:

And notice that lines 2 to $p+1$ proof this results, that is

\[ \mathbf{x}_i ' \hat{\mathbf{e}} = 0 \quad i=\{1,\ldots,p\} \] Now to show that $\hat{\mathbf{e}}$ and $\hat{\mathbf{y}}$ are orthogonal, we show that:

\[ \hat{\mathbf{e}}'\hat{\mathbf{y}} = 0 \] Now

\[\begin{align*} \hat{\mathbf{e}}'\hat{\mathbf{y}} &= (\mathbf{y}- \hat{\mathbf{y}})'\hat{\mathbf{y}} \\ &= \left(\mathbf{y}- \mathbf{X}\hat{\boldsymbol{\beta}}\right)'\mathbf{X}\hat{\boldsymbol{\beta}} \\ &= \left(\mathbf{y}- \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}\right)'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}\\ &= \mathbf{y}' \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}- \mathbf{y}' \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{y}\\ &= \mathbf{y}' \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\mathbf{y}- \mathbf{y}' \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{y}\\ &= 0 \end{align*}\]

Finally, to show that $\bar{y} = \hat{\bar{y}}$, we use:

\[ \hat{\mathbf{e}}'\mathbf{1}= (\mathbf{y}- \hat{\mathbf{y}})'\mathbf{1}= \mathbf{y}'\mathbf{1}- \hat{\mathbf{y}}'\mathbf{1}= \sum_{i=1}^ny_i - \sum_{i=1}^n\hat{y}_i = n\bar{y} - n\hat{\bar{y}} \] since $\hat{\mathbf{e}}'\mathbf{1}= 0$, then we have that

\[ n\bar{y} - n\hat{\bar{y}} = 0 \implies n\bar{y} = n\hat{\bar{y}} \implies \bar{y} = \hat{\bar{y}} \]

6.5 Multiple $R^2$

As with simple linear regression we can explain the total variability, by decomposing the variability in two parts, the regression variability and the error variability.

First, we define this concepts:

Total Sum of Squares $SS_{tot}$:
The total sum of squares measures the total variability in $\mathbf{y}$:

\[ SS_{tot} = (\mathbf{y} - \bar{y} \mathbf{1})' (\mathbf{y} - \bar{y} \mathbf{1}) \]
Residual Sum of Squares $SS_{res}$:
The residual sum of squares measures the unexplained variability in the regression model:

\[ SS_{res} = (\mathbf{y} - \hat{\mathbf{y}})' (\mathbf{y} - \hat{\mathbf{y}}) \]
Explained Sum of Squares $SS_{reg}$

The explained sum of squares measures how much of the total variability is explained by the regression model. It is the difference between the predicted values and the mean of $\mathbf{y}$:

\[ SS_{reg} = (\hat{\mathbf{y}} - \bar{y} \mathbf{1})' (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) \] As with simple linear regression, it can be shown that:

\[ SS_{tot} = SS_{reg} + SS_{res} \] To see this, we start form $SS_{tot}$, and do the adding and subtracting trick:

\[\begin{align*} SS_{tot} &= (\mathbf{y} - \bar{y} \mathbf{1})' (\mathbf{y} - \bar{y} \mathbf{1}) \\ &= (\mathbf{y} - \hat{\mathbf{y}} + \hat{\mathbf{y}} - \bar{y} \mathbf{1})' (\mathbf{y} - \hat{\mathbf{y}} + \hat{\mathbf{y}} - \bar{y} \mathbf{1}) \\ &= (\mathbf{y} - \hat{\mathbf{y}})' (\mathbf{y} - \hat{\mathbf{y}}) + (\mathbf{y} - \hat{\mathbf{y}})' (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) + (\hat{\mathbf{y}} - \bar{y} \mathbf{1})' (\mathbf{y} - \hat{\mathbf{y}}\mathbf{1}) + (\hat{\mathbf{y}} - \bar{y} \mathbf{1})' (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) \end{align*}\]

Now, notice that:

\[ (\mathbf{y} - \hat{\mathbf{y}})' (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) = \hat{\mathbf{e}}' (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) = \hat{\mathbf{e}}'\hat{\mathbf{y}} - \bar{y}\hat{\mathbf{e}}' \mathbf{1} = 0 - \bar{y}0 = 0 \] And similarly for $(\hat{\mathbf{y}} - \bar{y} \mathbf{1})' (\mathbf{y} - \hat{\mathbf{y}}\mathbf{1}) = 0$, then:

\[ SS_{tot} = (\mathbf{y} - \hat{\mathbf{y}})' (\mathbf{y} - \hat{\mathbf{y}}) + (\hat{\mathbf{y}} - \bar{y} \mathbf{1})' (\hat{\mathbf{y}} - \bar{y} \mathbf{1}) = SS_{reg} + SS_{res} \] The multiple $R^2$ is the variability explained by the regression with respect to the total variability and can be expressed as:

\[ R^2 = \frac{SS_{reg}}{SS_{tot}} \] or using the previous expression

\[ 1 = \frac{SS_{tot}}{SS_{tot}} = \frac{SS_{reg}}{SS_{tot}} + \frac{SS_{res}}{SS_{tot}} = R^2 + \frac{SS_{res}}{SS_{tot}} \implies R^2 = 1 - \frac{SS_{res}}{SS_{tot}} \]

Finally, we work on the expressions of $SS_{res}$ and $SS_{tot}$, to express them in terms of projection matrices.

First note that:

\[ \mathbf{y}- \hat{\mathbf{y}} = \mathbf{y}- \mathbf{X}\hat{\boldsymbol{\beta}} = \mathbf{y}- \mathbf{H}\mathbf{y}= (\mathbf{I}- \mathbf{H})\mathbf{y} \] and also notice that $(\mathbf{I}- \mathbf{H})$ is symmetric and:

\[ (\mathbf{I}- \mathbf{H})(\mathbf{I}- \mathbf{H}) = \mathbf{I}-\mathbf{H}- \mathbf{H}+ \mathbf{H}\mathbf{H} \] and

\[ \mathbf{H}\mathbf{H}= \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' = \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' = \mathbf{H} \] this means $\mathbf{H}$ is idempotent. In fact, all projection matrices are idempotent.

Then, we have that:

\[ (\mathbf{I}- \mathbf{H})(\mathbf{I}- \mathbf{H}) = \mathbf{I}-\mathbf{H}- \mathbf{H}+ \mathbf{H}= \mathbf{I}-\mathbf{H}- \mathbf{H} \] which makes $\mathbf{I}- \mathbf{H}$ also idempotent. Therefore:

\[ SS_{res} = (\mathbf{y}- \hat{\mathbf{y}})'(\mathbf{y}- \hat{\mathbf{y}}) = ((\mathbf{I}- \mathbf{H})\mathbf{y})'((\mathbf{I}- \mathbf{H})\mathbf{y}) = \mathbf{y}'(\mathbf{I}- \mathbf{H})'(\mathbf{I}- \mathbf{H})\mathbf{y}= \mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y} \] And we can do a similar trick for the $SS_{tot}$ by writing $\bar{y} \mathbf{1}$ as a result of projecting $\mathbf{y}$ with a design matrix $\mathbf{1}$:

\[ \bar{y} \mathbf{1}= \mathbf{1}\bar{y} = \mathbf{1}\frac{1}{n} \sum_{i=1}^n y_i = \mathbf{1}\frac{1}{n}\mathbf{1}' \mathbf{y}= \mathbf{1}(\mathbf{1}'\mathbf{1})^{-1}\mathbf{1}' \mathbf{y} \] where we use the fact that $\mathbf{1}'\mathbf{1}= n$.

We call $\mathbf{H}_0 = \mathbf{1}(\mathbf{1}'\mathbf{1})^{-1}\mathbf{1}'$, since $\mathbf{1}(\mathbf{1}'\mathbf{1})^{-1}\mathbf{1}'$ is a projection matrix. And since it is a projection matrix it is idempotent (it is also not difficult to check this manually) and $\mathbf{I}- \mathbf{H}_0$ is also idempotent.

So we can do:

\[ \mathbf{y}- \hat{y}\mathbf{1}= \mathbf{y}- \mathbf{H}_0 \mathbf{y}= (\mathbf{I}-\mathbf{H}_0)\mathbf{y} \]

\[ SS_{tot} = (\mathbf{y}- \bar{y}\mathbf{1})'(\mathbf{y}- \bar{y}\mathbf{1}) = ((\mathbf{I}- \mathbf{H}_0)\mathbf{y})'((\mathbf{I}- \mathbf{H}_0)\mathbf{y}) = \mathbf{y}'(\mathbf{I}- \mathbf{H}_0)'(\mathbf{I}- \mathbf{H}_0)\mathbf{y}= \mathbf{y}'(\mathbf{I}- \mathbf{H}_0)\mathbf{y} \]

so the $R^2$ can be expressed as follows:

\[ R^2 = 1 - \frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{\mathbf{y}'(\mathbf{I}- \mathbf{H}_0)\mathbf{y}} \] When written like this, it is easy to see that:

\[ \mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}= \min_\boldsymbol{\beta}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \] the solution to this minimization problem, since we are using the optimal value $\hat{\boldsymbol{\beta}}$. And

\[ \mathbf{y}'(\mathbf{I}- \mathbf{H}_0)\mathbf{y}= \min_{\beta_0} (\mathbf{y}- \mathbf{X}_0 \beta_0)'(\mathbf{y}- \mathbf{X}_0 \beta_0) \]

where $\mathbf{X}_0$ is just a matrix with one column $\mathbf{1}$.

Now, we also have that:

\[ \min_\boldsymbol{\beta}(\mathbf{y}- \mathbf{X}\boldsymbol{\beta})'(\mathbf{y}- \mathbf{X}\boldsymbol{\beta}) \leq \min_{\beta_0} (\mathbf{y}- \mathbf{X}_0 \beta_0)'(\mathbf{y}- \mathbf{X}_0 \beta_0) \]

therefore

\[ \mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}\leq \mathbf{y}'(\mathbf{I}- \mathbf{H}_0)\mathbf{y} \] and since both of them are quadratic forms, we have that:

$\mathbf{y}'(\mathbf{I}- \mathbf{H}_0)\mathbf{y}, \mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}\geq 0$

then:

\[0 \leq \frac{\mathbf{y}'(\mathbf{I}- \mathbf{H})\mathbf{y}}{\mathbf{y}'(\mathbf{I}- \mathbf{H}_0)\mathbf{y}} \leq 0\] then:

\[0 \leq R^2 \leq 0\].

Where we use the fact that all symmetric idempotent matrices are symmetric positive semi-definite.

Another interpretation of $R^2$ is the percentage of the variability explained by multiple regression of a “poor man’s regression” in which you don’t have independent variables (that is you are independent variable poor). In this way, we can define

\[ \bar{y} \mathbf{1}= \hat{\mathbf{y}}_0 \] the “poor man’s prediction”, of which $\mathbf{H}_0$ is it’s projection matrix (or hat matrix).

6.6 Geometric Interpretation of Multiple Linear Regression

Multiple linear regression can be thought as projecting $\mathbf{y}$ in the column space of the design matrix $\mathbf{X}$. The following diagram pictures multiple linear regression.

Here we can see several components:

$\mathbf{y}$ is the vector of observations. Is a vector in $\mathbb{R}^n$.
The grey hyper-plane is the column space generated by $\mathbf{X}$, a sub-space of $\mathbb{R}^n$.
The multiple regression prediction $\hat{\mathbf{y}}$ of $\mathbf{y}$ is the projection of $\mathbf{y}$ on the space generated by the column of $\mathbf{X}$.
The poor man’s prediction $\hat{\mathbf{y}}_0$, in the column space of $\mathbf{X}$ (since, one of the columns is $\mathbf{1}$), but in most cases it is different to $\hat{\mathbf{y}}$ (the closest vector in the column space of $\mathbf{X}$ to $\mathbf{y}$).
We notice that the differences:
- $\mathbf{y}- \hat{\mathbf{y}}$.
- $\mathbf{y}- \hat{\mathbf{y}}_0$
- $\hat{\mathbf{y}} - \hat{\mathbf{y}}_0$
form a right triangle, then it must be that: \[ ||\mathbf{y}- \hat{\mathbf{y}}_0||^2 = ||\mathbf{y}- \hat{\mathbf{y}}||^2 + ||\hat{\mathbf{y}} - \hat{\mathbf{y}}_0||^2 \] which is the same as \[ (\mathbf{y}- \hat{\mathbf{y}})'(\mathbf{y}- \hat{\mathbf{y}}_0) = (\mathbf{y}- \hat{\mathbf{y}})'(\mathbf{y}- \hat{\mathbf{y}}) + (\hat{\mathbf{y}} - \hat{\mathbf{y}}_0)'(\hat{\mathbf{y}} - \hat{\mathbf{y}}_0) \] that can be expressed as: \[ SS_{tot} = SS_{res} + SS_{reg} \]

6.7 Centered and Standarized Variables

6.7.1 Centered Variables

Like with simple linear regression we can center and standardize our variables.

For this section, let us use the notation:

\[\mathbf{X}= \left[\mathbf{x}_1, \mathbf{x}_2, \ldots \mathbf{x}_p \right]\] a matrix with $p$ variables in which each column is a variable $\mathbf{x}_i$. In the context of linear regression you can this is similar to the design matrix except that it doesn’t have the column of ones. In this way, we will rename the design matrix as

\[\mathbf{X}_{*} = \left[\mathbf{1}\mathbf{X}\right] \] we introduce this notation, since we don’t want to center or standardize the column of ones.

To center matrix $\mathbf{X}$ we need to remove the mean of every column. Notice that the vector of means is given by:

\[\bar{\mathbf{x}} = \left[\begin{matrix} \bar{\mathbf{x}}_1 \\ \bar{\mathbf{x}}_2 \\ \vdots \\ \bar{\mathbf{x}}_p \end{matrix}\right] = \left[\begin{matrix} \frac{1}{n} \mathbf{x}_1'\mathbf{1}\\ \frac{1}{n} \mathbf{x}_2'\mathbf{1}\\ \vdots \\ \frac{1}{n} \mathbf{x}_p'\mathbf{1} \end{matrix}\right] = \frac{1}{n} \left[\begin{matrix} \mathbf{x}_1'\mathbf{1}\\ \mathbf{x}_2'\mathbf{1}\\ \vdots \\ \mathbf{x}_p'\mathbf{1} \end{matrix}\right] = \frac{1}{n} \mathbf{X}' \mathbf{1}\]

then the centered data $\mathbf{X}_c$ is given by:

\[\mathbf{X}_c = \mathbf{X}- \left[\begin{matrix} \bar{\mathbf{x}}_1 & \bar{\mathbf{x}}_2 & \dots & \bar{\mathbf{x}}_p\\ \bar{\mathbf{x}}_1 & \bar{\mathbf{x}}_2 & \dots & \bar{\mathbf{x}}_p\\ \vdots & \vdots & \ddots & \vdots \\ \bar{\mathbf{x}}_1 & \bar{\mathbf{x}}_2 & \dots & \bar{\mathbf{x}}_p \end{matrix}\right] = \mathbf{X}- \mathbf{1}\bar{\mathbf{x}}' = \mathbf{X}- \mathbf{1}\left(\frac{1}{n} \mathbf{X}' \mathbf{1}\right)' = \mathbf{X}- \frac{1}{n}\mathbf{1}\mathbf{1}' \mathbf{X}= \left(\mathbf{I}- \frac{1}{n}\mathbf{1}\mathbf{1}' \right)\mathbf{X}\]

We call \[\mathbf{C}= \left(\mathbf{I}- \frac{1}{n}\mathbf{1}\mathbf{1}' \right) = \mathbf{I}- \mathbf{H}_0 \] the centering matrix, since it centers the variables of matrix $\mathbf{X}$. Note also, that $\mathbf{C}$ also centers any matrix with $n$ rows, in particular a vector of size $n$ is also centered by $\mathbf{C}$. So we can center $y$ the dependent variable, the same way:

\[\mathbf{y}_c = \mathbf{C}\mathbf{y}\]

6.7.2 Sample Covariance

Having defined, the centered matrix $\mathbf{X}_c$ we can define the sample covariance of $\mathbf{X}$, $\mathbf{S}_{XX} \in \mathbb{R}^{p \times p}$ as follows:

\[ \mathbf{S}_{XX} = \frac{1}{n-1} \left(\mathbf{X}- \mathbf{1}\bar{\mathbf{x}}'\right)'\left(\mathbf{X}- \mathbf{1}\bar{\mathbf{x}}'\right) = \frac{1}{n-1} \mathbf{X}_c'\mathbf{X}_c = \frac{1}{n-1} \mathbf{X}'\mathbf{C}' \mathbf{C}\mathbf{X}\] Now, since $\mathbf{C}= \mathbf{I}- \mathbf{H}_0$ is idempotent and symmetric we have that:

\[ \mathbf{S}_{XX} = \frac{1}{n-1} \mathbf{X}'\mathbf{C}' \mathbf{C}\mathbf{X}= \frac{1}{n-1} \mathbf{X}' \mathbf{C}\mathbf{X}= \frac{1}{n-1} \mathbf{X}_c'\mathbf{X}= \frac{1}{n-1} \mathbf{X}'\mathbf{X}_c \] So the sample covariance, is the same for the original variables and the centered variables.

We can also define the covariance vector between variables $\mathbf{x}_1, \mathbf{x}_2, \ldots \mathbf{x}_p$ and variable $\mathbf{y}$, $\mathbf{S}_{Xy} \in \mathbb{R}^{p \times 1}$, as follows:

\[ \mathbf{S}_{Xy} = \frac{1}{n-1} \left(\mathbf{X}- \mathbf{1}\bar{\mathbf{x}}'\right)'\left(\mathbf{y}- \mathbf{1}\bar{y}\right) = \frac{1}{n-1} \mathbf{X}_c'\mathbf{y}_c = \frac{1}{n-1} \mathbf{X}'\mathbf{C}' \mathbf{C}\mathbf{y}\]

and, in the same way than before, we have that:

\[\mathbf{S}_{Xy} = \frac{1}{n-1} \mathbf{X}' \mathbf{C}' \mathbf{C}\mathbf{y}= \frac{1}{n-1} \mathbf{X}' \mathbf{C}\mathbf{y}= \frac{1}{n-1} \mathbf{X}_c'\mathbf{y}= \frac{1}{n-1} \mathbf{X}'\mathbf{y}_c\]

so, you don’t need to center both variables. As long as you center one of them the result will be the same.

With this measures, we can focus on splitting the vector of estimated coefficients $\hat{\boldsymbol{\beta}}$, into the estimate for the intercept and the estimates for the independent variables, as follows:

\[\hat{\boldsymbol{\beta}} = \left[\begin{matrix} \hat{\beta}_0 \\ \hat{\boldsymbol{\beta}}_{-0} \end{matrix}\right]\]

with $\hat{\beta}_0$ the estimate of the intercept and $\hat{\boldsymbol{\beta}}_{-0}$ the coefficients for all independent variables. That is:

\[\hat{\boldsymbol{\beta}}_{-0} = \left[\begin{matrix} \hat{\beta}_1 \\ \hat{\beta}_2 \\ \vdots \\ \hat{\beta}_p \end{matrix}\right] \in \mathbb{R}^{p \times 1}\]

Under the new notation, $\mathbf{X}_{*}$ for the design matrix, we have that:

\[ \hat{\boldsymbol{\beta}} = \left(\mathbf{X}_{*}'\mathbf{X}_{*}\right)^{-1}\mathbf{X}_{*}'\mathbf{y}\] so:

\[ \hat{\boldsymbol{\beta}} = \left[\begin{matrix} \hat{\beta}_0 \\ \hat{\boldsymbol{\beta}}_{-0} \end{matrix}\right] = \left( \left[\mathbf{1}\mathbf{X}\right]' \left[\mathbf{1}\mathbf{X}\right] \right)^{-1} \left[\mathbf{1}\mathbf{X}\right]' \mathbf{y}\]

so we need to compute $\left( \left[\mathbf{1}\mathbf{X}\right]' \left[\mathbf{1}\mathbf{X}\right] \right)^{-1}$.

We start by computing:

\[\left[\mathbf{1}\mathbf{X}\right]' \left[\mathbf{1}\mathbf{X}\right] = \left[\begin{matrix} \mathbf{1}' \\ \mathbf{X}' \end{matrix}\right] \left[\mathbf{1}\mathbf{X}\right] = \left[\begin{matrix} \mathbf{1}' \mathbf{1}& \mathbf{1}' \mathbf{X}\\ \mathbf{X}'\mathbf{1}& \mathbf{X}' \mathbf{X}' \end{matrix}\right] = \left[\begin{matrix} n & n\bar{\mathbf{x}}' \\ n\bar{\mathbf{x}} & \mathbf{X}' \mathbf{X}' \end{matrix}\right] \] Now, we need to invert a 2 by 2 block matrix (luckily there is a formula for this). The formula is in the prerequisites section, however note that in this case one of the blocks is of height 1, since the first value we are looking for $\hat{\beta}_0$ is a scalar. After applying the formula, we have:

\[\begin{align*} \left( \left[\mathbf{1}\mathbf{X}\right]' \left[\mathbf{1}\mathbf{X}\right] \right)^{-1} &= \left[\begin{matrix} n^{-1} + n^{-1} n\bar{\mathbf{x}}' \left(\mathbf{X}' \mathbf{X}' - n\bar{\mathbf{x}} n^{-1} n\bar{\mathbf{x}}' \right)^{-1}n\bar{\mathbf{x}} n^{-1} & -n^{-1} n\bar{\mathbf{x}}' \left(\mathbf{X}' \mathbf{X}' - n\bar{\mathbf{x}} n^{-1} n\bar{\mathbf{x}}' \right)^{-1} \\ -\left(\mathbf{X}' \mathbf{X}' - n\bar{\mathbf{x}} n^{-1} n\bar{\mathbf{x}}' \right)^{-1}n\bar{\mathbf{x}} n^{-1} & \left(\mathbf{X}' \mathbf{X}' - n\bar{\mathbf{x}} n^{-1} n\bar{\mathbf{x}}' \right)^{-1} \end{matrix}\right] \\ &= \left[\begin{matrix} n^{-1} + \bar{\mathbf{x}}' \left(\mathbf{X}' \mathbf{X}- n\bar{\mathbf{x}}\bar{\mathbf{x}}' \right)^{-1}\bar{\mathbf{x}} & \bar{\mathbf{x}}' \left(\mathbf{X}' \mathbf{X}- n\bar{\mathbf{x}} \bar{\mathbf{x}}' \right)^{-1} \\ -\left(\mathbf{X}' \mathbf{X}- n\bar{\mathbf{x}} \bar{\mathbf{x}}' \right)^{-1}\bar{\mathbf{x}} & \left(\mathbf{X}' \mathbf{X}- n\bar{\mathbf{x}} \bar{\mathbf{x}}' \right)^{-1} \end{matrix}\right] \end{align*}\]

Now, notice that:

\[\begin{align*} \mathbf{X}' \mathbf{X}- n\bar{\mathbf{x}} \bar{\mathbf{x}}' &= \mathbf{X}' \mathbf{X}- n\left(\frac{\mathbf{X}'\mathbf{1}}{n}\right)\left(\frac{\mathbf{X}'\mathbf{1}}{n}\right)' \\ &= \mathbf{X}' \mathbf{X}- \frac{1}{n}\mathbf{X}'\mathbf{1}\mathbf{1}' \mathbf{X}\\ &= \mathbf{X}'\left(\mathbf{I}- \frac{1}{n}\mathbf{1}\mathbf{1}'\right) \mathbf{X}\\ &= \mathbf{X}'\mathbf{C}\mathbf{X}\\ &= (n-1)\mathbf{S}_{XX} \end{align*}\]

then:

\[\begin{align*} \left( \left[\mathbf{1}\mathbf{X}\right]' \left[\mathbf{1}\mathbf{X}\right] \right)^{-1} &= \left[\begin{matrix} n^{-1} + \bar{\mathbf{x}}' \left((n-1)\mathbf{S}_{XX} \right)^{-1}\bar{\mathbf{x}} & -\bar{\mathbf{x}}' \left((n-1)\mathbf{S}_{XX} \right)^{-1} \\ -\left((n-1)\mathbf{S}_{XX} \right)^{-1}\bar{\mathbf{x}} & \left((n-1)\mathbf{S}_{XX} \right)^{-1} \end{matrix}\right] \\ &= \left[\begin{matrix} n^{-1} + \frac{1}{n-1}\bar{\mathbf{x}}' \mathbf{S}_{XX}^{-1} \bar{\mathbf{x}} & -\frac{1}{n-1}\bar{\mathbf{x}}' \mathbf{S}_{XX}^{-1} \\ -\frac{1}{n-1}\mathbf{S}_{XX}^{-1} \bar{\mathbf{x}} & \frac{1}{n-1}\mathbf{S}_{XX}^{-1} \end{matrix}\right] \end{align*}\]

In a similar way we can easily compute:

\[\left[\mathbf{1}\mathbf{X}\right]' \mathbf{y}= \left[\begin{matrix} \mathbf{1}' \mathbf{y}\\ \mathbf{X}'\mathbf{y} \end{matrix}\right] = \left[\begin{matrix} n\bar{y} \\ \mathbf{X}'\mathbf{y} \end{matrix}\right]\]

Then, we have that:

\[\begin{align*} \hat{\boldsymbol{\beta}} &= \left[\begin{matrix} \\ \hat{\beta}_0 \\ \hat{\boldsymbol{\beta}}_{-0} \end{matrix}\right] \\ &= \left[\begin{matrix} n^{-1} + \frac{1}{n-1}\bar{\mathbf{x}}' \mathbf{S}_{XX}^{-1} \bar{\mathbf{x}} & -\frac{1}{n-1}\bar{\mathbf{x}}' \mathbf{S}_{XX}^{-1} \\ -\frac{1}{n-1}\mathbf{S}_{XX}^{-1} \bar{\mathbf{x}} & \frac{1}{n-1}\mathbf{S}_{XX}^{-1} \end{matrix}\right] \left[\begin{matrix} n\bar{y} \\ \mathbf{X}'\mathbf{y} \end{matrix}\right] \\ &= \left[\begin{matrix} n^{-1}n\bar{y} + \frac{n\bar{y}}{n-1}\bar{\mathbf{x}}' \mathbf{S}_{XX}^{-1} \bar{\mathbf{x}} - \frac{1}{n-1}\bar{\mathbf{x}}' \mathbf{S}_{XX}^{-1}\mathbf{X}'\mathbf{y}\\ -\frac{n\bar{y}}{n-1}\mathbf{S}_{XX}^{-1} \bar{\mathbf{x}} + \frac{1}{n-1}\mathbf{S}_{XX}^{-1}\mathbf{X}'\mathbf{y} \end{matrix}\right] \end{align*}\]

Then, working first with the second row-block:

\[\begin{align*} \hat{\boldsymbol{\beta}}_{-0} &= -\frac{n\bar{y}}{n-1}\mathbf{S}_{XX}^{-1} \bar{\mathbf{x}} + \frac{1}{n-1}\mathbf{S}_{XX}^{-1}\mathbf{X}'y \\ &= \frac{1}{n-1}\mathbf{S}_{XX}^{-1}\mathbf{X}'y -\frac{1}{n-1}\mathbf{S}_{XX}^{-1} \bar{\mathbf{x}}n\bar{y} \\ &= \frac{1}{n-1}\mathbf{S}_{XX}^{-1}\left(\mathbf{X}'y - n\bar{\mathbf{x}}\bar{y}\right) \end{align*}\]

Now, we note that:

\[\begin{align*} \mathbf{X}'\mathbf{y}- n\bar{\mathbf{x}}\bar{y} &= \mathbf{X}'\mathbf{y}- n\bar{\mathbf{x}}\bar{y} \\ &= \mathbf{X}'\mathbf{y}- n\left(\frac{\mathbf{X}'\mathbf{1}}{n}\right)\left(\frac{\mathbf{1}'\mathbf{y}}{n}\right) \\ &= \mathbf{X}'\mathbf{y}- \frac{1}{n}\mathbf{X}'\mathbf{1}\mathbf{1}'\mathbf{y}\\ &= \mathbf{X}' \left(\mathbf{I}- \frac{1}{n} \mathbf{1}\mathbf{1}' \right) \mathbf{y}\\ &= \mathbf{X}' \mathbf{C}\mathbf{y}\\ &= (n-1)\mathbf{S}_{Xy} \end{align*}\]

Then, we have that:

\[\hat{\boldsymbol{\beta}}_{-0} = \frac{1}{n-1}\mathbf{S}_{XX}^{-1}(n-1)\mathbf{S}_{Xy} = \mathbf{S}_{XX}^{-1}\mathbf{S}_{Xy}\] An equivalent result to that of simple linear regression, both using the covariance matrix of the independent variables and the covariance vector of the independent variables and the dependent variable.

This way of writing the coefficients shows the influence of each component:

$\mathbf{S}_{XX}^{-1}$: The relationship between the independent variables.
$\mathbf{S}_{Xy}$: The relationship between the independent variables and the dependent variables.

Also notice that, since centralizing doesn’t change the values of $\mathbf{S}_{X_cX_c}^{-1}$ and $\mathbf{S}_{X_cy_c}$, centralizing the independent variables or the dependent variable (or both), doesn’t change the value of the coefficients of the independent variables.

Now, we can work more easily with the intercept estimate:

\[\begin{align*} \hat{\beta}_0 &= n^{-1}n\bar{y} + \frac{n\bar{y}}{n-1}\bar{\mathbf{x}}' \mathbf{S}_{XX}^{-1} \bar{\mathbf{x}} - \frac{1}{n-1}\bar{\mathbf{x}}' \mathbf{S}_{XX}^{-1}\mathbf{X}'\mathbf{y}\\ &= \bar{y} + \frac{1}{n-1}\bar{\mathbf{x}}'\mathbf{S}_{XX}^{-1} n\bar{\mathbf{x}}\bar{y} - \frac{1}{n-1}\bar{\mathbf{x}}'\mathbf{S}_{XX}^{-1}\mathbf{X}'\mathbf{y}\\ &= \bar{y} + \frac{1}{n-1}\bar{\mathbf{x}}'\mathbf{S}_{XX}^{-1} \left(n\bar{\mathbf{x}}\bar{y} - \mathbf{X}'\mathbf{y}\right) \\ &= \bar{y} - \frac{1}{n-1}\bar{\mathbf{x}}'\mathbf{S}_{XX}^{-1} \left(\mathbf{X}'\mathbf{y}- n\bar{\mathbf{x}}\bar{y} \right) \\ &= \bar{y} - \frac{1}{n-1}\bar{\mathbf{x}}'\mathbf{S}_{XX}^{-1} (n-1)\mathbf{S}_{Xy} \\ &= \bar{y} - \bar{\mathbf{x}}'\mathbf{S}_{XX}^{-1} \mathbf{S}_{Xy} \\ &= \bar{y} - \bar{\mathbf{x}}' \hat{\boldsymbol{\beta}}_{-0} \end{align*}\]

Again, an equivalent result to that of simple linear regression. Like in simple linear regression, centering the independent variables does affect the intercept estimate, since $\bar{\mathbf{x}}=\mathbf{0}$, we have that the coefficient after centering the independent variables is $\bar{y}$ the mean of the dependent variable. And if we also center the dependent variable, then $\bar{y}=0$ so the estimate of the intercept is $0$ also. Therefore, if you are centering all variables, it is not necessary to add the column of ones in the design matrix, since the estimate of the intercept is $0$.

6.7.3 Satandard Variables

In the same way we worked with centered variables, we can work with standard variables to define the sample correlations.

The standardization of $\mathbf{X}$, $\mathbf{X}_s$, is given by:

\[\begin{align*} \mathbf{X}_s &= \left[\begin{matrix} \frac{x_{11} - \bar{\mathbf{x}}_1}{S_{x_1x_1}^{1/2}} & \frac{x_{12} - \bar{\mathbf{x}}_2}{S_{x_2x_2}^{1/2}} & \dots & \frac{x_{1p} - \bar{\mathbf{x}}_p}{S_{x_px_p}^{1/2}} \\ \frac{x_{21} - \bar{\mathbf{x}}_1}{S_{x_1x_1}^{1/2}} & \frac{x_{22} - \bar{\mathbf{x}}_2}{S_{x_2x_2}^{1/2}} & \dots & \frac{x_{2p} - \bar{\mathbf{x}}_p}{S_{x_px_p}^{1/2}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{x_{n1} - \bar{\mathbf{x}}_1}{S_{x_1x_1}^{1/2}} & \frac{x_{n2} - \bar{\mathbf{x}}_2}{S_{x_2x_2}^{1/2}} & \dots & \frac{x_{np} - \bar{\mathbf{x}}_p}{S_{x_px_p}^{1/2}} \end{matrix}\right] \\ &= \left[\begin{matrix} x_{11} - \bar{\mathbf{x}}_1 & x_{12} - \bar{\mathbf{x}}_2 & \dots & x_{1p} - \bar{\mathbf{x}}_p \\ x_{21} - \bar{\mathbf{x}}_1 & x_{22} - \bar{\mathbf{x}}_2 & \dots & x_{2p} - \bar{\mathbf{x}}_p \\ \vdots & \vdots & \ddots & \vdots \\ x_{n1} - \bar{\mathbf{x}}_1 & x_{n2} - \bar{\mathbf{x}}_2 & \dots & x_{np} - \bar{\mathbf{x}}_p \end{matrix}\right] \left[\begin{matrix} \frac{1}{S_{x_1x_1}^{1/2}} & 0 & \dots & 0 \\ 0 & \frac{x_{22} - \bar{\mathbf{x}}_2}{S_{x_2x_2}^{1/2}} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \frac{x_{np} - \bar{\mathbf{x}}_p}{S_{x_px_p}^{1/2}} \end{matrix}\right] &= \mathbf{X}_c \mathbf{D}_X &= \mathbf{C}\mathbf{X}\mathbf{D}_X \end{align*}\]

where

\[ \mathbf{D}_X = \left[\begin{matrix} \frac{1}{S_{x_1x_1}^{1/2}} & 0 & \dots & 0 \\ 0 & \frac{x_{22} - \bar{\mathbf{x}}_2}{S_{x_2x_2}^{1/2}} & \dots & 0 \\ \vdots & \vdots & \ddots & \vdots \\ 0 & 0 & \dots & \frac{x_{np} - \bar{\mathbf{x}}_p}{S_{x_px_p}^{1/2}} \end{matrix}\right]\]

is the matrix that standardizes $\mathbf{X}$. Notice that unlike $\mathbf{C}$, that centers any matrix with the appropriate number of rows, $\mathbf{D}_X$ only standardizes $\mathbf{X}$.

6.7.4 Sample Correlation Matrix

We define the sample correlation matrix as:

\[ r_{XX} = \frac{\mathbf{X}_s'\mathbf{X}_s}{n-1} \] where we break a little our notation convention of using bold capital letters for matrices. Entry at row $i$ and column $j$ of $r_{XX}$ is given by:

\[ \left[r_{XX}\right]_{ij} = \frac{S_{x_i x_j}}{S_{x_ix_i}^{1/2}S_{x_jx_j}^{1/2}} \]

We can also define the sample correlation between $\mathbf{X}$ and $\mathbf{y}$ as follows:

\[r_{Xy} = \frac{\mathbf{X}_s' \mathbf{y}_s}{n-1} \]

where $\mathbf{y}_s$ is the standardization of $\mathbf{y}$.

With this definitions in hand, we can see how the coefficients look like with standardized variables. Since, standardized variables are also centered, it is not necessary to include the column of ones in the design matrix, as the intercept estimate is always 0. Then the estimate of the coefficients of the independent variables is given by:

\[\hat{\boldsymbol{\beta}_s} = \left(\mathbf{X}_s' \mathbf{X}_s \right)^{-1} \mathbf{X}_s' \mathbf{y}_s = \frac{1}{n-1}r_{XX}^{-1}(n-1)r_{Xy}=r_{XX}^{-1}r_{Xy}\] Notice that the coefficients estimates do change for standardized variables, since in general $r_{XX}\neq\mathbf{S}_{XX}$ and $r_{Xy}\neq\mathbf{S}_{Xy}$. In the case of standardized variables the coefficient estimates depend in part from the correlation between independent variables $r_{XX}$ and the correlations between independent and dependent variables $r_{Xy}$.

Working with standardized variables is useful, since standardized variables are unit-less, so the estimated coefficients magnitudes are comparable.

We can see these results in practice with our GDP data:

# Read Data
dat <- read.csv("Gdp data.csv")

# Design Matrix Independent Variables
X <- as.matrix(dat[, -1])
# Dependent Variable
y <- dat$gdp
# Number of Observations
n <- nrow(X)
# Design Matrix with the column of Ones
Z <- cbind(rep(1, n), X)

# Centering
# Vector of Ones
v1 <- rep(1, n)
# Centering Matrix
C  <- diag(n) - (1/n) * v1 %*% t(v1)
# Independent Variables Centered
Xc <- C %*% X
# Dependent Variable Centered
yc <- C %*% y
# Design Matrix with Independent Variables Centered
Zc <- cbind(rep(1, n), Xc)

# Checks that the Variables are actually centered
print(round(colMeans(Xc), 8))

## inf une int gov exp 
##   0   0   0   0   0

print(round(mean(yc), 8))

## [1] 0

# Shows the Mean of y
print(round(mean(y), 8))

## [1] 2.981131

# Compute the Estimates of the Coefficients with Original Variables
b   <- solve(t(Z) %*% Z, t(Z) %*% y)
# Compute the Estimates of the Coefficients with Centered Independent Variables Only
bs1 <- solve(t(Zc) %*% Zc, t(Zc) %*% y)
# Compute the Estimates of the Coefficients with All Variables Centered
bs2 <- solve(t(Zc) %*% Zc, t(Zc) %*% yc)
# Compute the Estimates of the Coefficients with All variables Centered and no column of ones
bs3 <- solve(t(Xc) %*% Xc, t(Xc) %*% yc)

# Shows the Estimated Coefficients Side-by-Side
print(round(cbind(b, bs1, bs2, c(0, bs3)), 8))

##            [,1]        [,2]        [,3]        [,4]
##      1.55530096  2.98113131  0.00000000  0.00000000
## inf  0.31221813  0.31221813  0.31221813  0.31221813
## une -0.37733403 -0.37733403 -0.37733403 -0.37733403
## int  0.17782709  0.17782709  0.17782709  0.17782709
## gov -0.00848329 -0.00848329 -0.00848329 -0.00848329
## exp  0.06465692  0.06465692  0.06465692  0.06465692

Here we can appreciate that the estimated coefficients for the independent variables do not change, however the estimate for the intercept changes depending on if the independent variables are centered or the dependent variable is centered of both. Also, notice that this are the that we had using the lm function of R.

We can also check that computing the sample covariance matrix using our formula results in the same quantities that using the cov function in R.

# Sample Covariance of X
SXX <- t(Xc) %*% Xc / (n-1)
# Sample Covariance between X and y
SXy <- t(Xc) %*% yc / (n-1)

# Shows the comparison in covariance matrices
print(SXX)

##             inf         une         int        gov        exp
## inf  4.03991957 -1.28576595  1.52551017  2.4374124 0.03447976
## une -1.28576595  0.66963355 -0.27282160 -0.5319862 0.02605596
## int  1.52551017 -0.27282160  0.83778484  0.7633037 0.04335484
## gov  2.43741235 -0.53198621  0.76330368 15.2006711 0.21732195
## exp  0.03447976  0.02605596  0.04335484  0.2173220 9.10237220

print(cov(X))

##             inf         une         int        gov        exp
## inf  4.03991957 -1.28576595  1.52551017  2.4374124 0.03447976
## une -1.28576595  0.66963355 -0.27282160 -0.5319862 0.02605596
## int  1.52551017 -0.27282160  0.83778484  0.7633037 0.04335484
## gov  2.43741235 -0.53198621  0.76330368 15.2006711 0.21732195
## exp  0.03447976  0.02605596  0.04335484  0.2173220 9.10237220

# Shows the comparison in covariance vectors
print(SXy)

##           [,1]
## inf  1.9993285
## une -0.6964324
## int  0.7245455
## gov  0.9825766
## exp  0.5953308

print(cov(X, y))

##           [,1]
## inf  1.9993285
## une -0.6964324
## int  0.7245455
## gov  0.9825766
## exp  0.5953308

Finally, we can test our new formulas for the estimates:

# Computes the estimates of the coefficients using the covariance matrices
b1 <- solve(SXX, SXy)
b0 <- mean(y) - t(colMeans(X)) %*% b1

# Shows the Estimates Side by Side
print(round(cbind(b, c(b0, b1)), 8))

##            [,1]        [,2]
##      1.55530096  1.55530096
## inf  0.31221813  0.31221813
## une -0.37733403 -0.37733403
## int  0.17782709  0.17782709
## gov -0.00848329 -0.00848329
## exp  0.06465692  0.06465692

In the same way, we can work with the sample correlations

# Standardizing matrix of X
DX <- diag(1/sqrt(diag(SXX)))
# Standardize X
Xs <- Xc %*% DX
# Shows that Xs is indeed standardize
print(round(colMeans(Xs), 8))

## [1] 0 0 0 0 0

print(apply(X = Xs, MARGIN = 2, FUN = sd))

## [1] 1 1 1 1 1

# Standardizes y
ys <- yc / sd(y)
# Shows that ys is standardized
print(round(mean(ys), 8))

## [1] 0

print(round(sd(ys), 8))

## [1] 1

# Sample Correlation of X
rXX <- t(Xs) %*% Xs / (n-1)
# Sample Correlation between X and y
rXy <- t(Xs) %*% ys / (n-1)

# Shows the comparison in correlation matrices
print(rXX)

##              [,1]        [,2]       [,3]        [,4]        [,5]
## [1,]  1.000000000 -0.78173033  0.8292061  0.31103644 0.005685918
## [2,] -0.781730327  1.00000000 -0.3642453 -0.16674407 0.010553855
## [3,]  0.829206121 -0.36424525  1.0000000  0.21389456 0.015699798
## [4,]  0.311036436 -0.16674407  0.2138946  1.00000000 0.018475446
## [5,]  0.005685918  0.01055386  0.0156998  0.01847545 1.000000000

print(cor(X))

##              inf         une        int         gov         exp
## inf  1.000000000 -0.78173033  0.8292061  0.31103644 0.005685918
## une -0.781730327  1.00000000 -0.3642453 -0.16674407 0.010553855
## int  0.829206121 -0.36424525  1.0000000  0.21389456 0.015699798
## gov  0.311036436 -0.16674407  0.2138946  1.00000000 0.018475446
## exp  0.005685918  0.01055386  0.0156998  0.01847545 1.000000000

# Shows the comparison in correlation vectors
print(rXy)

##            [,1]
## [1,]  0.8751311
## [2,] -0.7487479
## [3,]  0.6964256
## [4,]  0.2217228
## [5,]  0.1736027

print(cor(X, y))

##           [,1]
## inf  0.8751311
## une -0.7487479
## int  0.6964256
## gov  0.2217228
## exp  0.1736027

and can test that:

\[\hat{\boldsymbol{\beta}_s} = r_{XX}^{-1}r_{Xy}\]

# Computes the estimates of the coefficients using the covariance matrices
bs1 <- solve(t(Xs) %*% Xs, t(Xs) %*% ys)
bs2 <- solve(rXX, rXy)

# Shows the Estimates Side by Side
print(round(cbind(bs1, bs2), 8))

##             [,1]        [,2]
## [1,]  0.55210259  0.55210259
## [2,] -0.27165637 -0.27165637
## [3,]  0.14319884  0.14319884
## [4,] -0.02909852 -0.02909852
## [5,]  0.17161988  0.17161988

We can also contrast the effects of the standarization on the coefficients

# Computes the estimates of the coefficients using the covariance matrices
bs1 <- lm(y ~ X)$coefficients
bs2 <- lm(ys ~ Xs)$coefficients

# Shows the Estimates Side by Side
print(round(cbind(bs1, bs2), 8))

##                     bs1         bs2
## (Intercept)  1.55530096  0.00000000
## Xinf         0.31221813  0.55210259
## Xune        -0.37733403 -0.27165637
## Xint         0.17782709  0.14319884
## Xgov        -0.00848329 -0.02909852
## Xexp         0.06465692  0.17161988

First, we can observe the 0 intercept estimate on the standardize values, so it is not necessary to estimate it, we could have done so by using lm(ys ~ Xs - 1) instead of lm(ys ~ Xs). Next we observe, the change in magnitudes for the estimated coefficients of the independent variables.

6.8 Variable Cross-Effects

For this sub-section we will work with standardized values so there is no need to estimate the intercept. Since all variables are standardized, we will not use the $\mathbf{X}_s$ notation, but instead only $\mathbf{X}$, to make notation less confusing.

The same goes with $\hat{\boldsymbol{\beta}}_s$, it will be only $\hat{\boldsymbol{\beta}}$.

The idea is to analyze the estimated coefficients when you divide the independent variables in to groups $1$ and $2$. So we can divide the design matrix in two:

\[\mathbf{X}= [\mathbf{X}_1 \mathbf{X}_2]\] Then, we can compute the coefficient estimates:

\[\hat{\boldsymbol{\beta}} = \left[\begin{matrix} \hat{\boldsymbol{\beta}}_1 \\ \hat{\boldsymbol{\beta}}_2 \end{matrix}\right] = \left([\mathbf{X}_1 \mathbf{X}_2]'[\mathbf{X}_1 \mathbf{X}_2]\right)^{-1}[\mathbf{X}_1 \mathbf{X}_2]' \mathbf{y}\]

So, we can work with these computations in the same way we did before:

\[\begin{align*} [\mathbf{X}_1 \mathbf{X}_2]'[\mathbf{X}_1 \mathbf{X}_2] &= \left[\begin{matrix} \mathbf{X}_1' \\ \mathbf{X}_2' \end{matrix}\right] [\mathbf{X}_1 \mathbf{X}_2] \\ &= \left[\begin{matrix} \mathbf{X}_1'\mathbf{X}_1 & \mathbf{X}_1'\mathbf{X}_2 \\ \mathbf{X}_2'\mathbf{X}_1 & \mathbf{X}_2'\mathbf{X}_2 \end{matrix}\right] \\ &= (n-1) \left[\begin{matrix} \frac{\mathbf{X}_1'\mathbf{X}_1}{n-1} & \frac{\mathbf{X}_1'\mathbf{X}_2}{n-1} \\ \frac{\mathbf{X}_2'\mathbf{X}_1}{n-1} & \frac{\mathbf{X}_2'\mathbf{X}_2}{n-1} \end{matrix}\right] \\ &= (n-1) \left[\begin{matrix} r_{X_1X_1} & r_{X_1X_2} \\ r_{X_2X_1} & r_{X_2X_2} \end{matrix}\right] \end{align*}\]

In the same way, we have that:

\[\begin{align*} [\mathbf{X}_1 \mathbf{X}_2]'\mathbf{y} &= \left[\begin{matrix} \mathbf{X}_1' \\ \mathbf{X}_2' \end{matrix}\right] \mathbf{y}\\ &= \left[\begin{matrix} \mathbf{X}_1'\mathbf{y}\\ \mathbf{X}_2'\mathbf{y} \end{matrix}\right] \\ &= (n-1) \left[\begin{matrix} \frac{\mathbf{X}_1'\mathbf{y}}{n-1} \\ \frac{\mathbf{X}_2'\mathbf{y}}{n-1} \end{matrix}\right] \\ &= (n-1) \left[\begin{matrix} r_{X_1y} \\ r_{X_2y} \end{matrix}\right] \end{align*}\]

Then we have that:

\[ \hat{\boldsymbol{\beta}} = \left[\begin{matrix} \hat{\boldsymbol{\beta}}_1 \\ \hat{\boldsymbol{\beta}}_2 \end{matrix}\right] = \left((n-1) \left[\begin{matrix} r_{X_1X_1} & r_{X_1X_2} \\ r_{X_2X_1} & r_{X_2X_2} \end{matrix}\right]\right)^{-1} (n-1) \left[\begin{matrix} r_{Xy,1} \\ r_{Xy,2} \end{matrix}\right] = \left[\begin{matrix} r_{X_1X_1} & r_{X_1X_2} \\ r_{X_2X_1} & r_{X_2X_2} \end{matrix}\right]^{-1} \left[\begin{matrix} r_{X_1y} \\ r_{X_2y} \end{matrix}\right] \]

Now we can compute the inverse of the 2 by 2 block matrix as before with, but first we define:

\[r_{X_1|X_2} = r_{X_1X_1} - r_{X_1X_2}r_{X_2X_2}^{-1}r_{X_2X_1}\] this is also, the Schur component. This can also be seen as the sample correlation matrix of $X_1$ after accounting by the relationships with $X_2$.

\[ \left[\begin{matrix} r_{XX,11} & r_{XX,12} \\ r_{XX,12} & r_{XX,22} \end{matrix}\right]^{-1} = \left[\begin{matrix} r_{X_1|X_2}^{-1} & -r_{X_1|X_2}^{-1}r_{X_1X_2}r_{X_2X_2}^{-1} \\ -r_{X_2X_2}^{-1}r_{X_2X_1}r_{X_1|X_2}^{-1} & r_{X_2X_2}^{-1}r_{X_2X_1}r_{X_1|X_2}^{-1}r_{X_1X_2}r_{X_2X_2}^{-1} \end{matrix}\right] \] Then, we have that:

\[\begin{align*} \hat{\boldsymbol{\beta}} &= \left[\begin{matrix} \hat{\boldsymbol{\beta}}_1 \\ \hat{\boldsymbol{\beta}}_2 \end{matrix}\right] \\ &= \left[\begin{matrix} r_{X_1|X_2}^{-1} & -r_{X_1|X_2}^{-1}r_{X_1X_2}r_{X_2X_2}^{-1} \\ -r_{X_2X_2}^{-1}r_{X_2X_1}r_{X_1|X_2}^{-1} & r_{X_2X_2}^{-1}r_{X_2X_1}r_{X_1|X_2}^{-1}r_{X_1X_2}r_{X_2X_2}^{-1} \end{matrix}\right] \left[\begin{matrix} r_{X_1y} \\ r_{X_2y} \end{matrix}\right] \\ &= \left[\begin{matrix} r_{X_1|X_2}^{-1}r_{X_1y} -r_{X_1|X_2}^{-1}r_{X_1X_2}r_{X_2X_2}^{-1}r_{X_2y} \\ -r_{X_2X_2}^{-1}r_{X_2X_1}r_{X_1|X_2}^{-1}r_{X_1y} + r_{X_2X_2}^{-1}r_{X_2X_1}r_{X_1|X_2}^{-1}r_{X_1X_2}r_{X_2X_2}^{-1}r_{X_2y} \end{matrix}\right] \end{align*}\]

Since the variables in the groupings 1 and 2 can be switched and have no special characteristics, we only need to analyze the structure of $\hat{\boldsymbol{\beta}}_1$ in relation to the variables of group 2, the results would be analogous for $\hat{\boldsymbol{\beta}}_1$.

Then we have that:

\[ \hat{\boldsymbol{\beta}}_1 = r_{X_1|X_2}^{-1}r_{X_1y} -r_{X_1|X_2}^{-1}r_{X_1X_2}r_{X_2X_2}^{-1}r_{X_2y} = r_{X_1|X_2}^{-1}\left(r_{X_1y} -r_{X_1X_2}r_{X_2X_2}^{-1}r_{X_2y}\right) \] Now, suppose that we want to fit the following linear models:

\[\mathbf{y}= \mathbf{X}_1 \boldsymbol{\beta}_1 + \mathbf{e}\quad \text{and} \quad \mathbf{y}= \mathbf{X}_2 \boldsymbol{\beta}_2 + \mathbf{e}\] that is a linear model of $\mathbf{y}$ using only the variables in group 1 for one model and only variables from the group 2 in the second model. Then, the coefficient estimates will be:

\[\tilde{\boldsymbol{\beta}}_1 = r_{X_1X_1}^{-1}r_{X_1y} \quad \text{and} \quad \tilde{\boldsymbol{\beta}}_2 = r_{X_2X_2}^{-1}r_{X_2y}\] Note that, in general, this estimates will be different to the estimates using all the variables, that is:

\[\tilde{\boldsymbol{\beta}}_1 \neq \hat{\boldsymbol{\beta}}_1 \quad \text{and} \quad \tilde{\boldsymbol{\beta}}_2 \neq \hat{\boldsymbol{\beta}}_2\] Then, we can re-write, our coefficient estimate for $\boldsymbol{\beta}_1$ as follows:

\[ \hat{\boldsymbol{\beta}}_1 = r_{X_1|X_2}^{-1}\left(r_{X_1y} -r_{X_1X_2}\tilde{\boldsymbol{\beta}}_2\right) \] Now, if $r_{X_1X_2} = \mathbf{0}$, that is the variables in group 1 and group 2 are uncorrelated, then

\[r_{X_1|X_2} = r_{X_1X_1} - r_{X_1X_2}r_{X_2X_2}^{-1}r_{X_2X_1} = r_{X_1X_1}\] so: \[\hat{\boldsymbol{\beta}}_1 = r_{X_1|X_2}^{-1}\left(r_{X_1y} -r_{X_1X_2}\tilde{\boldsymbol{\beta}}_2\right) = r_{X_1X_1}^{-1}r_{X_1y} = \tilde{\boldsymbol{\beta}}_1\] So, when the variables in 1 group are uncorrelated with the other the coefficient estimates are the same for the full and partial model.

We can also deduce that even when the sample correlation between $X_1$ and $y$ is $\mathbf{0}$, that is $r_{X_1y} = \mathbf{0}$, the estimated coefficients will not be $\mathbf{0}$ in general. In fact, we have that:

\[ \hat{\boldsymbol{\beta}}_1 = -r_{X_1|X_2}^{-1}r_{X_1X_2}\tilde{\boldsymbol{\beta}}_2 \]

6.8.1 Single Variable Cross-Effects

In the special case where group 1 consists of only 1 variable, we have that:

$\hat{\beta}_1, \quad r_{X_1X_1} = r_{x_1x_1}=1, \quad r_{X_1|X_2}=r_{x_1|X_2}, \quad r_{X_1y}=r_{x_1y}=\tilde{\beta}_1$

are scalars, while

\[r_{X_1X_2} = r_{x_1X_2}=r_{X_2X_1}' = r_{X_2x_1}', \quad r_{X_2y}, \quad \tilde{\boldsymbol{\beta}}_2\] are vectors of size $p-1$.

Now, consider the following linear model:

\[\mathbf{x}_1 = \mathbf{X}_2 \boldsymbol{\alpha}_2 + \mathbf{e}\]

that is, fitting the single variable in group 1, as the dependent variable, and the variables in group 2, as the independent variables. Then we have that the estimated coefficients are:

\[ \hat{\boldsymbol{\alpha}}_2 = r_{X_2X_2}^{-1}r_{X_2x_1}\] then, we notice the following

\[r_{x_1|X_2} = r_{X_1X_1} - r_{X_1X_2}r_{X_2X_2}^{-1}r_{X_2X_1} = r_{x_1X_1} - r_{x_1X_2}r_{X_2X_2}^{-1}r_{X_2x_1} = 1 - r_{x_1X_2}\hat{\boldsymbol{\alpha}}_2 = 1 - r_{x_1X_2}r_{X_2X_2}^{-1}r_{X_2X_2}\hat{\boldsymbol{\alpha}}_2 = 1 - \hat{\boldsymbol{\alpha}}_2'r_{X_2X_2}\hat{\boldsymbol{\alpha}}_2 = 1 - R^2_1 \]

where, $R^2_1$ is the multiple coefficient of determination for model:

\[\mathbf{x}_1 = \mathbf{X}_2 \boldsymbol{\alpha}_2 + \mathbf{e}\]

that is, how much of $\mathbf{x}_1$ is explained by $\mathbf{X}_2$.

Then, we have that:

\[\hat{\beta}_1 = r_{1_1|X_2}^{-1}\left(r_{x_1y} -r_{x_1X_2}\tilde{\boldsymbol{\beta}}_2\right) = \frac{1}{1-R^2_1} \left(\tilde{\beta}_1 -r_{x_1X_2}\tilde{\boldsymbol{\beta}}_2\right)\] We name $\frac{1}{1 - R^2_1}$ as:

\[VIF_1 = \frac{1}{1 - R^2_1} \] the bigger, the more distorted is the value of $\hat{\beta}_1$. This number increases as $R^2_1$ in creases.

If $R^2_1 = 0$, that is $\mathbf{X}_2$ explains nothing of $\mathbf{x}_1$, then

\[ VIF_1 = 1 \] that represents no distortion. If $R^2_1 \to 1$, then:

\[ VIF_1 \to 1 \] which represents infinite distortion.

The name $VIF$ stands for variance inflation factor. We will see that later, how it relates to the variance of the estimate $\hat{\beta}_1$.

Of course, variable 1, didn’t have any particular property, so we can generalize this to any variable $k$.

On the other hand, we also have an effect $r_{x_1X_2}\tilde{\boldsymbol{\beta}}_2$ on the coefficient

6.8.1.1 Single Variable Cross-effects example

We can see the effects the other variables have in changing the coefficients in our GDP data set, by computing $R^2_k$ for each variable.

# Re-names Standardized Variables
y <- ys
X <- Xs

# Coefficients for the Full Model
b <- lm(y ~ X -1)$coefficients

# Individual Coefficients
# Number of Variables
P <- ncol(X)
# Vector to Save Individual Coefficients
bi <- numeric(length = P)
for(k in 1:P){
  bi[k] <- lm(y ~ X[, k] - 1)$coefficients
}

# Coefficients without variable k
bt <- matrix(data = NA, nrow = P-1, ncol = P)
for(k in 1:P){
  bt[, k] <- lm(y ~ X[, -k] - 1)$coefficients
}

# Coefficient of Determination of X_k ~ X_{-k} models and coefficients
R2k <- numeric(length = P)
for(k in 1:P){
  outReg <- lm(X[, k] ~ X[, -k] - 1)
  R2k[k] <- summary(outReg)$r.squared
}

# Cross product of correlation and coefficients
rbt <- numeric(length = P)
rXX <- cor(X)
for(k in 1:P){
  rbt[k] <- rXX[k, -k] %*% bt[, k]
}

# Formats and Show Info Table
tab           <- cbind(b, bi, R2k, 1/(1-R2k), rbt)
colnames(tab) <- c("b_full", "b_ind", "R2k", "VIF", "rb_ind")
rownames(tab) <- colnames(dat)[2:6]
knitr::kable(tab)

	b_full	b_ind	R2k	VIF	rb_ind
inf	0.5521026	0.8751311	0.9604527	25.286192	0.8532969
une	-0.2716564	-0.7487479	0.8843302	8.645299	-0.7173255
int	0.1431988	0.6964256	0.9061532	10.655663	0.6829869
gov	-0.0290985	0.2217228	0.2067576	1.260649	0.2448050
exp	0.1716199	0.1736027	0.0008505	1.000851	0.0021287

6.9 Outliers and Leverage

It is harder to identify outliers visually in multiple linear regression, specially if the number of variables is big. The effect of an observation $i$ on the linear regression fit will depend on 2 things:

The size of the residual for observation $i$, that is $\hat{e}_i^2$.
How “close” is observation $i$ to other observations.

Note: For this section, we are going to use $X_i\in\mathbb{R}^{1 \times p}$ as the $i-th$ row of the design matrix. That is the $i$-th observation of the independent varaibles.

6.9.1 Leverage

In multiple linear regression, leverage refers to the influence that a particular data point has on the estimation of the regression model. Specifically, it measures how far an observation’s values for the independent variables deviate from the mean of those variables. Leverage identifies data points that could potentially have a significant effect on the regression line, particularly those that are outliers in terms of the input variables.

Leverage values are calculated from the hat matrix $\mathbf{H}$, which is derived from the design matrix $\mathbf{X}$ used in the regression. The diagonal elements of the hat matrix, $h_{ii}$, represent the leverage of each data point. These values range from 0 to 1, where:

A leverage value close to 0 indicates that the point is not influential in determining the regression line.
A leverage value closer to 1 suggests that the point has a higher influence on the fitted model.

High-leverage points are those whose independent variable values are far from the average, and they may disproportionately affect the slope and intercept of the regression line. These points can distort the regression model, leading to biased estimates if not handled properly.

A high-leverage point may not necessarily be an outlier in the response variable (dependent variable), but it could have a significant impact on the overall model fit, making leverage a critical diagnostic in regression analysis.

6.9.1.1 Properties of Leverages

Lets analyze the structure of the hat matrix.

First, let us see that $\mathbf{H}$ is symmetric semi-positive definite. Pick $z \in \mathbb{R}^n$. Recall, that $\mathbf{X}' \mathbf{X}$ is positive definite, then \[(\mathbf{X}'\mathbf{X})^{-1}\] is positive definite. Then

\[\begin{align*} z'Hz &= \mathbf{z}'\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'z \\ &= (\mathbf{X}'\mathbf{z})'(\mathbf{X}'\mathbf{X})^{-1}(\mathbf{X}'\mathbf{z}) \\ &= \mathbf{w}' (\mathbf{X}'\mathbf{X})^{-1} \mathbf{w}& \mathbf{w}= \mathbf{X}'\mathbf{z}\\ &\geq 0 \end{align*}\]

Therefore, $\mathbf{H}$ is semi-positive definite. This implies that all the eigenvalues of $\mathbf{H}$ are non-negative, that is positive or $0$.

Next, let us see that the diagonal elements are between 0 and 1.

To see this, consider leverage $i$ ($h_{ii}$), and the basis vectors $v_j \in \mathbb{R}^n$ such that are full of zeros except for entry $j$. Then:

\[h_{ij} = v_i '\mathbf{H}v_j\] in particular, leverage $i$ is given by:

\[h_{ii} = v_i '\mathbf{H}v_i\] Now, consider $\mathbf{H}v_i$ and notice that:

\[\begin{align*} ||v_i - \mathbf{H}v_i||_2^2 &= (v_i - \mathbf{H}v_i)'(v_i - \mathbf{H}v_i) \\ &= v_i'v_i - v_i'\mathbf{H}v_i - v_i'\mathbf{H}v_i + v_i' \mathbf{H}\mathbf{H}v_i \\ &= v_i'v_i - v_i'\mathbf{H}\mathbf{H}v_i - v_i'\mathbf{H}\mathbf{H}v_i + v_i' \mathbf{H}\mathbf{H}v_i \\ &= v_i'v_i - v_i'\mathbf{H}\mathbf{H}v_i \\ &= ||v_i||_2^2 - ||\mathbf{H}v_i||_2^2 \\ &= 1 - ||\mathbf{H}v_i||_2^2 \end{align*}\]

Then

\[\begin{align*} ||v_i - \mathbf{H}v_i||_2^2 = 1 - ||\mathbf{H}v_i||_2^2 & \implies ||v_i - \mathbf{H}v_i||_2^2 + ||\mathbf{H}v_i||_2^2 = 1 \\ & \implies ||\mathbf{H}v_i||_2^2 \leq 1 \\ & \implies v_i' \mathbf{H}\mathbf{H}v_i \leq 1 \\ & \implies v_i' \mathbf{H}v_i \leq 1 \\ & \implies h_{ii} \leq 1 \\ \end{align*}\]

Since $h_{ii}=||\mathbf{H}v_i||_2^2$ is a norm then $h_{ii} \geq 0$, then:

\[ 0 \leq h_{ii} \leq 1 \] So, leverages are bounded.

Furthermore, we have that:

\[ \sum_{i=1}^p h_{ii} = p \]

To see this, we will need the following properties:

$\text{rank}(\mathbf{H}) = p.$
All the eigenvalues of $\mathbf{H}$ are either 0 or 1.

As usual, we assume that $\text{rank}(\mathbf{X}) = p$ and $n>p$, that is there are more observations than independent variables. Then

\[\begin{align*} \text{rank}(\mathbf{X}) = p &\implies \text{rank}(\mathbf{X}'\mathbf{X}) = p \\ &\implies \text{rank}\left((\mathbf{X}'\mathbf{X})^{-1}\right) = p \\ &\implies \text{rank}\left(\mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}'\right) = p \\ &\implies \text{rank}\left(\mathbf{H}\right) = p \\ \end{align*}\]

Now, let $\lambda$ be an eigenvalue of $\mathbf{H}$, with respective $\mathbf{v}$ as eigenvector. Then:

\[\begin{align*} \mathbf{H}\mathbf{v}= \lambda \mathbf{v}\\ &\implies \mathbf{H}\mathbf{H}\mathbf{v}= \mathbf{H}\lambda \mathbf{v}= \lambda \mathbf{H}\mathbf{v}\\ &\implies \mathbf{H}\mathbf{v}= \lambda \lambda \mathbf{v}= \lambda^2 \mathbf{v}\\ &\implies \lambda = 1 \land \lambda = 0 \end{align*}\]

Now since, $\mathbf{H}$ is of rank $p$, then it must have $p$ non-zero eigenvalues. And, since all non-zero eigenvalues are 1, then we have that:

$\sum_{i=1}^n \lambda_i = p$

that is the sum of the eigenvalues is $p$. We also know that for any square matrix, the sum of the eigenvalues is equal to the trace of the matrix. Then:

\[ \text{trace}(\mathbf{H}) = p \implies \sum_{i=1}^n h_{ii} = p\] This is telling us important information about the leverages. You can compute a proportional leverage as:

\[ \frac{h_{ii}}{p}\] the proportion of the leverage $i$ out of “all” the leverage. Or in terms of “observations”, you can compute:

\[\frac{n h_{ii}}{p}\] if this number is several times bigger than $1$, it will be an observation with considerable leverage.

Another way to characterize the leverages can be found as follows:

Recall that the hat matrix is given by:

\[\mathbf{H}= \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \] and we can express the design matrix as follows:

\[ \mathbf{X}= \left[\begin{matrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{matrix}\right] \]

and

\[ \mathbf{X}' = [X_1' X_2' \ldots X_n'] \]

Then, the hat matrix is given by:

\[\begin{align*} \mathbf{H} &= \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1}\mathbf{X}' \\ &= \mathbf{X}(\mathbf{X}'\mathbf{X})^{-1} [X_1' X_2' \ldots X_n'] \\ &= \mathbf{X}[(\mathbf{X}'\mathbf{X})^{-1}X_1' (\mathbf{X}'\mathbf{X})^{-1}X_2' \ldots (\mathbf{X}'\mathbf{X})^{-1}X_n'] \\ &= \left[\begin{matrix} X_1 \\ X_2 \\ \vdots \\ X_n \end{matrix}\right] [(\mathbf{X}'\mathbf{X})^{-1}X_1' (\mathbf{X}'\mathbf{X})^{-1}X_2' \ldots (\mathbf{X}'\mathbf{X})^{-1}X_n'] \\ &= \left[\begin{matrix} X_1 (\mathbf{X}'\mathbf{X})^{-1}X_1' & X_1 (\mathbf{X}'\mathbf{X})^{-1}X_2' & \ldots & X_1 (\mathbf{X}'\mathbf{X})^{-1}X_n' \\ X_2 (\mathbf{X}'\mathbf{X})^{-1}X_1' & X_2 (\mathbf{X}'\mathbf{X})^{-1}X_2' & \ldots & X_2 (\mathbf{X}'\mathbf{X})^{-1}X_n' \\ \vdots & \vdots & \vdots \\ X_n (\mathbf{X}'\mathbf{X})^{-1}X_1' & X_n (\mathbf{X}'\mathbf{X})^{-1}X_2' & \ldots & X_n (\mathbf{X}'\mathbf{X})^{-1}X_n' \\ \end{matrix}\right] \end{align*}\]

we are going to use this expression latter, but we can also notice that:

\[h_{ij} = X_i (\mathbf{X}'\mathbf{X})^{-1}X_j'\] So we can characterize the leverages as:

\[h_{ij} = X_i (\mathbf{X}'\mathbf{X})^{-1}X_j'\]

6.9.2 Outliers identification

Identifying outliers in multiple regression is hard. One way to measure the influence of an observation is to see how different are the predicted values if an observation is removed. We will call:

$\mathbf{X}_{(i)}$ the design matrix, when the $i$-th observation is removed.
$\mathbf{y}_{(i)}$ the dependent variables without the observation $i$.
$\hat{\mathbf{y}}_{(i)}$ the predicted values when the $i$-th observation is removed.
$\mathbf{H}_{(i)}$ the hat matrix when the $i$-th observation is removed.

Then a measure of influence can be:

\[||\mathbf{y}- \hat{\mathbf{y}}_{(i)}||^2_2 = (\mathbf{y}- \hat{\mathbf{y}}_{(i)})'(\mathbf{y}- \hat{\mathbf{y}}_{(i)}) = \sum_{j=1}^n (\mathbf{y}_j - \hat{\mathbf{y}}_{(i),j})^2\] Notice that:

\[\hat{\mathbf{y}} = \mathbf{H}\mathbf{y}\] and, similarly

\[\hat{\mathbf{y}}_{(i)} = \mathbf{H}_{(i)} \mathbf{y}_{(i)}\]

Now, let us compute:

\[\begin{align*} \hat{\mathbf{y}} &= \mathbf{H}\mathbf{y}\\ &= \left[\begin{matrix} X_1 (\mathbf{X}'\mathbf{X})^{-1}X_1' & X_1 (\mathbf{X}'\mathbf{X})^{-1}X_2' & \ldots & X_1 (\mathbf{X}'\mathbf{X})^{-1}X_n' \\ X_2 (\mathbf{X}'\mathbf{X})^{-1}X_1' & X_2 (\mathbf{X}'\mathbf{X})^{-1}X_2' & \ldots & X_2 (\mathbf{X}'\mathbf{X})^{-1}X_n' \\ \vdots & \vdots & \vdots \\ X_n (\mathbf{X}'\mathbf{X})^{-1}X_1' & X_n (\mathbf{X}'\mathbf{X})^{-1}X_2' & \ldots & X_n (\mathbf{X}'\mathbf{X})^{-1}X_n' \\ \end{matrix}\right] \left[\begin{matrix} y_1 \\ y_2 \\ \vdots \\ y_n \end{matrix}\right] \\ &= \left[\begin{matrix} \sum_{k=1}^n X_1 (\mathbf{X}'\mathbf{X})^{-1}X_k'y_k \\ \sum_{k=1}^n X_2 (\mathbf{X}'\mathbf{X})^{-1}X_k'y_k \\ \vdots \\ \sum_{k=1}^n X_n (\mathbf{X}'\mathbf{X})^{-1}X_k'y_k \end{matrix}\right] \end{align*}\]

\[ \]

Then

\[\hat{y}_j = \sum_{k=1}^n X_j (\mathbf{X}'\mathbf{X})^{-1}X_k'y_k = X_j (\mathbf{X}'\mathbf{X})^{-1} \sum_{k=1}^n X_k'y_k\] in a similar way, observation :

\[\hat{y}_{(i)j} = \sum_{k=1}^n X_j (\mathbf{X}_{(i)}'\mathbf{X}_{(i)})^{-1}X_k'y_k = X_j (\mathbf{X}_{(i)}'\mathbf{X}_{(i)})^{-1} \sum_{k=1}^n X_k'y_k\] Notice that it is not necessary to use the subscript $(i)$ when we are dealing with single observations.

Now, we will relate $(\mathbf{X}'\mathbf{X})^{-1}$ to $(\mathbf{X}_{(i)}'\mathbf{X}_{(i)})^{-1}$ using the Sherman–Morrison formula. First notice that:

\[\mathbf{X}' \mathbf{X}= \mathbf{X}_{(i)}'\mathbf{X}_{(i)} + X_i'X_i\]

then

\[\mathbf{X}_{(i)}'\mathbf{X}_{(i)} = \mathbf{X}' \mathbf{X}- X_i'X_i\]

Now we apply the Sherman–Morrison formula.

\[\begin{align*} \left(\mathbf{X}_{(i)}'\mathbf{X}_{(i)}\right)^{-1} = \left(\mathbf{X}' \mathbf{X}- X_i'X_i\right)^{-1} &= \left(\mathbf{X}' \mathbf{X}+ (-X_i)'X_i\right)^{-1} \\ &= \left(\mathbf{X}' \mathbf{X}\right)^{-1} - \frac{\left(\mathbf{X}' \mathbf{X}\right)^{-1}(-X_i)'X_i\left(\mathbf{X}' \mathbf{X}\right)^{-1}}{1 + (-X_i)'\left(\mathbf{X}' \mathbf{X}\right)^{-1}X_i} \\ &= \left(\mathbf{X}' \mathbf{X}\right)^{-1} + \frac{\left(\mathbf{X}' \mathbf{X}\right)^{-1}X_i'X_i\left(\mathbf{X}' \mathbf{X}\right)^{-1}}{1 - X_i'\left(\mathbf{X}' \mathbf{X}\right)^{-1}X_i} \\ &= \left(\mathbf{X}' \mathbf{X}\right)^{-1} + \frac{\left(\mathbf{X}' \mathbf{X}\right)^{-1}X_i'X_i\left(\mathbf{X}' \mathbf{X}\right)^{-1}}{1 - h_{ii}} \end{align*}\]

Then,

\[\hat{y}_{(i)j} = X_j \left(\mathbf{X}_{(i)}'\mathbf{X}_{(i)}\right)^{-1} \sum_{k \neq i}X_k'y_k = X_j\left(\mathbf{X}' \mathbf{X}\right)^{-1}\sum_{k \neq i}X_k'y_k + \frac{X_j\left(\mathbf{X}' \mathbf{X}\right)^{-1}X_i'X_i\left(\mathbf{X}' \mathbf{X}\right)^{-1}\sum_{k \neq i}X_k'y_k}{1 - h_{ii}}\] Now

\[\begin{align*} X_j\left(\mathbf{X}' \mathbf{X}\right)^{-1}\sum_{k \neq i}X_k'y_k &= X_j\left(\mathbf{X}' \mathbf{X}\right)^{-1}\left(\sum_{k \neq i}X_k'y_k + X_i'y_i - X_i'y_i\right) \\ &= X_j\left(\mathbf{X}' \mathbf{X}\right)^{-1}\left(\sum_{k = 1}^nX_k'y_k - X_i'y_i\right) \\ &= X_j\left(\mathbf{X}' \mathbf{X}\right)^{-1}\sum_{k = 1}^nX_k'y_k - X_j\left(\mathbf{X}' \mathbf{X}\right)^{-1}X_i'y_i \\ &= \hat{y}_j - h_{ij}y_i \end{align*}\]

and

\[\begin{align*} X_j\left(\mathbf{X}' \mathbf{X}\right)^{-1}X_i'X_i\left(\mathbf{X}' \mathbf{X}\right)^{-1}\sum_{k \neq i}X_k'y_k &= h_{ij}X_i\left(\mathbf{X}' \mathbf{X}\right)^{-1}\sum_{k \neq i}X_k'y_k \\ &= h_{ij}\sum_{k \neq i}X_i\left(\mathbf{X}' \mathbf{X}\right)^{-1}X_k'y_k \\ &= h_{ij}\sum_{k \neq i}h_{ik}y_k \end{align*}\]

Then

\[\begin{align*} \hat{y}_{(i)j} = \hat{y}_j - h_{ij}y_i + \frac{h_{ij}\sum_{k \neq i}h_{ik}y_k}{1 - h_{ii}} &= \hat{y}_j - h_{ij}\left(y_i - \frac{\sum_{k \neq i}h_{ik}y_k}{1 - h_{ii}} \right) \\ &= \hat{y}_j - h_{ij}\left(\frac{y_i - y_ih_{ii}}{1 - h_{ii}} - \frac{\sum_{k \neq i}h_{ik}y_k}{1 - h_{ii}} \right) \\ &= \hat{y}_j - h_{ij}\left(\frac{y_i - y_ih_{ii} - \sum_{k \neq i}h_{ik}y_k}{1 - h_{ii}} \right) \\ &= \hat{y}_j - h_{ij}\left(\frac{y_i - \sum_{k=1}^n h_{ik}y_k}{1 - h_{ii}} \right) \\ &= \hat{y}_j - h_{ij}\left(\frac{y_i - \hat{y}_i}{1 - h_{ii}} \right) \\ &= \hat{y}_j - h_{ij}\left(\frac{\hat{e}_i}{1 - h_{ii}} \right) \end{align*}\]

Then

\[\begin{align*} \hat{y}_{(i)j} - \hat{y}_j = - h_{ij}\left(\frac{\hat{e}_i}{1 - h_{ii}} \right) &\implies \left(\hat{y}_{(i)j} - \hat{y}_j \right)^2= \left(h_{ij}\left(\frac{\hat{e}_i}{1 - h_{ii}} \right) \right)^2 = \frac{\hat{e}_i^2}{(1 - h_{ii})^2}h_{ij}^2 \\ &\implies \left(\hat{\mathbf{y}}_{(i)} - \hat{\mathbf{y}} \right)'\left(\hat{\mathbf{y}}_{(i)} - \hat{\mathbf{y}} \right) = \sum_{j=1}^n \frac{\hat{e}_i^2}{(1 - h_{ii})^2}h_{ij}^2 \\ &\implies ||\hat{\mathbf{y}}_{(i)} - \hat{\mathbf{y}}||_2^2 = \frac{\hat{e}_i^2}{(1 - h_{ii})^2} \sum_{j=1}^n h_{ij}^2 \end{align*}\]

Now for any symmetric and idempotent matrix $\mathbf{M}\in \mathbb{R}^{n \times n}$ we have that:

\[m_{ii} = \sum_{j = 1}^n m_{ij}^2\] Then

\[||\hat{\mathbf{y}}_{(i)} - \hat{\mathbf{y}}||_2^2 = \frac{\hat{e}_i^2}{(1 - h_{ii})^2} h_{ii}^2 = \frac{h_{ii}}{(1 - h_{ii})^2}\hat{e}_i^2 \] That is the change in the predictions is the result of a function of the leverages and a function of the error of observation $i$.

We can see this with our GDP dataset. We generate new observations, in which

# Reads Data
dat <- read.csv(file = "Gdp data.csv")

# Extracts the Data
y <- as.numeric(dat[,  1])
X <- as.matrix(dat[, -1])
n <- nrow(X)
X <- cbind(rep(1, n), X)

# Obtains the Estimates
b <- solve(t(X) %*% X, t(X) %*% y)

# Predicted Values
yh <- X %*% b

# Errors
eh <- abs(y - yh)

# Obtains the Minimum, Median and Maximum Errors
err <- quantile(eh, probs = c(0, 0.5, 1))

# Generates New Observations
# An observation at one extreme of the Observed Independent Variables
X1 <- apply(X = X, MARGIN = 2, FUN = max)
# An observation at the center of the Observed Independent Variables
X2 <- apply(X = X, MARGIN = 2, FUN = mean)
# Dependent Variables with different Error levels
y1 <- c(t(X1) %*% b) + c(0, err, err[3] * 2)
y2 <- c(t(X2) %*% b) + c(0, err, err[3] * 2)

# Outlier Level
m <- length(y1)

# Predictions without the Observations
yhi <- yh
bi  <- b
yi  <- y
Xi  <- X

# Table with Changes
tab <- matrix(data = NA, nrow = 0, ncol = 4)

for(i in 1:m){
  # Complete Data Set
  X <- rbind(Xi, X2)
  y <- c(yi, y2[i])
  
  # Coefficients
  b <- solve(t(X) %*% X, t(X) %*% y)
  
  # Predictions
  yh  <- c(X %*% b)
  yhi <- c(rbind(Xi, X2) %*% bi)
  
  # Hat matrix
  H <- X %*% solve(t(X) %*% X, t(X))
  
  # Leverage New Observation
  h <- H[n+1, n+1]
  
  # Change in Predictions
  preCha <- sum((yh - yhi)^2)
  
  # Leverage Component
  levCom <- h / ((1 - h)^2)
  
  # Error Component
  errCom <- (y[n+1] - yh[n+1])^2
  
  # Saves Values
  tab <- rbind(tab, c(preCha, levCom, errCom, levCom * errCom))
}

for(i in 1:m){
  # Complete Data Set
  X <- rbind(Xi, X1)
  y <- c(yi, y1[i])
  
  # Coefficients
  b <- solve(t(X) %*% X, t(X) %*% y)
  
  # Predictions
  yh  <- c(X %*% b)
  yhi <- c(rbind(Xi, X1) %*% bi)
  
  # Hat matrix
  H <- X %*% solve(t(X) %*% X, t(X))
  
  # Leverage New Observation
  h <- H[n+1, n+1]
  
  # Change in Predictions
  preCha <- sum((yh - yhi)^2)
  
  # Leverage Component
  levCom <- h / ((1 - h)^2)
  
  # Error Component
  errCom <- (y[n+1] - yh[n+1])^2
  
  # Saves Values
  tab <- rbind(tab, c(preCha, levCom, errCom, levCom * errCom))
}

  tab <- data.frame(tab)
  tab <- cbind(c(rep("Low", 5), rep("High", 5)), tab)
  tab <- cbind(rep(c("Minimum", "Low", "Medium", "High", "Very High"), 2), tab)
  colnames(tab) <- c("Error Level", "Leverage Level", "Sum of Sq. Diff.", "Lev. Comp.", "Err. Comp.", "Using Formula")
  knitr::kable(tab, digits = 5)

Error Level	Leverage Level	Sum of Sq. Diff.	Lev. Comp.	Err. Comp.	Using Formula
Minimum	Low	0.00000	0.00513	0.00000	0.00000
Low	Low	0.00000	0.00513	0.00002	0.00000
Medium	Low	0.00070	0.00513	0.13617	0.00070
High	Low	0.01245	0.00513	2.42785	0.01245
Very High	Low	0.04980	0.00513	9.71139	0.04980
Minimum	High	0.00000	0.86603	0.00000	0.00000
Low	High	0.00001	0.86603	0.00001	0.00001
Medium	High	0.04918	0.86603	0.05679	0.04918
High	High	0.87684	0.86603	1.01248	0.87684
Very High	High	3.50735	0.86603	4.04991	3.50735

6.10 Stability of the Solution

The results of the regression can be subject to numerical problems or the impact of changes in the data. In general, the results would depend on the stability of $\mathbf{X}'\mathbf{X}$. One way to measure the worst case scenario is the condition number of the matrix $\mathbf{X}' \mathbf{X}$, denoted $\kappa(\mathbf{X}' \mathbf{X})$, and is equal to:

\[ \kappa(\mathbf{X}' \mathbf{X}) = \frac{\lambda_{max}}{\lambda_{min}} \] where, $\lambda_{max}$ and $\lambda_{min}$ are the maximum and minimum eigenvalues of $\mathbf{X}' \mathbf{X}$.

There are some considerations, about the condition number:

The condition number is a measure of the worst case scenario, a bigger condition number of a matrix, does not necessarily will be more unstable than a matrix with a lower condition number.
$\lambda_{max} = \max_i \sigma_i^2$ and $\lambda_{max} = \min_i \sigma_i^2$, where $\sigma_i$ are the singular values of $\mathbf{X}$.

The condition number will be affected by:

Disparities in the magnitude of the variables.
The level of multicoliniarity.
Influential observations.

Since, the condition number is affected by the magnitude of the variables, it is common to work with centered and standardized variables when inspecting this issues.

Another approach is to check the determinant of a matrix. A matrix would be in general more unstable the close is it’s determinant closer to 0. Indeed if the determinant is 0, then $\mathbf{X}'\mathbf{X}$ will not be invertible and the solution \[ \hat{\boldsymbol{\beta}} = \left(\mathbf{X}'\mathbf{X}\right)^{-1}\mathbf{X}' \mathbf{y}\] is not applicable. However, just by looking at the determinant is hard to know what is close to 0. Consider for example the determinant of a multiple of the identity, then:

$\det(c\mathbf{I}_n) = c^n \mathbf{I}$

if $c<1$ this will make the determinant closer to zero. However the matrix $c \mathbf{I}$ has condition number of $1$.

It is better to evaluate the change in the determinant after some normalization to avoid this scenarios.

Consider, for a moment, the standardized case, with estimated coefficients:

\[ \hat{\boldsymbol{\beta}}=r_{XX}^{-1}r_{Xy}\] then we can see how the determinant the correlation matrix, changes when we remove a variable. Without loss of generality, we can assume we remove variable 1. Then, the correlation matrix will be given by:

\[r_{XX} = \left[ \begin{matrix} r_{X_2X_2} & r_{X_2x_{1}} \\ r_{x_{1}X} & r_{x_{1}x_{1}} \\ \end{matrix} \right]\]

Then, using the properties of the determinant of a block matrix, and the results of the Single Variable Cross-Effects subsection, we have that:

\[\begin{align*} \det(r_{XX}) &= \det \left( \left[ \begin{matrix} r_{X_2X_2} & r_{X_2x_{1}} \\ r_{x_{1}X} & r_{x_{1}x_{1}} \\ \end{matrix} \right]\right) \\ &= \det(r_{X_2X_2})\det(r_{x_{1}x_{1}} - r_{X_2x_{1}}r_{X_2X_2}^{-1}r_{x_{1}X_2}) \\ &= \det(r_{X_2X_2})\det(r_{x_{1}|X_2}) \\ &= \det(r_{X_2X_2})\det(1-R^2_{1}) \\ &= \det(r_{X_2X_2})(1-R^2_{1}) \end{align*}\]

We can see that the determinant is closer to 0 the more correlated is the first variable to the rest of the variables, that is:

\[R^2_1 \to 1 \implies \det(r_{XX}) \to 0\]

In a similar way, assume that we are working with centered values, we can analyze what would happen if we remove an observation $X_{i}$. Here we analyze $\mathbf{X}' \mathbf{X}$. Again, using properties of the determinant, and results of the Outliers Identification we have that:

\[\det(\mathbf{X}_{(i)}'\mathbf{X}_{(i)}) = \det(\mathbf{X}_{(i)}'\mathbf{X}_{(i)}) \\ =\det(\mathbf{X}'\mathbf{X}' - X_iX_i') \\ =\det(\mathbf{X}'\mathbf{X}')\det(1 - X_i(\mathbf{X}'\mathbf{X})^{-1}X_i') \\ =\det(\mathbf{X}'\mathbf{X}')\det(1 - h_{ii}) \\ =\det(\mathbf{X}'\mathbf{X}')(1 - h_{ii}) \\\]

So, the determinant deteriorates the more leverage observation $i$ has.

We can see examples of the previous operations with our GDP data.

# Read Data
dat <- read.csv("Gdp data.csv")

# Design Matrix Independent Variables
X <- as.matrix(dat[, -1])

# Centering
# Vector of Ones
v1 <- rep(1, n)
# Centering Matrix
C  <- diag(n) - (1/n) * v1 %*% t(v1)
# Independent Variables Centered
Xc <- C %*% X

# Standardizing matrix of X
DX <- diag(1/sqrt(diag(SXX)))
# Standardize X
Xs <- Xc %*% DX

# Condition numbers

print(paste0("Condition number of X'X = ",   round(kappa(t(X) %*% X), 2)))

## [1] "Condition number of X'X = 28072.42"

print(paste0("Condition number of Xc'Xc = ", round(kappa(t(Xc) %*% Xc), 2)))

## [1] "Condition number of Xc'Xc = 1186.14"

print(paste0("Condition number of Xs'Xs = ", round(kappa(t(Xs) %*% Xs), 2)))

## [1] "Condition number of Xs'Xs = 105.68"

as we can see the condition number improves as with the centering and the standardization. However, this is not necessarily so.

Now we evaluate the effect of each variable on the determinant.

P      <- ncol(X)
vecDet <- numeric(length = P)
vecR2  <- numeric(length = P)

# Computes
for(i in 1:P){
  vecDet[i] <- det(cor(X[, -i]))
  vecR2[i]  <- summary(lm(Xs[, i] ~ Xs[, -i] - 1))$r.squared
}

# Formats and Show Info Table
tab           <- cbind(rep(det(cor(X)), P), vecDet, vecR2)
colnames(tab) <- c("r_{X'X}", "r_{X_(i)'X_(i)}", "R2")
rownames(tab) <- colnames(dat)[2:6]
knitr::kable(tab)

	r_{X’X}	r_{X_(i)’X_(i)}	R2
inf	0.0323914	0.8190564	0.9604527
une	0.0323914	0.2800338	0.8843302
int	0.0323914	0.3451523	0.9061532
gov	0.0323914	0.0408342	0.2067576
exp	0.0323914	0.0324190	0.0008505

Finally we can evaluate the effects of a new observations depending on the leverage:

# Read Data
dat <- read.csv("Gdp data.csv")

# Design Matrix Independent Variables
X <- as.matrix(dat[, -1])
P <- ncol(X)
n <- nrow(X)

# Centering
# Vector of Ones
v1 <- rep(1, n)
# Centering Matrix
C  <- diag(n) - (1/n) * v1 %*% t(v1)
# Independent Variables Centered
Xc <- C %*% X

# Standardizing matrix of X
DX <- diag(1/sqrt(diag(SXX)))
# Standardize X
Xs <- Xc %*% DX

# New Observations
# Very Leveraged Observation
Xn <- t(rep(3, P))
# Mildly Leveraged Observation
Xn <- rbind(Xn, t(rep(1, P)))
# No Leverage
Xn <- rbind(Xn, t(rep(0, P)))

# Number of New Observations
m <- nrow(Xn)

# Computes the Effects of the New Observation
vecDet <- numeric(length = m)
vecLev <- numeric(length = m)
for(i in 1:m){
  Z         <- rbind(Xs, Xn[i, ])
  vecDet[i] <- det(t(Z) %*% Z)
  vecLev[i] <- t(Xn[i, ]) %*% solve(t(Z) %*% Z, Xn[i, ])
}

# Formats and Show Info Table
tab           <- cbind(rep(det(t(Xs) %*% Xs), 3), vecDet, vecLev, vecLev * n / P)
colnames(tab) <- c("det(Xs_(i)'Xs_(i))","det(Xs'Xs)", "Leverage", "Obs. Equivalent")
rownames(tab) <- c("Big Leverage", "Medium Leverage", "No Leverage")
knitr::kable(tab)

	det(Xs_(i)’Xs_(i))	det(Xs’Xs)	Leverage	Obs. Equivalent
Big Leverage	9132788636	18778457801	0.5136561	20.135319
Medium Leverage	9132788636	10204529654	0.1050260	4.117019
No Leverage	9132788636	9132788636	0.0000000	0.000000

6 Multiple Linear Regression

6.1 Introduction

6.2 Example

6.3 Least Squares Estimation

6.4 Properties of the Estimates

6.5 Multiple \(R^2\)

6.6 Geometric Interpretation of Multiple Linear Regression

6.7 Centered and Standarized Variables

6.7.1 Centered Variables

6.7.2 Sample Covariance

6.7.3 Satandard Variables

6.7.4 Sample Correlation Matrix

6.8 Variable Cross-Effects

6.8.1 Single Variable Cross-Effects

6.8.1.1 Single Variable Cross-effects example

6.9 Outliers and Leverage

6.9.1 Leverage

6.9.1.1 Properties of Leverages

6.9.2 Outliers identification

6.10 Stability of the Solution