Intermediate Econometrics

12-14th 2025 - Vincenzo Gioia

Linear Model

Limits and extension

Although a powerful tool to explore the relationship between the covariates and the outcome, the linear model is based on the assumptions:

Linearity: The expected value of \(Y\) is a linear function of the explanatory variables \(\mathbb{E}(Y_i) = \mu_i = \mathbf{x}^\top_i \boldsymbol{\beta}, \quad i = 1, \ldots, n\)
Normality: The variables \(Y_i\) have distribution \(Y_i \sim \mathcal{N}(\mu_i, \sigma^2)\) (for the errors \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2)\))
HOmoschedasticity: The variance of \(Y_i\) (or the error term) is not dependent on \(i\)
Independence: \(Y_i\) is independent on \(Y_j\) for each couple \(i,j\) with \(i\neq j\) (equivalently in terms of errors)
Linear independence between explanatory variables: the model matrix \(X\) is not stochastic and of full rank (\(p\))

If the assumptions are not aligning with the data, model-based inference can be misleading

Linear Model

Linearity

If we have evidence of non-linearity we can explore the following options
Transforming the predictors \(Y_i = \beta_1 + \beta_2 g(x_{i2}) + \ldots + \beta_p g(x_{ip}) + \varepsilon_i\)

Several transformations (log, square root, inverse, …)
Polynomial regression

Transforming the outcome: \(f(Y_i) = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon^*_i\)

For instance \(f(Y_i) = \log(Y_i)\); in such a case we are assuming that \(\log (Y_i)\) (and \(\varepsilon^*_i\)) are normally distributed, so \(Y_i\) and \(\varepsilon_i = {\rm exp}(\varepsilon^*_i)\) are distributed accordingly to he lognormal distribution

Transform either the predictors and the outcome

Interpretation

We are loosing the simplicity of the interpretation (the simple interpretation of the linear model coefficients must be done in the transoformed scale)

Linear Model

Normality

Also without assuming the normality of errors, the OLS estimator has good properties
However, without the normality assumption, the inferential results (confidence intervals and hypothesis test) cannot be obtained easily
Sometimes, we try to recover the normality assumption via transformations, for instance \(\log()\)

A very popular transformation is the Box-Cox transformation, which includes the logarithmic transformation as a special case (we do not see)
Considering a different class of models: Generalized Linear Models (GLMs), that also includes the normal linear model, where we assume that the response variable is distributed according to a certain distribution (we will see some of them)

Linear Model

Homoschedasticity

Let’s suppose that we are in a case where \(V(\varepsilon_i) = V(Y_i) = \sigma^2_i\), for \(i=1, \ldots, n\): the variance is not constant \(\implies\) heteroschedasticity

The OLS estimator \(\hat \beta\) has mean \(\beta\) (unbiased)
The OLS estimator \(\hat \beta\) is no more efficient; indeed its variance covariance matrix is \[V(\hat \beta) = V((X^\top X)^{-1}X^\top Y) = (X^\top X)^{-1}X^\top V(Y) X (X^\top X)^{-1}\]
The distribution is still normal \(\hat \beta \sim \mathcal{N}(\beta, V(\hat \beta))\)

Problems

Despite the normality, the heteroschedasticity implies that the usual procedures for obtaining hypothesis test and confidence intervals are no more valid

Linear Model

How to deal with eteroschedasticity

Trasforming the outcome (there are some trasformations that are able to stabilize the variance, e.g. square root, logarithm, inverse square root, arcsin of the square root). Consider the following data

Linear Model

How to deal with eteroschedasticity

Trasforming the outcome: Consider the square root

Linear Model

Heteroschedasticity

Change the estimation approach: GLS (generalized least square)

The model structure remains the same: \[ Y = X\beta + \varepsilon\] but we are substituting the homoschedasticity (\(V(\varepsilon) = \sigma^2I\)) assumption with \[V(\varepsilon) = \sigma^2 \Omega\] where \(\Omega\) is a known diagonal matrix (obviously not the identity)

Linear Model

Heteroschedasticity

Change the estimation approach: GLS (generalized least square)

The log-likelihood for (\(\beta, \sigma^2\)) is \[\ell(\beta, \sigma^2) = - \frac{n}{2}\log \sigma^2 - \frac{1}{2\sigma^2} (y - X\beta)^\top \Omega^{-1} (y - X\beta)\]
The maximum likelihood estimate of \(\beta\) is \[\hat \beta = \arg\min_\beta (y - X\beta)^\top \Omega^{-1}(y - X\beta) = (X^\top \Omega^{-1} X)^{-1}X^\top \Omega^{-1}y\]
So, it is easy to derive \(V(\hat \beta)\) (\(\hat \beta\) is still normally distributed)
Since \(\Omega\) is diagonal, by using the generalized least squares we are minimizing the function \[RSS_{g}(\beta) = \sum_{i=1}^{n}\frac{1}{w_{ii}} (y_i - x^\top_i \beta)^2\] where \(\omega_{ii}\) is the \(i\)-th diagonal element of \(\Omega\)

Linear Model

Heteroschedasticity

Change the estimation approach: GLS (generalized least square)

In summary, the contribution to the sum of squares are weighted by \(1/\omega_{ii}\): greater is \(\omega_{ii}\), that is greater is the variance of the error \(\varepsilon_i\) for the \(i\)-th observations, and lower is the weight associated to the contribution of the sum
Note: If \(\Omega\) is not diagonal, the generalized least squares can be used to deal with error’s dependence, in addition to the heteroschedasticity
The case illustrated above belongs to the class of non-spherical disturbance errors

Econometric Example

wage1 dataset

The dataset is extracted from Wooldridge’s Introductory Econometrics (2016)
Goal:

detect heteroschedasticity (non-constant error variance) in a simple wage regression
show how it can be mitigated by applying a logarithmic transformation to the dependent variable, or alternatively by estimating the model using Generalized Least Squares (GLS).

Estimate a linear regression: \[wage_i = \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 tenure_i + \varepsilon_i\]
Testing for heteroschedasticity, fit via OLS and GLS, then considering a transformation of the outcome

Econometric Example

wage1 dataset

The dataset wage1 is included in the wooldridge R package
It is a classic example in labor economics
It contains cross-sectional data on 526 working individuals in the U.S.
Variables

wage: Hourly wage (in USD)
educ: Years of education
exper: Years of labor market experience
tenure: Years with current employer
nonwhite, female, etc.: Demographic indicators

Econometric Example

library(wooldridge)
library(sandwich)
library(nlme)

data("wage1")
str(wage1)

'data.frame':   526 obs. of  24 variables:
 $ wage    : num  3.1 3.24 3 6 5.3 ...
 $ educ    : int  11 12 11 8 12 16 18 12 12 17 ...
 $ exper   : int  2 22 2 44 7 9 15 5 26 22 ...
 $ tenure  : int  0 2 0 28 2 8 7 3 4 21 ...
 $ nonwhite: int  0 0 0 0 0 0 0 0 0 0 ...
 $ female  : int  1 1 0 0 0 0 0 1 1 0 ...
 $ married : int  0 1 0 1 1 1 0 0 0 1 ...
 $ numdep  : int  2 3 2 0 1 0 0 0 2 0 ...
 $ smsa    : int  1 1 0 1 0 1 1 1 1 1 ...
 $ northcen: int  0 0 0 0 0 0 0 0 0 0 ...
 $ south   : int  0 0 0 0 0 0 0 0 0 0 ...
 $ west    : int  1 1 1 1 1 1 1 1 1 1 ...
 $ construc: int  0 0 0 0 0 0 0 0 0 0 ...
 $ ndurman : int  0 0 0 0 0 0 0 0 0 0 ...
 $ trcommpu: int  0 0 0 0 0 0 0 0 0 0 ...
 $ trade   : int  0 0 1 0 0 0 1 0 1 0 ...
 $ services: int  0 1 0 0 0 0 0 0 0 0 ...
 $ profserv: int  0 0 0 0 0 1 0 0 0 0 ...
 $ profocc : int  0 0 0 0 0 1 1 1 1 1 ...
 $ clerocc : int  0 0 0 1 0 0 0 0 0 0 ...
 $ servocc : int  0 1 0 0 0 0 0 0 0 0 ...
 $ lwage   : num  1.13 1.18 1.1 1.79 1.67 ...
 $ expersq : int  4 484 4 1936 49 81 225 25 676 484 ...
 $ tenursq : int  0 4 0 784 4 64 49 9 16 441 ...
 - attr(*, "time.stamp")= chr "25 Jun 2011 23:03"

Econometric Example

summary(wage1[,1:4])

      wage             educ           exper           tenure      
 Min.   : 0.530   Min.   : 0.00   Min.   : 1.00   Min.   : 0.000  
 1st Qu.: 3.330   1st Qu.:12.00   1st Qu.: 5.00   1st Qu.: 0.000  
 Median : 4.650   Median :12.00   Median :13.50   Median : 2.000  
 Mean   : 5.896   Mean   :12.56   Mean   :17.02   Mean   : 5.105  
 3rd Qu.: 6.880   3rd Qu.:14.00   3rd Qu.:26.00   3rd Qu.: 7.000  
 Max.   :24.980   Max.   :18.00   Max.   :51.00   Max.   :44.000

par(mfrow=c(1,4))
for(i in 1:4) hist(wage1[,i], main = names(wage1)[i], breaks = 30)

Econometric Example

library(PerformanceAnalytics)
chart.Correlation(wage1[,c(2,3,4,1)])

Econometric Example

OLS

educ: Each additional year of education increases the hourly wage by about 0.60 dollars per hour, on average, holding fixed the other predictors
exper and tenure typically have positive but a lower effect

ols <- lm(wage ~ educ + exper + tenure, data = wage1)
summary(ols)


Call:
lm(formula = wage ~ educ + exper + tenure, data = wage1)

Residuals:
    Min      1Q  Median      3Q     Max 
-7.6068 -1.7747 -0.6279  1.1969 14.6536 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept) -2.87273    0.72896  -3.941 9.22e-05 ***
educ         0.59897    0.05128  11.679  < 2e-16 ***
exper        0.02234    0.01206   1.853   0.0645 .  
tenure       0.16927    0.02164   7.820 2.93e-14 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 3.084 on 522 degrees of freedom
Multiple R-squared:  0.3064,    Adjusted R-squared:  0.3024 
F-statistic: 76.87 on 3 and 522 DF,  p-value: < 2.2e-16

Econometric Example

wage1 dataset

Residual variance increases for higher predicted wages — a clear sign of heteroskedasticity

plot(ols, 1)

Econometric example

Testing for heteroschedasticity (Breusch-Pagan Test)

Model \(y_i = x^\top_i \beta + \varepsilon_i\), with \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2_i)\), \(i = 1, \ldots, n\)
Test statistic: \[LM = \frac{1}{2}f^{t} W (W^TW)^{-1} W f\] where

\(f\) is a n-dimensional vector composed by \((e^2_i/\hat \sigma^2 - 1)\), where \(e_i\) is the \(i\)-th residual of the OLS regression and \(\hat \sigma^2\) is the estimate of the variance of the error term
\(W\) is a matrix of covariates. Indeed, we assume that \(\sigma^2_i\) is a function of \(J\) covariates denoted by \(w_i\), that is \(\sigma^2_i = h(\mathbf{w}^\top_i\delta)\), where the first element of the vector \(w_i\) is equal to 1

The test statistic is, under \(H_0: \sigma^2_1 = \sigma^2_2 = \ldots = \sigma^2_n = \sigma^2\) (homoschedasticity), distributed according to a Chi-square distribution with \(J\) degrees of freedom

Econometric example

wage1 dataset

Breusch - Pagan Test: p-value close to 0 suggesting a clear evidence for rejecting \(H_0\) (so confirming the results of the graphical inspection)

library(lmtest)
bptest(ols)


    studentized Breusch-Pagan test

data:  ols
BP = 43.096, df = 3, p-value = 2.349e-09

Econometric example

wage1 dataset

Let’s fit a GLS model, the R function is gls (in addition to the model and the data formula you must provide the weights)
In this case we are saying that the variance of the errors is proportional to \(\hat y_i^\delta\), with \(\delta\) a parameter to be estimated (you have several options)

pred <- predict(ols)
glsFit <- gls(wage ~ educ + exper + tenure, data = wage1,
          weights = varPower(form = ~ pred))
summary(glsFit)

Generalized least squares fit by REML
  Model: wage ~ educ + exper + tenure 
  Data: wage1 
       AIC      BIC    logLik
  2612.449 2637.995 -1300.225

Variance function:
 Structure: Power of variance covariate
 Formula: ~pred 
 Parameter estimates:
    power 
0.7879627 

Coefficients:
                Value Std.Error   t-value p-value
(Intercept) 1.0550929 0.3850044  2.740470  0.0063
educ        0.2887735 0.0282548 10.220328  0.0000
exper       0.0160721 0.0081646  1.968499  0.0495
tenure      0.1594773 0.0202684  7.868274  0.0000

 Correlation: 
       (Intr) educ   exper 
educ   -0.899              
exper  -0.758  0.544       
tenure  0.080 -0.167 -0.273

Standardized residuals:
       Min         Q1        Med         Q3        Max 
-1.9517615 -0.6652424 -0.2411887  0.4001136  5.1639173 

Residual standard error: 0.7339895 
Degrees of freedom: 526 total; 522 residual

Econometric Example

wage1 dataset

The results are quite different (especially the estimated coefficients for educ)
If there is no heteroschedasticity we expect similar estimated regression coefficients and standard errors
In the following slide, the residuals are plotted

cbind(coef(ols), coef(glsFit))

                   [,1]       [,2]
(Intercept) -2.87273482 1.05509285
educ         0.59896507 0.28877352
exper        0.02233952 0.01607207
tenure       0.16926865 0.15947728

cbind(summary(ols)$coefficients[,2], sqrt(diag(glsFit$varBeta)))

                  [,1]        [,2]
(Intercept) 0.72896429 0.385004357
educ        0.05128355 0.028254819
exper       0.01205685 0.008164635
tenure      0.02164461 0.020268395

Econometric Example

par(mfrow=c(1,3))
plot(fitted(ols), residuals(ols), xlab = "Fitted values (OLS)",
     ylab = "Residuals", main = "Heteroskedasticity in OLS")
abline(h = 0, col = "red", lty = 2)

plot(fitted(ols), rstandard(ols), xlab = "Fitted values (OLS)",
     ylab = "Standardized Residuals", main = "Heteroskedasticity in OLS")
abline(h = 0, col = "red", lty = 2)

plot(fitted(glsFit), resid(glsFit, type="normalized"),     xlab = "Fitted values (GLS)",
     ylab = "Standardized residuals", main = "Variance stabilized in GLS")
abline(h = 0, col = "red", lty = 2)

Econometric Example

ols_log <- lm(log(wage) ~ educ + exper + tenure, data = wage1)
bptest(ols_log)


    studentized Breusch-Pagan test

data:  ols_log
BP = 10.761, df = 3, p-value = 0.01309

plot(ols_log,1)

Econometric Example

pred <- predict(ols_log)
gls_log <- gls(log(wage) ~ educ + exper + tenure, data = wage1,
          weights = varPower(form = ~ pred))
par(mfrow=c(1,2))
plot(fitted(ols_log), residuals(ols_log), xlab = "Fitted values (OLS)",
     ylab = "Residuals", main = "OLS")
abline(h = 0, col = "red", lty = 2)

plot(fitted(gls_log), resid(gls_log, type="normalized"),     xlab = "Fitted values (GLS)",
     ylab = "Standardized residuals", main = "GLS")
abline(h = 0, col = "red", lty = 2)

Econometric Example

wage1 dataset

The results are no more too different (especially the estimated coefficients for educ)

cbind(coef(ols_log), coef(gls_log))

                   [,1]        [,2]
(Intercept) 0.284359555 0.391014486
educ        0.092028987 0.083188829
exper       0.004121109 0.004344837
tenure      0.022067217 0.022288778

cbind(summary(ols_log)$coefficients[,2], sqrt(diag(gls_log$varBeta)))

                   [,1]        [,2]
(Intercept) 0.104190378 0.096951203
educ        0.007329923 0.006902781
exper       0.001723277 0.001684173
tenure      0.003093649 0.003226156

Linear Model

Comparing the models

We discussed that we can compare models in terms of model accuracy by analyzing the RSE or the \(R^2\) but only if we are working in the same scale
However, for the GLS models the \(R^2\) is no longer available because the variance decomposition is no more unique (there exists some pseudo \(R^2\))
We can compare them using the RSE (but only within the model having the same scale of the answer)

cbind(summary(ols)$sigma, summary(glsFit)$sigma)

         [,1]      [,2]
[1,] 3.084476 0.7339895

cbind(summary(ols_log)$sigma, summary(gls_log)$sigma)

         [,1]      [,2]
[1,] 0.440862 0.3702997

Linear Model

Akaike Information Criteria

It is based on the log-likelihood of the fitted model and a penalization term

\[AIC = 2(p+1) - 2 \ell(\hat {\boldsymbol \beta}; \hat \sigma^2))\]

In a linear model it is simply

\[AIC = 2(p+1) + n \log {\hat \sigma}^2\]

To compare models fitted in different scales, you must consider an additive term in the log-likelihood
Lower is the best

Linear Model

Akaike Information Criteria

The comparison

cbind(AIC(ols), AIC(glsFit))

         [,1]     [,2]
[1,] 2683.662 2612.449

cbind(AIC(ols_log), AIC(gls_log))

         [,1]     [,2]
[1,] 637.0955 669.0064

AIC(ols_log) + sum(2*log(wage1$wage))

[1] 2344.774

AIC(gls_log) + sum(2*log(wage1$wage))

[1] 2376.685

Linear Model

Multicollinearity Problem

It refers to situations in which two or more predictors are closely related to one other.
See the example below: while Limit and Age have not an obvious relationship, limit and rating are highly correlated (we say that they are collinear)

library(ISLR2)
data(Credit)
par(mfrow=c(1,2))
with(Credit, plot(Limit, Age))
with(Credit, plot(Limit, Rating))

Linear Model

Multicollinearity Problem

Collinearity can pose problems in the context of regressions, since it can be difficult to separate individual effects of collinear variables on the response
In other words, since limit and rating tend to increase or decrease togheter , it can be difficult to determine how each one separately is associated with the response (balance)
Collinearity reduces the accuracy of the estimates: so it causes the standard error of \(\hat \beta_j\) to grow
Remember that the t-statistic for testing the nullity of the single coefficients is obtained by dividing the estimate of \(\hat \beta_j\) for its standard error
Consequently, collinearity results in a decline of the t-statistic and we may fail to reject \(H_0: \beta_j=0\)

Linear Model

Multicollinearity Problem

We fit the simple linear regression models

lmAge <- lm(Balance ~ Age, data = Credit)
summary(lmAge)$coefficients

               Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 517.2922247  77.851531 6.64459921 1.002280e-10
Age           0.0489114   1.335991 0.03661057 9.708139e-01

lmLimit <- lm(Balance ~ Limit, data = Credit)
summary(lmLimit)$coefficients

                Estimate   Std. Error   t value      Pr(>|t|)
(Intercept) -292.7904955 26.683414516 -10.97275  1.184152e-24
Limit          0.1716373  0.005066234  33.87867 2.530581e-119

lmRating <- lm(Balance ~ Rating, data = Credit)
summary(lmRating)$coefficients

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) -390.84634 29.0685146 -13.44569  3.073181e-34
Rating         2.56624  0.0750891  34.17594 1.898899e-120

Linear Model

Multicollinearity Problem

The estimated coefficient for limit is almost similar when including age in the model, as well as its standard error

lmAge <- lm(Balance ~ Age, data = Credit)
summary(lmAge)$coefficients

               Estimate Std. Error    t value     Pr(>|t|)
(Intercept) 517.2922247  77.851531 6.64459921 1.002280e-10
Age           0.0489114   1.335991 0.03661057 9.708139e-01

lmLimit <- lm(Balance ~ Limit, data = Credit)
summary(lmLimit)$coefficients

                Estimate   Std. Error   t value      Pr(>|t|)
(Intercept) -292.7904955 26.683414516 -10.97275  1.184152e-24
Limit          0.1716373  0.005066234  33.87867 2.530581e-119

lmAgeLimit <- lm(Balance ~ Age + Limit, data = Credit)
summary(lmAgeLimit)$coefficients

               Estimate   Std. Error   t value      Pr(>|t|)
(Intercept) -173.410901 43.828387048 -3.956589  9.005366e-05
Age           -2.291486  0.672484540 -3.407492  7.226468e-04
Limit          0.173365  0.005025662 34.495944 1.627198e-121

Linear Model

Multicollinearity Problem

Here we can see that there is a drastic change on both the estimated coefficients and they associated standard errors (increase of 12 times implying a p-value of 0.7)
The importance of the limit variable is masked due to the presence of collinearity

lmRating <- lm(Balance ~ Rating, data = Credit)
summary(lmRating)$coefficients

              Estimate Std. Error   t value      Pr(>|t|)
(Intercept) -390.84634 29.0685146 -13.44569  3.073181e-34
Rating         2.56624  0.0750891  34.17594 1.898899e-120

lmLimit <- lm(Balance ~ Limit, data = Credit)
summary(lmLimit)$coefficients

                Estimate   Std. Error   t value      Pr(>|t|)
(Intercept) -292.7904955 26.683414516 -10.97275  1.184152e-24
Limit          0.1716373  0.005066234  33.87867 2.530581e-119

lmLimitRating <- lm(Balance ~ Limit + Rating, data = Credit)
summary(lmLimitRating)$coefficients

                 Estimate  Std. Error    t value     Pr(>|t|)
(Intercept) -377.53679536 45.25417619 -8.3425846 1.213565e-15
Limit          0.02451438  0.06383456  0.3840298 7.011619e-01
Rating         2.20167217  0.95229387  2.3119672 2.129053e-02

Linear Model

Multicollinearity Problem

A simple way to detect a potential problem of collinearity is to see the correlation matrix
Unfortunately, not all the problems related with collinearity can be detected by analyzing the correlation matrix: it is possible for collinearity to exist between three or more variable, even if no pair of variables has a particularly high correlation (we call this situation multicollinearity)

cor(Credit[, c("Age", "Limit", "Rating")])

             Age     Limit    Rating
Age    1.0000000 0.1008879 0.1031650
Limit  0.1008879 1.0000000 0.9968797
Rating 0.1031650 0.9968797 1.0000000

Linear Model

Multicollinearity Problem

An alternative is to explore the Variance Inflaction Factor (VIF): the ratio between the variance of \(\hat \beta_j\) when fitting the full model and the variance of \(\hat \beta_j\) when fitting the model on its own
The lower value is 1 and values of 5/10 indicates a problematic amount of multicollinearity
In this case, it is clear the presence of multicollinearity (values of 160)

lmFull <- lm(Balance ~ Age + Limit + Rating, data = Credit)
library(car)
vif(lmFull)

       Age      Limit     Rating 
  1.011385 160.592880 160.668301

Linear Model

Multicollinearity Problem

Possible solutions to deal with this problem

Removing one of the variables (for instance Rating)
Combining the predictors into a single one
Use regularization techniques (we do not see)

vif(lmAgeLimit)

     Age    Limit 
1.010283 1.010283

Linear Model

Endogeneity

The unbiasedness and the consistency of the OLS estimator rest on the hypothesis that the conditional expectation of the error is constant (and can be set to zero if the model contains an intercept)
Consider the simple linear model: \[y_i = \beta_1 + \beta_2 x_i + \varepsilon_i\] with \[\mathbb{E}(Y_i|x_i) = \mathbb{E}[\beta_1+\beta_2x_i + \varepsilon_i|x_i] = \beta_1 +\beta_2x_i + \mathbb{E}[\varepsilon_i| x_i] = \beta_1 +\beta_2x_i\] if \(\mathbb{E}(\varepsilon_i|x_i) = 0\)
When \(x\) is correlated with \(\varepsilon\), we face endogeneity
Leads to biased and inconsistent OLS estimates
Common in observational data

Linear Model

Why endogeneity matters

The same property can also be described using the covariance between the covariate and the error that can be written, using the rule of repeated expectation (tower’s property): \[\text{cov}(x, \varepsilon) = \mathbb{E}[(x-\mu_x)\varepsilon] = \mathbb{E}_x[\mathbb{E}_\varepsilon[(x-\mu_x)\varepsilon |x]] = \mathbb{E}_x[(x-\mu_x)\mathbb{E}_\varepsilon[\varepsilon |x]]\]
Key consequence: \[\text{cov}(x,\varepsilon) \neq 0 \Rightarrow \hat{\beta_2}_{OLS} \xrightarrow{p} \beta_2 + \frac{\text{cov}(x,\varepsilon)}{\text{var}(x)}\] where \(\xrightarrow{p}\) stay for convergence in probability
Indeed, considering a simple linear regression model \[\hat{\beta_2}_{OLS}=\frac{cov(x,y)}{var(x)} = \frac{cov(x,\beta_1+\beta_2x+\varepsilon)}{var(x)} = \beta_2 + \frac{cov(x,\varepsilon)}{var(x)}\]

Linear Model

Why endogeneity matters

Cases that we could encounter in application:

if \(cov(x,\varepsilon)>0 \implies\) OLS overestimes \(\beta_2\)
if \(cov(x,\varepsilon)<0 \implies\) OLS underestimes \(\beta_2\)
if \(cov(x,\varepsilon)=0 \implies\) OLS is consistent (converges to \(\beta_2\))

OLS incorrectly attributes variation in \(y\) to \(x\)
Standard errors no longer meaningful
Inference becomes unreliable

Linear Model

Endogeneity

If the conditional expectation of \(\varepsilon\) is a constant, \(\mathbb{E}_\varepsilon[\varepsilon |x] = \mu_\varepsilon\) (not necessarily 0), the covariance is \[cov(x, \varepsilon) = \mu_\varepsilon \mathbb{E}_x[x-\mu_x] = 0\]
Stated in a different way, \(x\) is supposed to be exogenous, or \(x\) is assumed to be uncorrelated with \(\varepsilon\)
Endogeneity when \(cov(x, \varepsilon)\neq 0\)
Sources of endogeneity:

There are errors in the variables
There are omitted variables
Simultaneity bias

Linear Model

1. Errors in the variables (outcome)

Data used in economics, especially micro-data, are prone to errors of measurement (either outcome and predictors)
Suppose that the model that we seek to estimate is \[y^{*}_i = \beta_1 + \beta_2 x^{*}_i + \varepsilon^{*}_i\] where the covariates is exogenous (\(cov(x^*, \varepsilon^*)=0\))
Suppose that the response is observed with error, namely that the observed value of the response is \[y^{*}_i = y_i - \nu_i\] where \(\nu_i\) is the measurement error of the respnse. Then

\[y_i= \beta_1 + \beta_2 x^*_i + (\varepsilon^*_i + \nu_i) \]

The error of the model is \(\varepsilon_i = \varepsilon^*_i + \nu_i\), which is still uncorrelated with \(x\) if \(\nu\) is uncorrelated with \(x\), which means that the error of measurement of the response is uncorrelated with the covariate
The measurement error only increases the size of the error, which implies that the coefficients are estimated less precisely and that the \(R^2\) is lower compared to a model with a correctly measured response

Linear Model

1. Errors in the variables (predictor)

Let’s suppose now that the covariate is observed with error, namely that the observed value of the covariate is \[x^{*}_i = x_i - \nu_i\] where \(\nu_i\) is the measurement error of the covariate.
If the measurement error is uncorrelated with the value of the covariate, the variance of the observed covariate is \[\sigma^2_x = \sigma^{*2}_x + \sigma^2_{\nu}\] and the covariance between the observed covariate and the measurement error is equal to the variance of the measurement error, that is \({\text cov}(x, \nu) = \mathbb{E}(x^* + \nu - \mu_x) \nu) = \sigma^2_\nu\) because the measurement error is uncorrelated with the covariate
So, rewriting the model in terms of \(x\), we get \[y_i= \beta_1 + \beta_2 x_i + \varepsilon_i\] with \(\varepsilon_i= \varepsilon^{*}_i - \beta_2 \nu_i\)
The error of the model is correlated with \(x\), as \(cov(x,\varepsilon) = cov(x^* + \nu, \varepsilon - \beta \nu) = -\beta\sigma^2_\nu\)

Linear Model

1. Errors in the variables (predictor)

The OLS estimator can be written as usual as

\[\hat \beta_2 = \frac{\sum_{i=1}^{n}(x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^{n}(x_i - \bar x)^2} = \beta_2 + \frac{\sum_{i=1}^{n}(x_i - \bar x)\varepsilon_i}{\sum_{i=1}^{n}(x_i - \bar x)^2}\]

Taking the expectation, we have \(\mathbb{E}[(x-\bar x) \varepsilon]) = -\beta_2\sigma^2_\nu\) and the expected value of the estimator is \[\mathbb{E}(\hat \beta_2) = \beta_2 \left(1- \frac{\sigma^2_\nu}{\sum_{i=1}^{n}(x_i - \bar x)^2/n} \right) = \beta_2\left(1-\frac{\sigma^2_\nu}{\hat \sigma^2_x}\right)\]
The OLS estimator is biased and the term in brackets is the minus the share of the variance of that is due to measurement errors

Linear Model

1. Errors in the variables (predictor)

\[\mathbb{E}(\hat \beta_2) = \beta_2 \left(1- \frac{\sigma^2_\nu}{\sum_{i=1}^{n}(x_i - \bar x)^2/n} \right)= \beta_2\left(1-\frac{\sigma^2_\nu}{\hat \sigma^2_x}\right)\]

Then, \(|\hat \beta_2| < \beta_2\)
This is called attenuation bias: it can be either a lower or an upper bias depending on the sign of \(\beta\)
This bias clearly doesn’t attenuate in large samples. As \(n\) grows, the empirical variances/covariances converge to the population ones, and the estimator therefore converges to \(\beta_2(1 - \sigma^2_\nu/\sigma^2_x)\)

Linear Model

2. Omitted variables bias

Suppose that the true model is: \[y_i = \beta_1 + \beta_2x_i + \beta_3 z_i + \varepsilon_i\] where \(E[\varepsilon|x,z]=0\) and the model can be estimated consistently using OLS
Consider that \(z\) is unobserved
The model to be estimated is \[y_i = \beta_1 + \beta_2x_i + \varepsilon^*_i\] where \(\varepsilon^*_i = \varepsilon_i + \beta_3 z_i\)

Linear Model

2. Omitted variables bias

The omission of relevant covaritate \(\beta_3 \neq 0\) has two consequences

The variance of the error is \[\sigma^2_{\varepsilon^*} = \beta^2_3 \sigma^2_z + \sigma^2_{\varepsilon}\] and it is greater than the one of the initial model for which \(z\) is observed and used as a covariate
the covariance between the error and \(x\) is \[\text{cov}(x, \varepsilon^*) = \beta_3 \text{cov}(x,z)\] if the covariate is correlated with the omitted variable, the covariate and the error of the model are correlated.

Linear Model

2. Omitted variables bias

As the variance of the OLS estimator is proportional to the variance of the errors, omission of a relevant covariate will always induce a less precise estimation of the slopes and a lower \(R^2\)
Moreover, if the omitted covariate is correlated with the covariate used in the regression, the estimation will be biased and inconsistent
This omitted variable bias can be computed as follows:

\[\hat \beta_2 = \beta_2 + \frac{\sum_{i=1}^{n}(x_i -\bar x)(\beta_3z_i + \varepsilon_i)}{\sum_{i=1}^{n}(x_i - \bar x)^2} = \beta_2 + \beta_3 \frac{\sum_{i=1}^{n}(x_i -\bar x)(z_i + \bar z)}{\sum_{i=1}^{n}(x_i - \bar x)^2} + \frac{\sum_{i=1}^{n}(x_i -\bar x)\varepsilon_i}{\sum_{i=1}^{n}(x_i - \bar x)^2}\]

Taking the conditional expectation, the last term disappear, so that \[\mathbb{E}(\hat \beta_2|x,z) = \beta_2 + \beta_3 \hat{\text cov}(x,z)/\hat\sigma^2_x\]

upper bias if the signs of the covariance between \(x\) and \(z\) and \(\beta_3\) are the same
a lower bias if they have opposite signs

Linear Model

2. Omitted variables bias

There is an upper bias if the signs of the covariance between \(x\) and \(z\) and \(\beta_3\) are the same, and a lower bias if they have opposite signs.
As \(n \to \infty\) the OLS estimator converges to \[\hat \beta_2 \xrightarrow{p}\beta_2 + \beta_3 {\text cov}(x,z)/{\text var(x)} = \beta_2 + \beta_3 \times \beta^{*2}\] where
\(\beta^{*2}\) is the true value of the slope of the regression of \(z\) on \(x\)
This formula makes clear what \(\hat \beta_2\) really estimates in a linear regression:

The direct effect of \(x\) on \(y\)
The indirect effect of \(x\) on \(y\) which is the product of \(x\) in \(z\) (\(\beta^{2*}\)) times the effect of \(z\) on \(\varepsilon^*\)

Econometric Example

2. Returns from education

A classic example of omitted variable bias occurs in the estimation of a Mincer earning function, which relates wage (w), education (e) and experience (s)

\[\log(w_i) = \beta_1 + \beta_2 e_i + \beta_3 s_i + \beta_4 s^2_i + \varepsilon_i\] where \(\beta_2\) is the percentage increase of the wage for one more year of education, holding fixed the other predictors. Indeed \[ \beta_2 = \frac{d \log(w)}{d e} = \frac{dw/w}{de}\]

Numerous studies of the Mincer function deal with this problem of endogeneity of the education level

Econometric Example

2. Returns from education

Sample of 303 white males taken from the National Longitudinal Survey of Youth in 1992
Variables:

wage: Log hourly wage
educ: Education (years of schooling)
AFQT: Ability based on standardized AFQT test score
educSibl: Education of oldest sibling (years of schooling)
exper: Experience (total weeks of labor market experience)
tenure: Tenure (weeks on current job)
educM: Mother’s education (years of schooling)
educF: Father’s education (years of schooling)
urban: Dummy variable for urban residence
home: Broken home dummy (dummy for living with both parents at age 14)

data <- read.table("kpt.dat")
colnames(data) <- c("wage", "educ", "AFQT", "educSibl", "exper",
                    "tenure", "educM", "educF", "urban", "home")

Econometric Example

2. Returns from education

Let’s fit an OLS regression: one more year of education increases the average by 10%, holding fixed the experience

data$exper <- data$exper/52

fit1 <- lm(wage ~ educ + poly(exper, 2), data = data)
summary(fit1)$coefficients

                 Estimate Std. Error  t value     Pr(>|t|)
(Intercept)     1.1114866 0.15949528 6.968774 2.045533e-11
educ            0.1000376 0.01179666 8.480160 1.049101e-15
poly(exper, 2)1 2.3912857 0.45538678 5.251109 2.874745e-07
poly(exper, 2)2 0.4821923 0.45418215 1.061672 2.892415e-01

Econometric Example

2. Returns from education

Concern: the individuals have different abilities (a), and that more abilities have a positive effect on wage, but may also have a positive effect on education
If so, adding ability in the model \[\log(w_i) = \beta_1 + \beta_2 e_i + \beta_3 s_i + \beta_4 s^2_i + \beta_5 a_i + \varepsilon_i\]

will provide \(\beta_5>0\) and regressing ability on the education, that is \[e_i = \beta^*_1+ \beta^*_2 a_i + \varepsilon^*_i\] will provide \(\beta^*_2>0\)

Thus \[\hat \beta_2 \xrightarrow{p} \beta_2 + \beta_5 \times \beta^*_2>\beta_2\] and the OLS estimator is upwarded biased

Econometric Example

2. Returns from education

This is the case, because more education

increases, for a given level of ability, the expected wage is \(\beta_2\)
on average, the level of ability is higher, this effect being \(\beta^*_2\), so the wage will also be higher

In our data set we have the ability (AFQT), which is the standardized AFQT test score.
If we introduce ability in the regression, education is no more endogenous and least squares will give a consistent estimation of the effect of education on wage.
We first check that education and ability are positively correlated:

with(data, cor(educ, AFQT))

[1] 0.6055756

Econometric Example

2. Returns from education

Adding ability as a covariate should decrease the coefficient on education
The effect of one more year of education is now an increase of 8.7% of the wage (compared to the 10%)
The relatively small decrease suggests that the correlation between education and ability is limited, so the omitted variable bias in this example is not very large

fit2 <- lm(wage ~ educ + poly(exper, 2) + AFQT, data = data)
summary(fit2)$coefficients

                  Estimate Std. Error  t value     Pr(>|t|)
(Intercept)     1.23953056 0.19077860 6.497220 3.428526e-10
educ            0.08874818 0.01498128 5.923938 8.660934e-09
poly(exper, 2)1 2.30936000 0.45993514 5.021056 8.861583e-07
poly(exper, 2)2 0.44963379 0.45459292 0.989091 3.234210e-01
AFQT            0.04693974 0.03844733 1.220884 2.230949e-01

Linear Model

3. Simultaneity bias

Often in economics, the phenomenon of interest is not described by a single equation, but by a system of equations
Consider for example a market equilibrium. The two equations relate the quantity demanded / supplied (\(q^d\) and \(q^s\)) to the unit price and to some specific covariates to the demand and to the supply side of the market
The equilibrium on the loan market is then defined by a system of three equations:

\[\begin{cases} q^d = \beta^d_1 + \beta^d_2 p + \beta^d_3 d +\varepsilon^d \\q^s = \beta^s_1 + \beta^s_2 p + \beta^o_3 s +\varepsilon^s\\q^d = q^s\end{cases}\] where

\(q\) is the quantity (in logarithm)
\(p\) is the price
\(d\) and \(s\) two rates

Linear Model

3. Simultaneity bias

\[\begin{cases} q^d = \beta^d_1 + \beta^d_2 p + \beta^d_3 d +\varepsilon^d \\q^s = \beta^s_1 + \beta^s_2 p + \beta^o_3 s +\varepsilon^s\\q^d = q^s\end{cases}\]

The demand curve should be decreasing: \(\beta^d_2<0\)
The supply curve should be increasing: \(\beta^s_2>0\)
By fitting the OLS regression we can identify the correct signs…
However, the fit of the supply equation is very bad

Linear Model

3. Simultaneity bias

What is actually observed for each observation in the sample is a price-quantity combination at an equilibrium.
A positive shock on the demand equation will move upward the demand curve and will lead to a new equilibrium with a higher equilibrium quantity \(q'\) and also a higher equilibrium price \(p'\) (except in the special case where the supply curve is vertical, which means that the price elasticity of supply is infinite).
This means that \(p\) is correlated with \(\varepsilon^{d}\), which leads to a bias in the estimation of \(\beta^d_2\) via OLS
The same reasoning applies of course to the supply curve.

Linear Model

3. Instrumental variable estimator

Instrumental variable regression can eliminate the bias when \(\mathbb{E}(\varepsilon|x)\neq 0\) using an instrumental variable (let’s say \(z\))
We will explore very quickly the theoretical part (see the iv2 slides), in particular we will see an application in R