12-14th 2025 - Vincenzo Gioia
Limits and extension
Linearity
Interpretation
Normality
Homoschedasticity
Problems
How to deal with eteroschedasticity
How to deal with eteroschedasticity
Heteroschedasticity
Heteroschedasticity
Heteroschedasticity
wage1 dataset
The dataset is extracted from Wooldridge’s Introductory Econometrics (2016)
Goal:
Estimate a linear regression: \[wage_i = \beta_0 + \beta_1 educ_i + \beta_2 exper_i + \beta_3 tenure_i + \varepsilon_i\]
Testing for heteroschedasticity, fit via OLS and GLS, then considering a transformation of the outcome
wage1 dataset
The dataset wage1 is included in the wooldridge R package
It is a classic example in labor economics
It contains cross-sectional data on 526 working individuals in the U.S.
Variables
'data.frame': 526 obs. of 24 variables:
$ wage : num 3.1 3.24 3 6 5.3 ...
$ educ : int 11 12 11 8 12 16 18 12 12 17 ...
$ exper : int 2 22 2 44 7 9 15 5 26 22 ...
$ tenure : int 0 2 0 28 2 8 7 3 4 21 ...
$ nonwhite: int 0 0 0 0 0 0 0 0 0 0 ...
$ female : int 1 1 0 0 0 0 0 1 1 0 ...
$ married : int 0 1 0 1 1 1 0 0 0 1 ...
$ numdep : int 2 3 2 0 1 0 0 0 2 0 ...
$ smsa : int 1 1 0 1 0 1 1 1 1 1 ...
$ northcen: int 0 0 0 0 0 0 0 0 0 0 ...
$ south : int 0 0 0 0 0 0 0 0 0 0 ...
$ west : int 1 1 1 1 1 1 1 1 1 1 ...
$ construc: int 0 0 0 0 0 0 0 0 0 0 ...
$ ndurman : int 0 0 0 0 0 0 0 0 0 0 ...
$ trcommpu: int 0 0 0 0 0 0 0 0 0 0 ...
$ trade : int 0 0 1 0 0 0 1 0 1 0 ...
$ services: int 0 1 0 0 0 0 0 0 0 0 ...
$ profserv: int 0 0 0 0 0 1 0 0 0 0 ...
$ profocc : int 0 0 0 0 0 1 1 1 1 1 ...
$ clerocc : int 0 0 0 1 0 0 0 0 0 0 ...
$ servocc : int 0 1 0 0 0 0 0 0 0 0 ...
$ lwage : num 1.13 1.18 1.1 1.79 1.67 ...
$ expersq : int 4 484 4 1936 49 81 225 25 676 484 ...
$ tenursq : int 0 4 0 784 4 64 49 9 16 441 ...
- attr(*, "time.stamp")= chr "25 Jun 2011 23:03"
wage educ exper tenure
Min. : 0.530 Min. : 0.00 Min. : 1.00 Min. : 0.000
1st Qu.: 3.330 1st Qu.:12.00 1st Qu.: 5.00 1st Qu.: 0.000
Median : 4.650 Median :12.00 Median :13.50 Median : 2.000
Mean : 5.896 Mean :12.56 Mean :17.02 Mean : 5.105
3rd Qu.: 6.880 3rd Qu.:14.00 3rd Qu.:26.00 3rd Qu.: 7.000
Max. :24.980 Max. :18.00 Max. :51.00 Max. :44.000
OLS
Call:
lm(formula = wage ~ educ + exper + tenure, data = wage1)
Residuals:
Min 1Q Median 3Q Max
-7.6068 -1.7747 -0.6279 1.1969 14.6536
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -2.87273 0.72896 -3.941 9.22e-05 ***
educ 0.59897 0.05128 11.679 < 2e-16 ***
exper 0.02234 0.01206 1.853 0.0645 .
tenure 0.16927 0.02164 7.820 2.93e-14 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 3.084 on 522 degrees of freedom
Multiple R-squared: 0.3064, Adjusted R-squared: 0.3024
F-statistic: 76.87 on 3 and 522 DF, p-value: < 2.2e-16
wage1 dataset
Testing for heteroschedasticity (Breusch-Pagan Test)
Model \(y_i = x^\top_i \beta + \varepsilon_i\), with \(\varepsilon_i \sim \mathcal{N}(0, \sigma^2_i)\), \(i = 1, \ldots, n\)
Test statistic: \[LM = \frac{1}{2}f^{t} W (W^TW)^{-1} W f\] where
wage1 dataset
wage1 dataset
pred <- predict(ols)
glsFit <- gls(wage ~ educ + exper + tenure, data = wage1,
weights = varPower(form = ~ pred))
summary(glsFit)Generalized least squares fit by REML
Model: wage ~ educ + exper + tenure
Data: wage1
AIC BIC logLik
2612.449 2637.995 -1300.225
Variance function:
Structure: Power of variance covariate
Formula: ~pred
Parameter estimates:
power
0.7879627
Coefficients:
Value Std.Error t-value p-value
(Intercept) 1.0550929 0.3850044 2.740470 0.0063
educ 0.2887735 0.0282548 10.220328 0.0000
exper 0.0160721 0.0081646 1.968499 0.0495
tenure 0.1594773 0.0202684 7.868274 0.0000
Correlation:
(Intr) educ exper
educ -0.899
exper -0.758 0.544
tenure 0.080 -0.167 -0.273
Standardized residuals:
Min Q1 Med Q3 Max
-1.9517615 -0.6652424 -0.2411887 0.4001136 5.1639173
Residual standard error: 0.7339895
Degrees of freedom: 526 total; 522 residual
wage1 dataset
[,1] [,2]
(Intercept) -2.87273482 1.05509285
educ 0.59896507 0.28877352
exper 0.02233952 0.01607207
tenure 0.16926865 0.15947728
[,1] [,2]
(Intercept) 0.72896429 0.385004357
educ 0.05128355 0.028254819
exper 0.01205685 0.008164635
tenure 0.02164461 0.020268395
par(mfrow=c(1,3))
plot(fitted(ols), residuals(ols), xlab = "Fitted values (OLS)",
ylab = "Residuals", main = "Heteroskedasticity in OLS")
abline(h = 0, col = "red", lty = 2)
plot(fitted(ols), rstandard(ols), xlab = "Fitted values (OLS)",
ylab = "Standardized Residuals", main = "Heteroskedasticity in OLS")
abline(h = 0, col = "red", lty = 2)
plot(fitted(glsFit), resid(glsFit, type="normalized"), xlab = "Fitted values (GLS)",
ylab = "Standardized residuals", main = "Variance stabilized in GLS")
abline(h = 0, col = "red", lty = 2)pred <- predict(ols_log)
gls_log <- gls(log(wage) ~ educ + exper + tenure, data = wage1,
weights = varPower(form = ~ pred))
par(mfrow=c(1,2))
plot(fitted(ols_log), residuals(ols_log), xlab = "Fitted values (OLS)",
ylab = "Residuals", main = "OLS")
abline(h = 0, col = "red", lty = 2)
plot(fitted(gls_log), resid(gls_log, type="normalized"), xlab = "Fitted values (GLS)",
ylab = "Standardized residuals", main = "GLS")
abline(h = 0, col = "red", lty = 2)wage1 dataset
[,1] [,2]
(Intercept) 0.284359555 0.391014486
educ 0.092028987 0.083188829
exper 0.004121109 0.004344837
tenure 0.022067217 0.022288778
[,1] [,2]
(Intercept) 0.104190378 0.096951203
educ 0.007329923 0.006902781
exper 0.001723277 0.001684173
tenure 0.003093649 0.003226156
Comparing the models
Akaike Information Criteria
\[AIC = 2(p+1) - 2 \ell(\hat {\boldsymbol \beta}; \hat \sigma^2))\]
\[AIC = 2(p+1) + n \log {\hat \sigma}^2\]
To compare models fitted in different scales, you must consider an additive term in the log-likelihood
Lower is the best
Akaike Information Criteria
Multicollinearity Problem
Multicollinearity Problem
Multicollinearity Problem
Estimate Std. Error t value Pr(>|t|)
(Intercept) 517.2922247 77.851531 6.64459921 1.002280e-10
Age 0.0489114 1.335991 0.03661057 9.708139e-01
Estimate Std. Error t value Pr(>|t|)
(Intercept) -292.7904955 26.683414516 -10.97275 1.184152e-24
Limit 0.1716373 0.005066234 33.87867 2.530581e-119
Estimate Std. Error t value Pr(>|t|)
(Intercept) -390.84634 29.0685146 -13.44569 3.073181e-34
Rating 2.56624 0.0750891 34.17594 1.898899e-120
Multicollinearity Problem
Estimate Std. Error t value Pr(>|t|)
(Intercept) 517.2922247 77.851531 6.64459921 1.002280e-10
Age 0.0489114 1.335991 0.03661057 9.708139e-01
Estimate Std. Error t value Pr(>|t|)
(Intercept) -292.7904955 26.683414516 -10.97275 1.184152e-24
Limit 0.1716373 0.005066234 33.87867 2.530581e-119
Estimate Std. Error t value Pr(>|t|)
(Intercept) -173.410901 43.828387048 -3.956589 9.005366e-05
Age -2.291486 0.672484540 -3.407492 7.226468e-04
Limit 0.173365 0.005025662 34.495944 1.627198e-121
Multicollinearity Problem
Estimate Std. Error t value Pr(>|t|)
(Intercept) -390.84634 29.0685146 -13.44569 3.073181e-34
Rating 2.56624 0.0750891 34.17594 1.898899e-120
Estimate Std. Error t value Pr(>|t|)
(Intercept) -292.7904955 26.683414516 -10.97275 1.184152e-24
Limit 0.1716373 0.005066234 33.87867 2.530581e-119
Estimate Std. Error t value Pr(>|t|)
(Intercept) -377.53679536 45.25417619 -8.3425846 1.213565e-15
Limit 0.02451438 0.06383456 0.3840298 7.011619e-01
Rating 2.20167217 0.95229387 2.3119672 2.129053e-02
Multicollinearity Problem
Multicollinearity Problem
Multicollinearity Problem
Endogeneity
The unbiasedness and the consistency of the OLS estimator rest on the hypothesis that the conditional expectation of the error is constant (and can be set to zero if the model contains an intercept)
Consider the simple linear model: \[y_i = \beta_1 + \beta_2 x_i + \varepsilon_i\] with \[\mathbb{E}(Y_i|x_i) = \mathbb{E}[\beta_1+\beta_2x_i + \varepsilon_i|x_i] = \beta_1 +\beta_2x_i + \mathbb{E}[\varepsilon_i| x_i] = \beta_1 +\beta_2x_i\] if \(\mathbb{E}(\varepsilon_i|x_i) = 0\)
When \(x\) is correlated with \(\varepsilon\), we face endogeneity
Leads to biased and inconsistent OLS estimates
Common in observational data
Why endogeneity matters
The same property can also be described using the covariance between the covariate and the error that can be written, using the rule of repeated expectation (tower’s property): \[\text{cov}(x, \varepsilon) = \mathbb{E}[(x-\mu_x)\varepsilon] = \mathbb{E}_x[\mathbb{E}_\varepsilon[(x-\mu_x)\varepsilon |x]] = \mathbb{E}_x[(x-\mu_x)\mathbb{E}_\varepsilon[\varepsilon |x]]\]
Key consequence: \[\text{cov}(x,\varepsilon) \neq 0 \Rightarrow \hat{\beta_2}_{OLS} \xrightarrow{p} \beta_2 + \frac{\text{cov}(x,\varepsilon)}{\text{var}(x)}\] where \(\xrightarrow{p}\) stay for convergence in probability
Indeed, considering a simple linear regression model \[\hat{\beta_2}_{OLS}=\frac{cov(x,y)}{var(x)} = \frac{cov(x,\beta_1+\beta_2x+\varepsilon)}{var(x)} = \beta_2 + \frac{cov(x,\varepsilon)}{var(x)}\]
Why endogeneity matters
Endogeneity
If the conditional expectation of \(\varepsilon\) is a constant, \(\mathbb{E}_\varepsilon[\varepsilon |x] = \mu_\varepsilon\) (not necessarily 0), the covariance is \[cov(x, \varepsilon) = \mu_\varepsilon \mathbb{E}_x[x-\mu_x] = 0\]
Stated in a different way, \(x\) is supposed to be exogenous, or \(x\) is assumed to be uncorrelated with \(\varepsilon\)
Endogeneity when \(cov(x, \varepsilon)\neq 0\)
Sources of endogeneity:
1. Errors in the variables (outcome)
Data used in economics, especially micro-data, are prone to errors of measurement (either outcome and predictors)
Suppose that the model that we seek to estimate is \[y^{*}_i = \beta_1 + \beta_2 x^{*}_i + \varepsilon^{*}_i\] where the covariates is exogenous (\(cov(x^*, \varepsilon^*)=0\))
Suppose that the response is observed with error, namely that the observed value of the response is \[y^{*}_i = y_i - \nu_i\] where \(\nu_i\) is the measurement error of the respnse. Then
\[y_i= \beta_1 + \beta_2 x^*_i + (\varepsilon^*_i + \nu_i) \]
The error of the model is \(\varepsilon_i = \varepsilon^*_i + \nu_i\), which is still uncorrelated with \(x\) if \(\nu\) is uncorrelated with \(x\), which means that the error of measurement of the response is uncorrelated with the covariate
The measurement error only increases the size of the error, which implies that the coefficients are estimated less precisely and that the \(R^2\) is lower compared to a model with a correctly measured response
1. Errors in the variables (predictor)
Let’s suppose now that the covariate is observed with error, namely that the observed value of the covariate is \[x^{*}_i = x_i - \nu_i\] where \(\nu_i\) is the measurement error of the covariate.
If the measurement error is uncorrelated with the value of the covariate, the variance of the observed covariate is \[\sigma^2_x = \sigma^{*2}_x + \sigma^2_{\nu}\] and the covariance between the observed covariate and the measurement error is equal to the variance of the measurement error, that is \({\text cov}(x, \nu) = \mathbb{E}(x^* + \nu - \mu_x) \nu) = \sigma^2_\nu\) because the measurement error is uncorrelated with the covariate
So, rewriting the model in terms of \(x\), we get \[y_i= \beta_1 + \beta_2 x_i + \varepsilon_i\] with \(\varepsilon_i= \varepsilon^{*}_i - \beta_2 \nu_i\)
The error of the model is correlated with \(x\), as \(cov(x,\varepsilon) = cov(x^* + \nu, \varepsilon - \beta \nu) = -\beta\sigma^2_\nu\)
1. Errors in the variables (predictor)
\[\hat \beta_2 = \frac{\sum_{i=1}^{n}(x_i - \bar x)(y_i - \bar y)}{\sum_{i=1}^{n}(x_i - \bar x)^2} = \beta_2 + \frac{\sum_{i=1}^{n}(x_i - \bar x)\varepsilon_i}{\sum_{i=1}^{n}(x_i - \bar x)^2}\]
Taking the expectation, we have \(\mathbb{E}[(x-\bar x) \varepsilon]) = -\beta_2\sigma^2_\nu\) and the expected value of the estimator is \[\mathbb{E}(\hat \beta_2) = \beta_2 \left(1- \frac{\sigma^2_\nu}{\sum_{i=1}^{n}(x_i - \bar x)^2/n} \right) = \beta_2\left(1-\frac{\sigma^2_\nu}{\hat \sigma^2_x}\right)\]
The OLS estimator is biased and the term in brackets is the minus the share of the variance of that is due to measurement errors
1. Errors in the variables (predictor)
\[\mathbb{E}(\hat \beta_2) = \beta_2 \left(1- \frac{\sigma^2_\nu}{\sum_{i=1}^{n}(x_i - \bar x)^2/n} \right)= \beta_2\left(1-\frac{\sigma^2_\nu}{\hat \sigma^2_x}\right)\]
Then, \(|\hat \beta_2| < \beta_2\)
This is called attenuation bias: it can be either a lower or an upper bias depending on the sign of \(\beta\)
This bias clearly doesn’t attenuate in large samples. As \(n\) grows, the empirical variances/covariances converge to the population ones, and the estimator therefore converges to \(\beta_2(1 - \sigma^2_\nu/\sigma^2_x)\)
2. Omitted variables bias
Suppose that the true model is: \[y_i = \beta_1 + \beta_2x_i + \beta_3 z_i + \varepsilon_i\] where \(E[\varepsilon|x,z]=0\) and the model can be estimated consistently using OLS
Consider that \(z\) is unobserved
The model to be estimated is \[y_i = \beta_1 + \beta_2x_i + \varepsilon^*_i\] where \(\varepsilon^*_i = \varepsilon_i + \beta_3 z_i\)
2. Omitted variables bias
2. Omitted variables bias
As the variance of the OLS estimator is proportional to the variance of the errors, omission of a relevant covariate will always induce a less precise estimation of the slopes and a lower \(R^2\)
Moreover, if the omitted covariate is correlated with the covariate used in the regression, the estimation will be biased and inconsistent
This omitted variable bias can be computed as follows:
\[\hat \beta_2 = \beta_2 + \frac{\sum_{i=1}^{n}(x_i -\bar x)(\beta_3z_i + \varepsilon_i)}{\sum_{i=1}^{n}(x_i - \bar x)^2} = \beta_2 + \beta_3 \frac{\sum_{i=1}^{n}(x_i -\bar x)(z_i + \bar z)}{\sum_{i=1}^{n}(x_i - \bar x)^2} + \frac{\sum_{i=1}^{n}(x_i -\bar x)\varepsilon_i}{\sum_{i=1}^{n}(x_i - \bar x)^2}\]
2. Omitted variables bias
2. Returns from education
\[\log(w_i) = \beta_1 + \beta_2 e_i + \beta_3 s_i + \beta_4 s^2_i + \varepsilon_i\] where \(\beta_2\) is the percentage increase of the wage for one more year of education, holding fixed the other predictors. Indeed \[ \beta_2 = \frac{d \log(w)}{d e} = \frac{dw/w}{de}\]
2. Returns from education
Sample of 303 white males taken from the National Longitudinal Survey of Youth in 1992
Variables:
2. Returns from education
data$exper <- data$exper/52
fit1 <- lm(wage ~ educ + poly(exper, 2), data = data)
summary(fit1)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.1114866 0.15949528 6.968774 2.045533e-11
educ 0.1000376 0.01179666 8.480160 1.049101e-15
poly(exper, 2)1 2.3912857 0.45538678 5.251109 2.874745e-07
poly(exper, 2)2 0.4821923 0.45418215 1.061672 2.892415e-01
2. Returns from education
Concern: the individuals have different abilities (a), and that more abilities have a positive effect on wage, but may also have a positive effect on education
If so, adding ability in the model \[\log(w_i) = \beta_1 + \beta_2 e_i + \beta_3 s_i + \beta_4 s^2_i + \beta_5 a_i + \varepsilon_i\]
will provide \(\beta_5>0\) and regressing ability on the education, that is \[e_i = \beta^*_1+ \beta^*_2 a_i + \varepsilon^*_i\] will provide \(\beta^*_2>0\)
2. Returns from education
In our data set we have the ability (AFQT), which is the standardized AFQT test score.
If we introduce ability in the regression, education is no more endogenous and least squares will give a consistent estimation of the effect of education on wage.
We first check that education and ability are positively correlated:
2. Returns from education
Estimate Std. Error t value Pr(>|t|)
(Intercept) 1.23953056 0.19077860 6.497220 3.428526e-10
educ 0.08874818 0.01498128 5.923938 8.660934e-09
poly(exper, 2)1 2.30936000 0.45993514 5.021056 8.861583e-07
poly(exper, 2)2 0.44963379 0.45459292 0.989091 3.234210e-01
AFQT 0.04693974 0.03844733 1.220884 2.230949e-01
3. Simultaneity bias
\[\begin{cases} q^d = \beta^d_1 + \beta^d_2 p + \beta^d_3 d +\varepsilon^d \\q^s = \beta^s_1 + \beta^s_2 p + \beta^o_3 s +\varepsilon^s\\q^d = q^s\end{cases}\] where
3. Simultaneity bias
\[\begin{cases} q^d = \beta^d_1 + \beta^d_2 p + \beta^d_3 d +\varepsilon^d \\q^s = \beta^s_1 + \beta^s_2 p + \beta^o_3 s +\varepsilon^s\\q^d = q^s\end{cases}\]
The demand curve should be decreasing: \(\beta^d_2<0\)
The supply curve should be increasing: \(\beta^s_2>0\)
By fitting the OLS regression we can identify the correct signs…
However, the fit of the supply equation is very bad
3. Simultaneity bias
What is actually observed for each observation in the sample is a price-quantity combination at an equilibrium.
A positive shock on the demand equation will move upward the demand curve and will lead to a new equilibrium with a higher equilibrium quantity \(q'\) and also a higher equilibrium price \(p'\) (except in the special case where the supply curve is vertical, which means that the price elasticity of supply is infinite).
This means that \(p\) is correlated with \(\varepsilon^{d}\), which leads to a bias in the estimation of \(\beta^d_2\) via OLS
The same reasoning applies of course to the supply curve.
3. Instrumental variable estimator