Intermediate Econometrics

29th October 2025 - Vincenzo Gioia

Regression Models: Examples

Some uses of regression models:

To predict an individual’s income based on gender, holding other conditions constant (such as education level, age, etc.)
To predict the number of exams taken by a first-year student based on demographic data, income, school background, etc.
To predict the number of claims made by an insured person based on their individual characteristics and past history
To assess whether blood pressure decreases following the administration of a drug, taking individual characteristics into account
To evaluate how mortality in the population varies according to the concentration of air pollutants
To decide whether a credit card payment is fraudulent

Reg. Models: A Common Framework

Framework:

a quantity of interest (income, number of exams, number of claims, blood pressure, mortality, fraudulence): the response (or outcome or dependent variable)
other quantities: the explanatory (independent) variables (also called covariates or regressors)

Question: the first is related to the latter, and, if so, how?

Note

The latter (covariates) are of practical interest only insofar as they are connected to the first (outcome)

Reg. Models: Goals

Predictive

Obtain a tool for predicting the value of the variable of interest given the values of the explanatory variables (e.g., because these are easier to measure or can be observed in advance with respect to the response)

Example 3: when the goal is to determine the insurance premium that the policyholder should pay
Example 6: to block a transaction before it is carried out

Interpretative

The main interest is to determine which explanatory variables have the strongest relationship with the response, and in which direction that relationship goes

Example 1: when the goal is to determine whether there is gender-based discrimination
Example 4: when the goal is to determine whether the drug is effective

Reg. Models: General Form

The probability distribution of the outcome depends on the covariates

\[([\text{outcome}]) \sim f(y; [\text{covariates}]) \]

Note

Asymmetric relationship
\(f(\cdot;\cdot)\) is specified up to a parameter

General structure depending on

The type of outcome variable
The functional form of the relationship

Reg. Models: Type of Outcome

Different models

Binary: logistic/probit/\(\ldots\) regression
(Qualitative) Categorical: multinomial regression
Counts (Quantitative discrete): Poisson (Negative Binomial) regression
(Quantitative) Continuous: Linear regression model

Note

Under certain conditions the linear regression model can be used for quantitative discrete variables

Reg. Models: Data

Data Matrix

General structure including outcome and covariates

\[ {\small \begin{array}{c| c c c c c} \text{Unit} & y & x_1 & x_2 & \cdots & x_p \\ \hline 1 & y_1 & x_{11} & x_{12} & \cdots & x_{1p} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ i & y_i & x_{i1} & x_{i2} & \cdots & x_{ip} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ n & y_n & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{array} } \]

Reg. Models: Linear Regression Model

\[Y_i \sim f(y_i; x_{i1}, \ldots, x_{ip}), \quad i = 1, \ldots, n\]

Additive structure

\[h(Y_i) = g(x_{i1}, \ldots, x_{ip}) + \varepsilon_i\]

\(h(\cdot)\): known function
\(g(x_{i1}, \ldots, x_{ip})\): systematic component
\(\varepsilon_i\): error component

Reg. Models: Linear Regression Model

Linearity of \(g(\cdot)\)

\[g(x_{i1}, \ldots, x_{ip}) = \beta_1 g_1(x_{i1}) + \ldots + \beta_p g_p(x_{ip})\]

\(g_j(\cdot)\): known function

Linear model

\[h(Y_i) = \beta_1 g_1(x_{i1}) + \ldots + \beta_p g_p(x_{ip}) + \varepsilon_i\]

\(h(\cdot)\) and \(g_j(\cdot)\): known functions
\(\varepsilon_i\): random variables with mean zero (whose distribution is specified up to a parameter)
\(\beta_1, \ldots, \beta_p\): parameters to estimate

Reg. Models: Linear Regression Model

Examples

\[Y_i = \beta_1 + \beta_2 x_{i2} + \beta_3 x_{i3} + \varepsilon_i\] \[Y_i = \beta_1 +\beta_2 x^2_{i2} + \beta_3 \sqrt{x_{i3}} + \varepsilon_i\] \[\log(Y_i) = \beta_1 + \beta_2 x_{i2} + \beta_3 x_{i3} + \varepsilon_i\]

\[\log(Y_i) = \beta_1 + \beta_2 \log(x_{i2}) + \beta_3 \log(x_{i3}) + \varepsilon_i\]

Reg. Models: Interpretation - Be Careful!

Relationship should not be interpreted as Cause-and-Effect

When we write a model in which one variable (\(y\)) is a function of another (\(x\)), it is very tempting to interpret it as \(x\) causes \(y\)

A statistical relationship — even a strong one — between \(y\) and \(x\) does not imply a cause-and-effect relationship
For example, both variables might be related to a third variable that causes them both
There are statistical methods for making inferences about cause-and-effect relationships, but they require greater sophistication or a sample constructed in a specific way

Reg. Models: Linear Model

Linear model: \(\quad Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i\)

Matrix representation \(Y = X \mathbf{\beta} + \varepsilon\)

\(Y\): \(n\)-dimensional outcome vector
\(X\): \(n \times p\) model matrix

\[ {\small \begin{array}{c| c c c c c} \text{Unit} & x_1 & x_2 & \cdots & x_p \\ \hline 1 & 1 & x_{12} & \cdots & x_{1p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ i & 1 & x_{i2} & \cdots & x_{ip} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ n & 1 & x_{n2} & \cdots & x_{np} \\ \end{array} } \]

Linear Model: Error Term

Linear model: \(Y = X \mathbf{\beta} + \varepsilon\)

\(\varepsilon\): \(n\)-dimensional error vector (this component introduces casuality in the model, \(Y\) is random beacuse \(\varepsilon\) is random)

Assumptions

Linearity
Errors having mean 0, homoschedastic, and uncorrelated
Linear indepencence between explanatory variables

Note

We do not make distributional assumptions on the error components: these are called second-order hypotheses

Linear Model: Assumption 1

Linearity

Linearity: \(Y= X \beta + \varepsilon\)

\[ \begin{pmatrix}Y_1 \\ \vdots \\ Y_i \\ \vdots \\ Y_n \end{pmatrix} = \begin{pmatrix} x_{11} & \cdots & x_{1p}\\ \vdots & \ddots & \vdots \\ x_{i1} & \cdots & x_{ip}\\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{pmatrix}\begin{pmatrix}\beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix}\varepsilon_1 \\ \vdots \\ \varepsilon_i \\ \vdots \\ \varepsilon_n \end{pmatrix} \] \[ \begin{pmatrix}n \times 1\end{pmatrix} \quad \quad \quad \begin{pmatrix} n \times p \end{pmatrix} \quad \quad \begin{pmatrix} p \times 1 \end{pmatrix} \quad \quad \begin{pmatrix}n \times 1\end{pmatrix} \]

Linear Model: Assumption 2

Second-order hypothesis for the error term

Errors having mean 0, homoschedastic, and uncorrelated

\[ \mathbb{E}(\varepsilon) = 0 \quad \quad V(\varepsilon) = \begin{pmatrix}\sigma^2 & 0 & 0 &\cdots & 0\\ 0 & \sigma^2 & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \vdots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \sigma^2 \end{pmatrix} = \sigma^2 I\]

Note

\(E[Y|X] = X \beta\)
\(V(Y|X) = \sigma^2I\)

Linear model: Assumption 3

Linear indepencence between explanatory variables

The vectors \(x_j\), \(j=1, \ldots, p,\) are linearly independent

This guarantees the identifiability of the model
It translates into a matrix \(X\) (which is non-stochastic because we are working conditional to the values observed for the covariates) that is of full rank (\({\rm rank}(X) = p\))

Linear Model: Least Square Estimator

Least Square (LS) Estimator (OLS: ordinary least square)

\[\hat\beta_{OLS}= (X^\top X)^{-1}X^\top Y\]

LS estimator is obtained by minimizing the residual sum of squares

\[{\rm RSS}(\beta) = (Y-X\beta)^\top (Y-X\beta) = Y^tY -2\beta^\top X^\top Y +\beta^\top X^\top X \beta\] \[ \frac{\partial}{\partial \beta} {\rm RSS}(\beta)= -2 X^\top Y + 2 X^\top X \beta\] \[ \frac{\partial}{\partial \beta} {\rm RSS}(\beta)=0 \implies \hat \beta_{OLS} = (X^\top X)^{-1}X^\top Y\]

Linear Model: Important quantities

LS estimate

\[\hat \beta_{OLS} = (X^\top X)^{-1}X^\top y\]

Predicted values

\[\hat y = X \hat \beta_{OLS} = X (X^\top X)^{-1}X^\top y = Py\] where \(P = X (X^\top X)^{-1}X^\top\) is called projection matrix (symmetric and idempotent)

Residuals

\[e = y - \hat y = (I-P)y\]

Linear Models: OLS properties

Properties of \(\hat\beta_{OLS}\)

Unbiasedness: \(E(\hat \beta_{OLS}) = \beta\)
\(V(\hat \beta_{OLS}) = \sigma^2(X^\top X)^{-1}\)

We need an estimate of \(\sigma^2\) (which is unknown)

The idea is to use the residuals as substitutes for the errors and to use their variance as an estimator of \(\sigma^2\)
\(\hat \sigma^2 = \frac{1}{n} e^\top e\), which is biased
A consistent estimate is given by \[S^2 = \frac{1}{n-p} e^\top e\]
This implies that the variance/covariance of the \(\hat \beta_{OLS}\) estimator is \[\hat V (\hat \beta_{OLS}) = S^2(X^\top X)^{-1}\]

Linear Models: \(R^2\) coefficient

How well the pedicted values \(\hat y\) are able to represent the observed data \(y\)

Measure of goodness of fit: \(R^2\) coefficient \[R^2 = \frac{\sum_{i=1}^{n}(\hat y_i - \bar y)^2}{\sum_{i=1}^{n}(y_i - \bar y)^2} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat y_i)^2}{\sum_{i=1}^{n}(y_i - \bar y)^2}\]
\(R^2 \in [0,1]\)
It represents the fraction of variability of \(Y\) explained by the model

Deviance decomposition

\[{\rm Total\, deviance} = {\rm Model\, deviance} + {\rm Residual \, deviance}\]

\({\rm Total \, deviance} = \sum_{i=1}^{n}(y_i - \bar y)^2\)
\({\rm Model \, deviance} = \sum_{i=1}^{n}(\hat y_i - \bar y)^2\)
\({\rm Residual \, deviance} = \sum_{i=1}^{n}(y_i - \hat y_i)^2\)

Linear Models: In practice

Credit Card Balance Data

Description: A simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
Outcome: Balance
Available Covariates: Income, Limit, Rating, Cards, Age, Education, Gender, Student, Married, Ethnicity

We just considered a simple example regressing Balance on Student and Limit

library(ISLR)
data("Credit")
#help(Credit) you can see the help

Linear Models: In practice

Exploratory purposes

Scatterplot

with(Credit, plot(Limit, Balance))

Linear Models: In practice

Exploratory purposes

Boxplots

with(Credit, plot(Balance ~ Student))

Linear Models: In practice

Exploratory purposes

Combining both information

with(Credit, plot(Limit, Balance, col = ifelse(Student == "Yes", "green", "red")))

Linear Models: In practice

The lm() function: fit a linear model on your dataset

Model formula
Specifying the dataset (where the variables can be found)
Assign it to an object
By simply digitizing the object you can see only the parameter estimates (in addition ti the call)

lmFit <- lm(Balance ~ Limit + Student, data = Credit)
lmFit


Call:
lm(formula = Balance ~ Limit + Student, data = Credit)

Coefficients:
(Intercept)        Limit   StudentYes  
   -334.730        0.172      404.404

Linear Models: In practice

Obtaining the parameter LS estimates by hand

\[\hat \beta_{OLS} = (X^\top X)^{-1}X^\top y\]

X <- model.matrix(Balance ~ Limit + Student, data = Credit)
y <- Credit$Balance 
beta_OLS <- solve(t(X)%*%X)%*%t(X)%*%y
beta_OLS

                    [,1]
(Intercept) -334.7299372
Limit          0.1719538
StudentYes   404.4036438

lmFit$coefficients

 (Intercept)        Limit   StudentYes 
-334.7299372    0.1719538  404.4036438

Linear Models: An exhaustive summary

summary(lmFit)


Call:
lm(formula = Balance ~ Limit + Student, data = Credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-637.77 -116.90    6.04  130.92  434.24 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.347e+02  2.307e+01  -14.51   <2e-16 ***
Limit        1.720e-01  4.331e-03   39.70   <2e-16 ***
StudentYes   4.044e+02  3.328e+01   12.15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 199.7 on 397 degrees of freedom
Multiple R-squared:  0.8123,    Adjusted R-squared:  0.8114 
F-statistic: 859.2 on 2 and 397 DF,  p-value: < 2.2e-16

Linear Models: Table of coefficients

summary(lmFit)$coefficients

                Estimate   Std. Error   t value      Pr(>|t|)
(Intercept) -334.7299372 23.069301674 -14.50976  1.417949e-38
Limit          0.1719538  0.004330826  39.70463 2.558391e-140
StudentYes   404.4036438 33.279679039  12.15167  4.181612e-29

Quantities: first and second column

\[\hat \beta_{OLS} = (X^\top X)^{-1}X^\top y\] \[\sqrt{[\hat V (\hat \beta_{OLS}]_{jj})} = \sqrt{[s^2(X^\top X)^{-1}]_{jj}}\]

p <- ncol(X)
hat_s2 <- sum(residuals(lmFit)^2)/(nrow(Credit) - p)
hat_Vb <- hat_s2*solve(t(X)%*%X)
sqrt(diag(hat_Vb))

 (Intercept)        Limit   StudentYes 
23.069301674  0.004330826 33.279679039

Linear Models: Residuals summary

Residuals: \(e = y - \hat y\)

A summary of the residuals
Residual standard error (the square root of the unbiased estimate of the variance of the error term, \(\sigma^2\))

head(residuals(lmFit))

         1          2          3          4          5          6 
  47.66441 -309.30693 -301.84343 -335.51929 -176.32798  102.01744

head(Credit$Balance - predict(lmFit))

         1          2          3          4          5          6 
  47.66441 -309.30693 -301.84343 -335.51929 -176.32798  102.01744

summary(residuals(lmFit))

    Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
-637.771 -116.900    6.045    0.000  130.916  434.236

summary(lmFit)$sigma

[1] 199.6745

sqrt(sum(residuals(lmFit)^2)/(nrow(Credit)-ncol(X)))

[1] 199.6745

Linear Models: Residuals summary

\(R^2\) coefficient and the adjusted R-squared (\(R^2_c\))

\[R^2 = \frac{\sum_{i=1}^{n}(\hat y_i - \bar y)^2}{\sum_{i=1}^{n}(y_i - \bar y)^2} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat y_i)^2}{\sum_{i=1}^{n}(y_i - \bar y)^2}\] \[R^2_c = R^2-\frac{p-1}{n-p} (1- R^2)\]

summary(lmFit)$r.squared

[1] 0.8123267

Rsq <- 1 - sum(residuals(lmFit)^2)/(var(Credit$Balance)*(nrow(Credit) - 1))
Rsq

[1] 0.8123267

summary(lmFit)$adj.r.squared

[1] 0.8113813

Rsq - (1-Rsq)*(ncol(X)-1)/(nrow(Credit)-ncol(X))

[1] 0.8113813

Linear Models: Regression lines

with(Credit, plot(Limit, Balance, col = ifelse(Student == "Yes", "green", "red")))
abline(lmFit$coefficients[1:2], col = "red")
abline(c(lmFit$coefficients[1] + lmFit$coefficients[3], 
         lmFit$coefficients[2]), col = "green")

Remember: This is just a toy example

We are just using this data and this model setting to show how some quantities are obtained

Linear Models: Further steps

Obtaining the remaining quantitities of the summary

t value
\({\rm Pr(>|t|)}\)
F-statistic

Explore violations of the linear model assumptions

Residual analysis
\(\ldots\)

What we need?

A further assumption

Linear Models: Statistical Inference

Problems of Statistical Inference

Estimation: point and interval estimation
Hypothesis test
Prediction (point or interval)

By deriving the OLS estimator we have just obtained a point estimator and derived its variance (which provide information on how the estimator is far from the unknown parameter \(\beta\))

To carry out the remaining inferential results (interval estimation, hypothesis test), we need to

Use the asymptotic theory of the least squares or resampling techinque
Introduce a further assumption: the errors are distributed according to a \(\mathcal{N}(0, \sigma^2)\) and they are independent

Normal Linear Model

Assumptions

Linearity \[Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i\]
Errors having mean zero, homoschedastic, normally distributed and independent \[\varepsilon_i \sim \mathcal{N}(0, \sigma^2), \quad \text{independent}, \quad i = 1, \ldots, n\] \[\varepsilon \sim \mathcal{N}_n(0, \Sigma), \quad \text{where} \quad \Sigma = \sigma^2 I\]
Linear independence between explanatory variables

Note

From 1) and 2), we get that the \(Y_i\) are independent with \[Y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad\text{where} \quad \mu_i = \beta_1 + \beta_2x_{i2} + \ldots +\beta_p x_{ip}, \quad i=1, \ldots, n\]

Normal Linear Model: What we need?

By introducing the normality assumption for the errors we can derive

Distribution of the estimators (\(\hat \beta\) and \(\hat \sigma^2\))
Joint distribution of (\(\hat \beta, \hat \sigma^2\))
Pivotal quantity to make inference on a single coefficient (confidence interval, hypothesis test)
Procedure for testing hypothesis on a group of coefficient and make predictions

Note

Friday, we will introduce the likelihood function and deriving the remaining quantities