29th October 2025 - Vincenzo Gioia
Some uses of regression models:
Framework:
Question: the first is related to the latter, and, if so, how?
Note
The latter (covariates) are of practical interest only insofar as they are connected to the first (outcome)
Predictive
Obtain a tool for predicting the value of the variable of interest given the values of the explanatory variables (e.g., because these are easier to measure or can be observed in advance with respect to the response)
Interpretative
The main interest is to determine which explanatory variables have the strongest relationship with the response, and in which direction that relationship goes
The probability distribution of the outcome depends on the covariates
\[([\text{outcome}]) \sim f(y; [\text{covariates}]) \]
Note
General structure depending on
Different models
Note
Under certain conditions the linear regression model can be used for quantitative discrete variables
Data Matrix
General structure including outcome and covariates
\[ {\small \begin{array}{c| c c c c c} \text{Unit} & y & x_1 & x_2 & \cdots & x_p \\ \hline 1 & y_1 & x_{11} & x_{12} & \cdots & x_{1p} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ i & y_i & x_{i1} & x_{i2} & \cdots & x_{ip} \\ \vdots & \vdots & \vdots & \vdots & \ddots & \vdots \\ n & y_n & x_{n1} & x_{n2} & \cdots & x_{np} \\ \end{array} } \]
\[Y_i \sim f(y_i; x_{i1}, \ldots, x_{ip}), \quad i = 1, \ldots, n\]
Additive structure
\[h(Y_i) = g(x_{i1}, \ldots, x_{ip}) + \varepsilon_i\]
Linearity of \(g(\cdot)\)
\[g(x_{i1}, \ldots, x_{ip}) = \beta_1 g_1(x_{i1}) + \ldots + \beta_p g_p(x_{ip})\]
Linear model
\[h(Y_i) = \beta_1 g_1(x_{i1}) + \ldots + \beta_p g_p(x_{ip}) + \varepsilon_i\]
Examples
\[Y_i = \beta_1 + \beta_2 x_{i2} + \beta_3 x_{i3} + \varepsilon_i\] \[Y_i = \beta_1 +\beta_2 x^2_{i2} + \beta_3 \sqrt{x_{i3}} + \varepsilon_i\] \[\log(Y_i) = \beta_1 + \beta_2 x_{i2} + \beta_3 x_{i3} + \varepsilon_i\]
\[\log(Y_i) = \beta_1 + \beta_2 \log(x_{i2}) + \beta_3 \log(x_{i3}) + \varepsilon_i\]
Relationship should not be interpreted as Cause-and-Effect
When we write a model in which one variable (\(y\)) is a function of another (\(x\)), it is very tempting to interpret it as \(x\) causes \(y\)
Linear model: \(\quad Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i\)
Matrix representation \(Y = X \mathbf{\beta} + \varepsilon\)
\[ {\small \begin{array}{c| c c c c c} \text{Unit} & x_1 & x_2 & \cdots & x_p \\ \hline 1 & 1 & x_{12} & \cdots & x_{1p} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ i & 1 & x_{i2} & \cdots & x_{ip} \\ \vdots & \vdots & \vdots & \ddots & \vdots \\ n & 1 & x_{n2} & \cdots & x_{np} \\ \end{array} } \]
Linear model: \(Y = X \mathbf{\beta} + \varepsilon\)
Assumptions
Note
We do not make distributional assumptions on the error components: these are called second-order hypotheses
Linearity
\[ \begin{pmatrix}Y_1 \\ \vdots \\ Y_i \\ \vdots \\ Y_n \end{pmatrix} = \begin{pmatrix} x_{11} & \cdots & x_{1p}\\ \vdots & \ddots & \vdots \\ x_{i1} & \cdots & x_{ip}\\ \vdots & \ddots & \vdots \\ x_{n1} & \cdots & x_{np} \end{pmatrix}\begin{pmatrix}\beta_1 \\ \vdots \\ \beta_p \end{pmatrix} + \begin{pmatrix}\varepsilon_1 \\ \vdots \\ \varepsilon_i \\ \vdots \\ \varepsilon_n \end{pmatrix} \] \[ \begin{pmatrix}n \times 1\end{pmatrix} \quad \quad \quad \begin{pmatrix} n \times p \end{pmatrix} \quad \quad \begin{pmatrix} p \times 1 \end{pmatrix} \quad \quad \begin{pmatrix}n \times 1\end{pmatrix} \]
Second-order hypothesis for the error term
\[ \mathbb{E}(\varepsilon) = 0 \quad \quad V(\varepsilon) = \begin{pmatrix}\sigma^2 & 0 & 0 &\cdots & 0\\ 0 & \sigma^2 & 0 & \cdots & 0 \\ \vdots & \ddots & \ddots & \vdots & \vdots \\ \vdots & \ddots & \ddots & \ddots & \vdots \\ 0 & 0 & 0 & \cdots & \sigma^2 \end{pmatrix} = \sigma^2 I\]
Note
Linear indepencence between explanatory variables
Least Square (LS) Estimator (OLS: ordinary least square)
\[\hat\beta_{OLS}= (X^\top X)^{-1}X^\top Y\]
LS estimator is obtained by minimizing the residual sum of squares
\[{\rm RSS}(\beta) = (Y-X\beta)^\top (Y-X\beta) = Y^tY -2\beta^\top X^\top Y +\beta^\top X^\top X \beta\] \[ \frac{\partial}{\partial \beta} {\rm RSS}(\beta)= -2 X^\top Y + 2 X^\top X \beta\] \[ \frac{\partial}{\partial \beta} {\rm RSS}(\beta)=0 \implies \hat \beta_{OLS} = (X^\top X)^{-1}X^\top Y\]
LS estimate
\[\hat \beta_{OLS} = (X^\top X)^{-1}X^\top y\]
Predicted values
\[\hat y = X \hat \beta_{OLS} = X (X^\top X)^{-1}X^\top y = Py\] where \(P = X (X^\top X)^{-1}X^\top\) is called projection matrix (symmetric and idempotent)
Residuals
\[e = y - \hat y = (I-P)y\]
Properties of \(\hat\beta_{OLS}\)
We need an estimate of \(\sigma^2\) (which is unknown)
The idea is to use the residuals as substitutes for the errors and to use their variance as an estimator of \(\sigma^2\)
\(\hat \sigma^2 = \frac{1}{n} e^\top e\), which is biased
A consistent estimate is given by \[S^2 = \frac{1}{n-p} e^\top e\]
This implies that the variance/covariance of the \(\hat \beta_{OLS}\) estimator is \[\hat V (\hat \beta_{OLS}) = S^2(X^\top X)^{-1}\]
How well the pedicted values \(\hat y\) are able to represent the observed data \(y\)
Measure of goodness of fit: \(R^2\) coefficient \[R^2 = \frac{\sum_{i=1}^{n}(\hat y_i - \bar y)^2}{\sum_{i=1}^{n}(y_i - \bar y)^2} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat y_i)^2}{\sum_{i=1}^{n}(y_i - \bar y)^2}\]
\(R^2 \in [0,1]\)
It represents the fraction of variability of \(Y\) explained by the model
Deviance decomposition
\[{\rm Total\, deviance} = {\rm Model\, deviance} + {\rm Residual \, deviance}\]
Credit Card Balance Data
We just considered a simple example regressing Balance on Student and Limit
Exploratory purposes
Exploratory purposes
Exploratory purposes
The lm() function: fit a linear model on your dataset
Obtaining the parameter LS estimates by hand
\[\hat \beta_{OLS} = (X^\top X)^{-1}X^\top y\]
Call:
lm(formula = Balance ~ Limit + Student, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-637.77 -116.90 6.04 130.92 434.24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.347e+02 2.307e+01 -14.51 <2e-16 ***
Limit 1.720e-01 4.331e-03 39.70 <2e-16 ***
StudentYes 4.044e+02 3.328e+01 12.15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 199.7 on 397 degrees of freedom
Multiple R-squared: 0.8123, Adjusted R-squared: 0.8114
F-statistic: 859.2 on 2 and 397 DF, p-value: < 2.2e-16
Quantities: first and second column
\[\hat \beta_{OLS} = (X^\top X)^{-1}X^\top y\] \[\sqrt{[\hat V (\hat \beta_{OLS}]_{jj})} = \sqrt{[s^2(X^\top X)^{-1}]_{jj}}\]
Residuals: \(e = y - \hat y\)
1 2 3 4 5 6
47.66441 -309.30693 -301.84343 -335.51929 -176.32798 102.01744
1 2 3 4 5 6
47.66441 -309.30693 -301.84343 -335.51929 -176.32798 102.01744
Min. 1st Qu. Median Mean 3rd Qu. Max.
-637.771 -116.900 6.045 0.000 130.916 434.236
[1] 199.6745
[1] 199.6745
\(R^2\) coefficient and the adjusted R-squared (\(R^2_c\))
\[R^2 = \frac{\sum_{i=1}^{n}(\hat y_i - \bar y)^2}{\sum_{i=1}^{n}(y_i - \bar y)^2} = 1 - \frac{\sum_{i=1}^{n}(y_i - \hat y_i)^2}{\sum_{i=1}^{n}(y_i - \bar y)^2}\] \[R^2_c = R^2-\frac{p-1}{n-p} (1- R^2)\]
Remember: This is just a toy example
Obtaining the remaining quantitities of the summary
Explore violations of the linear model assumptions
What we need?
Problems of Statistical Inference
By deriving the OLS estimator we have just obtained a point estimator and derived its variance (which provide information on how the estimator is far from the unknown parameter \(\beta\))
To carry out the remaining inferential results (interval estimation, hypothesis test), we need to
Use the asymptotic theory of the least squares or resampling techinque
Introduce a further assumption: the errors are distributed according to a \(\mathcal{N}(0, \sigma^2)\) and they are independent
Assumptions
Linearity \[Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta_p x_{ip} + \varepsilon_i\]
Errors having mean zero, homoschedastic, normally distributed and independent \[\varepsilon_i \sim \mathcal{N}(0, \sigma^2), \quad \text{independent}, \quad i = 1, \ldots, n\] \[\varepsilon \sim \mathcal{N}_n(0, \Sigma), \quad \text{where} \quad \Sigma = \sigma^2 I\]
Linear independence between explanatory variables
Note
From 1) and 2), we get that the \(Y_i\) are independent with \[Y_i \sim \mathcal{N}(\mu_i, \sigma^2), \quad\text{where} \quad \mu_i = \beta_1 + \beta_2x_{i2} + \ldots +\beta_p x_{ip}, \quad i=1, \ldots, n\]
By introducing the normality assumption for the errors we can derive
Note
Friday, we will introduce the likelihood function and deriving the remaining quantities