Intermediate Econometrics

31st October 2025 - Vincenzo Gioia

Statistical Model in Brief

Statistical Model: The Simplest One

Independent observations (on \(n\) statistical units) about a random variable of interest
The sample is composed by the random variables \(Y_1, \ldots, Y_n\) that are independent and identically distributed according to a certain probability distribution (which is known up to an unknown parameter) \[Y_i \sim f(\cdot, \theta), \quad \theta \in \Theta \subset \mathbb{R}^d, \quad d \geq 1, \quad i =1, \ldots, n\]

Assumption

The random variable is described by a probability mass function (discrete r.v.) or density function (continuous r.v.)
\(\theta\): d-dimensional parameter assuming value in the paramteric space \(\Theta\)

Statistical Model: The Normal Model

Normal/Gaussian Random Variable

A random variable \(X\) is said to be Normal (or Gaussian) with mean \(E[X] = \mu\) and variance \(V(X) = \sigma^2\) if the density function takes the functional form

\[f_X(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi} \sigma}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}\]

We denote it with \(X \sim \mathcal{N}(\mu, \sigma^2)\)

Standard Normal

A particular normal random variable is the standard normal, that is a normal with mean 0 and variance 1

\[f_Z(z) = \phi(z) = \frac{1}{\sqrt{2 \pi}}e^{-\frac{z^2}{2}}\]

We (usually) denote it using \(Z \sim \mathcal{N}(0,1)\) and it represents the core building for all the normal distributions

The Normal Distribution

Relationship between Standard Normal and a Generic Normal

If \(Z \sim \mathcal{N}(0,1)\) then \(X = \mu + \sigma Z\) is distributed as \(X\sim \mathcal{N}(\mu, \sigma^2)\)
Viceversa, if \(X\sim \mathcal{N}(\mu, \sigma^2)\) then the r.v. \(Z=\frac{X-\mu}{\sigma} \sim \mathcal{N}(0,1)\)

Cumulative Distribution Function

\(F_X(x) = P(X \leq x) = P(\mu + \sigma Z \leq x) = P\left(Z \leq \frac{x-\mu}{\sigma} \right) = \Phi(\frac{x-\mu}{\sigma})\) where \(\Phi(\cdot)\) is the cumulative distribution function of \(Z\) (whose values are obtained numerically)

Quantiles of Order \(p \in (0,1)\) of \(X \sim \mathcal{N}(\mu, \sigma^2)\)

\(x_p = \mu + \sigma z_p\) where \(z_p= \Phi^{-1}(p)\) is the quantile of order \(p\) of \(Z\)

The Normal distribution

Characteristics of \(f_X(x)\)

It is symmetric, with the line \(x = \mu\) as its axis of symmetry
It is increasing on the interval \((-\infty, \mu)\) and decreasing on the interval \((\mu, +\infty)\)
It has two inflection points at \(x = \mu - \sigma\) and \(x = \mu + \sigma\)
It is concave on the interval \((\mu-\sigma, \mu + \sigma)\) and convex elsewhere
It has the x-axis as an asymptote

curve(dnorm(x, mean = 100, sd = 10), xlim = c(70, 130), ylab = "density")
abline(v = 100, lty = "dashed", col = "red")
abline(v = c(90, 110), lty = "dashed", col = "blue")

The Normal distribution

Location Shift

\(Z \sim \mathcal{N}(0,1)\)
\(X = \mu + Z \sim \mathcal{N}(\mu, 1)\)
Example considering \(\mu \in \{-5,0,5,10\}\)

curve(dnorm(x, mean = -5, sd = 1), xlim = c(-10, 15), ylab = "density")
curve(dnorm(x, mean = 0, sd = 1), add = TRUE, col = "darkgreen")
curve(dnorm(x, mean = 5, sd = 1), add = TRUE, col = "gold")
curve(dnorm(x, mean = 10, sd = 1), add = TRUE, col = "darkred")

The Normal distribution

Scaling

\(Z \sim \mathcal{N}(0,1)\)
\(X = \sigma Z \sim \mathcal{N}(0, \sigma^2)\)
Example considering \(\sigma \in \{0.5,1,2,5\}\)

curve(dnorm(x, mean = 0, sd = 0.5), xlim = c(-10, 10), ylab = "density")
curve(dnorm(x, mean = 0, sd = 1), add = TRUE, col = "darkgreen")
curve(dnorm(x, mean = 0, sd = 2), add = TRUE, col = "gold")
curve(dnorm(x, mean = 0, sd = 5), add = TRUE, col = "darkred")

The Normal distribution

Location Shift and Scaling

\(Z \sim \mathcal{N}(0,1)\)
\(X = \mu + \sigma Z \sim \mathcal{N}(\mu, \sigma^2)\)
Example considering all the combination of \(\mu \in \{0,5\}\) and \(\sigma \in \{1,2\}\)

curve(dnorm(x, mean = 0, sd = 1), xlim = c(-10, 10), ylab = "density")
curve(dnorm(x, mean = 0, sd = 2), add = TRUE, col = "darkgreen")
curve(dnorm(x, mean = 5, sd = 1), add = TRUE, col = "gold")
curve(dnorm(x, mean = 5, sd = 2), add = TRUE, col = "darkred")

The Normal distribution

The Cumulative Distribution Function \(F_X(x)\)

\(Z \sim \mathcal{N}(0,1)\)
\(X = \mu + \sigma Z \sim \mathcal{N}(0, \sigma^2)\)
Example considering all the combinations of \(\mu \in \{0,5\}\) and \(\sigma \in \{1,2\}\)

curve(pnorm(x, mean = 0, sd = 1), xlim = c(-10, 10), ylab = "cumulative distribution function")
curve(pnorm(x, mean = 0, sd = 2), add = TRUE, col = "darkgreen")
curve(pnorm(x, mean = 5, sd = 1), add = TRUE, col = "gold")
curve(pnorm(x, mean = 5, sd = 2), add = TRUE, col = "darkred")

The Normal distribution

The Density Function \(f_X(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi} \sigma}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}\)

In R, the density function is obtained by means of the dnorm() function

curve(dnorm(x, mean = 100, sd = 10), xlim = c(80,120), ylab = "density")
dnorm(110, mean = 100, sd = 10)

[1] 0.02419707

y <- dnorm(110, mean = 100, sd = 10)
segments(x0 = 110, y0 = 0, x1 = 110, y1 = y, col = "red", lwd = 2)
segments(x0 = 80, y0 = y, x1 = 110, y1 = y, col = "red", lwd = 2)
points(110, y, col = "red", pch = 19)

The Normal distribution

The Cumulative Distribution Function \(F_X(x; \mu, \sigma^2) = P(X \leq x) = \Phi\left(\frac{x-\mu}{\sigma}\right)\)

In R, the cumulative distribution function is obtained by means of the pnorm() function

curve(pnorm(x, mean = 100, sd = 10), xlim = c(80,120), ylab = "cumulative distribution function")
pnorm(110, mean = 100, sd = 10)

[1] 0.8413447

y <- pnorm(110, mean = 100, sd = 10)
segments(x0 = 110, y0 = 0, x1 = 110, y1 = y, col = "red", lwd = 2)
segments(x0 = 80, y0 = y, x1 = 110, y1 = y, col = "red", lwd = 2)
points(110, y, col = "red", pch = 19)

The Normal distribution

Quantiles of Order \(p \in (0,1)\) of \(X \sim \mathcal{N}(\mu, \sigma^2)\)

\(x_p = \mu + \sigma z_p\) where \(z_p= \Phi^{-1}(p)\) is the quantile of order \(p\) of \(Z\)
In R, the quantiles are obtained by means of the qnorm() function

curve(pnorm(x, mean = 100, sd = 10), xlim = c(80,120), ylab = "cumulative distribution function")
qnorm(c(0.1,0.5,0.9), mean = 100, sd = 10)

[1]  87.18448 100.00000 112.81552

x <- qnorm(c(0.1,0.5,0.9), mean = 100, sd = 10)
segments(x0 = c(0,0,0), y0 = c(0.1, 0.5, 0.9), x1 = x, y1 = c(0.1, 0.5, 0.9), col = "red", lwd = 2)
segments(x0 = x, y0 = 0, x1 = x, y1 = c(0.1, 0.5, 0.9), col = "red", lwd = 2)

The Normal distribution

Random Generation

We can generate a number of random values from the \(\mathcal{N}(\mu, \sigma^2)\)

set.seed(14)
sim <- rnorm(1000, mean = 100, sd = 10)
hist(sim, breaks = 40, freq = FALSE)
curve(dnorm(x, mean = 100, sd = 10), add = TRUE)

Statistical model: The Normal Model

The Normal (or Gaussian) Model

We assume that \(Y_1, \ldots, Y_n\) are random variables independent and identically distributed according to the \(\mathcal{N}(\mu, \sigma^2)\) distribution
Parameter: \(\theta = (\mu, \sigma^2) \in \Theta = \mathbb{R} \times \mathbb{R}_{+}\)
The most popular estimators of \(\mu\) and \(\sigma^2\) are respectively \[\overline Y = \frac{1}{n} \sum_{i=1}^{n} Y_i \quad \quad S^2 = \frac{1}{n-1} \sum_{i=1}^{n} (Y_i-\overline Y)^2\] that is the sample mean and the sample variance, respectively
The estimators above are unbiased and consistent (meaning that the variance tends to zero when \(n \to \infty\))

The Normal Model: Inferential Results

Inference on \(\theta=(\mu, \sigma^2)\)

\(\overline Y\) and \(S^2\) are independent
\(\overline Y \sim \mathcal{N}(\mu, \sigma^2/n)\) and \((n-1)S^2/\sigma^2 \sim \chi^2_{n-1}\), where \(\chi^2_{\nu}\) is the Chi-square distribution with \(\nu\) degrees of freedom

Pivotal Quantity (Focus on \(\mu\))

\(T= \frac{\overline Y - \mu}{S/\sqrt{n}} \sim \mathcal{t}_{n-1}\) where \(t_\nu\) denotes the Student’s t distribution with \(\nu\) degrees of freedom

Confidence Interval for \(\mu\) at level \(1-\alpha\)

For \(0<\alpha <1\) \[\rm{Pr}(t_{n-1;\alpha/2} \leq T \leq t_{n-1;1-{\alpha/2}})=1-\alpha \]
The realization of the confidence interval is \[IC^{1-\alpha}_{\mu} = (\bar y - t_{n-1; 1-\alpha/2} s/\sqrt{n}, \bar y - t_{n-1; 1-\alpha/2} s/\sqrt{n})\]
where \(-t_{n-1;1-\alpha/2}=t_{n-1;\alpha/2}\) and \(t_{n-1;1-\alpha/2}\) represent the quantiles of order \(\alpha/2\) and \(1-\alpha/2\) of a Student’s t distribution with \(n-1\) degrees of freedom, respectively.

Statistical Model: Inferential Results

Confidence Intervals: Example

Let us to generate \(B=1000\) samples of size \(n=10\) from a \(N(\mu=5, \sigma^2 = 4)\) and fix \(\alpha=0.05\)

mu <- 5; sigma <- 2; B <- 1000; n <- 10; alpha <- 0.05
CI <- matrix(0, B, 2)
l <- rep(0, B)

set.seed(23)
q0975 <- qt(1 - alpha/2, n - 1)

for (i in 1 : B){
 x <- rnorm(n, mu, sigma)
 CI[i, ] <- mean(x) + c(-q0975, q0975) * sd(x)/sqrt(n)
 l[i] <- (mu > CI[i,1] & mu < CI[i,2]) 
}

Statistical Model: Inferential Results

#Empirical coverage probability
mean(l)

[1] 0.956

# Plot of the (first 100) CIs
plot(1, xlim = c(0,10), ylim = c(0,11), type = "n", 
     xlab = expression(mu), ylab = "", yaxt  = "n", 
     main = paste("100 IC for the mean (unknown variance)"), cex.main = 1.2)
abline(v = mu)

d <- 0
for (i in 1 : 100){
  d <- d + 0.1 
  lines(seq(CI[i, 1], CI[i, 2], length = 100), rep(d, 100), col = (l[i] + 1))
}

# number of intervals (out the 100) including the true parameter value 
sum(l[1 : 100])

[1] 94

The Normal Model: Inferential Results

Hypothesis test

\[\begin{cases} H_0: \mu = \mu_0 \\ H_1: \mu \neq \mu_0 \end{cases}\]

The reject region of level \(\alpha\) associated to the test is \[R_{\alpha} = \left\{(y_1, \ldots, y_n): \left| \frac{\bar y - \mu_0}{s/\sqrt{n}}\right| > t_{n-1;1-\alpha/2} \right\}\]
The p-value (the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value) is \[p-value = {\rm Pr}\left(|T| > \left| \frac{\bar y - \mu_0}{s/\sqrt{n}}\right|\right)\]

The Normal Model: Inferential Results

Example

To check whether the declared weight on a package corresponds to the truth, a pasta manufacturer analyzes the weight of a sample of \(n=10\) packages of pasta. The packages have a declared weight of \(500 g\), and it is reasonable to assume that the weight of a package is a normally distributed random variable. To verify that the machine does not produce packages with a different weight, it is checked whether the sample mean weight deviates too much from \(500\); if it does, production is suspended.

To decide when the sample mean weight is too far from 500, a 5% rejection region is used. The hypothesis system is: \[\begin{cases}H_0: \mu = 500 \\H_1: \mu\neq 500 \end{cases}\] The sample has given the following measurements: \[497.0621,\, 497.7690, \, 497.2631,\, 498.6688,\, 494.6227,\, 496.2185,\, 498.2635,\, 502.7199,\, 497.4339,\, 494.2883 \]

The Normal Model: Inferential Results

y <- c(497.0621, 497.7690, 497.2631, 498.6688, 494.6227, 496.2185, 498.2635, 502.7199, 497.4339, 494.2883)
t_obs <- (mean(y) - 500)/(sd(y)/sqrt(length(y))) 
curve(dt(x, 9), xlim = c(-5, 5), ylim = c(0, 0.4), 
      main = "Rejection region", col = "blue", 
      lwd = 2, xlab = "",  ylab = expression(t[9]),  yaxs="i")
cord.x0 <- c(-5, seq(-5, qt(0.025, 9), 0.01),  qt(0.025, 9))
cord.y0 <- c(0, dt(seq(-5,qt(0.025, 9), 0.01), 9), 0)
cord.x1 <- c(qt(0.975, 9),seq(qt(0.975, 9), 5, 0.01), 5)
cord.y1 <- c(0, dt(seq(qt(0.975, 9), 5, 0.01), 9), 0)
polygon(cord.x0, cord.y0, col = "darkred", border = NA )
polygon(cord.x1, cord.y1, col = "darkred", border = NA )

abline(v = t_obs, lty = 2, lwd = 2, col = "red")
text(0, 0.2, paste("Accept", expression(H0)))
text(c(-2.7,2.7), 0.08, paste("Reject", expression(H0)), col = "darkred")
text(as.double(t_obs) - 0.15, 0.02, "t", col = "red", cex = 1.2)

The Normal Model: Inferential Results

curve(dt(x, 9), xlim = c(-5, 5), ylim = c(0, 0.4), 
      main = "P-value", col = "blue", 
      lwd = 2, xlab = "",  ylab = expression(t[9]),  yaxs="i")
cord.x0 <- c(-5, seq(-5, t_obs, 0.01), t_obs)
cord.y0 <- c(0, dt(seq(-5,t_obs, 0.01), 9), 0)
polygon(cord.x0, cord.y0, col = "darkgreen", border = NA )

abline(v = t_obs, lty = 2, lwd = 2, col = "red")
text(as.double(t_obs) - 0.15, 0.02, "t", col = "red", cex = 1.2)

Statistical Model: Inferential Results

Likelihood Inference

Let’s assume to be interested on the inference on a parameter \(\theta\) (unknown)
Given a random sample \(y=(y_1, \ldots, y_n)\) of size \(n\) from \(Y\) which is distributed according to the model \(f(y; \theta)\), with \(\theta \in \Theta \subset R^d, d \geq 1\), we define the likelihood function for \(\theta\) the function as \(\mathcal{L}(\theta) = f(y_1, \ldots, y_n; \theta)\)
It’s not a probability function: It’s a function of \(\theta\) which is defined on \(\Theta\) getting values in \(\mathbb{R}^{+}\)
Under a random sampling, due to the independence condition of the random variables (and for the condition of identical distribution), we have \[\mathcal{L}(\theta) = f(y_1, \ldots, y_n; \theta) = \prod_{i=1}^{n}f(y_i; \theta)\]
For convenience, it is better to work with the log-likelihood function

\[\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log f(y_i; \theta)\]

Statistical Model: Inferential Results

Maximum Likelihood

A value \(\hat \theta \in \Theta\) such that \(\ell(\hat \theta) \geq \ell(\theta)\) for all \(\theta \in \Theta\) is said maximum likelihood estimate (the random variable \(\hat \theta(Y)\) is said maximum likelihood estimator)
Under regularity conditions (what we consider) the maximum likelihood estimate is obtained by solving \[\ell'(\theta) = \frac{\partial \ell(\theta)}{\partial \theta} = 0\] where the vector \(\ell'(\theta)\) is the score vector
Finally, \(\hat \theta\) is consistent and has asymptotic distribution \(\hat \theta \stackrel{\cdot} \sim \mathcal{N}(\theta, j(\hat \theta)^{-1})\) where \[j(\theta) = -\ell''(\theta) = -\frac{\partial^2 \ell(\theta)}{\partial \theta \partial \theta^\top}\]

Statistical Model: Inferential Results

Hypothesis Test

To verify the hypothesis \(H_0: \theta = \theta_0\) versus \(H_1: \theta \neq \theta\) the test statistic is based on the test of the log-likelihood ratio, that is \[W(\theta_0) = 2(\ell(\hat \theta) - \ell(\theta_0))\] whose asymptotic distribution is \(\chi^2_d\)
Confidence Region: \(\{\theta: W(\theta) \leq \chi^2_{d;1-\alpha}\}\)
In the normal linear model, by assuming the normality of errors the function \(W(\theta)\) is a monotone increasing function of a statistic F (Fisher) or t (Students’ t) and the results are reported in that terms

The Normal model: Likelihood Inference

Likelihood Function

\[\mathcal{L}(\theta) = \mathcal{L}(\mu, \sigma^2) = \prod_{i=1}^{n}f(y_i; \mu, \sigma^2)= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2\sigma^2} (y_i - \mu)^2} = \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\mu)^2}\] \[\ell(\mu, \sigma^2) = \log \mathcal{L}(\theta) = -\frac{n}{2} \log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\mu)^2\]

Score Vector: \(\ell'(\theta) = (s_1, s_2)\)

\[s_1 = \frac{\partial \ell(\theta)}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(y_i-\mu)\] \[s_2 = \frac{\partial \ell(\theta)}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{\sigma^4}\sum_{i=1}^{n}(y_i-\mu)^2\]

The Normal model: Likelihood Inference

Maximum Likelihood Estimate

From \(s_1 = 0\) we obtain \(\hat \mu = \bar y\)
From \(s_2 = 0\) (and substituting \(\mu\) with \(\hat \mu\)), we obtain \(\hat \sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar y)^2\)

Note

The latter is biased but it can be expressed in terms of the unbiased variance estimator using the relationship \(\hat \sigma^2 = \frac{S^2(n-1)}{n}\)
In this case, we have an explicit solution for the maximum likelihood estimates. In other cases, we must obtain the estimates numerically (for instance considering the Newton-Raphson algorithm)

The Normal model: Likelihood Inference

Optimization

In R, there are several routines for optimization purposes (usually minimize a function): optim, nlminb, \(\ldots\)
We analyze one of them in order to obtain the maximum likelihood estimates, comparing it with the analytic solution
First, we need to write down the likelihood function

nllik <- function(par, data) {
  mu <- par[1]      
  sigma2 <- par[2]   
  n <- length(data)
  nll <-  0.5 * n * log(sigma2) + sum((data - mu)^2) / (2 * sigma2)
  return(nll)
}

Statistical Model: The Normal model

Optimization

Let’s generate a sample of size \(n\) from a \(\mathcal{N}(\mu = 10, \sigma^2 = 4)\)
Optimize the likelihood function using nlminb(), which requires the starting values for the algorithm, the function to minimize, the data, and the limits of the parametric space (if there are some constraints on the prametric space)

set.seed(13)
n <- 50
x <- rnorm(n, mean = 10, sd = 2)
start_vals <- c(mu = 0, sigma = 1)

fit <- nlminb(start = start_vals,
              objective = nllik,
              data = x,
              lower = c(-Inf, 1e-6))
fit$par

      mu    sigma 
9.947162 3.726903

mean(x)

[1] 9.947163

var(x)*(n-1)/n

[1] 3.726902

The Normal Model: Linear Regression

Linear Regression Model

The settings above can be straightforwardly extended to linear regression models
Let’s consider a simple linear regression \(Y_i = \beta_1 + \beta_2 x_i +\varepsilon_i\), where \(\varepsilon_i\sim\mathcal{N}(0, \sigma^2)\) independently
Note that the difference is that we do not have the identically distribution because \(Y_i \sim \mathcal{N}(\mu_i, \sigma^2)\) where \(\mu_i=\beta_1 + \beta_2 x_i\)
Let’s adapt the negative log-likelihood function to this case

nllik_lm <- function(par, x, y) {
  beta0 <- par[1]
  beta1 <- par[2]
  sigma2 <- par[3]
  mu <- beta0 + beta1 * x
  n <- length(y)
  nll <-  0.5 * n * log(sigma2) + sum((y - mu)^2) / (2 * sigma2)
  return(nll)
}

The Normal Model: Linear Regression

Linear Regression Model

Let’s generate some data and fit the model

n <- 100
x <- runif(n, 0, 10)
beta0 <- 2; beta1 <- 0.8; sigma <- 1.5
y <- beta0 + beta1 * x + rnorm(n, 0, sigma)
start_vals <- c(beta0 = 0, beta1 = 0, sigma2 = 1)

fit <- nlminb(
  start = start_vals,
  objective = nllik_lm,
  x = x,
  y = y,
  lower = c(-Inf, -Inf, 1e-6)
)

fit$par

    beta0     beta1    sigma2 
2.1476594 0.7798585 2.6494339

fitLM <- lm(y ~ x)
fitLM$coefficients

(Intercept)           x 
  2.1476602   0.7798585

sum(fitLM$residuals^2)/n

[1] 2.649434

The Normal Model: Linear Regression

Linear Regression Model

As we noticed the \(\beta\) estimates obtained via maximum likelihood are the same of those obtained via OLS.
Indeed, from \[\mathcal{L}(\beta, \sigma^2) = \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi} \sigma} e^{-\frac{1}{2\sigma^2} (y_i - \beta_1 x_{i1} - \ldots - \beta_p x_{ip})^2}\] \[ \propto (\sigma^2)^{-n/2} e^{-\frac{1}{2\sigma^2} \sum_{i=1}^{n}(y_i - \beta_1 x_{i1} - \ldots - \beta_p x_{ip})^2}\] \[ \propto (\sigma^2)^{-n/2} e^{-\frac{1}{2\sigma^2} (y - X \beta)^\top (y-X\beta)}\] \[\ell(\beta, \sigma^2) = - \frac{n}{2}\log(\sigma^2)-\frac{1}{2\sigma^2} (y - X \beta)^\top (y-X\beta)\]
Note that the term that does not depend on the parameter has been removed and we introduce the proprionality symbol \(\propto\)

The Normal Model: Linear Regression

\(\beta\) Estimate

For fixed \(\sigma^2\), \(\ell(\beta, \sigma^2)\) is maximum when the residual sum of square \(RSS(\beta) = (y - X^\top \beta)^\top (y-X\beta)\) is minimum, that is

\[\hat \beta = \operatorname*{argmin}_{\beta} \mathrm{RSS}(\beta) = \operatorname*{argmax}_{\beta} \ell({\beta}, \sigma^2)\]

Then, the maximum likelihood estimate is equal to the OLS estimate
\[\hat \beta_{ML} = \hat \beta_{OLS} = (X^\top X)^{-1} X^\top y\]

The Normal model: Linear Regression

\(\sigma^2\) Estimate

By substituing \(\hat \beta\) in \(\ell(\beta, \sigma^2)\) we have \[\ell(\hat \beta, \sigma^2) = - \frac{n}{2}\log(\sigma^2)-\frac{1}{2\sigma^2} (y - X^\top \hat \beta)^\top (y-X\hat \beta)\] \[= - \frac{n}{2}\log(\sigma^2)-\frac{1}{2\sigma^2} RSS(\hat \beta) = - \frac{n}{2}\log(\sigma^2)-\frac{1}{2\sigma^2}\sum_{i=1}^{n}e^2_i\]
By getting the derivative with respect to \(\sigma^2\) and equating to zero, we obtain

\[\hat \sigma^2= \frac{1}{n}\sum_{i=1}^{n}e^2_i = \frac{1}{n} e^\top e\]

The Normal model: Linear Regression

Distribution of \(\hat \beta\)

Since \(Y\sim \mathcal{N}_n(\mu, \sigma^2I)\), the estimator \(\hat \beta = (X^\top X)^{-1} X^\top Y\) is a linear transformation of a multivariate normal. So

\[{\hat \beta} \sim \mathcal{N}_p(\beta, \sigma^2(X^\top X)^{-1})\]

Due to the results on the multivariate normal distribution, we know that also the marginals are normally distributed and in particular

\[\hat \beta_r \sim \mathcal{N} (\beta_r, \sigma^2(X^\top X)^{-1}_{rr}), \quad r=1, \ldots, p\] having denoted with \((X^\top X)^{-1}_{rr}\) the \(r\)-th diagonal element of \((X^\top X)^{-1}\).

The Normal model: Linear Regression

Distribution of \(\hat \sigma^2\)

The estimator \(\sigma^2\) is biased, while the unbiased estimator is \[S^2 = \frac{n}{n-p}\hat \sigma^2\] for which \[\frac{(n-p)S^2}{\sigma^2} \sim \chi^2_{n-p}\]

Joint Distribution of \((\hat \beta, \hat \sigma^2)\)

\(\hat \beta\) is independent from \(\hat \sigma^2\), and so \(\hat \beta\) is independent from \(S^2\)

Normal Linear Regression: Inference

Inference on a Single Coefficient

From the results just outlined, we can obtain pivotal quantities to make inference on a single coefficient \[t_r = \frac{\hat \beta_r - \beta_r}{\sqrt{S^2(X^\top X)^{-1}_{rr}}} \sim t_{n-p}\]
Confidence interval of level \(1-\alpha\) for \(\beta_r\) \[IC^{1-\alpha}_{\beta_r}=(\hat \beta_r - t_{n-p; 1- \alpha/2} \sqrt{S^2((X^\top X)^{-1}_{rr})}, \hat \beta_r + t_{n-p; 1- \alpha/2} \sqrt{S^2((X^\top X)^{-1}_{rr})})\]
Hypothesis test (nullity of the single coefficient): \[\begin{cases}H_0: \beta_r = 0 \\H_1: \beta_r \neq 0 \end{cases}\] \[t_r = \hat \beta_r/\sqrt{S^2(X^\top X)^{-1}_{rr}} \stackrel{H_0} \sim t_{n-p}\] \[\mathcal{R}_\alpha = \{|t^{obs}_r| > t_{n-p; 1-\alpha/2}\}\] \[p-value = P(|t_{n-p}| > |t^{obs}_{r}|) = 2P(t_{n-p}\leq -|t^{obs}_{r}|)\]

Normal Linear Regression: Inference

library(ISLR)
data("Credit")
lmFit <- lm(Balance ~ Limit + Student, data = Credit)
summary(lmFit)$coefficients

                Estimate   Std. Error   t value      Pr(>|t|)
(Intercept) -334.7299372 23.069301674 -14.50976  1.417949e-38
Limit          0.1719538  0.004330826  39.70463 2.558391e-140
StudentYes   404.4036438 33.279679039  12.15167  4.181612e-29

X <- model.matrix(Balance ~ Limit + Student, data = Credit)
n <- nrow(X)
p <- ncol(X)
hat_s2 <- sum(residuals(lmFit)^2)/(nrow(Credit) - p)

tval <- lmFit$coefficients/(sqrt(hat_s2 * diag(solve(t(X) %*% X))))
tval

(Intercept)       Limit  StudentYes 
  -14.50976    39.70463    12.15167

2*pt(-abs(tval), n - p)

  (Intercept)         Limit    StudentYes 
 1.417949e-38 2.558391e-140  4.181612e-29

Normal Linear Regression: Inference

Inference on a Pool of Coefficients (A Special Case)

Let’s suppose we want assess the model overall, that is no one of the predictors are explaining the variability of \(Y\). The hypothesis system is \[\begin{cases}H_0: \beta_2 = \ldots, \beta_p = 0 \\ \text{at least one of} \quad H_1: \beta_r \neq 0 \quad \text{for} \,\, r = 1, \ldots, p\end{cases}\]
In this case we distinguish between \[ \text{Full model}(\mathcal{M_1}): Y_i = \beta_1 + \beta_2 x_{i2} + \ldots + \beta x_{ip} + \varepsilon_i\] \[ \text{Reduced model}(\mathcal{M_0}): Y_i = \beta_1 + \varepsilon_i\]
Under \(\mathcal{M1}\) the estimate of the variance of the error term component is \(\hat \sigma^2 = \frac{1}{n}e^\top e\), while Under \(\mathcal{M0}\) is simply \(\tilde \sigma^2 = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar y)^2\)
The test statistic is \[F = \frac{\frac{\tilde \sigma^2 - \hat \sigma^2}{p-1}}{\frac{\hat \sigma^2}{n-p}} \stackrel{H_0}\sim F_{p-1, n-p}\]
The p-value is given by \(p-value = Pr(F_{p-1;n-p} > f^{obs})\)

Normal Linear Regression: Inference

hat_sigma2 <- sum(residuals(lmFit)^2)/n
tilde_sigma2 <- var(Credit$Balance)*(n-1)/n
fval <- ((tilde_sigma2 - hat_sigma2)/(p-1))/((hat_sigma2)/(n-p))
fval

[1] 859.1893

summary(lmFit)$fstatistic

   value    numdf    dendf 
859.1893   2.0000 397.0000

pf(fval, p-1, n-p, lower = FALSE)

[1] 5.893567e-145

summary(lmFit)


Call:
lm(formula = Balance ~ Limit + Student, data = Credit)

Residuals:
    Min      1Q  Median      3Q     Max 
-637.77 -116.90    6.04  130.92  434.24 

Coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -3.347e+02  2.307e+01  -14.51   <2e-16 ***
Limit        1.720e-01  4.331e-03   39.70   <2e-16 ***
StudentYes   4.044e+02  3.328e+01   12.15   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 199.7 on 397 degrees of freedom
Multiple R-squared:  0.8123,    Adjusted R-squared:  0.8114 
F-statistic: 859.2 on 2 and 397 DF,  p-value: < 2.2e-16