31st October 2025 - Vincenzo Gioia
Statistical Model: The Simplest One
Assumption
Normal/Gaussian Random Variable
A random variable \(X\) is said to be Normal (or Gaussian) with mean \(E[X] = \mu\) and variance \(V(X) = \sigma^2\) if the density function takes the functional form
\[f_X(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi} \sigma}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}\]
Standard Normal
A particular normal random variable is the standard normal, that is a normal with mean 0 and variance 1
\[f_Z(z) = \phi(z) = \frac{1}{\sqrt{2 \pi}}e^{-\frac{z^2}{2}}\]
Relationship between Standard Normal and a Generic Normal
Cumulative Distribution Function
Quantiles of Order \(p \in (0,1)\) of \(X \sim \mathcal{N}(\mu, \sigma^2)\)
Characteristics of \(f_X(x)\)
Location Shift
Scaling
Location Shift and Scaling
The Cumulative Distribution Function \(F_X(x)\)
The Density Function \(f_X(x; \mu, \sigma^2) = \frac{1}{\sqrt{2 \pi} \sigma}e^{-\frac{1}{2\sigma^2}(x-\mu)^2}\)
curve(dnorm(x, mean = 100, sd = 10), xlim = c(80,120), ylab = "density")
dnorm(110, mean = 100, sd = 10)[1] 0.02419707
y <- dnorm(110, mean = 100, sd = 10)
segments(x0 = 110, y0 = 0, x1 = 110, y1 = y, col = "red", lwd = 2)
segments(x0 = 80, y0 = y, x1 = 110, y1 = y, col = "red", lwd = 2)
points(110, y, col = "red", pch = 19)The Cumulative Distribution Function \(F_X(x; \mu, \sigma^2) = P(X \leq x) = \Phi\left(\frac{x-\mu}{\sigma}\right)\)
curve(pnorm(x, mean = 100, sd = 10), xlim = c(80,120), ylab = "cumulative distribution function")
pnorm(110, mean = 100, sd = 10)[1] 0.8413447
y <- pnorm(110, mean = 100, sd = 10)
segments(x0 = 110, y0 = 0, x1 = 110, y1 = y, col = "red", lwd = 2)
segments(x0 = 80, y0 = y, x1 = 110, y1 = y, col = "red", lwd = 2)
points(110, y, col = "red", pch = 19)Quantiles of Order \(p \in (0,1)\) of \(X \sim \mathcal{N}(\mu, \sigma^2)\)
curve(pnorm(x, mean = 100, sd = 10), xlim = c(80,120), ylab = "cumulative distribution function")
qnorm(c(0.1,0.5,0.9), mean = 100, sd = 10) [1] 87.18448 100.00000 112.81552
x <- qnorm(c(0.1,0.5,0.9), mean = 100, sd = 10)
segments(x0 = c(0,0,0), y0 = c(0.1, 0.5, 0.9), x1 = x, y1 = c(0.1, 0.5, 0.9), col = "red", lwd = 2)
segments(x0 = x, y0 = 0, x1 = x, y1 = c(0.1, 0.5, 0.9), col = "red", lwd = 2)Random Generation
The Normal (or Gaussian) Model
Inference on \(\theta=(\mu, \sigma^2)\)
Pivotal Quantity (Focus on \(\mu\))
Confidence Interval for \(\mu\) at level \(1-\alpha\)
Confidence Intervals: Example
[1] 0.956
# Plot of the (first 100) CIs
plot(1, xlim = c(0,10), ylim = c(0,11), type = "n",
xlab = expression(mu), ylab = "", yaxt = "n",
main = paste("100 IC for the mean (unknown variance)"), cex.main = 1.2)
abline(v = mu)
d <- 0
for (i in 1 : 100){
d <- d + 0.1
lines(seq(CI[i, 1], CI[i, 2], length = 100), rep(d, 100), col = (l[i] + 1))
}[1] 94
Hypothesis test
\[\begin{cases} H_0: \mu = \mu_0 \\ H_1: \mu \neq \mu_0 \end{cases}\]
The reject region of level \(\alpha\) associated to the test is \[R_{\alpha} = \left\{(y_1, \ldots, y_n): \left| \frac{\bar y - \mu_0}{s/\sqrt{n}}\right| > t_{n-1;1-\alpha/2} \right\}\]
The p-value (the probability under a specified statistical model that a statistical summary of the data would be equal to or more extreme than its observed value) is \[p-value = {\rm Pr}\left(|T| > \left| \frac{\bar y - \mu_0}{s/\sqrt{n}}\right|\right)\]
Example
To check whether the declared weight on a package corresponds to the truth, a pasta manufacturer analyzes the weight of a sample of \(n=10\) packages of pasta. The packages have a declared weight of \(500 g\), and it is reasonable to assume that the weight of a package is a normally distributed random variable. To verify that the machine does not produce packages with a different weight, it is checked whether the sample mean weight deviates too much from \(500\); if it does, production is suspended.
To decide when the sample mean weight is too far from 500, a 5% rejection region is used. The hypothesis system is: \[\begin{cases}H_0: \mu = 500 \\H_1: \mu\neq 500 \end{cases}\] The sample has given the following measurements: \[497.0621,\, 497.7690, \, 497.2631,\, 498.6688,\, 494.6227,\, 496.2185,\, 498.2635,\, 502.7199,\, 497.4339,\, 494.2883 \]
y <- c(497.0621, 497.7690, 497.2631, 498.6688, 494.6227, 496.2185, 498.2635, 502.7199, 497.4339, 494.2883)
t_obs <- (mean(y) - 500)/(sd(y)/sqrt(length(y)))
curve(dt(x, 9), xlim = c(-5, 5), ylim = c(0, 0.4),
main = "Rejection region", col = "blue",
lwd = 2, xlab = "", ylab = expression(t[9]), yaxs="i")
cord.x0 <- c(-5, seq(-5, qt(0.025, 9), 0.01), qt(0.025, 9))
cord.y0 <- c(0, dt(seq(-5,qt(0.025, 9), 0.01), 9), 0)
cord.x1 <- c(qt(0.975, 9),seq(qt(0.975, 9), 5, 0.01), 5)
cord.y1 <- c(0, dt(seq(qt(0.975, 9), 5, 0.01), 9), 0)
polygon(cord.x0, cord.y0, col = "darkred", border = NA )
polygon(cord.x1, cord.y1, col = "darkred", border = NA )
abline(v = t_obs, lty = 2, lwd = 2, col = "red")
text(0, 0.2, paste("Accept", expression(H0)))
text(c(-2.7,2.7), 0.08, paste("Reject", expression(H0)), col = "darkred")
text(as.double(t_obs) - 0.15, 0.02, "t", col = "red", cex = 1.2)curve(dt(x, 9), xlim = c(-5, 5), ylim = c(0, 0.4),
main = "P-value", col = "blue",
lwd = 2, xlab = "", ylab = expression(t[9]), yaxs="i")
cord.x0 <- c(-5, seq(-5, t_obs, 0.01), t_obs)
cord.y0 <- c(0, dt(seq(-5,t_obs, 0.01), 9), 0)
polygon(cord.x0, cord.y0, col = "darkgreen", border = NA )
abline(v = t_obs, lty = 2, lwd = 2, col = "red")
text(as.double(t_obs) - 0.15, 0.02, "t", col = "red", cex = 1.2)Likelihood Inference
Let’s assume to be interested on the inference on a parameter \(\theta\) (unknown)
Given a random sample \(y=(y_1, \ldots, y_n)\) of size \(n\) from \(Y\) which is distributed according to the model \(f(y; \theta)\), with \(\theta \in \Theta \subset R^d, d \geq 1\), we define the likelihood function for \(\theta\) the function as \(\mathcal{L}(\theta) = f(y_1, \ldots, y_n; \theta)\)
It’s not a probability function: It’s a function of \(\theta\) which is defined on \(\Theta\) getting values in \(\mathbb{R}^{+}\)
Under a random sampling, due to the independence condition of the random variables (and for the condition of identical distribution), we have \[\mathcal{L}(\theta) = f(y_1, \ldots, y_n; \theta) = \prod_{i=1}^{n}f(y_i; \theta)\]
For convenience, it is better to work with the log-likelihood function
\[\ell(\theta) = \log \mathcal{L}(\theta) = \sum_{i=1}^{n} \log f(y_i; \theta)\]
Maximum Likelihood
Hypothesis Test
To verify the hypothesis \(H_0: \theta = \theta_0\) versus \(H_1: \theta \neq \theta\) the test statistic is based on the test of the log-likelihood ratio, that is \[W(\theta_0) = 2(\ell(\hat \theta) - \ell(\theta_0))\] whose asymptotic distribution is \(\chi^2_d\)
Confidence Region: \(\{\theta: W(\theta) \leq \chi^2_{d;1-\alpha}\}\)
In the normal linear model, by assuming the normality of errors the function \(W(\theta)\) is a monotone increasing function of a statistic F (Fisher) or t (Students’ t) and the results are reported in that terms
Likelihood Function
\[\mathcal{L}(\theta) = \mathcal{L}(\mu, \sigma^2) = \prod_{i=1}^{n}f(y_i; \mu, \sigma^2)= \prod_{i=1}^{n} \frac{1}{\sqrt{2\pi}\sigma}e^{-\frac{1}{2\sigma^2} (y_i - \mu)^2} = \left(\frac{1}{2\pi\sigma^2}\right)^{n/2} e^{-\frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\mu)^2}\] \[\ell(\mu, \sigma^2) = \log \mathcal{L}(\theta) = -\frac{n}{2} \log(2\pi) - \frac{n}{2}\log(\sigma^2) - \frac{1}{2\sigma^2}\sum_{i=1}^{n}(y_i-\mu)^2\]
Score Vector: \(\ell'(\theta) = (s_1, s_2)\)
\[s_1 = \frac{\partial \ell(\theta)}{\partial \mu} = \frac{1}{\sigma^2}\sum_{i=1}^{n}(y_i-\mu)\] \[s_2 = \frac{\partial \ell(\theta)}{\partial \sigma^2} = -\frac{n}{2\sigma^2} + \frac{1}{\sigma^4}\sum_{i=1}^{n}(y_i-\mu)^2\]
Maximum Likelihood Estimate
Note
Optimization
Optimization
Linear Regression Model
Linear Regression Model
n <- 100
x <- runif(n, 0, 10)
beta0 <- 2; beta1 <- 0.8; sigma <- 1.5
y <- beta0 + beta1 * x + rnorm(n, 0, sigma)
start_vals <- c(beta0 = 0, beta1 = 0, sigma2 = 1)
fit <- nlminb(
start = start_vals,
objective = nllik_lm,
x = x,
y = y,
lower = c(-Inf, -Inf, 1e-6)
)
fit$par beta0 beta1 sigma2
2.1476594 0.7798585 2.6494339
(Intercept) x
2.1476602 0.7798585
[1] 2.649434
Linear Regression Model
\(\beta\) Estimate
\[\hat \beta = \operatorname*{argmin}_{\beta} \mathrm{RSS}(\beta) = \operatorname*{argmax}_{\beta} \ell({\beta}, \sigma^2)\]
\(\sigma^2\) Estimate
\[\hat \sigma^2= \frac{1}{n}\sum_{i=1}^{n}e^2_i = \frac{1}{n} e^\top e\]
Distribution of \(\hat \beta\)
\[{\hat \beta} \sim \mathcal{N}_p(\beta, \sigma^2(X^\top X)^{-1})\]
\[\hat \beta_r \sim \mathcal{N} (\beta_r, \sigma^2(X^\top X)^{-1}_{rr}), \quad r=1, \ldots, p\] having denoted with \((X^\top X)^{-1}_{rr}\) the \(r\)-th diagonal element of \((X^\top X)^{-1}\).
Distribution of \(\hat \sigma^2\)
Joint Distribution of \((\hat \beta, \hat \sigma^2)\)
Inference on a Single Coefficient
library(ISLR)
data("Credit")
lmFit <- lm(Balance ~ Limit + Student, data = Credit)
summary(lmFit)$coefficients Estimate Std. Error t value Pr(>|t|)
(Intercept) -334.7299372 23.069301674 -14.50976 1.417949e-38
Limit 0.1719538 0.004330826 39.70463 2.558391e-140
StudentYes 404.4036438 33.279679039 12.15167 4.181612e-29
X <- model.matrix(Balance ~ Limit + Student, data = Credit)
n <- nrow(X)
p <- ncol(X)
hat_s2 <- sum(residuals(lmFit)^2)/(nrow(Credit) - p)
tval <- lmFit$coefficients/(sqrt(hat_s2 * diag(solve(t(X) %*% X))))
tval(Intercept) Limit StudentYes
-14.50976 39.70463 12.15167
(Intercept) Limit StudentYes
1.417949e-38 2.558391e-140 4.181612e-29
Inference on a Pool of Coefficients (A Special Case)
hat_sigma2 <- sum(residuals(lmFit)^2)/n
tilde_sigma2 <- var(Credit$Balance)*(n-1)/n
fval <- ((tilde_sigma2 - hat_sigma2)/(p-1))/((hat_sigma2)/(n-p))
fval[1] 859.1893
value numdf dendf
859.1893 2.0000 397.0000
[1] 5.893567e-145
Call:
lm(formula = Balance ~ Limit + Student, data = Credit)
Residuals:
Min 1Q Median 3Q Max
-637.77 -116.90 6.04 130.92 434.24
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) -3.347e+02 2.307e+01 -14.51 <2e-16 ***
Limit 1.720e-01 4.331e-03 39.70 <2e-16 ***
StudentYes 4.044e+02 3.328e+01 12.15 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Residual standard error: 199.7 on 397 degrees of freedom
Multiple R-squared: 0.8123, Adjusted R-squared: 0.8114
F-statistic: 859.2 on 2 and 397 DF, p-value: < 2.2e-16