Central Limit Theorem (CLT)

Let \(X_1, X_2, \ldots, X_n\) be a sequence of independent and identically distributed (i.i.d.) random variables (r.v.) of size \(n\) from a distribution with mean \(\mu\) and variance \(\sigma^2\). Then, for large n, the sample average \(\overline X_n= (\sum_{i=1}^{n} X_i)/n\) \[\overline X_n \, \dot \sim \, \mathcal N \bigg (\mu, \frac{\sigma^2}{n} \bigg)\] where \(\dot \sim\) denotes the converge in distribution. The theorem could be stated in terms of \(T_n=\sum_{i=1}^{n}X_i=n \overline X_n\), that is \[T_n \, \dot \sim \, \mathcal{N}(n \mu, n \sigma^2) \]

The CLT supports the normal distribution of a r.v. that can be viewed as the sum of other r.v.

Approximation with CLT: application

The approximation above is useful in statistics for computing some quantities. For instance, let \(X\) and \(Y\) two independent r.v., such that \(X \sim \mathcal{B_i}(n_1,p_1)\) and \(Y \sim \mathcal{B_i}(n_2,p_2)\), how can easily compute the probability \(\rm{Pr}(X>Y)\)?

Normal approximation of the binomial distribution is the simplest way to do it.

\[X \, \approx \, \mathcal{N}(\mu_X = n_1p_1, \sigma^2_X=n_1p_1(1-p_1)) \hspace{1cm} Y \, \approx \, \mathcal{N}(\mu_Y=n_2p_2, \sigma^2_Y=n_2p_2(1-p_2)) \] Denoting with \(W= X-Y\), we know that \(W\) is also a normal r.v. with
\[\mu_W = \mu_X - \mu_Y\] and, by independence \[ \sigma^2_W=\sigma^2_X+\sigma^2_Y \] Then, we have \(W \, \approx \, \mathcal{N}(\mu_W=n_1p_1-n_2p_2, \, \sigma^2_W=n_1p_1(1-p_1)+n_2p_2(1-p_2))\)

Approximation with CLT: real application with waterpolo goals

Tomorrow, two professional Italian waterpolo teams, Posillipo and Pro Recco, compete against each other. Let \(X\) and \(Y\) the r.v. denoting the goals scored by both the teams. Let’s assume that \(X\) and \(Y\) are independent and follow the Binomial distributions, namely \(X \sim \mathcal{B}_i(n_1,p_1)\) and \(Y \sim \mathcal{B}_i(n_2,p_2)\)

\(X\): number of shots converted in goals on the total number of shots \(n_1\) with probability \(p_1\) (Posillipo)
\(Y\): number of shots converted in goals on the total number of shots \(n_2\) with probability \(p_2\) (Pro Recco)

Before a match, the number of shots (\(n_1,n_2\)) are unknown. We treat the quantities \(p_1, p_2, n_1, n_2\) as known, for instance fixing them upon historical data:
\(p_1=0.5, \quad p_2=0.7, \quad n_1=20, \quad n_2=20.\)

What is the probability the Posillipo will win the next match against the Pro Recco: \(\rm{Pr}(X>Y)=\rm{Pr}(X-Y>0)=?\)

Denoting with \(W=X-Y\) and using the above formulas, we can approximate \(W \, \approx \, \mathcal N(\mu_W=\mu_X-\mu_Y, \sigma^2_W=\sigma^2_X+\sigma^2_Y)\).

p1 <- 0.5; p2 <- 0.7; n1 <- 20; n2 <- 20;
mean_w <- n1*p1 - n2*p2
sigma2_w <- n1*p1*(1 - p1) + n2*p2*(1 - p2) 
Prob_win_Posillipo <- pnorm(0, mean = mean_w, sd = sqrt(sigma2_w), lower.tail = FALSE)
Prob_win_Posillipo

#Alternatively
#Prob_win_Posillipo <- 1 - pnorm(0, mean = mean_w, sd = sqrt(sigma2_w))

## [1] 0.09362452

plot(1, ylim = c(0,0.3), xlim = c(-12,20), ylab = "f(x)", xlab = "x", type = "n")
points(0:n1, dbinom(0:n1, n1, p1), pch = 21, bg=1)
text(10.25, 0.20, "X", cex = 2, col="black")
points(0:n2, dbinom(0:n2, n2, p2), pch = 21, bg=2)
text(14.25, 0.23, "Y", cex = 2, col="red")

curve(dnorm(x, p1*n1 - p2*n2, sqrt(n1*p1*(1 - p1) + n2*p2*(1 - p2)) ), add = TRUE, col = "blue", lty= "dashed")
text(-4.0, 0.16, "W", cex = 2, col= 4)
segments(0,0, 0, dnorm(0, mean_w, sqrt(sigma2_w)), col = "black")

Waterpolo goals: what if we use a Poisson distribution?

Rather than specifying in advance the total number of unknown shots and the converting shots probabilities, for an amount of 4 parameters, we may take into account a different distribution, accounting just for the scoring intensity, regardless the number of the shots.

In this respect, the Poisson distribution seems suitable. We assume two independent Poisson distributions for the number of goals of the upcoming match, namely: \(X \sim \mathcal{P}(\lambda_1)\), \(Y \sim \mathcal{P}(\lambda_2)\)

We need to specify the rates \(\lambda_1\) and \(\lambda_2\). Based on previous knowledge, we fix \(\lambda_1 = 5\) and \(\lambda_2 = 7.5\).

What is the probability the Posillipo will win the next match against the Pro Recco: \(\rm{Pr}(X>Y)=\rm{Pr}(X-Y>0)=?\)

The difference between two Poisson distributions is a known probability distribution, the Poisson difference or Skellam distribution. Then \(W=X-Y\sim \mathcal{PD}(\lambda_1, \lambda_2)\), such that \(\mu_W=\lambda_1 - \lambda_2\) and \(\sigma^2_W=\lambda_1 + \lambda_2\) (see [https://www.jstor.org/stable/2981372?seq=1#metadata_info_tab_contents])

library(skellam)
lambda1 <- 5; lambda2 <- 7.5

Prob_win_Posillipo_S <- pskellam(0, lambda1, lambda2, lower.tail=FALSE)
Prob_win_Posillipo_S

## [1] 0.1961603

plot(1, ylim = c(0,0.3), xlim = c(-12,20), ylab = "f(x)", xlab = "x", type = "n")
points(0:n1, dpois(0:n1, lambda1), pch = 21, bg=1)
text(4, 0.25, "X", cex = 2, col="black")
points(0:n2, dpois(0:n2, lambda2), pch = 21, bg=2)
text(7, 0.20, "Y", cex = 2, col="red")
points(-12:20, dskellam(-12:20, lambda1, lambda2),col = "blue")
text(-3.0, 0.16, "W", cex = 2, col= 4)
segments(0,0, 0, dskellam(0, lambda1, lambda2), col = "black")

Monte Carlo (MC) simulation

Simulation is a well-known technique designed to approximate a process and retrieve general results by assuming to observe the process several times.We rely on the so called MC Simulation, a wide class of simulation procedures based on sampling i.i.d. values from a process — precisely, from the underlying “true” probability distribution of the process— and computing numerical outputs. Steps of a MC simulation are

generate \(n\) i.i.d. values from a process;
compute a summary for this sample, a statistic;
repeat the steps above \(R\) times and obtain a sample distribution for the statistic.

We will see the behaviour of some sample statistics (mean and variance), through their sample distribution.

Some relationship among probability distributions: Inverse sampling method

Let \(U_1, \ldots, U_n\) independent r.v. distributed as \(Unif(0,1)\). We can verify that the r.v.

\[X=-\frac{\log(U_1)}{\lambda} \sim \mathcal{E(\lambda)} \]

Briefly:

generate a value from a standard uniform
obtain \(x\) from the previous expression, providing fixing in advance \(\lambda\)
repeat the process B times

B <- 10000
lambda <- 2
x <- rep(0, B)

for(i in 1:B){
 x[i] <- -log(runif(1,0,1))/lambda
}

par(mfrow = c(1,2))
hist.scott(x,  xlab = "x", prob = TRUE)
curve(dexp(x, rate = lambda), from = 1e-16, col = "red", add = TRUE)

plot(ecdf(x))
curve(pexp(x, rate = lambda), col = "red", add = TRUE)

Now, we verify that the r.v. \(Z=min\{X_1, \ldots, X_n\} \, \sim \, \mathcal{E}(n \lambda)\), with \(X_i=-\log(U_i)/\lambda\), \(i=1, \ldots, n\)

(\(X_1, \ldots, X_n\) independent r.v., such that \(X_i \, \sim \, \mathcal{E}(\lambda_i)\), then \(Z=min\{X_1, \ldots, X_n\} \, \sim \, \mathcal{E}(\sum_{i=1}^{n}\lambda_i)\)

B <- 10000
lambda <- 2
n <- 10
x <- rep(0, n)
z <- rep(0, B)

for(i in 1:B){
  x <- -log(runif(n,0,1))/lambda
  z[i] <- min(x)
}

par(mfrow = c(1,2))
hist.scott(z, xlab="z", prob = TRUE)
curve(dexp(x, rate = n*lambda), from = 1e-16, col = "red", add = TRUE)

plot(ecdf(z), xlab="z")
curve(pexp(x, rate = n*lambda), col = "red", add = TRUE)

It’s your turn Now, let \(U_1, \ldots, U_n\) independent r.v. distributed as \(Unif(0,1)\), and let \(X_i=-\log(U_i)/\lambda\), \(i=1, \ldots, n\). Verify that the r.v. \(Y= \sum_{i=1}^{n} X_i\, \sim \, \mathcal{Ga}(n, \lambda)\)

B <- 10000
lambda <- 2
n <- 10
x <- rep(0, n)
y <- rep(0, B)

for(i in 1:B){
  x <- -log(runif(n,0,1))/lambda
  y[i] <- sum(x)
}

par(mfrow = c(1,2))
hist.scott(y,  xlab = "y", prob = TRUE)
curve(dgamma(x, shape = n, rate = lambda), col = "red", add = TRUE)
#Alternatively
#curve(dgamma(x, n, scale = 1/lambda), col="red", add=TRUE)

plot(ecdf(y), xlab = "y")
curve(pgamma(x, shape = n, rate = lambda), col = "red", add = TRUE)

Now, we try to increase \(n\)

B <- 10000
lambda <- 2
n <- 30
x <- rep(0, n)
y <- rep(0, B)

for(i in 1:B){
  x <- -log(runif(n,0,1))/lambda
  y[i] <- sum(x)
}

par(mfrow=c(1,2))
hist.scott(y,  xlab = "y", prob = TRUE)
curve(dgamma(x, shape = n, rate = lambda), col="red", add=TRUE)
curve(dnorm(x, mean = n/lambda, sd = sqrt(n/lambda^2)), col="blue", add=TRUE)

plot(ecdf(y),)
curve(pgamma(x, shape = n, rate = lambda), col = "red", add=TRUE)
curve(pnorm(x, mean = n/lambda, sd = sqrt(n/lambda^2)), col = "blue", add = TRUE)

Exercise (optional) Generate values from a \(X \, \sim \, \mathcal{Weibull}(\alpha,\mu)\), with \(\alpha, \mu > 0\), using the inverse solving method. The cumulative distribution function is \[F_X(x)=\begin{cases} 0 \quad \quad \quad \quad \quad x \leq 0 \\ 1-e^{-(x/\mu)^\alpha} \, \, x > 0\end{cases} \]

Normal and Chi-square distributions

Let \(X_1, X_2, \ldots, X_n\) i.i.d. r.v. from \(X \sim \mathcal{N}(0,1)\) and let \(T_n=\sum_{i=1}^{n}{X^2_i}\). We know that \(T_n \sim \mathcal{X}^2_n\). Assuming \(n=10\), verify this relationship via Monte Carlo simulation.

n <- 10 # n observations
B <- 10000 # B simulations
chisq.mc <- rep(0, B)

set.seed(1234)

for(j in 1:B){
 x <- rnorm(10, 0, 1)
 chisq.mc[j] <-  sum(x^2) 
}

par(mfrow = c(1, 2))
hist.scott(chisq.mc, prob = TRUE, xlab = "x")
curve(dchisq(x, df = n), add = TRUE, col = "red", lwd = 1.5)


plot(ecdf(chisq.mc), xlim = c(0,25))
curve(pchisq(x, df = n), add = TRUE, col = "red", lwd = 1)

It’s your turn Following the setup already seen (\(T_{10} \sim \mathcal{X}^2_{10}\)), write a function that compare

sample mean
sample variance
sample quantiles of order 0.01 and 0.05

Such function

take in input the number of simulations, let’s say \(B\)
return in output a table with the sample and the theoretical results (4 rows and 2 columns)

compare_MC_chisq <- function(B){
  set.seed(B)
  n <- 10 # n observations
  chisq.mc <- rep(0, B)
  for(j in 1:B){
    x <- rnorm(10, 0, 1)
    chisq.mc[j] <-  sum(x^2) 
  }
  mean_chisq <- mean(chisq.mc)
  var_chisq <- var(chisq.mc)
  quantile_chisq <- quantile(chisq.mc, probs = c(0.01, 0.05))
  
  chisq.truevalues <- c(n, 2*n, qchisq(p = c(0.01, 0.05), df=n))
  chisq.mc.results <- c(mean_chisq, var_chisq, quantile_chisq)
  
  return(data.frame(MC = chisq.mc.results, TrueVal= chisq.truevalues, row.names = c("mean","var", "q0.01", "q0.05")))
  
  #res <- matrix(0,4,2)
  #colnames(res) <- c("MC_res", "Trueval")
  #res[, 1] <- chisq.mc.results
  #res[, 2] <- chisq.truevalues
  #rownames(res) <- c("mean", "var", "q0.01", "q0.05")
  #return(res)
}

compare_MC_chisq(100)

##              MC   TrueVal
## mean  10.613318 10.000000
## var   23.329811 20.000000
## q0.01  2.832739  2.558212
## q0.05  4.019324  3.940299

compare_MC_chisq(1000)

##              MC   TrueVal
## mean   9.898044 10.000000
## var   19.532186 20.000000
## q0.01  2.213470  2.558212
## q0.05  4.018309  3.940299

compare_MC_chisq(10000)

##              MC   TrueVal
## mean   9.967561 10.000000
## var   19.746209 20.000000
## q0.01  2.697681  2.558212
## q0.05  4.004759  3.940299

Distribution of the sample mean

Let’s suppose:

\(x_1, \ldots, x_n\) random sample from \(\mathcal{N}(0,1)\)
\(x_1, \ldots, x_n\) random sample from \(\mathcal{t}_3\)
\(x_1, \ldots, x_n\) random sample from \(\mathcal{Unif}(0,1)\)

Which is the distribution of the sample mean \(\overline X= (\sum_{i=1}^{n}X_i)/n\)? Before answering, let’s investigate this issue via Monte Carlo simulation. Fix \(B=1000\) and \(n=30\). Remember that:

if \(X \sim t_k\) then \(\mu_X=0\), providing that \(k > 1\), and \(\sigma^2_X=k/(k-2)\), providing that \(k > 2\)
if \(X \sim \mathcal{Unif}(a,b)\) then \(\mu_X=(a+b)/2\) and \(\sigma^2_X=(b-a)^2/12\)

n <- 30 # n observations
B <- 1000 # B simulations

set.seed(1234)

# sample generations
samples <- array(0, dim = c(n, 3, B))
for(i in 1:B){
  samples[, 1, i] <-rnorm(n, 0, 1)
  samples[, 2, i] <-rt(n, df = 3)
  samples[, 3, i] <-runif(n, 0, 1)
}

#Sample statistics
samples_stat <- array(0, dim = c(2, 3, B))
for(j in 1:3){
  #sample mean
  samples_stat[1, j, ] <- apply(samples[,j,], 2, mean)
  #sample variance
  samples_stat[2, j, ] <- apply(samples[,j,], 2, var)
}

par(mfrow = c(1, 3))
hist.scott(samples_stat[1,1,], prob =TRUE, xlab = "x", main = "N(0,1)")
curve(dnorm(x, 0, 1/sqrt(n)), col = "red", lwd = 2, add = TRUE)

hist.scott(samples_stat[1,2,], prob =TRUE, xlab = "x", main = "t3")
curve(dnorm(x, 0, sqrt(3/n)), col = "red", lwd = 2, add = TRUE)

hist.scott(samples_stat[1,3,], prob =TRUE, xlab = "x", main = "U(0,1)")
curve(dnorm(x, 1/2, 1/sqrt(12*n)), col = "red", lwd = 2, add = TRUE)

From the theory:

in the Gaussian case the sample mean follows the normal distribution, namely \(\overline X \, \sim \, \mathcal{N}(\mu, \sigma^2/n)\)
for the other distribution, by CLT theorem, this result holds asymptotically, that is \(\overline X \, \dot \sim \, \mathcal{N}(\mu, \sigma^2/n)\)

Exercise (optional) Try to use the empirical cumulative distribution function and quantile-quantile plot, as an alternative to the histograms check.

Distribution of the sample variance

From theory, the \(S^2 = \frac{1}{n-1}{\sum_{i=1}^{n}(X_i - \overline X )^2}\) is an unbiased estimator (\(E(\hat \theta)= \theta\)) for the variance. The \(\mathsf{var()}\) applied to a sample vector gives such estimate. In the example above, we have already stored the variance. If the assumed true model is normal, we know that the distribution of the sample variance is \[ \frac{n-1}{\sigma^2} S^2 \, \sim \chi^2_{n-1} \]

Otherwise, see [https://www.tandfonline.com/doi/abs/10.1080/00031305.2014.966589] results 13 and 14 for the asymptotic distributions. Namely, \[\frac{S^2}{\sigma^2} k \, \dot \sim \, \chi^2_{k}\] where \(k=\frac{2\sigma^4}{Var(S^2)}\).

sigma2_n <- 1
sigma2_t <- 3 #3/(3-2)
sigma2_u <- 1/12

par(mfrow=c(1,3))
hist.scott(samples_stat[2,1, ], prob=TRUE, xlab=expression(s^2), main = "N(0,1)")
curve((n-1)/sigma2_n *dchisq(x * (n-1)/sigma2_n, df = n - 1), add=TRUE, col="red", lwd=2)

hist.scott(samples_stat[2,2, ], prob=TRUE, xlab=expression(s^2), main = "t(3)", xlim = c(0,25))
curve((n-1)/sigma2_t *dchisq(x * (n-1)/sigma2_t, df = n - 1), add=TRUE, col="red", lwd=2)
curve(2*sigma2_t/var(samples_stat[2,2, ]) * dchisq(x * 2*sigma2_t/var(samples_stat[2,2, ]) , df = ( 2*sigma2_t^2/var(samples_stat[2, 2, ]))), add=TRUE, col="green", lwd=2)

hist.scott(samples_stat[2,3, ],  prob=TRUE, xlab=expression(s^2), main = "U(0,1)")
curve((n-1)/sigma2_u *dchisq(x * (n-1)/sigma2_u, df = n - 1), add=TRUE, col="red", lwd=2)
curve(2*sigma2_u/var(samples_stat[2,3, ]) * dchisq(x * 2*sigma2_u/var(samples_stat[2,3, ]) , df = ( 2*sigma2_u^2/var(samples_stat[2, 3, ]))), add=TRUE, col="green", lwd=2)

It’s your turn The estimator \(S^2_b=\frac{1}{n} \sum_{i=1}^{n}{(X_i - \overline X)^2}\) is a biased estimator (asimptotically unbiased). Check the biased nature of \(S^2_b\) via MC simulation, generating \(n=10\) i.i.d. values from a normal distribution. Compare the results with the unbiased estimator \(S^2\).

set.seed(1234)
B <- 1000
n <- 10
mu <- 5
sigma <- 2

s2 <- rep(0, B)
s2_b <- rep(0, B)

for(i in 1:B){
 y <- rnorm(10, mu, sigma)
 s2[i] <- var(y)
 s2_b[i] <-  s2[i] *(n - 1)/n 
}

par(mfrow=c(1,2))

s2_m <- mean(s2)
s2_b_m <- mean(s2_b)

par(mfrow = c(1,2))
hist.scott(s2, xlab = expression(s^2), 
           main = expression(s^2), prob = TRUE)
abline(v = sigma^2, col = "red")  
abline(v = s2_m, col = "blue", lty=2)  

hist.scott(s2_b, xlab = expression(s[b]^2), 
           main = expression(s[b]^2), prob = TRUE)
abline(v = sigma^2, col = "red")  
abline(v = s2_b_m, col = "blue", lty=2)

Lab1

Statistical Methods for Data Science

V Gioia, L Egidi, M Stefanucci

19/10/2021