Let \(X_1, X_2, \ldots, X_n\) be a sequence of independent and identically distributed (i.i.d.) random variables (r.v.) of size \(n\) from a distribution with mean \(\mu\) and variance \(\sigma^2\). Then, for large n, the sample average \(\overline X_n= (\sum_{i=1}^{n} X_i)/n\) \[\overline X_n \, \dot \sim \, \mathcal N \bigg (\mu, \frac{\sigma^2}{n} \bigg)\] where \(\dot \sim\) denotes the converge in distribution. The theorem could be stated in terms of \(T_n=\sum_{i=1}^{n}X_i=n \overline X_n\), that is \[T_n \, \dot \sim \, \mathcal{N}(n \mu, n \sigma^2) \]
The CLT supports the normal distribution of a r.v. that can be viewed as the sum of other r.v.
The approximation above is useful in statistics for computing some quantities. For instance, let \(X\) and \(Y\) two independent r.v., such that \(X \sim \mathcal{B_i}(n_1,p_1)\) and \(Y \sim \mathcal{B_i}(n_2,p_2)\), how can easily compute the probability \(\rm{Pr}(X>Y)\)?
Normal approximation of the binomial distribution is the simplest way to do it.
\[X \, \approx \, \mathcal{N}(\mu_X = n_1p_1, \sigma^2_X=n_1p_1(1-p_1)) \hspace{1cm} Y \, \approx \, \mathcal{N}(\mu_Y=n_2p_2, \sigma^2_Y=n_2p_2(1-p_2)) \] Denoting with \(W= X-Y\), we know that \(W\) is also a normal r.v. with
\[\mu_W = \mu_X - \mu_Y\] and, by independence \[ \sigma^2_W=\sigma^2_X+\sigma^2_Y \] Then, we have \(W \, \approx \, \mathcal{N}(\mu_W=n_1p_1-n_2p_2, \, \sigma^2_W=n_1p_1(1-p_1)+n_2p_2(1-p_2))\)
Tomorrow, two professional Italian waterpolo teams, Posillipo and Pro Recco, compete against each other. Let \(X\) and \(Y\) the r.v. denoting the goals scored by both the teams. Let’s assume that \(X\) and \(Y\) are independent and follow the Binomial distributions, namely \(X \sim \mathcal{B}_i(n_1,p_1)\) and \(Y \sim \mathcal{B}_i(n_2,p_2)\)
\(X\): number of shots converted in goals on the total number of shots \(n_1\) with probability \(p_1\) (Posillipo)
\(Y\): number of shots converted in goals on the total number of shots \(n_2\) with probability \(p_2\) (Pro Recco)
Before a match, the number of shots (\(n_1,n_2\)) are unknown. We treat the quantities \(p_1, p_2, n_1, n_2\) as known, for instance fixing them upon historical data:
\(p_1=0.5, \quad p_2=0.7, \quad n_1=20, \quad n_2=20.\)
What is the probability the Posillipo will win the next match against the Pro Recco: \(\rm{Pr}(X>Y)=\rm{Pr}(X-Y>0)=?\)
Denoting with \(W=X-Y\) and using the above formulas, we can approximate \(W \, \approx \, \mathcal N(\mu_W=\mu_X-\mu_Y, \sigma^2_W=\sigma^2_X+\sigma^2_Y)\).
p1 <- 0.5; p2 <- 0.7; n1 <- 20; n2 <- 20;
mean_w <- n1*p1 - n2*p2
sigma2_w <- n1*p1*(1 - p1) + n2*p2*(1 - p2)
Prob_win_Posillipo <- pnorm(0, mean = mean_w, sd = sqrt(sigma2_w), lower.tail = FALSE)
Prob_win_Posillipo
#Alternatively
#Prob_win_Posillipo <- 1 - pnorm(0, mean = mean_w, sd = sqrt(sigma2_w))
## [1] 0.09362452
plot(1, ylim = c(0,0.3), xlim = c(-12,20), ylab = "f(x)", xlab = "x", type = "n")
points(0:n1, dbinom(0:n1, n1, p1), pch = 21, bg=1)
text(10.25, 0.20, "X", cex = 2, col="black")
points(0:n2, dbinom(0:n2, n2, p2), pch = 21, bg=2)
text(14.25, 0.23, "Y", cex = 2, col="red")
curve(dnorm(x, p1*n1 - p2*n2, sqrt(n1*p1*(1 - p1) + n2*p2*(1 - p2)) ), add = TRUE, col = "blue", lty= "dashed")
text(-4.0, 0.16, "W", cex = 2, col= 4)
segments(0,0, 0, dnorm(0, mean_w, sqrt(sigma2_w)), col = "black")
Rather than specifying in advance the total number of unknown shots and the converting shots probabilities, for an amount of 4 parameters, we may take into account a different distribution, accounting just for the scoring intensity, regardless the number of the shots.
In this respect, the Poisson distribution seems suitable. We assume two independent Poisson distributions for the number of goals of the upcoming match, namely: \(X \sim \mathcal{P}(\lambda_1)\), \(Y \sim \mathcal{P}(\lambda_2)\)
We need to specify the rates \(\lambda_1\) and \(\lambda_2\). Based on previous knowledge, we fix \(\lambda_1 = 5\) and \(\lambda_2 = 7.5\).
What is the probability the Posillipo will win the next match against the Pro Recco: \(\rm{Pr}(X>Y)=\rm{Pr}(X-Y>0)=?\)
The difference between two Poisson distributions is a known probability distribution, the Poisson difference or Skellam distribution. Then \(W=X-Y\sim \mathcal{PD}(\lambda_1, \lambda_2)\), such that \(\mu_W=\lambda_1 - \lambda_2\) and \(\sigma^2_W=\lambda_1 + \lambda_2\) (see [https://www.jstor.org/stable/2981372?seq=1#metadata_info_tab_contents])
library(skellam)
lambda1 <- 5; lambda2 <- 7.5
Prob_win_Posillipo_S <- pskellam(0, lambda1, lambda2, lower.tail=FALSE)
Prob_win_Posillipo_S
## [1] 0.1961603
plot(1, ylim = c(0,0.3), xlim = c(-12,20), ylab = "f(x)", xlab = "x", type = "n")
points(0:n1, dpois(0:n1, lambda1), pch = 21, bg=1)
text(4, 0.25, "X", cex = 2, col="black")
points(0:n2, dpois(0:n2, lambda2), pch = 21, bg=2)
text(7, 0.20, "Y", cex = 2, col="red")
points(-12:20, dskellam(-12:20, lambda1, lambda2),col = "blue")
text(-3.0, 0.16, "W", cex = 2, col= 4)
segments(0,0, 0, dskellam(0, lambda1, lambda2), col = "black")
Simulation is a well-known technique designed to approximate a process and retrieve general results by assuming to observe the process several times.We rely on the so called MC Simulation, a wide class of simulation procedures based on sampling i.i.d. values from a process — precisely, from the underlying “true” probability distribution of the process— and computing numerical outputs. Steps of a MC simulation are
We will see the behaviour of some sample statistics (mean and variance), through their sample distribution.
Let \(U_1, \ldots, U_n\) independent r.v. distributed as \(Unif(0,1)\). We can verify that the r.v.
\[X=-\frac{\log(U_1)}{\lambda} \sim \mathcal{E(\lambda)} \]
Briefly:
B <- 10000
lambda <- 2
x <- rep(0, B)
for(i in 1:B){
x[i] <- -log(runif(1,0,1))/lambda
}
par(mfrow = c(1,2))
hist.scott(x, xlab = "x", prob = TRUE)
curve(dexp(x, rate = lambda), from = 1e-16, col = "red", add = TRUE)
plot(ecdf(x))
curve(pexp(x, rate = lambda), col = "red", add = TRUE)
Now, we verify that the r.v. \(Z=min\{X_1, \ldots, X_n\} \, \sim \, \mathcal{E}(n \lambda)\), with \(X_i=-\log(U_i)/\lambda\), \(i=1, \ldots, n\)
(\(X_1, \ldots, X_n\) independent r.v., such that \(X_i \, \sim \, \mathcal{E}(\lambda_i)\), then \(Z=min\{X_1, \ldots, X_n\} \, \sim \, \mathcal{E}(\sum_{i=1}^{n}\lambda_i)\)
B <- 10000
lambda <- 2
n <- 10
x <- rep(0, n)
z <- rep(0, B)
for(i in 1:B){
x <- -log(runif(n,0,1))/lambda
z[i] <- min(x)
}
par(mfrow = c(1,2))
hist.scott(z, xlab="z", prob = TRUE)
curve(dexp(x, rate = n*lambda), from = 1e-16, col = "red", add = TRUE)
plot(ecdf(z), xlab="z")
curve(pexp(x, rate = n*lambda), col = "red", add = TRUE)
It’s your turn Now, let \(U_1, \ldots, U_n\) independent r.v. distributed as \(Unif(0,1)\), and let \(X_i=-\log(U_i)/\lambda\), \(i=1, \ldots, n\). Verify that the r.v. \(Y= \sum_{i=1}^{n} X_i\, \sim \, \mathcal{Ga}(n, \lambda)\)
B <- 10000
lambda <- 2
n <- 10
x <- rep(0, n)
y <- rep(0, B)
for(i in 1:B){
x <- -log(runif(n,0,1))/lambda
y[i] <- sum(x)
}
par(mfrow = c(1,2))
hist.scott(y, xlab = "y", prob = TRUE)
curve(dgamma(x, shape = n, rate = lambda), col = "red", add = TRUE)
#Alternatively
#curve(dgamma(x, n, scale = 1/lambda), col="red", add=TRUE)
plot(ecdf(y), xlab = "y")
curve(pgamma(x, shape = n, rate = lambda), col = "red", add = TRUE)
Now, we try to increase \(n\)
B <- 10000
lambda <- 2
n <- 30
x <- rep(0, n)
y <- rep(0, B)
for(i in 1:B){
x <- -log(runif(n,0,1))/lambda
y[i] <- sum(x)
}
par(mfrow=c(1,2))
hist.scott(y, xlab = "y", prob = TRUE)
curve(dgamma(x, shape = n, rate = lambda), col="red", add=TRUE)
curve(dnorm(x, mean = n/lambda, sd = sqrt(n/lambda^2)), col="blue", add=TRUE)
plot(ecdf(y),)
curve(pgamma(x, shape = n, rate = lambda), col = "red", add=TRUE)
curve(pnorm(x, mean = n/lambda, sd = sqrt(n/lambda^2)), col = "blue", add = TRUE)
Exercise (optional) Generate values from a \(X \, \sim \, \mathcal{Weibull}(\alpha,\mu)\), with \(\alpha, \mu > 0\), using the inverse solving method. The cumulative distribution function is \[F_X(x)=\begin{cases} 0 \quad \quad \quad \quad \quad x \leq 0 \\ 1-e^{-(x/\mu)^\alpha} \, \, x > 0\end{cases} \]
Let \(X_1, X_2, \ldots, X_n\) i.i.d. r.v. from \(X \sim \mathcal{N}(0,1)\) and let \(T_n=\sum_{i=1}^{n}{X^2_i}\). We know that \(T_n \sim \mathcal{X}^2_n\). Assuming \(n=10\), verify this relationship via Monte Carlo simulation.
n <- 10 # n observations
B <- 10000 # B simulations
chisq.mc <- rep(0, B)
set.seed(1234)
for(j in 1:B){
x <- rnorm(10, 0, 1)
chisq.mc[j] <- sum(x^2)
}
par(mfrow = c(1, 2))
hist.scott(chisq.mc, prob = TRUE, xlab = "x")
curve(dchisq(x, df = n), add = TRUE, col = "red", lwd = 1.5)
plot(ecdf(chisq.mc), xlim = c(0,25))
curve(pchisq(x, df = n), add = TRUE, col = "red", lwd = 1)
It’s your turn Following the setup already seen (\(T_{10} \sim \mathcal{X}^2_{10}\)), write a function that compare
Such function
compare_MC_chisq <- function(B){
set.seed(B)
n <- 10 # n observations
chisq.mc <- rep(0, B)
for(j in 1:B){
x <- rnorm(10, 0, 1)
chisq.mc[j] <- sum(x^2)
}
mean_chisq <- mean(chisq.mc)
var_chisq <- var(chisq.mc)
quantile_chisq <- quantile(chisq.mc, probs = c(0.01, 0.05))
chisq.truevalues <- c(n, 2*n, qchisq(p = c(0.01, 0.05), df=n))
chisq.mc.results <- c(mean_chisq, var_chisq, quantile_chisq)
return(data.frame(MC = chisq.mc.results, TrueVal= chisq.truevalues, row.names = c("mean","var", "q0.01", "q0.05")))
#res <- matrix(0,4,2)
#colnames(res) <- c("MC_res", "Trueval")
#res[, 1] <- chisq.mc.results
#res[, 2] <- chisq.truevalues
#rownames(res) <- c("mean", "var", "q0.01", "q0.05")
#return(res)
}
compare_MC_chisq(100)
## MC TrueVal
## mean 10.613318 10.000000
## var 23.329811 20.000000
## q0.01 2.832739 2.558212
## q0.05 4.019324 3.940299
compare_MC_chisq(1000)
## MC TrueVal
## mean 9.898044 10.000000
## var 19.532186 20.000000
## q0.01 2.213470 2.558212
## q0.05 4.018309 3.940299
compare_MC_chisq(10000)
## MC TrueVal
## mean 9.967561 10.000000
## var 19.746209 20.000000
## q0.01 2.697681 2.558212
## q0.05 4.004759 3.940299
Let’s suppose:
Which is the distribution of the sample mean \(\overline X= (\sum_{i=1}^{n}X_i)/n\)? Before answering, let’s investigate this issue via Monte Carlo simulation. Fix \(B=1000\) and \(n=30\). Remember that:
n <- 30 # n observations
B <- 1000 # B simulations
set.seed(1234)
# sample generations
samples <- array(0, dim = c(n, 3, B))
for(i in 1:B){
samples[, 1, i] <-rnorm(n, 0, 1)
samples[, 2, i] <-rt(n, df = 3)
samples[, 3, i] <-runif(n, 0, 1)
}
#Sample statistics
samples_stat <- array(0, dim = c(2, 3, B))
for(j in 1:3){
#sample mean
samples_stat[1, j, ] <- apply(samples[,j,], 2, mean)
#sample variance
samples_stat[2, j, ] <- apply(samples[,j,], 2, var)
}
par(mfrow = c(1, 3))
hist.scott(samples_stat[1,1,], prob =TRUE, xlab = "x", main = "N(0,1)")
curve(dnorm(x, 0, 1/sqrt(n)), col = "red", lwd = 2, add = TRUE)
hist.scott(samples_stat[1,2,], prob =TRUE, xlab = "x", main = "t3")
curve(dnorm(x, 0, sqrt(3/n)), col = "red", lwd = 2, add = TRUE)
hist.scott(samples_stat[1,3,], prob =TRUE, xlab = "x", main = "U(0,1)")
curve(dnorm(x, 1/2, 1/sqrt(12*n)), col = "red", lwd = 2, add = TRUE)
From the theory:
Exercise (optional) Try to use the empirical cumulative distribution function and quantile-quantile plot, as an alternative to the histograms check.
From theory, the \(S^2 = \frac{1}{n-1}{\sum_{i=1}^{n}(X_i - \overline X )^2}\) is an unbiased estimator (\(E(\hat \theta)= \theta\)) for the variance. The \(\mathsf{var()}\) applied to a sample vector gives such estimate. In the example above, we have already stored the variance. If the assumed true model is normal, we know that the distribution of the sample variance is \[ \frac{n-1}{\sigma^2} S^2 \, \sim \chi^2_{n-1} \]
Otherwise, see [https://www.tandfonline.com/doi/abs/10.1080/00031305.2014.966589] results 13 and 14 for the asymptotic distributions. Namely, \[\frac{S^2}{\sigma^2} k \, \dot \sim \, \chi^2_{k}\] where \(k=\frac{2\sigma^4}{Var(S^2)}\).
sigma2_n <- 1
sigma2_t <- 3 #3/(3-2)
sigma2_u <- 1/12
par(mfrow=c(1,3))
hist.scott(samples_stat[2,1, ], prob=TRUE, xlab=expression(s^2), main = "N(0,1)")
curve((n-1)/sigma2_n *dchisq(x * (n-1)/sigma2_n, df = n - 1), add=TRUE, col="red", lwd=2)
hist.scott(samples_stat[2,2, ], prob=TRUE, xlab=expression(s^2), main = "t(3)", xlim = c(0,25))
curve((n-1)/sigma2_t *dchisq(x * (n-1)/sigma2_t, df = n - 1), add=TRUE, col="red", lwd=2)
curve(2*sigma2_t/var(samples_stat[2,2, ]) * dchisq(x * 2*sigma2_t/var(samples_stat[2,2, ]) , df = ( 2*sigma2_t^2/var(samples_stat[2, 2, ]))), add=TRUE, col="green", lwd=2)
hist.scott(samples_stat[2,3, ], prob=TRUE, xlab=expression(s^2), main = "U(0,1)")
curve((n-1)/sigma2_u *dchisq(x * (n-1)/sigma2_u, df = n - 1), add=TRUE, col="red", lwd=2)
curve(2*sigma2_u/var(samples_stat[2,3, ]) * dchisq(x * 2*sigma2_u/var(samples_stat[2,3, ]) , df = ( 2*sigma2_u^2/var(samples_stat[2, 3, ]))), add=TRUE, col="green", lwd=2)
It’s your turn The estimator \(S^2_b=\frac{1}{n} \sum_{i=1}^{n}{(X_i - \overline X)^2}\) is a biased estimator (asimptotically unbiased). Check the biased nature of \(S^2_b\) via MC simulation, generating \(n=10\) i.i.d. values from a normal distribution. Compare the results with the unbiased estimator \(S^2\).
set.seed(1234)
B <- 1000
n <- 10
mu <- 5
sigma <- 2
s2 <- rep(0, B)
s2_b <- rep(0, B)
for(i in 1:B){
y <- rnorm(10, mu, sigma)
s2[i] <- var(y)
s2_b[i] <- s2[i] *(n - 1)/n
}
par(mfrow=c(1,2))
s2_m <- mean(s2)
s2_b_m <- mean(s2_b)
par(mfrow = c(1,2))
hist.scott(s2, xlab = expression(s^2),
main = expression(s^2), prob = TRUE)
abline(v = sigma^2, col = "red")
abline(v = s2_m, col = "blue", lty=2)
hist.scott(s2_b, xlab = expression(s[b]^2),
main = expression(s[b]^2), prob = TRUE)
abline(v = sigma^2, col = "red")
abline(v = s2_b_m, col = "blue", lty=2)