--- title: "Lab1" author: "V Gioia, L Egidi, M Stefanucci" date: "19/10/2021" output: html_document: css: style.css toc: yes pdf_document: highligth: tango toc: yes institute: University of Trieste subtitle: Statistical Methods for Data Science fontsize: 10 pt --- ```{r global_options, include=FALSE} knitr::opts_chunk$set(fig.align = 'center', warning=FALSE, message=FALSE, fig.asp=0.625, dev='png', global.par = TRUE, dev.args=list(pointsize=10), fig.path = 'figs/') library(MASS) ``` ```{r setup, include=FALSE} library(knitr) local({ hook_plot = knit_hooks$get('plot') knit_hooks$set(plot = function(x, options) { paste0('\n\n----\n\n', hook_plot(x, options)) }) }) ``` # Central Limit Theorem (CLT) Let $X_1, X_2, \ldots, X_n$ be a sequence of independent and identically distributed (i.i.d.) random variables (r.v.) of size $n$ from a distribution with mean $\mu$ and variance $\sigma^2$. Then, for large n, the sample average $\overline X_n= (\sum_{i=1}^{n} X_i)/n$ $$\overline X_n \, \dot \sim \, \mathcal N \bigg (\mu, \frac{\sigma^2}{n} \bigg)$$ where $\dot \sim$ denotes the converge in distribution. The theorem could be stated in terms of $T_n=\sum_{i=1}^{n}X_i=n \overline X_n$, that is $$T_n \, \dot \sim \, \mathcal{N}(n \mu, n \sigma^2) $$ The CLT supports the normal distribution of a r.v. that can be viewed as the sum of other r.v. ## Approximation with CLT: application The approximation above is useful in statistics for computing some quantities. For instance, let $X$ and $Y$ two independent r.v., such that $X \sim \mathcal{B_i}(n_1,p_1)$ and $Y \sim \mathcal{B_i}(n_2,p_2)$, how can easily compute the probability $\rm{Pr}(X>Y)$?\ Normal approximation of the binomial distribution is the simplest way to do it. $$X \, \approx \, \mathcal{N}(\mu_X = n_1p_1, \sigma^2_X=n_1p_1(1-p_1)) \hspace{1cm} Y \, \approx \, \mathcal{N}(\mu_Y=n_2p_2, \sigma^2_Y=n_2p_2(1-p_2)) $$ Denoting with $W= X-Y$, we know that $W$ is also a normal r.v. with $$\mu_W = \mu_X - \mu_Y$$ and, by independence $$ \sigma^2_W=\sigma^2_X+\sigma^2_Y $$ Then, we have $W \, \approx \, \mathcal{N}(\mu_W=n_1p_1-n_2p_2, \, \sigma^2_W=n_1p_1(1-p_1)+n_2p_2(1-p_2))$ ## Approximation with CLT: real application with waterpolo goals Tomorrow, two professional Italian waterpolo teams, Posillipo and Pro Recco, compete against each other. Let $X$ and $Y$ the r.v. denoting the goals scored by both the teams. Let's assume that $X$ and $Y$ are independent and follow the Binomial distributions, namely $X \sim \mathcal{B}_i(n_1,p_1)$ and $Y \sim \mathcal{B}_i(n_2,p_2)$\ - $X$: number of shots converted in goals on the total number of shots $n_1$ with probability $p_1$ (Posillipo) \ - $Y$: number of shots converted in goals on the total number of shots $n_2$ with probability $p_2$ (Pro Recco) Before a match, the number of shots ($n_1,n_2$) are unknown. We treat the quantities $p_1, p_2, n_1, n_2$ as known, for instance fixing them upon historical data: \ $p_1=0.5, \quad p_2=0.7, \quad n_1=20, \quad n_2=20.$ What is the probability the Posillipo will win the next match against the Pro Recco: $\rm{Pr}(X>Y)=\rm{Pr}(X-Y>0)=?$ Denoting with $W=X-Y$ and using the above formulas, we can approximate $W \, \approx \, \mathcal N(\mu_W=\mu_X-\mu_Y, \sigma^2_W=\sigma^2_X+\sigma^2_Y)$. ```{r binom_water, echo=TRUE, results='hold', fig.keep='high', fig.height=8,fig.width=5} p1 <- 0.5; p2 <- 0.7; n1 <- 20; n2 <- 20; mean_w <- n1*p1 - n2*p2 sigma2_w <- n1*p1*(1 - p1) + n2*p2*(1 - p2) Prob_win_Posillipo <- pnorm(0, mean = mean_w, sd = sqrt(sigma2_w), lower.tail = FALSE) Prob_win_Posillipo #Alternatively #Prob_win_Posillipo <- 1 - pnorm(0, mean = mean_w, sd = sqrt(sigma2_w)) ``` ```{r binom_water3, echo=TRUE} plot(1, ylim = c(0,0.3), xlim = c(-12,20), ylab = "f(x)", xlab = "x", type = "n") points(0:n1, dbinom(0:n1, n1, p1), pch = 21, bg=1) text(10.25, 0.20, "X", cex = 2, col="black") points(0:n2, dbinom(0:n2, n2, p2), pch = 21, bg=2) text(14.25, 0.23, "Y", cex = 2, col="red") curve(dnorm(x, p1*n1 - p2*n2, sqrt(n1*p1*(1 - p1) + n2*p2*(1 - p2)) ), add = TRUE, col = "blue", lty= "dashed") text(-4.0, 0.16, "W", cex = 2, col= 4) segments(0,0, 0, dnorm(0, mean_w, sqrt(sigma2_w)), col = "black") ``` ## Waterpolo goals: what if we use a Poisson distribution? Rather than specifying in advance the total number of unknown shots and the converting shots probabilities, for an amount of 4 parameters, we may take into account a different distribution, accounting just for the scoring intensity, regardless the number of the shots. In this respect, the Poisson distribution seems suitable. We assume two independent Poisson distributions for the number of goals of the upcoming match, namely: $X \sim \mathcal{P}(\lambda_1)$, $Y \sim \mathcal{P}(\lambda_2)$ We need to specify the rates $\lambda_1$ and $\lambda_2$. Based on previous knowledge, we fix $\lambda_1 = 5$ and $\lambda_2 = 7.5$. What is the probability the Posillipo will win the next match against the Pro Recco: $\rm{Pr}(X>Y)=\rm{Pr}(X-Y>0)=?$ The difference between two Poisson distributions is a known probability distribution, the Poisson difference or Skellam distribution. Then $W=X-Y\sim \mathcal{PD}(\lambda_1, \lambda_2)$, such that $\mu_W=\lambda_1 - \lambda_2$ and $\sigma^2_W=\lambda_1 + \lambda_2$ (see [https://www.jstor.org/stable/2981372?seq=1#metadata_info_tab_contents])\ ```{r skellam_water, echo=TRUE, results='hold', fig.keep='high', fig.height=8,fig.width=5} library(skellam) lambda1 <- 5; lambda2 <- 7.5 Prob_win_Posillipo_S <- pskellam(0, lambda1, lambda2, lower.tail=FALSE) Prob_win_Posillipo_S ``` ```{r skellam_water2, echo=TRUE} plot(1, ylim = c(0,0.3), xlim = c(-12,20), ylab = "f(x)", xlab = "x", type = "n") points(0:n1, dpois(0:n1, lambda1), pch = 21, bg=1) text(4, 0.25, "X", cex = 2, col="black") points(0:n2, dpois(0:n2, lambda2), pch = 21, bg=2) text(7, 0.20, "Y", cex = 2, col="red") points(-12:20, dskellam(-12:20, lambda1, lambda2),col = "blue") text(-3.0, 0.16, "W", cex = 2, col= 4) segments(0,0, 0, dskellam(0, lambda1, lambda2), col = "black") ``` # Monte Carlo (MC) simulation Simulation is a well-known technique designed to approximate a process and retrieve general results by assuming to observe the process several times.We rely on the so called MC Simulation, a wide class of simulation procedures based on sampling i.i.d. values from a process --- precisely, from the underlying "true" probability distribution of the process--- and computing numerical outputs. Steps of a MC simulation are - generate $n$ i.i.d. values from a process; - compute a summary for this sample, a statistic; - repeat the steps above $R$ times and obtain a sample distribution for the statistic. We will see the behaviour of some sample statistics (mean and variance), through their sample distribution. ## Some relationship among probability distributions: Inverse sampling method Let $U_1, \ldots, U_n$ independent r.v. distributed as $Unif(0,1)$. We can verify that the r.v. $$X=-\frac{\log(U_1)}{\lambda} \sim \mathcal{E(\lambda)} $$ Briefly: - generate a value from a standard uniform - obtain $x$ from the previous expression, providing fixing in advance $\lambda$ - repeat the process B times ```{r exp, echo=TRUE} B <- 10000 lambda <- 2 x <- rep(0, B) for(i in 1:B){ x[i] <- -log(runif(1,0,1))/lambda } par(mfrow = c(1,2)) hist.scott(x, xlab = "x", prob = TRUE) curve(dexp(x, rate = lambda), from = 1e-16, col = "red", add = TRUE) plot(ecdf(x)) curve(pexp(x, rate = lambda), col = "red", add = TRUE) ``` Now, we verify that the r.v. $Z=min\{X_1, \ldots, X_n\} \, \sim \, \mathcal{E}(n \lambda)$, with $X_i=-\log(U_i)/\lambda$, $i=1, \ldots, n$\ ($X_1, \ldots, X_n$ independent r.v., such that $X_i \, \sim \, \mathcal{E}(\lambda_i)$, then $Z=min\{X_1, \ldots, X_n\} \, \sim \, \mathcal{E}(\sum_{i=1}^{n}\lambda_i)$ ```{r min exp, echo=TRUE} B <- 10000 lambda <- 2 n <- 10 x <- rep(0, n) z <- rep(0, B) for(i in 1:B){ x <- -log(runif(n,0,1))/lambda z[i] <- min(x) } par(mfrow = c(1,2)) hist.scott(z, xlab="z", prob = TRUE) curve(dexp(x, rate = n*lambda), from = 1e-16, col = "red", add = TRUE) plot(ecdf(z), xlab="z") curve(pexp(x, rate = n*lambda), col = "red", add = TRUE) ``` **It's your turn** Now, let $U_1, \ldots, U_n$ independent r.v. distributed as $Unif(0,1)$, and let $X_i=-\log(U_i)/\lambda$, $i=1, \ldots, n$. Verify that the r.v. $Y= \sum_{i=1}^{n} X_i\, \sim \, \mathcal{Ga}(n, \lambda)$\ ```{r sum exp, echo=TRUE} B <- 10000 lambda <- 2 n <- 10 x <- rep(0, n) y <- rep(0, B) for(i in 1:B){ x <- -log(runif(n,0,1))/lambda y[i] <- sum(x) } par(mfrow = c(1,2)) hist.scott(y, xlab = "y", prob = TRUE) curve(dgamma(x, shape = n, rate = lambda), col = "red", add = TRUE) #Alternatively #curve(dgamma(x, n, scale = 1/lambda), col="red", add=TRUE) plot(ecdf(y), xlab = "y") curve(pgamma(x, shape = n, rate = lambda), col = "red", add = TRUE) ``` Now, we try to increase $n$ ```{r sum exp2, echo=TRUE} B <- 10000 lambda <- 2 n <- 30 x <- rep(0, n) y <- rep(0, B) for(i in 1:B){ x <- -log(runif(n,0,1))/lambda y[i] <- sum(x) } par(mfrow=c(1,2)) hist.scott(y, xlab = "y", prob = TRUE) curve(dgamma(x, shape = n, rate = lambda), col="red", add=TRUE) curve(dnorm(x, mean = n/lambda, sd = sqrt(n/lambda^2)), col="blue", add=TRUE) plot(ecdf(y),) curve(pgamma(x, shape = n, rate = lambda), col = "red", add=TRUE) curve(pnorm(x, mean = n/lambda, sd = sqrt(n/lambda^2)), col = "blue", add = TRUE) ``` **Exercise (optional)** Generate values from a $X \, \sim \, \mathcal{Weibull}(\alpha,\mu)$, with $\alpha, \mu > 0$, using the inverse solving method. The cumulative distribution function is $$F_X(x)=\begin{cases} 0 \quad \quad \quad \quad \quad x \leq 0 \\ 1-e^{-(x/\mu)^\alpha} \, \, x > 0\end{cases} $$ ## Normal and Chi-square distributions Let $X_1, X_2, \ldots, X_n$ i.i.d. r.v. from $X \sim \mathcal{N}(0,1)$ and let $T_n=\sum_{i=1}^{n}{X^2_i}$. We know that $T_n \sim \mathcal{X}^2_n$. Assuming $n=10$, verify this relationship via Monte Carlo simulation. ```{r normal_chisq, echo=TRUE} n <- 10 # n observations B <- 10000 # B simulations chisq.mc <- rep(0, B) set.seed(1234) for(j in 1:B){ x <- rnorm(10, 0, 1) chisq.mc[j] <- sum(x^2) } par(mfrow = c(1, 2)) hist.scott(chisq.mc, prob = TRUE, xlab = "x") curve(dchisq(x, df = n), add = TRUE, col = "red", lwd = 1.5) plot(ecdf(chisq.mc), xlim = c(0,25)) curve(pchisq(x, df = n), add = TRUE, col = "red", lwd = 1) ``` **It's your turn** Following the setup already seen ($T_{10} \sim \mathcal{X}^2_{10}$), write a function that compare - sample mean - sample variance - sample quantiles of order 0.01 and 0.05 Such function - take in **input** the number of simulations, let's say $B$ - return in **output** a table with the sample and the theoretical results (4 rows and 2 columns) ```{r normal_chisq2, echo=TRUE} compare_MC_chisq <- function(B){ set.seed(B) n <- 10 # n observations chisq.mc <- rep(0, B) for(j in 1:B){ x <- rnorm(10, 0, 1) chisq.mc[j] <- sum(x^2) } mean_chisq <- mean(chisq.mc) var_chisq <- var(chisq.mc) quantile_chisq <- quantile(chisq.mc, probs = c(0.01, 0.05)) chisq.truevalues <- c(n, 2*n, qchisq(p = c(0.01, 0.05), df=n)) chisq.mc.results <- c(mean_chisq, var_chisq, quantile_chisq) return(data.frame(MC = chisq.mc.results, TrueVal= chisq.truevalues, row.names = c("mean","var", "q0.01", "q0.05"))) #res <- matrix(0,4,2) #colnames(res) <- c("MC_res", "Trueval") #res[, 1] <- chisq.mc.results #res[, 2] <- chisq.truevalues #rownames(res) <- c("mean", "var", "q0.01", "q0.05") #return(res) } compare_MC_chisq(100) compare_MC_chisq(1000) compare_MC_chisq(10000) ``` ## Distribution of the sample mean Let's suppose: - $x_1, \ldots, x_n$ random sample from $\mathcal{N}(0,1)$ - $x_1, \ldots, x_n$ random sample from $\mathcal{t}_3$ - $x_1, \ldots, x_n$ random sample from $\mathcal{Unif}(0,1)$ Which is the distribution of the sample mean $\overline X= (\sum_{i=1}^{n}X_i)/n$? Before answering, let's investigate this issue via Monte Carlo simulation. Fix $B=1000$ and $n=30$. Remember that: - if $X \sim t_k$ then $\mu_X=0$, providing that $k > 1$, and $\sigma^2_X=k/(k-2)$, providing that $k > 2$ - if $X \sim \mathcal{Unif}(a,b)$ then $\mu_X=(a+b)/2$ and $\sigma^2_X=(b-a)^2/12$ ```{r sample mean, echo=TRUE} n <- 30 # n observations B <- 1000 # B simulations set.seed(1234) # sample generations samples <- array(0, dim = c(n, 3, B)) for(i in 1:B){ samples[, 1, i] <-rnorm(n, 0, 1) samples[, 2, i] <-rt(n, df = 3) samples[, 3, i] <-runif(n, 0, 1) } #Sample statistics samples_stat <- array(0, dim = c(2, 3, B)) for(j in 1:3){ #sample mean samples_stat[1, j, ] <- apply(samples[,j,], 2, mean) #sample variance samples_stat[2, j, ] <- apply(samples[,j,], 2, var) } par(mfrow = c(1, 3)) hist.scott(samples_stat[1,1,], prob =TRUE, xlab = "x", main = "N(0,1)") curve(dnorm(x, 0, 1/sqrt(n)), col = "red", lwd = 2, add = TRUE) hist.scott(samples_stat[1,2,], prob =TRUE, xlab = "x", main = "t3") curve(dnorm(x, 0, sqrt(3/n)), col = "red", lwd = 2, add = TRUE) hist.scott(samples_stat[1,3,], prob =TRUE, xlab = "x", main = "U(0,1)") curve(dnorm(x, 1/2, 1/sqrt(12*n)), col = "red", lwd = 2, add = TRUE) ``` From the theory: - in the Gaussian case the sample mean follows the normal distribution, namely $\overline X \, \sim \, \mathcal{N}(\mu, \sigma^2/n)$ - for the other distribution, by CLT theorem, this result holds asymptotically, that is $\overline X \, \dot \sim \, \mathcal{N}(\mu, \sigma^2/n)$ **Exercise (optional)** Try to use the empirical cumulative distribution function and quantile-quantile plot, as an alternative to the histograms check. ## Distribution of the sample variance From theory, the $S^2 = \frac{1}{n-1}{\sum_{i=1}^{n}(X_i - \overline X )^2}$ is an unbiased estimator ($E(\hat \theta)= \theta$) for the variance. The $\mathsf{var()}$ applied to a sample vector gives such estimate. In the example above, we have already stored the variance. If the assumed true model is normal, we know that the distribution of the sample variance is $$ \frac{n-1}{\sigma^2} S^2 \, \sim \chi^2_{n-1} $$ Otherwise, see [https://www.tandfonline.com/doi/abs/10.1080/00031305.2014.966589] results 13 and 14 for the asymptotic distributions. Namely, $$\frac{S^2}{\sigma^2} k \, \dot \sim \, \chi^2_{k}$$ where $k=\frac{2\sigma^4}{Var(S^2)}$. ```{r sample variance, echo=TRUE} sigma2_n <- 1 sigma2_t <- 3 #3/(3-2) sigma2_u <- 1/12 par(mfrow=c(1,3)) hist.scott(samples_stat[2,1, ], prob=TRUE, xlab=expression(s^2), main = "N(0,1)") curve((n-1)/sigma2_n *dchisq(x * (n-1)/sigma2_n, df = n - 1), add=TRUE, col="red", lwd=2) hist.scott(samples_stat[2,2, ], prob=TRUE, xlab=expression(s^2), main = "t(3)", xlim = c(0,25)) curve((n-1)/sigma2_t *dchisq(x * (n-1)/sigma2_t, df = n - 1), add=TRUE, col="red", lwd=2) curve(2*sigma2_t/var(samples_stat[2,2, ]) * dchisq(x * 2*sigma2_t/var(samples_stat[2,2, ]) , df = ( 2*sigma2_t^2/var(samples_stat[2, 2, ]))), add=TRUE, col="green", lwd=2) hist.scott(samples_stat[2,3, ], prob=TRUE, xlab=expression(s^2), main = "U(0,1)") curve((n-1)/sigma2_u *dchisq(x * (n-1)/sigma2_u, df = n - 1), add=TRUE, col="red", lwd=2) curve(2*sigma2_u/var(samples_stat[2,3, ]) * dchisq(x * 2*sigma2_u/var(samples_stat[2,3, ]) , df = ( 2*sigma2_u^2/var(samples_stat[2, 3, ]))), add=TRUE, col="green", lwd=2) ``` **It's your turn** The estimator $S^2_b=\frac{1}{n} \sum_{i=1}^{n}{(X_i - \overline X)^2}$ is a biased estimator (asimptotically unbiased). Check the biased nature of $S^2_b$ via MC simulation, generating $n=10$ i.i.d. values from a normal distribution. Compare the results with the unbiased estimator $S^2$. ```{r biased sample variance, echo=TRUE} set.seed(1234) B <- 1000 n <- 10 mu <- 5 sigma <- 2 s2 <- rep(0, B) s2_b <- rep(0, B) for(i in 1:B){ y <- rnorm(10, mu, sigma) s2[i] <- var(y) s2_b[i] <- s2[i] *(n - 1)/n } par(mfrow=c(1,2)) s2_m <- mean(s2) s2_b_m <- mean(s2_b) par(mfrow = c(1,2)) hist.scott(s2, xlab = expression(s^2), main = expression(s^2), prob = TRUE) abline(v = sigma^2, col = "red") abline(v = s2_m, col = "blue", lty=2) hist.scott(s2_b, xlab = expression(s[b]^2), main = expression(s[b]^2), prob = TRUE) abline(v = sigma^2, col = "red") abline(v = s2_b_m, col = "blue", lty=2) ```