---
title: "Lab1"
author: "V Gioia, L Egidi, M Stefanucci"
date: "19/10/2021"
output: 
  html_document:
    css: style.css
    toc: yes
  pdf_document:
    highligth: tango
    toc: yes
institute: University of Trieste
subtitle: Statistical Methods for Data Science
fontsize: 10 pt
---

```{r global_options, include=FALSE}
knitr::opts_chunk$set(fig.align = 'center', warning=FALSE, message=FALSE, fig.asp=0.625, dev='png', global.par = TRUE, dev.args=list(pointsize=10), fig.path = 'figs/')
library(MASS)
```

```{r setup, include=FALSE}
library(knitr)
local({
  hook_plot = knit_hooks$get('plot')
  knit_hooks$set(plot = function(x, options) {
    paste0('\n\n----\n\n', hook_plot(x, options))
  })
})
```


# Central Limit Theorem (CLT)
Let $X_1, X_2, \ldots, X_n$ be a sequence of independent and identically distributed (i.i.d.) random variables (r.v.) of size $n$ from a distribution with mean $\mu$ and variance $\sigma^2$. Then, for large n, the sample average $\overline X_n= (\sum_{i=1}^{n} X_i)/n$
$$\overline X_n \, \dot   \sim \, \mathcal N \bigg (\mu, \frac{\sigma^2}{n} \bigg)$$
where $\dot   \sim$ denotes the converge in distribution. The theorem could be stated in terms of $T_n=\sum_{i=1}^{n}X_i=n \overline X_n$, that is 
$$T_n \, \dot \sim \, \mathcal{N}(n \mu, n \sigma^2) $$

The CLT supports the normal distribution of a r.v. that can be viewed as the sum of other r.v.

## Approximation with CLT: application
The approximation above is useful in statistics for computing some quantities. For instance, let $X$ and $Y$ two independent r.v., such that $X \sim \mathcal{B_i}(n_1,p_1)$ and $Y \sim \mathcal{B_i}(n_2,p_2)$, how can easily compute the probability $\rm{Pr}(X>Y)$?\

Normal approximation of the binomial distribution is the simplest way to do it.

$$X \, \approx \, \mathcal{N}(\mu_X = n_1p_1, \sigma^2_X=n_1p_1(1-p_1)) \hspace{1cm} Y \, \approx \, \mathcal{N}(\mu_Y=n_2p_2, \sigma^2_Y=n_2p_2(1-p_2)) $$
Denoting with $W= X-Y$, we know that $W$ is also a normal r.v. with  
$$\mu_W = \mu_X - \mu_Y$$
and, by independence 
$$ \sigma^2_W=\sigma^2_X+\sigma^2_Y $$
Then, we have $W \, \approx \, \mathcal{N}(\mu_W=n_1p_1-n_2p_2, \, \sigma^2_W=n_1p_1(1-p_1)+n_2p_2(1-p_2))$

## Approximation with CLT: real application with waterpolo goals
Tomorrow, two professional Italian waterpolo teams, Posillipo and Pro Recco, compete against each other. Let $X$ and $Y$ the r.v. denoting the goals scored by both the teams. Let's assume that $X$ and $Y$ are independent and follow the Binomial distributions, namely $X \sim \mathcal{B}_i(n_1,p_1)$ and $Y \sim \mathcal{B}_i(n_2,p_2)$\

- $X$: number of shots converted in goals on the total number of shots $n_1$ with probability $p_1$ 
(Posillipo) \

- $Y$: number of shots converted in goals on the total number of shots $n_2$ with probability $p_2$ (Pro Recco)

Before a match, the number of shots ($n_1,n_2$) are unknown. We treat the quantities $p_1, p_2, n_1, n_2$ as known, for instance fixing them upon historical data: \
$p_1=0.5, \quad p_2=0.7, \quad n_1=20, \quad n_2=20.$

What is the probability the Posillipo will win the next match against the Pro Recco:
$\rm{Pr}(X>Y)=\rm{Pr}(X-Y>0)=?$

Denoting with $W=X-Y$ and using the above formulas, we can approximate  $W \, \approx \, \mathcal N(\mu_W=\mu_X-\mu_Y, \sigma^2_W=\sigma^2_X+\sigma^2_Y)$.

```{r binom_water, echo=TRUE, results='hold', fig.keep='high', fig.height=8,fig.width=5}
p1 <- 0.5; p2 <- 0.7; n1 <- 20; n2 <- 20;
mean_w <- n1*p1 - n2*p2
sigma2_w <- n1*p1*(1 - p1) + n2*p2*(1 - p2) 
Prob_win_Posillipo <- pnorm(0, mean = mean_w, sd = sqrt(sigma2_w), lower.tail = FALSE)
Prob_win_Posillipo

#Alternatively
#Prob_win_Posillipo <- 1 - pnorm(0, mean = mean_w, sd = sqrt(sigma2_w))
```


```{r binom_water3, echo=TRUE}
plot(1, ylim = c(0,0.3), xlim = c(-12,20), ylab = "f(x)", xlab = "x", type = "n")
points(0:n1, dbinom(0:n1, n1, p1), pch = 21, bg=1)
text(10.25, 0.20, "X", cex = 2, col="black")
points(0:n2, dbinom(0:n2, n2, p2), pch = 21, bg=2)
text(14.25, 0.23, "Y", cex = 2, col="red")

curve(dnorm(x, p1*n1 - p2*n2, sqrt(n1*p1*(1 - p1) + n2*p2*(1 - p2)) ), add = TRUE, col = "blue", lty= "dashed")
text(-4.0, 0.16, "W", cex = 2, col= 4)
segments(0,0, 0, dnorm(0, mean_w, sqrt(sigma2_w)), col = "black")
```

## Waterpolo goals: what if we use a Poisson distribution?
Rather than specifying in advance the total number of unknown shots and the converting shots probabilities, for an amount of 4 parameters, we may take into account a different distribution, accounting just for the scoring intensity, regardless the number of the shots. 

In this respect, the Poisson distribution seems suitable. We assume two independent Poisson distributions for the number of goals of the upcoming match, namely:
$X \sim \mathcal{P}(\lambda_1)$,
$Y \sim \mathcal{P}(\lambda_2)$

We need to specify the rates $\lambda_1$ and $\lambda_2$. Based on previous knowledge, we fix $\lambda_1 = 5$ and $\lambda_2 = 7.5$.

What is the probability the Posillipo will win the next match against the Pro Recco:
$\rm{Pr}(X>Y)=\rm{Pr}(X-Y>0)=?$

The difference between  two Poisson distributions is a known probability distribution, the Poisson difference or Skellam distribution. Then $W=X-Y\sim \mathcal{PD}(\lambda_1, \lambda_2)$, such that $\mu_W=\lambda_1 -  \lambda_2$ and $\sigma^2_W=\lambda_1 + \lambda_2$ (see [https://www.jstor.org/stable/2981372?seq=1#metadata_info_tab_contents])\

```{r skellam_water, echo=TRUE, results='hold', fig.keep='high', fig.height=8,fig.width=5}
library(skellam)
lambda1 <- 5; lambda2 <- 7.5

Prob_win_Posillipo_S <- pskellam(0, lambda1, lambda2, lower.tail=FALSE)
Prob_win_Posillipo_S
```

```{r skellam_water2, echo=TRUE}
plot(1, ylim = c(0,0.3), xlim = c(-12,20), ylab = "f(x)", xlab = "x", type = "n")
points(0:n1, dpois(0:n1, lambda1), pch = 21, bg=1)
text(4, 0.25, "X", cex = 2, col="black")
points(0:n2, dpois(0:n2, lambda2), pch = 21, bg=2)
text(7, 0.20, "Y", cex = 2, col="red")
points(-12:20, dskellam(-12:20, lambda1, lambda2),col = "blue")
text(-3.0, 0.16, "W", cex = 2, col= 4)
segments(0,0, 0, dskellam(0, lambda1, lambda2), col = "black")
```


# Monte Carlo (MC) simulation

Simulation is a well-known technique designed to approximate a process and retrieve general results by assuming to observe the process several times.We rely on the so called MC Simulation, a wide class of simulation procedures based on sampling i.i.d. values from a process --- precisely, from the underlying "true" probability distribution of the process--- and computing numerical outputs. Steps of a MC simulation are

- generate $n$ i.i.d. values from a process;
- compute a summary for this sample, a statistic;
- repeat the steps above $R$ times and obtain a sample distribution for the statistic.

We will see the behaviour of some sample statistics (mean and variance), through their sample distribution.


## Some relationship among probability distributions: Inverse sampling method
Let $U_1, \ldots, U_n$ independent r.v. distributed as $Unif(0,1)$. We can verify that the r.v. 

 $$X=-\frac{\log(U_1)}{\lambda} \sim \mathcal{E(\lambda)} $$

Briefly: 

- generate a value from a standard uniform
- obtain $x$ from the previous expression, providing  fixing in advance $\lambda$
- repeat the process B times 

```{r exp, echo=TRUE}
B <- 10000
lambda <- 2
x <- rep(0, B)

for(i in 1:B){
 x[i] <- -log(runif(1,0,1))/lambda
}

par(mfrow = c(1,2))
hist.scott(x,  xlab = "x", prob = TRUE)
curve(dexp(x, rate = lambda), from = 1e-16, col = "red", add = TRUE)

plot(ecdf(x))
curve(pexp(x, rate = lambda), col = "red", add = TRUE)

```

Now, we verify that the r.v. $Z=min\{X_1, \ldots, X_n\} \, \sim \, \mathcal{E}(n \lambda)$, with $X_i=-\log(U_i)/\lambda$, $i=1, \ldots, n$\

($X_1, \ldots, X_n$ independent r.v., such that $X_i \, \sim \, \mathcal{E}(\lambda_i)$, then $Z=min\{X_1, \ldots, X_n\} \, \sim \, \mathcal{E}(\sum_{i=1}^{n}\lambda_i)$

```{r min exp, echo=TRUE}
B <- 10000
lambda <- 2
n <- 10
x <- rep(0, n)
z <- rep(0, B)

for(i in 1:B){
  x <- -log(runif(n,0,1))/lambda
  z[i] <- min(x)
}

par(mfrow = c(1,2))
hist.scott(z, xlab="z", prob = TRUE)
curve(dexp(x, rate = n*lambda), from = 1e-16, col = "red", add = TRUE)

plot(ecdf(z), xlab="z")
curve(pexp(x, rate = n*lambda), col = "red", add = TRUE)
```

**It's your turn** Now, let $U_1, \ldots, U_n$ independent r.v. distributed as $Unif(0,1)$, and let  $X_i=-\log(U_i)/\lambda$, $i=1, \ldots, n$. Verify that the r.v. $Y= \sum_{i=1}^{n} X_i\, \sim \, \mathcal{Ga}(n, \lambda)$\

```{r sum exp, echo=TRUE}
B <- 10000
lambda <- 2
n <- 10
x <- rep(0, n)
y <- rep(0, B)

for(i in 1:B){
  x <- -log(runif(n,0,1))/lambda
  y[i] <- sum(x)
}

par(mfrow = c(1,2))
hist.scott(y,  xlab = "y", prob = TRUE)
curve(dgamma(x, shape = n, rate = lambda), col = "red", add = TRUE)
#Alternatively
#curve(dgamma(x, n, scale = 1/lambda), col="red", add=TRUE)

plot(ecdf(y), xlab = "y")
curve(pgamma(x, shape = n, rate = lambda), col = "red", add = TRUE)
```

Now, we try to increase $n$
```{r sum exp2, echo=TRUE}
B <- 10000
lambda <- 2
n <- 30
x <- rep(0, n)
y <- rep(0, B)

for(i in 1:B){
  x <- -log(runif(n,0,1))/lambda
  y[i] <- sum(x)
}

par(mfrow=c(1,2))
hist.scott(y,  xlab = "y", prob = TRUE)
curve(dgamma(x, shape = n, rate = lambda), col="red", add=TRUE)
curve(dnorm(x, mean = n/lambda, sd = sqrt(n/lambda^2)), col="blue", add=TRUE)

plot(ecdf(y),)
curve(pgamma(x, shape = n, rate = lambda), col = "red", add=TRUE)
curve(pnorm(x, mean = n/lambda, sd = sqrt(n/lambda^2)), col = "blue", add = TRUE)
```


**Exercise (optional)** Generate values from a $X \, \sim \, \mathcal{Weibull}(\alpha,\mu)$, with $\alpha, \mu > 0$, using the inverse solving method. The cumulative distribution function is 
$$F_X(x)=\begin{cases} 0 \quad \quad \quad \quad  \quad x \leq 0 \\ 
 1-e^{-(x/\mu)^\alpha} \, \, x > 0\end{cases} $$

## Normal and Chi-square distributions
Let $X_1, X_2, \ldots, X_n$ i.i.d. r.v. from $X \sim \mathcal{N}(0,1)$ and let $T_n=\sum_{i=1}^{n}{X^2_i}$. We know that $T_n \sim \mathcal{X}^2_n$. Assuming $n=10$, verify this relationship via Monte Carlo simulation.

```{r normal_chisq, echo=TRUE}
n <- 10 # n observations
B <- 10000 # B simulations
chisq.mc <- rep(0, B)

set.seed(1234)

for(j in 1:B){
 x <- rnorm(10, 0, 1)
 chisq.mc[j] <-  sum(x^2) 
}

par(mfrow = c(1, 2))
hist.scott(chisq.mc, prob = TRUE, xlab = "x")
curve(dchisq(x, df = n), add = TRUE, col = "red", lwd = 1.5)


plot(ecdf(chisq.mc), xlim = c(0,25))
curve(pchisq(x, df = n), add = TRUE, col = "red", lwd = 1)
```

**It's your turn** Following the setup  already seen ($T_{10} \sim \mathcal{X}^2_{10}$), write a function that compare

- sample mean
- sample variance
- sample quantiles of order 0.01 and 0.05

Such function 

- take in **input** the number of simulations, let's say $B$
- return in **output** a table with the sample and the theoretical results (4 rows and 2 columns)

```{r normal_chisq2, echo=TRUE}
compare_MC_chisq <- function(B){
  set.seed(B)
  n <- 10 # n observations
  chisq.mc <- rep(0, B)
  for(j in 1:B){
    x <- rnorm(10, 0, 1)
    chisq.mc[j] <-  sum(x^2) 
  }
  mean_chisq <- mean(chisq.mc)
  var_chisq <- var(chisq.mc)
  quantile_chisq <- quantile(chisq.mc, probs = c(0.01, 0.05))
  
  chisq.truevalues <- c(n, 2*n, qchisq(p = c(0.01, 0.05), df=n))
  chisq.mc.results <- c(mean_chisq, var_chisq, quantile_chisq)
  
  return(data.frame(MC = chisq.mc.results, TrueVal= chisq.truevalues, row.names = c("mean","var", "q0.01", "q0.05")))
  
  #res <- matrix(0,4,2)
  #colnames(res) <- c("MC_res", "Trueval")
  #res[, 1] <- chisq.mc.results
  #res[, 2] <- chisq.truevalues
  #rownames(res) <- c("mean", "var", "q0.01", "q0.05")
  #return(res)
}

compare_MC_chisq(100)

compare_MC_chisq(1000)

compare_MC_chisq(10000)
```
## Distribution of the sample mean
Let's suppose:

- $x_1, \ldots, x_n$ random sample from $\mathcal{N}(0,1)$
- $x_1, \ldots, x_n$ random sample from $\mathcal{t}_3$
- $x_1, \ldots, x_n$ random sample from $\mathcal{Unif}(0,1)$

Which is the distribution of the sample mean $\overline X= (\sum_{i=1}^{n}X_i)/n$? Before answering, let's investigate this issue  via Monte Carlo simulation. Fix $B=1000$ and $n=30$. Remember that:

- if $X \sim t_k$ then $\mu_X=0$, providing that $k > 1$, and $\sigma^2_X=k/(k-2)$, providing that $k > 2$
- if $X \sim \mathcal{Unif}(a,b)$ then $\mu_X=(a+b)/2$ and $\sigma^2_X=(b-a)^2/12$


```{r sample mean, echo=TRUE}
n <- 30 # n observations
B <- 1000 # B simulations

set.seed(1234)

# sample generations
samples <- array(0, dim = c(n, 3, B))
for(i in 1:B){
  samples[, 1, i] <-rnorm(n, 0, 1)
  samples[, 2, i] <-rt(n, df = 3)
  samples[, 3, i] <-runif(n, 0, 1)
}

#Sample statistics
samples_stat <- array(0, dim = c(2, 3, B))
for(j in 1:3){
  #sample mean
  samples_stat[1, j, ] <- apply(samples[,j,], 2, mean)
  #sample variance
  samples_stat[2, j, ] <- apply(samples[,j,], 2, var)
}

par(mfrow = c(1, 3))
hist.scott(samples_stat[1,1,], prob =TRUE, xlab = "x", main = "N(0,1)")
curve(dnorm(x, 0, 1/sqrt(n)), col = "red", lwd = 2, add = TRUE)

hist.scott(samples_stat[1,2,], prob =TRUE, xlab = "x", main = "t3")
curve(dnorm(x, 0, sqrt(3/n)), col = "red", lwd = 2, add = TRUE)

hist.scott(samples_stat[1,3,], prob =TRUE, xlab = "x", main = "U(0,1)")
curve(dnorm(x, 1/2, 1/sqrt(12*n)), col = "red", lwd = 2, add = TRUE)
```

From the theory:

- in the Gaussian case the sample mean follows the normal distribution, namely  $\overline X \, \sim  \, \mathcal{N}(\mu, \sigma^2/n)$
- for the other distribution, by CLT theorem, this result holds asymptotically, that is $\overline X \, \dot \sim \, \mathcal{N}(\mu, \sigma^2/n)$

**Exercise (optional)** Try to use the empirical cumulative distribution function and quantile-quantile plot, as an alternative to the histograms check. 


## Distribution of the sample variance
From theory, the $S^2 = \frac{1}{n-1}{\sum_{i=1}^{n}(X_i - \overline X )^2}$ is an unbiased estimator ($E(\hat \theta)= \theta$) for the variance. The $\mathsf{var()}$ applied to a sample vector gives such estimate. In the example above, we have already stored the variance. If the assumed true model is normal, we know that the distribution of the sample variance is 
$$ \frac{n-1}{\sigma^2} S^2 \, \sim \chi^2_{n-1} $$


Otherwise, see [https://www.tandfonline.com/doi/abs/10.1080/00031305.2014.966589] results 13 and 14 for the asymptotic distributions. Namely, 
$$\frac{S^2}{\sigma^2} k  \, \dot \sim \, \chi^2_{k}$$
where $k=\frac{2\sigma^4}{Var(S^2)}$.

```{r sample variance, echo=TRUE}
sigma2_n <- 1
sigma2_t <- 3 #3/(3-2)
sigma2_u <- 1/12

par(mfrow=c(1,3))
hist.scott(samples_stat[2,1, ], prob=TRUE, xlab=expression(s^2), main = "N(0,1)")
curve((n-1)/sigma2_n *dchisq(x * (n-1)/sigma2_n, df = n - 1), add=TRUE, col="red", lwd=2)

hist.scott(samples_stat[2,2, ], prob=TRUE, xlab=expression(s^2), main = "t(3)", xlim = c(0,25))
curve((n-1)/sigma2_t *dchisq(x * (n-1)/sigma2_t, df = n - 1), add=TRUE, col="red", lwd=2)
curve(2*sigma2_t/var(samples_stat[2,2, ]) * dchisq(x * 2*sigma2_t/var(samples_stat[2,2, ]) , df = ( 2*sigma2_t^2/var(samples_stat[2, 2, ]))), add=TRUE, col="green", lwd=2)

hist.scott(samples_stat[2,3, ],  prob=TRUE, xlab=expression(s^2), main = "U(0,1)")
curve((n-1)/sigma2_u *dchisq(x * (n-1)/sigma2_u, df = n - 1), add=TRUE, col="red", lwd=2)
curve(2*sigma2_u/var(samples_stat[2,3, ]) * dchisq(x * 2*sigma2_u/var(samples_stat[2,3, ]) , df = ( 2*sigma2_u^2/var(samples_stat[2, 3, ]))), add=TRUE, col="green", lwd=2)
```

**It's your turn** The estimator $S^2_b=\frac{1}{n} \sum_{i=1}^{n}{(X_i - \overline X)^2}$ is a biased estimator (asimptotically unbiased). Check the biased nature of $S^2_b$ via MC simulation, generating $n=10$ i.i.d. values from a normal distribution. Compare the results with the unbiased estimator $S^2$.
```{r biased sample variance, echo=TRUE}
set.seed(1234)
B <- 1000
n <- 10
mu <- 5
sigma <- 2

s2 <- rep(0, B)
s2_b <- rep(0, B)

for(i in 1:B){
 y <- rnorm(10, mu, sigma)
 s2[i] <- var(y)
 s2_b[i] <-  s2[i] *(n - 1)/n 
}

par(mfrow=c(1,2))

s2_m <- mean(s2)
s2_b_m <- mean(s2_b)

par(mfrow = c(1,2))
hist.scott(s2, xlab = expression(s^2), 
           main = expression(s^2), prob = TRUE)
abline(v = sigma^2, col = "red")  
abline(v = s2_m, col = "blue", lty=2)  

hist.scott(s2_b, xlab = expression(s[b]^2), 
           main = expression(s[b]^2), prob = TRUE)
abline(v = sigma^2, col = "red")  
abline(v = s2_b_m, col = "blue", lty=2)  

```