Basic concepts of estimation
- Point estimation
  - Application: Comparison of four estimators of the population mean
- Interval estimation
Basic concepts oh hypothesis testing
Extra: Independent trials: weak law of large numbers

Basic concepts of estimation

Point estimation

Let \(\hat \theta\) be the point estimator for the parameter \(\theta\). An estimator is said to be

unbiased if and only if \(E(\hat \theta)=\theta\)
consistent if \(\hat \theta \stackrel{P}\rightarrow \theta\), as \(n \rightarrow \infty\) or, equivalently, if \(\text{var}(\hat \theta) \rightarrow 0\), as \(n \rightarrow \infty\).

There are some properties that we would ensure for an estimator: low variance and low bias. But, there is a tradeoff between unbiasedness and low variance, so we would usually seek to get both (to some extent); ideally, we would target a small Mean Squared Error (MSE)

\[\rm{MSE}(\hat \theta)=E\{(\hat \theta-\theta)^2 \}= \{E(\hat \theta) - \theta\}^2+var(\hat \theta)=(Bias)^2+Variance \]

Application: Comparison of four estimators of the population mean

As we already know from the theory, the sample mean is an excellent estimator for the population mean. But we will consider now other estimators. Consider the case of a normal distribution \(\mathcal{N}(\mu, \sigma^2)\) and the following estimators:

sample mean: \(\hat \mu_1=\frac{1}{n}\sum_{i=1}^{n}x_i\)
sample median: \(\hat \mu_2= \text{Median}(x)\)
the semi-sum of the sample minimum and the maximum \(\hat \mu_3=\frac{\min(x)+\max(x)}{2}=\frac{x_{(1)}+x_{(n)}}{2}\)
the trimmed sample mean \(\hat \mu_4=\frac{1}{n-2}\sum_{i=2}^{n-1}x_{(i)}\)

with \(x_{(i)}\) the \(i\)-th order statistic.

We aim to compare the four estimators in terms of unbiasedness and efficiency.

#Initial settings:
# Number of replications B = 10000
# Sample size n = 10
# Population mean = 2
# Population variance = 4

B <- 1000
n <- 10
mu <- 5
sigma <- 2

# Save the results in a matrix:
# Column 1: sample mean 
# Column 2: sample median
# Column 3: semi-sum of minumu and maximum
# Column 4: trimmed sample mean
est <- matrix(0,B,4)

set.seed(1234)
# For each replication we generate 10 samples from
# a normal r.v. with mean mu and variance sigma^2
# and we compute the four sample estimate 
for (i in 1:B){
 x <- rnorm(n, mu, sigma)
 est[i,1] <- mean(x)
 est[i,2] <- median(x)
 est[i,3] <- (min(x)+max(x))/2
 est[i,4] <- sum(sort(x)[-c(1,n)])/(n-2)
}

# By using the boxplot, we plot the distribution of the sample estimate 
# Blue horizontal line is the true population mean
par(mfrow=c(1,1), xaxt="n")
boxplot(est, main="Comparison between four estimators")
par(xaxt="s")
axis(1, 1:4, c(expression(hat(mu)[1]), expression(hat(mu)[2]), expression(hat(mu)[3]), expression(hat(mu)[4])) )
abline(h=5, lwd=2, col="blue")

# Bias
bias <- apply(est, 2, mean) - mu
bias

## [1]  0.012231786  0.032274611 -0.005109588  0.016567130

# Variances
variance <- apply(est, 2, var)
variance

## [1] 0.3674287 0.5111770 0.7202595 0.3870022

All the estimators appear unbiased and the sample mean register the lowest sample variance, which is a good approximation of \(\sigma^2/n=0.4\).

It’s your turn Let’s check now whether all the estimators are consistent. For checking this statement, \(n=10\) is extremely low and need to increase it, let’s say \(n=200\). Compare the simulated values for \(n=10\) with those obtained for \(n=200\) by using histograms. What do you expect?

# We repeat the previous steps considering n=200
est_200 <- matrix(0, B, 4)
n <- 200

for (i in 1:B){
 x <- rnorm(n, mu, sigma)
 est_200[i , 1] <- mean(x)
 est_200[i , 2] <- median(x)
 est_200[i , 3] <- (max(x)+min(x))/2
 est_200[i , 4] <- sum(sort(x)[-c(1,n)])/(n-2)
}

# We plot the histograms for n=10 and n=200 of the sample estimates
# Blue vertical line is the true population mean

# n =10
par(mfrow=c(2,2))
for(j in 1:4){
    hist.scott(est[,j], prob = TRUE, 
               main = substitute(hat(mu)[j],
                               list(j = j)),
               xlab = "", xlim = c(0, 10), cex.main = 1.5)
    abline(v = mu, col = "blue", lwd = 2)
}

# n= 200
for(j in 1:4){
 hist.scott(est_200[,j],  prob = TRUE, 
            main=substitute(hat(mu)[j] ,
                            list(j = j)), 
            xlab = "", xlim = c(0, 10), cex.main = 1.5)
 abline(v = mu, col = "blue", lwd = 2)
}

Interval estimation

Giving a point value is in most cases a rude estimate. Rather, we may construct an interval for our parameter, giving an entire set of values to estimate the model parameter. Interval estimation is based on pivotal quantities.

Normal case with known population variance

In the normal case with \(\sigma^2\) known, the pivotal quantity for \(\mu\) is defined as: \[ T(\mu)=\frac{\overline X - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1), \quad \forall \mu \in \mathbb{R}, \sigma^2 > 0.\]

Then, it follows that for \(0<\alpha <1\) \[\rm{Pr}(-z_{1-\alpha/2} \leq T(\mu) \leq z_{1-{\alpha/2}})=1-\alpha \] and a confidence interval of level \(1-\alpha\) for \(\mu\) is given by \[\text{CI}_\mu=(\bar x - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar x + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}})\] where \(-z_{1-\alpha/2}=z_{\alpha/2}\) and \(z_{1-\alpha/2}\) represent the quantile of order \(\alpha/2\) and \(1-\alpha/2\) of a standard normal distribution, respectively. Also for interval estimation, MC simulation is well suited for exploring the procedure:

#normal case with sigma^2 known (we take mu = 5 and sigma = 2 initialized above)
B <- 1000
n <- 10

# 1-alpha is the confidence level
# CI is matrix where we save the confidence intervals for each replication:
# -) first column: lower bound
# -)  second column: upper bound
# l is a vector whose elements assume TRUE (1) or FALSE(0) depending on whether the true parameter value lies within the interval

alpha <- 0.05
CI <- matrix(0,B,2)
l <- rep(0,B)

set.seed(1234)
for (i in 1:B){
 x <- rnorm(n, mu, sigma)
 CI[i,1] <- mean(x) - qnorm(1-alpha/2)*sigma/sqrt(n)
 CI[i,2] <- mean(x) + qnorm(1-alpha/2)*sigma/sqrt(n)
 l[i] <- (mu > CI[i,1] & mu < CI[i,2]) 
}

#Empirical coverage probability
mean(l)

## [1] 0.955

#Plot the first 100 c.i.:
# black: intervals that not include mu
# red: intervals that include mu

plot(1, xlim = c(0, 10), ylim = c(0, 11), type = "n", xlab = expression(mu), ylab = "",
     main = paste("100 IC for the mean (known variance)"), cex.main = 1.2)
abline(v = mu)

d <- 0
for (i in 1:100){
    d <- d + 0.1 
    lines(seq(CI[i,1], CI[i,2], length = 100), rep(d, 100), col = (l[i] + 1))
}

sum(l[1:100])

## [1] 94

Normal case when the population variance is unknown

For the mean \(\mu\) of a normal distribution when \(\sigma^2\) is unknown, and it is estimated by the sample variance, the pivotal quantity is: \[ T(\mu)=\frac{\bar X - \mu}{s/\sqrt{n}} \sim \mathcal{t}_{n-1}, \quad \forall \mu \in \mathbb{R}, \sigma^2>0 \]

Then, it follows that for \(0<\alpha <1\) \[\rm{Pr}(t_{n-1;\alpha/2} \leq T(\mu) \leq t_{n-1;1-{\alpha/2}})=1-\alpha \] and a confidence interval of level \(1-\alpha\) for \(\mu\) is given by \[\text{CI}_\mu=(\bar x - t_{n-1; 1-\alpha/2} \frac{s}{\sqrt{n}}, \bar x + t_{n-1;1-\alpha/2} \frac{s}{\sqrt{n}})\] where \(-t_{n-1;1-\alpha/2}=t_{n-1;\alpha/2}\) and \(t_{n-1;1-\alpha/2}\) represent the quantile of order \(\alpha/2\) and \(1-\alpha/2\) of a t-student distribution with \(n-1\) degrees of freedom, respectively.

#normal case with sigma^2 unknown 
B <- 1000
n <- 10

alpha <- 0.05
CI <- matrix(0, B, 2)
l <- rep(0, B)

set.seed(1234)
for (i in 1:B){
 x <- rnorm(n, mu, sigma)
 CI[i,1] <- mean(x) - qt(1-alpha/2, n-1)*sd(x)/sqrt(n)
 # CI[i,1] <- mean(x) + qt(alpha/2, n-1)*sd(x)/sqrt(n)
 CI[i,2] <- mean(x) + qt(1-alpha/2, n-1)*sd(x)/sqrt(n)
 l[i] <- (mu > CI[i,1] & mu < CI[i,2]) 
}

#Empirical coverage probability
mean(l)

## [1] 0.959

plot(1, xlim = c(0,10), ylim = c(0,11), type = "n", xlab = expression(mu), ylab = "", 
     main = paste("100 IC for the mean (unknown variance)"), cex.main=1.2)
abline(v = mu)

d <- 0
for (i in 1:100){
    d <- d + 0.1 
    lines(seq(CI[i,1], CI[i,2], length = 100), rep(d, 100), col = (l[i] + 1))
}

sum(l[1:100])

## [1] 93

Application with real data

The dataset pair65 in the DAAG package contains the difference in mm between the members of nine pairs of elastic bands. One member of each pair was placed in hot wather for four minutes, while the others was left at the ambient temperature. After ten minutes, the differences were recorded:

library(DAAG)
pair_data_frame <- cbind(pair65, pair65[,1] - pair65[,2])
pair_data_frame <- cbind(c(1:9), pair_data_frame)
dimnames(pair_data_frame) <- list(1:9, c("pair","heated", "ambient", "difference"))

pair	heated	ambient	difference
1	244	225	19
2	255	247	8
3	253	249	4
4	254	253	1
5	251	245	6
6	269	259	10
7	248	242	6
8	252	255	-3
9	292	286	6

We have two independent samples, heated and ambient, of size \(n_1\) and \(n_2\), with \(n_1=n_2=n=9\). We are interested in assessing the sample mean difference \[\overline d=\overline x_{H}-\overline x_{A} = 6.33.\] \(s=6.10\), and \(s/\sqrt{n}=2.03\). The pivotal quantity is then:

\[ T(\mu)= \frac{\overline D -\mu}{s / \sqrt{n}} \sim t_{8}. \]

We are able to compute the confidence intervals at \(95\%\) or \(99\%\) level.

n <- nrow(pair65)

#compute quantiles for student t with n-1 =8 degrees of freedom
alpha <- 0.05
q_inf_95 <- qt(alpha/2, df = n -1)
q_sup_95 <- qt(1-alpha/2, df = n-1)

alpha <- 0.01
q_inf_99 <- qt(alpha/2, df = n-1)
q_sup_99 <- qt(1-alpha/2, df = n-1)


#compute the confidence intervals
d <- mean(pair_data_frame[,4])
d

## [1] 6.333333

s <- sd(pair_data_frame[,4])
s

## [1] 6.103278

# 95% confidence interval
CI_95 <- c(d + q_inf_95*s/sqrt(n), d + q_sup_95*s/sqrt(n))
CI_95

## [1]  1.641939 11.024728

# 99% confidence interval
CI_99 <- c(d + q_inf_99*s/sqrt(n), d + q_sup_99*s/sqrt(n))
CI_99

## [1] -0.4929537 13.1596203

library(RColorBrewer)
plotclr <- brewer.pal(6,"YlOrRd")

par(mfrow = c(1, 2))
# Plot of the probability density function of the pivotal quantity (t-student with 8 df) with endpoints that encloses 95% of the probability 
curve(dt(x, 8), xlim = c(-6, 6), ylim = c(0, 0.4),
      main = "", col = "blue", lwd = 3, xlab = "t", 
      ylab = expression(t[8]),  yaxs="i")
cord.x <- c(q_inf_95, seq(q_inf_95, q_sup_95, 0.01), q_sup_95)
cord.y <- c(0, dt(seq(q_inf_95, q_sup_95, 0.01), 8), 0)
polygon(cord.x, cord.y, col = plotclr[3], border = NA )

# Plot of the probability density function of the pivotal quantity (t-student with 8 df) with endpoints that encloses 99% of the probability 
curve(dt(x, 8), xlim = c(-6, 6), ylim = c(0,0.4), 
      main = "", col = "blue", lwd = 3, xlab="t",
      ylab = expression(t[8]),  yaxs="i")
cord.x2 <- c(q_inf_99, seq(q_inf_99, q_sup_99, 0.01), q_sup_99)
cord.y2 <- c(0, dt(seq(q_inf_99, q_sup_99, 0.01), 8), 0)
polygon(cord.x2, cord.y2, col = plotclr[5], border = NA )

Basic concepts oh hypothesis testing

The null hypothesis for the parameter \(\theta\) is usually expressed as \[\text{H}_0: \theta = \theta_0\]

Complementary to the choice of \(\text{H}_0\), we have to specify the alternative hypothesis \(\text{H}_1\), specifying the values of the parameter which becomes reasonable when \(\text{H}_0\) does not hold. Usually \(\text{H}_1\) may be:

\(\text{H}_1: \theta \neq \theta_0\) (two-sided alternative)
\(\text{H}_1: \theta > \theta_0\) or \(\text{H}_1: \theta < \theta_0\)(one-sided alternative)

Application: test for the mean difference

Consider the number of the 15 most followed accounts on Instagram, expressed in millions, with each total rounded to the nearest million followers, as of March 31, 2018 (Source: Wikipedia). Among them there are

6 musicians
9 non musicians, referred as others

Musicians <- c( "Selena Gomez",  "Ariana Grande", "Beyonce" , "Taylor Swift",  "Justin Bieber", "Nicki Minaj")
Others <- c("Cristiano Ronaldo", "Kim Kardashian",  "Kylie Jenner", "Dwayne Johnson", "Neymar", "Lionel Messi", 
  "Kendall Jenner", "Kourtney Kardashian", "Kevin Hart")

Followers_M <- c(135, 118, 113, 107,  98,  86)
Followers_O <- c(123, 110, 106, 103,  91,  89,  89,  62,  58)

Musicians	Followers_M
Selena Gomez	135
Ariana Grande	118
Beyonce	113
Taylor Swift	107
Justin Bieber	98
Nicki Minaj	86

Others	Followers_O
Cristiano Ronaldo	123
Kim Kardashian	110
Kylie Jenner	106
Dwayne Johnson	103
Neymar	91
Lionel Messi	89
Kendall Jenner	89
Kourtney Kardashian	62
Kevin Hart	58

We could set up a test with the following aim: do the musicians, on average, have the same number of followers than the non-musicians? Or, do the musicians have more followers ?

We suppose that \(X_i \sim \mathcal{N}(\mu_m, \sigma^2_m)\), \(i=1, \ldots, n_1\) and \(Y_i \sim \mathcal{N}(\mu_o, \sigma^2_o)\), \(i=1, \ldots, n_2\) are independent normal samples, where \(n_1=6\) and \(n_2=9\) denote the number of musicians and others, respectively. We aim to compare their means, \(\mu_m\) and \(\mu_o\) through the following one-sided two-sample test

\[ \begin{cases} H_0: \mu_{m}-\mu_{o} = 0 \\ H_1: \mu_{m}-\mu_{o} > 0 \\ \end{cases} \]

Let \(\overline X\) and \(\overline Y\) be the sample mean for the musicians and the others, respectively. After a preliminary \(\text{F-test}\) about the variances between the two groups, we may assume that \(\sigma^2_m=\sigma^2_0\), and then the test statistic, under \(\text{H}_0\) has the form

\[T = \frac{\overline X - \overline Y}{s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \underset{H_{0}}{\sim}t_{n_1+n_2-2}\] with \(n_1+n_2-2=13\). In the previous formula the pooled standard deviation is: \[s = \sqrt{\frac{(n_1-1)s^2_M+(n_2-1)s^2_O}{n_1+n_2-2}} \] (see Maindonald and Braun: pages 67 and 104)

## by hand
n1 <- length(Followers_M)
n2 <- length(Followers_O)

mean_M <- mean(Followers_M)
mean_O <- mean(Followers_O)

#pooled variance: see Data Analysis and Graphics Using R An Example Based Approach, by Maindonald and Braun: pages 104 and 67
v <- (var(Followers_M)*(n1 - 1) + var(Followers_O)*(n2 - 1))/(n1 + n2 - 2) 
v

## [1] 392.4231

test.stat <- (mean_M - mean_O)/(sqrt(v*(1/n1 + 1/n2)))
test.stat

## [1] 1.64422

pt(test.stat, 13, lower.tail = FALSE)

## [1] 0.06204155

# by using the t.test() funtion
t <- t.test(Followers_M, Followers_O, mu = 0, alternative = "greater", var.equal = TRUE) 
t

## 
##  Two Sample t-test
## 
## data:  Followers_M and Followers_O
## t = 1.6442, df = 13, p-value = 0.06204
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -1.322964       Inf
## sample estimates:
## mean of x mean of y 
## 109.50000  92.33333

The p-value is 0.06204, slightly greater than \(\alpha\), when \(\alpha=0.05\).

curve(dt(x, 13), xlim = c(-5, 5), ylim = c(0, 0.4), 
      main = "p-values and rejection region", col = "blue", 
      lwd = 2, xlab = "x-y",  ylab = expression(t[13]),  yaxs="i")
cord.x <- c(qt(0.95, 13),seq(qt(0.95, 13), 5, 0.01), 5)
cord.y <- c(0, dt(seq(qt(0.95, 13), 5, 0.01), 13), 0)
polygon(cord.x, cord.y, col = plotclr[3], border = NA )


abline(v = t$statistic, lty = 2, lwd = 2, col = "red")
text(0, 0.2, paste("Accept", expression(H0)))
text(2.7, 0.08, paste("Reject", expression(H0)))
text(as.double(t$statistic) - 0.15, 0.02, "t", col = "red", cex = 1.2)

Pearson’s chi-squared test

This is a class of test applied to sets of categorical data to evaluate whether any observed difference between the sets arose by chance. It is suitable for unpaired data from large samples. Pearson’s chi-squared test is uded to assess three types of comparison:

goodness of fit: establishes whether an observed frequency differs from a theoretical distribution;
homogeneity: test if two or more sub-groups of a population share the same distribution of a single categorical variable;
independence: determines whether two categorical variables are associated with one another in the population

Goodness of fit: Darts challenge against one friend

Suppose that \(n\) observations \(x_1, \ldots, x_n\) are divided among \(K\) cells. The following test statistic is then defined:

\[X^2 = \sum_{k=1}^{K} \frac{(O_k-E_k)^2}{E_k} \underset{H_0}{\sim} \chi^2_{K-1}, \] where

\(O_k\) are the observed frequencies for cell \(k\)
\(E_k\) are the expected frequencies for cell \(k\), under \(\text{H}_0\)

But what is \(H_0\) here? For illustration purposes, suppose you are playing darts agains another friend.

Darts target

You suspect that your friend is not a great darts player, and that his shots along the game will hit the lowest points with great probability and the highest point with low probability. Translated in probability terms, you divide the darts target into \(K=4\) zones and you assign the the following hitting probabilities:

Zone 1 (from 1 to 3 points): \(p_1=7/16\);
Zone 2 (from 4 to 6 points): \(p_2=5/16\);
Zone 3 (from 7 to 9 points): \(p_3=3/16\);
Zone 4 (the highest points in the middle of the target, let’s say 10,25 and 50 points): \(p_4=1/16\).

Your null hypothesis is that, due to a moderate control on his darts skills, he has decrasing probabilities to hit the best zones:

\[ \text{H}_0: p_1=7/16, \quad p_2=5/16, \quad p_3=3/16, \quad p_4=1/16.\]

Any significative deviation from the above probability distribution, would cause the rejection of the null hypothesis. For checking your assumption, you count the first \(n=50\) attempts \(x_1, \ldots, x_n\) of your opponent, and you will code \(x_i=k\), if the \(i\)-th shot hits the \(k\)-th zone.

# Initial settings:
# number of attempts n = 50
# number of zones K = 4
# vector of probabilities p = c( 7/16, 5/16, 3/16, 1/16)
n <- 50
K <- 4
p <- c( 7/16, 5/16, 3/16, 1/16)

# generate the values: Let's suppose that we sampling from p
set.seed(1234)
x <- sample(1:K, n, replace = TRUE, prob = p)

# observed (absolute) frequencies
observed <- table(x)
observed

## x
##  1  2  3  4 
## 23 17  9  1

# expected (absolute) frequencies
expected <- n*p
expected

## [1] 21.875 15.625  9.375  3.125

# observed test statistic
X2 <- sum( (observed - expected)^(2)/expected)
X2

## [1] 1.638857

#manually compute the p-value
pchisq(X2, df = K - 1, lower.tail = FALSE )

## [1] 0.6506117

#same result with the chisq.test function
chisq.test(observed, p = p)

## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 1.6389, df = 3, p-value = 0.6506

Yes, your assumption may be accepted, your friend is hitting the target according to your probabilities… but he wants to play again, and performing another challenge. Before starting, he requires to drink an energetic drink. You are ready to count his next 50 attempts. Does the drink improve the performance?

# Let's suppose to generate according to a different vector of probabilities
p2 <- c(5/16, 3/16, 6/16, 2/16)
x_after_energy_drink <- sample(1:K, n, replace = TRUE, prob = p2)

new_observed <- table(x_after_energy_drink)
new_observed

## x_after_energy_drink
##  1  2  3  4 
## 12  7 26  5

#new test after the energetic drink
chisq.test(new_observed, p = p)

## 
##  Chi-squared test for given probabilities
## 
## data:  new_observed
## X-squared = 39.826, df = 3, p-value = 1.16e-08

Your friend improved his performance! His new shots do not follow your assumed distribution. Apparently, the energetic drink was strongly required.

Homogeneity: Darts challenge with more friends

Suppose now that other five friends join you and other guy, for a total amount of \(M=6\) friends. Do all of your friends share the same probabilities, with the above probabilities to hit the four zones? We have now this test statistic: \[X^2=\sum_{k=1}^{K}\sum_{m=1}^{M} \frac{(O_{k,m}-E_{k,m})^2}{E_{k,m}} \underset{H_0}{\sim} \chi^2_{(K-1)(M-1)}\] For each of them, you count the first 50 shots

# Initial settings:
# number of attempts n = 50
# number of zones K = 4
# number of friends M = 6
n <- 50
K <- 4
M <- 6
x <- matrix(0, M, n)
p <- c(7/16, 5/16, 3/16, 1/16)

set.seed(123)
# generate the values
for (m in 1:M){
  x[m, ] <- sample(1:K, n, replace = TRUE, prob = p)
} 
observed_matrix <- apply(x, 1, table)
chisq.test(observed_matrix, p = p)

## 
##  Pearson's Chi-squared test
## 
## data:  observed_matrix
## X-squared = 12.743, df = 15, p-value = 0.6221

Yes, the test is suggesting that all of your friends homogeneously hit the darts target.

It’s your turn

What happens if a great player decides to join you, now? Try to simulate the data and perform the test again.

M <- 7 
n <- 50
K <- 4 
x <- matrix(0, M, n )
p <- c(7/16, 5/16, 3/16, 1/16)

set.seed(123)
# the 6 friends
for (m in 1:(M-1)){
 x[m, ] <- sample(1:K, n, replace=TRUE, prob = p )
} 

x[M,] <- sample(1:K, n, replace = TRUE, prob = c(1/16, 3/16, 5/16, 7/16)) #great player

observed_matrix <- apply(x, 1, table)
observed_matrix

##   [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## 1   21   23   22   21   23   25    3
## 2   15   14   16   21   17   11    8
## 3    9   12    7    7    6   11   17
## 4    5    1    5    1    4    3   22

# test homogeneity
chisq.test(observed_matrix, p)

## 
##  Pearson's Chi-squared test
## 
## data:  observed_matrix
## X-squared = 88.166, df = 18, p-value = 3.077e-11

Extra exercise: Correlation test

Sometimes it could be useful to assess the degree of association, or correlation, between paired samples, by using the Pearson, the Kendall’s \(\tau\) or the Spearman’s \(\rho\) correlation coefficient. Regardless of the adopted coefficient, the null hypothesis for a given correlation coefficient \(\rho\) is: \[ \text{H}_0: \rho=0.\] The test statistic is then defined as \[T=r\sqrt{\frac{n-2}{1-r^2}} \underset{H_0}{\sim} t_{n-2},\] where \(r=corr(X,Y)\) is the Pearson correlation coefficient. Suppose to have two samples of the same length \(x_1, \ldots, x_n\) and \(y_1, \ldots, y_n\), and to measure the association between them. Once we compute the observed test statistic \(t_{obs}\), we may compute then compute the \(p\)-value, which, by considering a two-sided test, is obtained as:

\[p-value=2\rm{Pr}_{H_0}(T \geq |t_{obs}|) \]

Consider now some of the most followed Instagram accounts in 2018: for each of the owners, we report also the number of Twitter followers (in millions). Are the Instagram and Twitter accounts somehow associated? Perform a correlation test, compute the \(p\)-value and give an answer. Here is the dataframe.

      Owners <- c( "Katy Perry", "Justin Bieber", "Taylor Swift", "Cristiano Ronaldo",
                   "Kim Kardashian", "Ariana Grande", "Selena Gomez", "Demi Lovato")
      Instagram <- c( 69, 98,107, 123, 110, 118, 135, 67)
      Twitter <- c( 109, 106, 86, 72, 59, 57, 56, 56)
      plot( Instagram, Twitter, pch=21, bg=2, xlim=c(60, 150), ylim=c(40, 120) )
      text( Instagram[-6], Twitter[-6]+5, Owners[-6], cex=0.8 )
      text( Instagram[6], Twitter[6]-5, Owners[6], cex=0.8 )

Owners	Instagram	Twitter
Katy Perry	69	109
Justin Bieber	98	106
Taylor Swift	107	86
Cristiano Ronaldo	123	72
Kim Kardashian	110	59
Ariana Grande	118	57
Selena Gomez	135	56
Demi Lovato	67	56

Extra: Independent trials: weak law of large numbers

Consider the Bernoulli trials \(X_1, \ldots, X_n\), with mean \(p\) and variance \(p(1-p)\). For this setting, the weak law of large numbers holds: \[ \overline{X}_n \overset{P}{\rightarrow} p, \ \mbox{as } n \rightarrow \infty,\] where the notation \(\overset{P}{\rightarrow}\) means convergence in probability. We will write a function that take in input \(n\) and \(p\).

law_large_numb <- function(n, p){
  x<-rbinom(n, 1, p)
  return(mean(x))
}
law_large_numb <- Vectorize(law_large_numb)

We are able to evaluate graphically the convergence to \(p\) of the sample mean, as function of \(n\).

set.seed(12345) 
par(mfrow = c(1, 1))
curve(law_large_numb(x, p = 0.5), from = 1, to = 1000, xlab = "n", 
      ylab = "frequency")
abline(a = 0.5, b = 0, lwd = 2, col = "red")
legend(80, 0.9, lty = 1, col = c("black", "red"), 
       lwd  =c(1, 2), c( "X/n", "p"))

A few trials are needed for the convergence to \(p\).

Laboratory 2

Statical Methods for Data Science

L Egidi, V Gioia, M Stefanucci

26 October 2021