Basic concepts of estimation

Point estimation

Let \(\hat \theta\) be the point estimator for the parameter \(\theta\). An estimator is said to be

  • unbiased if and only if \(E(\hat \theta)=\theta\)
  • consistent if \(\hat \theta \stackrel{P}\rightarrow \theta\), as \(n \rightarrow \infty\) or, equivalently, if \(\text{var}(\hat \theta) \rightarrow 0\), as \(n \rightarrow \infty\).

There are some properties that we would ensure for an estimator: low variance and low bias. But, there is a tradeoff between unbiasedness and low variance, so we would usually seek to get both (to some extent); ideally, we would target a small Mean Squared Error (MSE)

\[\rm{MSE}(\hat \theta)=E\{(\hat \theta-\theta)^2 \}= \{E(\hat \theta) - \theta\}^2+var(\hat \theta)=(Bias)^2+Variance \]

Application: Comparison of four estimators of the population mean

As we already know from the theory, the sample mean is an excellent estimator for the population mean. But we will consider now other estimators. Consider the case of a normal distribution \(\mathcal{N}(\mu, \sigma^2)\) and the following estimators:

  • sample mean: \(\hat \mu_1=\frac{1}{n}\sum_{i=1}^{n}x_i\)
  • sample median: \(\hat \mu_2= \text{Median}(x)\)
  • the semi-sum of the sample minimum and the maximum \(\hat \mu_3=\frac{\min(x)+\max(x)}{2}=\frac{x_{(1)}+x_{(n)}}{2}\)
  • the trimmed sample mean \(\hat \mu_4=\frac{1}{n-2}\sum_{i=2}^{n-1}x_{(i)}\)

with \(x_{(i)}\) the \(i\)-th order statistic.

We aim to compare the four estimators in terms of unbiasedness and efficiency.

#Initial settings:
# Number of replications B = 10000
# Sample size n = 10
# Population mean = 2
# Population variance = 4

B <- 1000
n <- 10
mu <- 5
sigma <- 2

# Save the results in a matrix:
# Column 1: sample mean 
# Column 2: sample median
# Column 3: semi-sum of minumu and maximum
# Column 4: trimmed sample mean
est <- matrix(0,B,4)

set.seed(1234)
# For each replication we generate 10 samples from
# a normal r.v. with mean mu and variance sigma^2
# and we compute the four sample estimate 
for (i in 1:B){
 x <- rnorm(n, mu, sigma)
 est[i,1] <- mean(x)
 est[i,2] <- median(x)
 est[i,3] <- (min(x)+max(x))/2
 est[i,4] <- sum(sort(x)[-c(1,n)])/(n-2)
}

# By using the boxplot, we plot the distribution of the sample estimate 
# Blue horizontal line is the true population mean
par(mfrow=c(1,1), xaxt="n")
boxplot(est, main="Comparison between four estimators")
par(xaxt="s")
axis(1, 1:4, c(expression(hat(mu)[1]), expression(hat(mu)[2]), expression(hat(mu)[3]), expression(hat(mu)[4])) )
abline(h=5, lwd=2, col="blue")

# Bias
bias <- apply(est, 2, mean) - mu
bias
## [1]  0.012231786  0.032274611 -0.005109588  0.016567130
# Variances
variance <- apply(est, 2, var)
variance
## [1] 0.3674287 0.5111770 0.7202595 0.3870022

All the estimators appear unbiased and the sample mean register the lowest sample variance, which is a good approximation of \(\sigma^2/n=0.4\).

It’s your turn Let’s check now whether all the estimators are consistent. For checking this statement, \(n=10\) is extremely low and need to increase it, let’s say \(n=200\). Compare the simulated values for \(n=10\) with those obtained for \(n=200\) by using histograms. What do you expect?

# We repeat the previous steps considering n=200
est_200 <- matrix(0, B, 4)
n <- 200

for (i in 1:B){
 x <- rnorm(n, mu, sigma)
 est_200[i , 1] <- mean(x)
 est_200[i , 2] <- median(x)
 est_200[i , 3] <- (max(x)+min(x))/2
 est_200[i , 4] <- sum(sort(x)[-c(1,n)])/(n-2)
}

# We plot the histograms for n=10 and n=200 of the sample estimates
# Blue vertical line is the true population mean

# n =10
par(mfrow=c(2,2))
for(j in 1:4){
    hist.scott(est[,j], prob = TRUE, 
               main = substitute(hat(mu)[j],
                               list(j = j)),
               xlab = "", xlim = c(0, 10), cex.main = 1.5)
    abline(v = mu, col = "blue", lwd = 2)
}

# n= 200
for(j in 1:4){
 hist.scott(est_200[,j],  prob = TRUE, 
            main=substitute(hat(mu)[j] ,
                            list(j = j)), 
            xlab = "", xlim = c(0, 10), cex.main = 1.5)
 abline(v = mu, col = "blue", lwd = 2)
}

Interval estimation

Giving a point value is in most cases a rude estimate. Rather, we may construct an interval for our parameter, giving an entire set of values to estimate the model parameter. Interval estimation is based on pivotal quantities.

Normal case with known population variance

In the normal case with \(\sigma^2\) known, the pivotal quantity for \(\mu\) is defined as: \[ T(\mu)=\frac{\overline X - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1), \quad \forall \mu \in \mathbb{R}, \sigma^2 > 0.\]

Then, it follows that for \(0<\alpha <1\) \[\rm{Pr}(-z_{1-\alpha/2} \leq T(\mu) \leq z_{1-{\alpha/2}})=1-\alpha \] and a confidence interval of level \(1-\alpha\) for \(\mu\) is given by \[\text{CI}_\mu=(\bar x - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar x + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}})\] where \(-z_{1-\alpha/2}=z_{\alpha/2}\) and \(z_{1-\alpha/2}\) represent the quantile of order \(\alpha/2\) and \(1-\alpha/2\) of a standard normal distribution, respectively. Also for interval estimation, MC simulation is well suited for exploring the procedure:

#normal case with sigma^2 known (we take mu = 5 and sigma = 2 initialized above)
B <- 1000
n <- 10

# 1-alpha is the confidence level
# CI is matrix where we save the confidence intervals for each replication:
# -) first column: lower bound
# -)  second column: upper bound
# l is a vector whose elements assume TRUE (1) or FALSE(0) depending on whether the true parameter value lies within the interval

alpha <- 0.05
CI <- matrix(0,B,2)
l <- rep(0,B)

set.seed(1234)
for (i in 1:B){
 x <- rnorm(n, mu, sigma)
 CI[i,1] <- mean(x) - qnorm(1-alpha/2)*sigma/sqrt(n)
 CI[i,2] <- mean(x) + qnorm(1-alpha/2)*sigma/sqrt(n)
 l[i] <- (mu > CI[i,1] & mu < CI[i,2]) 
}

#Empirical coverage probability
mean(l)
## [1] 0.955
#Plot the first 100 c.i.:
# black: intervals that not include mu
# red: intervals that include mu

plot(1, xlim = c(0, 10), ylim = c(0, 11), type = "n", xlab = expression(mu), ylab = "",
     main = paste("100 IC for the mean (known variance)"), cex.main = 1.2)
abline(v = mu)

d <- 0
for (i in 1:100){
    d <- d + 0.1 
    lines(seq(CI[i,1], CI[i,2], length = 100), rep(d, 100), col = (l[i] + 1))
}

sum(l[1:100])
## [1] 94

Normal case when the population variance is unknown

For the mean \(\mu\) of a normal distribution when \(\sigma^2\) is unknown, and it is estimated by the sample variance, the pivotal quantity is: \[ T(\mu)=\frac{\bar X - \mu}{s/\sqrt{n}} \sim \mathcal{t}_{n-1}, \quad \forall \mu \in \mathbb{R}, \sigma^2>0 \]

Then, it follows that for \(0<\alpha <1\) \[\rm{Pr}(t_{n-1;\alpha/2} \leq T(\mu) \leq t_{n-1;1-{\alpha/2}})=1-\alpha \] and a confidence interval of level \(1-\alpha\) for \(\mu\) is given by \[\text{CI}_\mu=(\bar x - t_{n-1; 1-\alpha/2} \frac{s}{\sqrt{n}}, \bar x + t_{n-1;1-\alpha/2} \frac{s}{\sqrt{n}})\] where \(-t_{n-1;1-\alpha/2}=t_{n-1;\alpha/2}\) and \(t_{n-1;1-\alpha/2}\) represent the quantile of order \(\alpha/2\) and \(1-\alpha/2\) of a t-student distribution with \(n-1\) degrees of freedom, respectively.

#normal case with sigma^2 unknown 
B <- 1000
n <- 10

alpha <- 0.05
CI <- matrix(0, B, 2)
l <- rep(0, B)

set.seed(1234)
for (i in 1:B){
 x <- rnorm(n, mu, sigma)
 CI[i,1] <- mean(x) - qt(1-alpha/2, n-1)*sd(x)/sqrt(n)
 # CI[i,1] <- mean(x) + qt(alpha/2, n-1)*sd(x)/sqrt(n)
 CI[i,2] <- mean(x) + qt(1-alpha/2, n-1)*sd(x)/sqrt(n)
 l[i] <- (mu > CI[i,1] & mu < CI[i,2]) 
}

#Empirical coverage probability
mean(l)
## [1] 0.959
plot(1, xlim = c(0,10), ylim = c(0,11), type = "n", xlab = expression(mu), ylab = "", 
     main = paste("100 IC for the mean (unknown variance)"), cex.main=1.2)
abline(v = mu)

d <- 0
for (i in 1:100){
    d <- d + 0.1 
    lines(seq(CI[i,1], CI[i,2], length = 100), rep(d, 100), col = (l[i] + 1))
}

sum(l[1:100])
## [1] 93

Application with real data

The dataset pair65 in the DAAG package contains the difference in mm between the members of nine pairs of elastic bands. One member of each pair was placed in hot wather for four minutes, while the others was left at the ambient temperature. After ten minutes, the differences were recorded:

library(DAAG)
pair_data_frame <- cbind(pair65, pair65[,1] - pair65[,2])
pair_data_frame <- cbind(c(1:9), pair_data_frame)
dimnames(pair_data_frame) <- list(1:9, c("pair","heated", "ambient", "difference"))
pair heated ambient difference
1 244 225 19
2 255 247 8
3 253 249 4
4 254 253 1
5 251 245 6
6 269 259 10
7 248 242 6
8 252 255 -3
9 292 286 6

We have two independent samples, heated and ambient, of size \(n_1\) and \(n_2\), with \(n_1=n_2=n=9\). We are interested in assessing the sample mean difference \[\overline d=\overline x_{H}-\overline x_{A} = 6.33.\] \(s=6.10\), and \(s/\sqrt{n}=2.03\). The pivotal quantity is then:

\[ T(\mu)= \frac{\overline D -\mu}{s / \sqrt{n}} \sim t_{8}. \]

We are able to compute the confidence intervals at \(95\%\) or \(99\%\) level.

n <- nrow(pair65)

#compute quantiles for student t with n-1 =8 degrees of freedom
alpha <- 0.05
q_inf_95 <- qt(alpha/2, df = n -1)
q_sup_95 <- qt(1-alpha/2, df = n-1)

alpha <- 0.01
q_inf_99 <- qt(alpha/2, df = n-1)
q_sup_99 <- qt(1-alpha/2, df = n-1)


#compute the confidence intervals
d <- mean(pair_data_frame[,4])
d
## [1] 6.333333
s <- sd(pair_data_frame[,4])
s
## [1] 6.103278
# 95% confidence interval
CI_95 <- c(d + q_inf_95*s/sqrt(n), d + q_sup_95*s/sqrt(n))
CI_95
## [1]  1.641939 11.024728
# 99% confidence interval
CI_99 <- c(d + q_inf_99*s/sqrt(n), d + q_sup_99*s/sqrt(n))
CI_99
## [1] -0.4929537 13.1596203
library(RColorBrewer)
plotclr <- brewer.pal(6,"YlOrRd")

par(mfrow = c(1, 2))
# Plot of the probability density function of the pivotal quantity (t-student with 8 df) with endpoints that encloses 95% of the probability 
curve(dt(x, 8), xlim = c(-6, 6), ylim = c(0, 0.4),
      main = "", col = "blue", lwd = 3, xlab = "t", 
      ylab = expression(t[8]),  yaxs="i")
cord.x <- c(q_inf_95, seq(q_inf_95, q_sup_95, 0.01), q_sup_95)
cord.y <- c(0, dt(seq(q_inf_95, q_sup_95, 0.01), 8), 0)
polygon(cord.x, cord.y, col = plotclr[3], border = NA )

# Plot of the probability density function of the pivotal quantity (t-student with 8 df) with endpoints that encloses 99% of the probability 
curve(dt(x, 8), xlim = c(-6, 6), ylim = c(0,0.4), 
      main = "", col = "blue", lwd = 3, xlab="t",
      ylab = expression(t[8]),  yaxs="i")
cord.x2 <- c(q_inf_99, seq(q_inf_99, q_sup_99, 0.01), q_sup_99)
cord.y2 <- c(0, dt(seq(q_inf_99, q_sup_99, 0.01), 8), 0)
polygon(cord.x2, cord.y2, col = plotclr[5], border = NA )

Basic concepts oh hypothesis testing

The null hypothesis for the parameter \(\theta\) is usually expressed as \[\text{H}_0: \theta = \theta_0\]

Complementary to the choice of \(\text{H}_0\), we have to specify the alternative hypothesis \(\text{H}_1\), specifying the values of the parameter which becomes reasonable when \(\text{H}_0\) does not hold. Usually \(\text{H}_1\) may be:

Application: test for the mean difference

Consider the number of the 15 most followed accounts on Instagram, expressed in millions, with each total rounded to the nearest million followers, as of March 31, 2018 (Source: Wikipedia). Among them there are

  • 6 musicians
  • 9 non musicians, referred as others
Musicians <- c( "Selena Gomez",  "Ariana Grande", "Beyonce" , "Taylor Swift",  "Justin Bieber", "Nicki Minaj")
Others <- c("Cristiano Ronaldo", "Kim Kardashian",  "Kylie Jenner", "Dwayne Johnson", "Neymar", "Lionel Messi", 
  "Kendall Jenner", "Kourtney Kardashian", "Kevin Hart")

Followers_M <- c(135, 118, 113, 107,  98,  86)
Followers_O <- c(123, 110, 106, 103,  91,  89,  89,  62,  58)
Musicians Followers_M
Selena Gomez 135
Ariana Grande 118
Beyonce 113
Taylor Swift 107
Justin Bieber 98
Nicki Minaj 86
Others Followers_O
Cristiano Ronaldo 123
Kim Kardashian 110
Kylie Jenner 106
Dwayne Johnson 103
Neymar 91
Lionel Messi 89
Kendall Jenner 89
Kourtney Kardashian 62
Kevin Hart 58

We could set up a test with the following aim: do the musicians, on average, have the same number of followers than the non-musicians? Or, do the musicians have more followers ?

We suppose that \(X_i \sim \mathcal{N}(\mu_m, \sigma^2_m)\), \(i=1, \ldots, n_1\) and \(Y_i \sim \mathcal{N}(\mu_o, \sigma^2_o)\), \(i=1, \ldots, n_2\) are independent normal samples, where \(n_1=6\) and \(n_2=9\) denote the number of musicians and others, respectively. We aim to compare their means, \(\mu_m\) and \(\mu_o\) through the following one-sided two-sample test

\[ \begin{cases} H_0: \mu_{m}-\mu_{o} = 0 \\ H_1: \mu_{m}-\mu_{o} > 0 \\ \end{cases} \]

Let \(\overline X\) and \(\overline Y\) be the sample mean for the musicians and the others, respectively. After a preliminary \(\text{F-test}\) about the variances between the two groups, we may assume that \(\sigma^2_m=\sigma^2_0\), and then the test statistic, under \(\text{H}_0\) has the form

\[T = \frac{\overline X - \overline Y}{s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \underset{H_{0}}{\sim}t_{n_1+n_2-2}\] with \(n_1+n_2-2=13\). In the previous formula the pooled standard deviation is: \[s = \sqrt{\frac{(n_1-1)s^2_M+(n_2-1)s^2_O}{n_1+n_2-2}} \] (see Maindonald and Braun: pages 67 and 104)

## by hand
n1 <- length(Followers_M)
n2 <- length(Followers_O)

mean_M <- mean(Followers_M)
mean_O <- mean(Followers_O)

#pooled variance: see Data Analysis and Graphics Using R An Example Based Approach, by Maindonald and Braun: pages 104 and 67
v <- (var(Followers_M)*(n1 - 1) + var(Followers_O)*(n2 - 1))/(n1 + n2 - 2) 
v
## [1] 392.4231
test.stat <- (mean_M - mean_O)/(sqrt(v*(1/n1 + 1/n2)))
test.stat
## [1] 1.64422
pt(test.stat, 13, lower.tail = FALSE)
## [1] 0.06204155
# by using the t.test() funtion
t <- t.test(Followers_M, Followers_O, mu = 0, alternative = "greater", var.equal = TRUE) 
t
## 
##  Two Sample t-test
## 
## data:  Followers_M and Followers_O
## t = 1.6442, df = 13, p-value = 0.06204
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
##  -1.322964       Inf
## sample estimates:
## mean of x mean of y 
## 109.50000  92.33333

The p-value is 0.06204, slightly greater than \(\alpha\), when \(\alpha=0.05\).

curve(dt(x, 13), xlim = c(-5, 5), ylim = c(0, 0.4), 
      main = "p-values and rejection region", col = "blue", 
      lwd = 2, xlab = "x-y",  ylab = expression(t[13]),  yaxs="i")
cord.x <- c(qt(0.95, 13),seq(qt(0.95, 13), 5, 0.01), 5)
cord.y <- c(0, dt(seq(qt(0.95, 13), 5, 0.01), 13), 0)
polygon(cord.x, cord.y, col = plotclr[3], border = NA )


abline(v = t$statistic, lty = 2, lwd = 2, col = "red")
text(0, 0.2, paste("Accept", expression(H0)))
text(2.7, 0.08, paste("Reject", expression(H0)))
text(as.double(t$statistic) - 0.15, 0.02, "t", col = "red", cex = 1.2)

Pearson’s chi-squared test

This is a class of test applied to sets of categorical data to evaluate whether any observed difference between the sets arose by chance. It is suitable for unpaired data from large samples. Pearson’s chi-squared test is uded to assess three types of comparison:

  • goodness of fit: establishes whether an observed frequency differs from a theoretical distribution;
  • homogeneity: test if two or more sub-groups of a population share the same distribution of a single categorical variable;
  • independence: determines whether two categorical variables are associated with one another in the population

Goodness of fit: Darts challenge against one friend

Suppose that \(n\) observations \(x_1, \ldots, x_n\) are divided among \(K\) cells. The following test statistic is then defined:

\[X^2 = \sum_{k=1}^{K} \frac{(O_k-E_k)^2}{E_k} \underset{H_0}{\sim} \chi^2_{K-1}, \] where

  • \(O_k\) are the observed frequencies for cell \(k\)
  • \(E_k\) are the expected frequencies for cell \(k\), under \(\text{H}_0\)

But what is \(H_0\) here? For illustration purposes, suppose you are playing darts agains another friend.

Darts target

You suspect that your friend is not a great darts player, and that his shots along the game will hit the lowest points with great probability and the highest point with low probability. Translated in probability terms, you divide the darts target into \(K=4\) zones and you assign the the following hitting probabilities:

  • Zone 1 (from 1 to 3 points): \(p_1=7/16\);
  • Zone 2 (from 4 to 6 points): \(p_2=5/16\);
  • Zone 3 (from 7 to 9 points): \(p_3=3/16\);
  • Zone 4 (the highest points in the middle of the target, let’s say 10,25 and 50 points): \(p_4=1/16\).

Your null hypothesis is that, due to a moderate control on his darts skills, he has decrasing probabilities to hit the best zones:

\[ \text{H}_0: p_1=7/16, \quad p_2=5/16, \quad p_3=3/16, \quad p_4=1/16.\]

Any significative deviation from the above probability distribution, would cause the rejection of the null hypothesis. For checking your assumption, you count the first \(n=50\) attempts \(x_1, \ldots, x_n\) of your opponent, and you will code \(x_i=k\), if the \(i\)-th shot hits the \(k\)-th zone.

# Initial settings:
# number of attempts n = 50
# number of zones K = 4
# vector of probabilities p = c( 7/16, 5/16, 3/16, 1/16)
n <- 50
K <- 4
p <- c( 7/16, 5/16, 3/16, 1/16)

# generate the values: Let's suppose that we sampling from p
set.seed(1234)
x <- sample(1:K, n, replace = TRUE, prob = p)

# observed (absolute) frequencies
observed <- table(x)
observed
## x
##  1  2  3  4 
## 23 17  9  1
# expected (absolute) frequencies
expected <- n*p
expected
## [1] 21.875 15.625  9.375  3.125
# observed test statistic
X2 <- sum( (observed - expected)^(2)/expected)
X2
## [1] 1.638857
#manually compute the p-value
pchisq(X2, df = K - 1, lower.tail = FALSE )
## [1] 0.6506117
#same result with the chisq.test function
chisq.test(observed, p = p)
## 
##  Chi-squared test for given probabilities
## 
## data:  observed
## X-squared = 1.6389, df = 3, p-value = 0.6506

Yes, your assumption may be accepted, your friend is hitting the target according to your probabilities… but he wants to play again, and performing another challenge. Before starting, he requires to drink an energetic drink. You are ready to count his next 50 attempts. Does the drink improve the performance?

# Let's suppose to generate according to a different vector of probabilities
p2 <- c(5/16, 3/16, 6/16, 2/16)
x_after_energy_drink <- sample(1:K, n, replace = TRUE, prob = p2)

new_observed <- table(x_after_energy_drink)
new_observed
## x_after_energy_drink
##  1  2  3  4 
## 12  7 26  5
#new test after the energetic drink
chisq.test(new_observed, p = p)
## 
##  Chi-squared test for given probabilities
## 
## data:  new_observed
## X-squared = 39.826, df = 3, p-value = 1.16e-08

Your friend improved his performance! His new shots do not follow your assumed distribution. Apparently, the energetic drink was strongly required.

Homogeneity: Darts challenge with more friends

Suppose now that other five friends join you and other guy, for a total amount of \(M=6\) friends. Do all of your friends share the same probabilities, with the above probabilities to hit the four zones? We have now this test statistic: \[X^2=\sum_{k=1}^{K}\sum_{m=1}^{M} \frac{(O_{k,m}-E_{k,m})^2}{E_{k,m}} \underset{H_0}{\sim} \chi^2_{(K-1)(M-1)}\] For each of them, you count the first 50 shots

# Initial settings:
# number of attempts n = 50
# number of zones K = 4
# number of friends M = 6
n <- 50
K <- 4
M <- 6
x <- matrix(0, M, n)
p <- c(7/16, 5/16, 3/16, 1/16)

set.seed(123)
# generate the values
for (m in 1:M){
  x[m, ] <- sample(1:K, n, replace = TRUE, prob = p)
} 
observed_matrix <- apply(x, 1, table)
chisq.test(observed_matrix, p = p)
## 
##  Pearson's Chi-squared test
## 
## data:  observed_matrix
## X-squared = 12.743, df = 15, p-value = 0.6221

Yes, the test is suggesting that all of your friends homogeneously hit the darts target.

It’s your turn

What happens if a great player decides to join you, now? Try to simulate the data and perform the test again.

M <- 7 
n <- 50
K <- 4 
x <- matrix(0, M, n )
p <- c(7/16, 5/16, 3/16, 1/16)

set.seed(123)
# the 6 friends
for (m in 1:(M-1)){
 x[m, ] <- sample(1:K, n, replace=TRUE, prob = p )
} 

x[M,] <- sample(1:K, n, replace = TRUE, prob = c(1/16, 3/16, 5/16, 7/16)) #great player

observed_matrix <- apply(x, 1, table)
observed_matrix
##   [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## 1   21   23   22   21   23   25    3
## 2   15   14   16   21   17   11    8
## 3    9   12    7    7    6   11   17
## 4    5    1    5    1    4    3   22
# test homogeneity
chisq.test(observed_matrix, p)
## 
##  Pearson's Chi-squared test
## 
## data:  observed_matrix
## X-squared = 88.166, df = 18, p-value = 3.077e-11

Extra exercise: Correlation test

Sometimes it could be useful to assess the degree of association, or correlation, between paired samples, by using the Pearson, the Kendall’s \(\tau\) or the Spearman’s \(\rho\) correlation coefficient. Regardless of the adopted coefficient, the null hypothesis for a given correlation coefficient \(\rho\) is: \[ \text{H}_0: \rho=0.\] The test statistic is then defined as \[T=r\sqrt{\frac{n-2}{1-r^2}} \underset{H_0}{\sim} t_{n-2},\] where \(r=corr(X,Y)\) is the Pearson correlation coefficient. Suppose to have two samples of the same length \(x_1, \ldots, x_n\) and \(y_1, \ldots, y_n\), and to measure the association between them. Once we compute the observed test statistic \(t_{obs}\), we may compute then compute the \(p\)-value, which, by considering a two-sided test, is obtained as:

\[p-value=2\rm{Pr}_{H_0}(T \geq |t_{obs}|) \]

Consider now some of the most followed Instagram accounts in 2018: for each of the owners, we report also the number of Twitter followers (in millions). Are the Instagram and Twitter accounts somehow associated? Perform a correlation test, compute the \(p\)-value and give an answer. Here is the dataframe.

      Owners <- c( "Katy Perry", "Justin Bieber", "Taylor Swift", "Cristiano Ronaldo",
                   "Kim Kardashian", "Ariana Grande", "Selena Gomez", "Demi Lovato")
      Instagram <- c( 69, 98,107, 123, 110, 118, 135, 67)
      Twitter <- c( 109, 106, 86, 72, 59, 57, 56, 56)
      plot( Instagram, Twitter, pch=21, bg=2, xlim=c(60, 150), ylim=c(40, 120) )
      text( Instagram[-6], Twitter[-6]+5, Owners[-6], cex=0.8 )
      text( Instagram[6], Twitter[6]-5, Owners[6], cex=0.8 )

Owners Instagram Twitter
Katy Perry 69 109
Justin Bieber 98 106
Taylor Swift 107 86
Cristiano Ronaldo 123 72
Kim Kardashian 110 59
Ariana Grande 118 57
Selena Gomez 135 56
Demi Lovato 67 56

Extra: Independent trials: weak law of large numbers

Consider the Bernoulli trials \(X_1, \ldots, X_n\), with mean \(p\) and variance \(p(1-p)\). For this setting, the weak law of large numbers holds: \[ \overline{X}_n \overset{P}{\rightarrow} p, \ \mbox{as } n \rightarrow \infty,\] where the notation \(\overset{P}{\rightarrow}\) means convergence in probability. We will write a function that take in input \(n\) and \(p\).

law_large_numb <- function(n, p){
  x<-rbinom(n, 1, p)
  return(mean(x))
}
law_large_numb <- Vectorize(law_large_numb)

We are able to evaluate graphically the convergence to \(p\) of the sample mean, as function of \(n\).

set.seed(12345) 
par(mfrow = c(1, 1))
curve(law_large_numb(x, p = 0.5), from = 1, to = 1000, xlab = "n", 
      ylab = "frequency")
abline(a = 0.5, b = 0, lwd = 2, col = "red")
legend(80, 0.9, lty = 1, col = c("black", "red"), 
       lwd  =c(1, 2), c( "X/n", "p"))

A few trials are needed for the convergence to \(p\).