Let \(\hat \theta\) be the point estimator for the parameter \(\theta\). An estimator is said to be
There are some properties that we would ensure for an estimator: low variance and low bias. But, there is a tradeoff between unbiasedness and low variance, so we would usually seek to get both (to some extent); ideally, we would target a small Mean Squared Error (MSE)
\[\rm{MSE}(\hat \theta)=E\{(\hat \theta-\theta)^2 \}= \{E(\hat \theta) - \theta\}^2+var(\hat \theta)=(Bias)^2+Variance \]
As we already know from the theory, the sample mean is an excellent estimator for the population mean. But we will consider now other estimators. Consider the case of a normal distribution \(\mathcal{N}(\mu, \sigma^2)\) and the following estimators:
with \(x_{(i)}\) the \(i\)-th order statistic.
We aim to compare the four estimators in terms of unbiasedness and efficiency.
#Initial settings:
# Number of replications B = 10000
# Sample size n = 10
# Population mean = 2
# Population variance = 4
B <- 1000
n <- 10
mu <- 5
sigma <- 2
# Save the results in a matrix:
# Column 1: sample mean
# Column 2: sample median
# Column 3: semi-sum of minumu and maximum
# Column 4: trimmed sample mean
est <- matrix(0,B,4)
set.seed(1234)
# For each replication we generate 10 samples from
# a normal r.v. with mean mu and variance sigma^2
# and we compute the four sample estimate
for (i in 1:B){
x <- rnorm(n, mu, sigma)
est[i,1] <- mean(x)
est[i,2] <- median(x)
est[i,3] <- (min(x)+max(x))/2
est[i,4] <- sum(sort(x)[-c(1,n)])/(n-2)
}
# By using the boxplot, we plot the distribution of the sample estimate
# Blue horizontal line is the true population mean
par(mfrow=c(1,1), xaxt="n")
boxplot(est, main="Comparison between four estimators")
par(xaxt="s")
axis(1, 1:4, c(expression(hat(mu)[1]), expression(hat(mu)[2]), expression(hat(mu)[3]), expression(hat(mu)[4])) )
abline(h=5, lwd=2, col="blue")
# Bias
bias <- apply(est, 2, mean) - mu
bias
## [1] 0.012231786 0.032274611 -0.005109588 0.016567130
# Variances
variance <- apply(est, 2, var)
variance
## [1] 0.3674287 0.5111770 0.7202595 0.3870022
All the estimators appear unbiased and the sample mean register the lowest sample variance, which is a good approximation of \(\sigma^2/n=0.4\).
It’s your turn Let’s check now whether all the estimators are consistent. For checking this statement, \(n=10\) is extremely low and need to increase it, let’s say \(n=200\). Compare the simulated values for \(n=10\) with those obtained for \(n=200\) by using histograms. What do you expect?
# We repeat the previous steps considering n=200
est_200 <- matrix(0, B, 4)
n <- 200
for (i in 1:B){
x <- rnorm(n, mu, sigma)
est_200[i , 1] <- mean(x)
est_200[i , 2] <- median(x)
est_200[i , 3] <- (max(x)+min(x))/2
est_200[i , 4] <- sum(sort(x)[-c(1,n)])/(n-2)
}
# We plot the histograms for n=10 and n=200 of the sample estimates
# Blue vertical line is the true population mean
# n =10
par(mfrow=c(2,2))
for(j in 1:4){
hist.scott(est[,j], prob = TRUE,
main = substitute(hat(mu)[j],
list(j = j)),
xlab = "", xlim = c(0, 10), cex.main = 1.5)
abline(v = mu, col = "blue", lwd = 2)
}
# n= 200
for(j in 1:4){
hist.scott(est_200[,j], prob = TRUE,
main=substitute(hat(mu)[j] ,
list(j = j)),
xlab = "", xlim = c(0, 10), cex.main = 1.5)
abline(v = mu, col = "blue", lwd = 2)
}
Giving a point value is in most cases a rude estimate. Rather, we may construct an interval for our parameter, giving an entire set of values to estimate the model parameter. Interval estimation is based on pivotal quantities.
In the normal case with \(\sigma^2\) known, the pivotal quantity for \(\mu\) is defined as: \[ T(\mu)=\frac{\overline X - \mu}{\sigma/\sqrt{n}} \sim \mathcal{N}(0,1), \quad \forall \mu \in \mathbb{R}, \sigma^2 > 0.\]
Then, it follows that for \(0<\alpha <1\) \[\rm{Pr}(-z_{1-\alpha/2} \leq T(\mu) \leq z_{1-{\alpha/2}})=1-\alpha \] and a confidence interval of level \(1-\alpha\) for \(\mu\) is given by \[\text{CI}_\mu=(\bar x - z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}}, \bar x + z_{1-\alpha/2} \frac{\sigma}{\sqrt{n}})\] where \(-z_{1-\alpha/2}=z_{\alpha/2}\) and \(z_{1-\alpha/2}\) represent the quantile of order \(\alpha/2\) and \(1-\alpha/2\) of a standard normal distribution, respectively. Also for interval estimation, MC simulation is well suited for exploring the procedure:
#normal case with sigma^2 known (we take mu = 5 and sigma = 2 initialized above)
B <- 1000
n <- 10
# 1-alpha is the confidence level
# CI is matrix where we save the confidence intervals for each replication:
# -) first column: lower bound
# -) second column: upper bound
# l is a vector whose elements assume TRUE (1) or FALSE(0) depending on whether the true parameter value lies within the interval
alpha <- 0.05
CI <- matrix(0,B,2)
l <- rep(0,B)
set.seed(1234)
for (i in 1:B){
x <- rnorm(n, mu, sigma)
CI[i,1] <- mean(x) - qnorm(1-alpha/2)*sigma/sqrt(n)
CI[i,2] <- mean(x) + qnorm(1-alpha/2)*sigma/sqrt(n)
l[i] <- (mu > CI[i,1] & mu < CI[i,2])
}
#Empirical coverage probability
mean(l)
## [1] 0.955
#Plot the first 100 c.i.:
# black: intervals that not include mu
# red: intervals that include mu
plot(1, xlim = c(0, 10), ylim = c(0, 11), type = "n", xlab = expression(mu), ylab = "",
main = paste("100 IC for the mean (known variance)"), cex.main = 1.2)
abline(v = mu)
d <- 0
for (i in 1:100){
d <- d + 0.1
lines(seq(CI[i,1], CI[i,2], length = 100), rep(d, 100), col = (l[i] + 1))
}
sum(l[1:100])
## [1] 94
For the mean \(\mu\) of a normal distribution when \(\sigma^2\) is unknown, and it is estimated by the sample variance, the pivotal quantity is: \[ T(\mu)=\frac{\bar X - \mu}{s/\sqrt{n}} \sim \mathcal{t}_{n-1}, \quad \forall \mu \in \mathbb{R}, \sigma^2>0 \]
Then, it follows that for \(0<\alpha <1\) \[\rm{Pr}(t_{n-1;\alpha/2} \leq T(\mu) \leq t_{n-1;1-{\alpha/2}})=1-\alpha \] and a confidence interval of level \(1-\alpha\) for \(\mu\) is given by \[\text{CI}_\mu=(\bar x - t_{n-1; 1-\alpha/2} \frac{s}{\sqrt{n}}, \bar x + t_{n-1;1-\alpha/2} \frac{s}{\sqrt{n}})\] where \(-t_{n-1;1-\alpha/2}=t_{n-1;\alpha/2}\) and \(t_{n-1;1-\alpha/2}\) represent the quantile of order \(\alpha/2\) and \(1-\alpha/2\) of a t-student distribution with \(n-1\) degrees of freedom, respectively.
#normal case with sigma^2 unknown
B <- 1000
n <- 10
alpha <- 0.05
CI <- matrix(0, B, 2)
l <- rep(0, B)
set.seed(1234)
for (i in 1:B){
x <- rnorm(n, mu, sigma)
CI[i,1] <- mean(x) - qt(1-alpha/2, n-1)*sd(x)/sqrt(n)
# CI[i,1] <- mean(x) + qt(alpha/2, n-1)*sd(x)/sqrt(n)
CI[i,2] <- mean(x) + qt(1-alpha/2, n-1)*sd(x)/sqrt(n)
l[i] <- (mu > CI[i,1] & mu < CI[i,2])
}
#Empirical coverage probability
mean(l)
## [1] 0.959
plot(1, xlim = c(0,10), ylim = c(0,11), type = "n", xlab = expression(mu), ylab = "",
main = paste("100 IC for the mean (unknown variance)"), cex.main=1.2)
abline(v = mu)
d <- 0
for (i in 1:100){
d <- d + 0.1
lines(seq(CI[i,1], CI[i,2], length = 100), rep(d, 100), col = (l[i] + 1))
}
sum(l[1:100])
## [1] 93
The dataset pair65 in the DAAG package contains the difference in mm between the members of nine pairs of elastic bands. One member of each pair was placed in hot wather for four minutes, while the others was left at the ambient temperature. After ten minutes, the differences were recorded:
library(DAAG)
pair_data_frame <- cbind(pair65, pair65[,1] - pair65[,2])
pair_data_frame <- cbind(c(1:9), pair_data_frame)
dimnames(pair_data_frame) <- list(1:9, c("pair","heated", "ambient", "difference"))
pair | heated | ambient | difference |
---|---|---|---|
1 | 244 | 225 | 19 |
2 | 255 | 247 | 8 |
3 | 253 | 249 | 4 |
4 | 254 | 253 | 1 |
5 | 251 | 245 | 6 |
6 | 269 | 259 | 10 |
7 | 248 | 242 | 6 |
8 | 252 | 255 | -3 |
9 | 292 | 286 | 6 |
We have two independent samples, heated and ambient, of size \(n_1\) and \(n_2\), with \(n_1=n_2=n=9\). We are interested in assessing the sample mean difference \[\overline d=\overline x_{H}-\overline x_{A} = 6.33.\] \(s=6.10\), and \(s/\sqrt{n}=2.03\). The pivotal quantity is then:
\[ T(\mu)= \frac{\overline D -\mu}{s / \sqrt{n}} \sim t_{8}. \]
We are able to compute the confidence intervals at \(95\%\) or \(99\%\) level.
n <- nrow(pair65)
#compute quantiles for student t with n-1 =8 degrees of freedom
alpha <- 0.05
q_inf_95 <- qt(alpha/2, df = n -1)
q_sup_95 <- qt(1-alpha/2, df = n-1)
alpha <- 0.01
q_inf_99 <- qt(alpha/2, df = n-1)
q_sup_99 <- qt(1-alpha/2, df = n-1)
#compute the confidence intervals
d <- mean(pair_data_frame[,4])
d
## [1] 6.333333
s <- sd(pair_data_frame[,4])
s
## [1] 6.103278
# 95% confidence interval
CI_95 <- c(d + q_inf_95*s/sqrt(n), d + q_sup_95*s/sqrt(n))
CI_95
## [1] 1.641939 11.024728
# 99% confidence interval
CI_99 <- c(d + q_inf_99*s/sqrt(n), d + q_sup_99*s/sqrt(n))
CI_99
## [1] -0.4929537 13.1596203
library(RColorBrewer)
plotclr <- brewer.pal(6,"YlOrRd")
par(mfrow = c(1, 2))
# Plot of the probability density function of the pivotal quantity (t-student with 8 df) with endpoints that encloses 95% of the probability
curve(dt(x, 8), xlim = c(-6, 6), ylim = c(0, 0.4),
main = "", col = "blue", lwd = 3, xlab = "t",
ylab = expression(t[8]), yaxs="i")
cord.x <- c(q_inf_95, seq(q_inf_95, q_sup_95, 0.01), q_sup_95)
cord.y <- c(0, dt(seq(q_inf_95, q_sup_95, 0.01), 8), 0)
polygon(cord.x, cord.y, col = plotclr[3], border = NA )
# Plot of the probability density function of the pivotal quantity (t-student with 8 df) with endpoints that encloses 99% of the probability
curve(dt(x, 8), xlim = c(-6, 6), ylim = c(0,0.4),
main = "", col = "blue", lwd = 3, xlab="t",
ylab = expression(t[8]), yaxs="i")
cord.x2 <- c(q_inf_99, seq(q_inf_99, q_sup_99, 0.01), q_sup_99)
cord.y2 <- c(0, dt(seq(q_inf_99, q_sup_99, 0.01), 8), 0)
polygon(cord.x2, cord.y2, col = plotclr[5], border = NA )
The null hypothesis for the parameter \(\theta\) is usually expressed as \[\text{H}_0: \theta = \theta_0\]
Complementary to the choice of \(\text{H}_0\), we have to specify the alternative hypothesis \(\text{H}_1\), specifying the values of the parameter which becomes reasonable when \(\text{H}_0\) does not hold. Usually \(\text{H}_1\) may be:
Consider the number of the 15 most followed accounts on Instagram, expressed in millions, with each total rounded to the nearest million followers, as of March 31, 2018 (Source: Wikipedia). Among them there are
Musicians <- c( "Selena Gomez", "Ariana Grande", "Beyonce" , "Taylor Swift", "Justin Bieber", "Nicki Minaj")
Others <- c("Cristiano Ronaldo", "Kim Kardashian", "Kylie Jenner", "Dwayne Johnson", "Neymar", "Lionel Messi",
"Kendall Jenner", "Kourtney Kardashian", "Kevin Hart")
Followers_M <- c(135, 118, 113, 107, 98, 86)
Followers_O <- c(123, 110, 106, 103, 91, 89, 89, 62, 58)
Musicians | Followers_M |
---|---|
Selena Gomez | 135 |
Ariana Grande | 118 |
Beyonce | 113 |
Taylor Swift | 107 |
Justin Bieber | 98 |
Nicki Minaj | 86 |
Others | Followers_O |
---|---|
Cristiano Ronaldo | 123 |
Kim Kardashian | 110 |
Kylie Jenner | 106 |
Dwayne Johnson | 103 |
Neymar | 91 |
Lionel Messi | 89 |
Kendall Jenner | 89 |
Kourtney Kardashian | 62 |
Kevin Hart | 58 |
We could set up a test with the following aim: do the musicians, on average, have the same number of followers than the non-musicians? Or, do the musicians have more followers ?
We suppose that \(X_i \sim \mathcal{N}(\mu_m, \sigma^2_m)\), \(i=1, \ldots, n_1\) and \(Y_i \sim \mathcal{N}(\mu_o, \sigma^2_o)\), \(i=1, \ldots, n_2\) are independent normal samples, where \(n_1=6\) and \(n_2=9\) denote the number of musicians and others, respectively. We aim to compare their means, \(\mu_m\) and \(\mu_o\) through the following one-sided two-sample test
\[ \begin{cases} H_0: \mu_{m}-\mu_{o} = 0 \\ H_1: \mu_{m}-\mu_{o} > 0 \\ \end{cases} \]
Let \(\overline X\) and \(\overline Y\) be the sample mean for the musicians and the others, respectively. After a preliminary \(\text{F-test}\) about the variances between the two groups, we may assume that \(\sigma^2_m=\sigma^2_0\), and then the test statistic, under \(\text{H}_0\) has the form
\[T = \frac{\overline X - \overline Y}{s\sqrt{\frac{1}{n_1}+\frac{1}{n_2}}} \underset{H_{0}}{\sim}t_{n_1+n_2-2}\] with \(n_1+n_2-2=13\). In the previous formula the pooled standard deviation is: \[s = \sqrt{\frac{(n_1-1)s^2_M+(n_2-1)s^2_O}{n_1+n_2-2}} \] (see Maindonald and Braun: pages 67 and 104)
## by hand
n1 <- length(Followers_M)
n2 <- length(Followers_O)
mean_M <- mean(Followers_M)
mean_O <- mean(Followers_O)
#pooled variance: see Data Analysis and Graphics Using R An Example Based Approach, by Maindonald and Braun: pages 104 and 67
v <- (var(Followers_M)*(n1 - 1) + var(Followers_O)*(n2 - 1))/(n1 + n2 - 2)
v
## [1] 392.4231
test.stat <- (mean_M - mean_O)/(sqrt(v*(1/n1 + 1/n2)))
test.stat
## [1] 1.64422
pt(test.stat, 13, lower.tail = FALSE)
## [1] 0.06204155
# by using the t.test() funtion
t <- t.test(Followers_M, Followers_O, mu = 0, alternative = "greater", var.equal = TRUE)
t
##
## Two Sample t-test
##
## data: Followers_M and Followers_O
## t = 1.6442, df = 13, p-value = 0.06204
## alternative hypothesis: true difference in means is greater than 0
## 95 percent confidence interval:
## -1.322964 Inf
## sample estimates:
## mean of x mean of y
## 109.50000 92.33333
The p-value is 0.06204, slightly greater than \(\alpha\), when \(\alpha=0.05\).
curve(dt(x, 13), xlim = c(-5, 5), ylim = c(0, 0.4),
main = "p-values and rejection region", col = "blue",
lwd = 2, xlab = "x-y", ylab = expression(t[13]), yaxs="i")
cord.x <- c(qt(0.95, 13),seq(qt(0.95, 13), 5, 0.01), 5)
cord.y <- c(0, dt(seq(qt(0.95, 13), 5, 0.01), 13), 0)
polygon(cord.x, cord.y, col = plotclr[3], border = NA )
abline(v = t$statistic, lty = 2, lwd = 2, col = "red")
text(0, 0.2, paste("Accept", expression(H0)))
text(2.7, 0.08, paste("Reject", expression(H0)))
text(as.double(t$statistic) - 0.15, 0.02, "t", col = "red", cex = 1.2)
This is a class of test applied to sets of categorical data to evaluate whether any observed difference between the sets arose by chance. It is suitable for unpaired data from large samples. Pearson’s chi-squared test is uded to assess three types of comparison:
Suppose that \(n\) observations \(x_1, \ldots, x_n\) are divided among \(K\) cells. The following test statistic is then defined:
\[X^2 = \sum_{k=1}^{K} \frac{(O_k-E_k)^2}{E_k} \underset{H_0}{\sim} \chi^2_{K-1}, \] where
But what is \(H_0\) here? For illustration purposes, suppose you are playing darts agains another friend.
You suspect that your friend is not a great darts player, and that his shots along the game will hit the lowest points with great probability and the highest point with low probability. Translated in probability terms, you divide the darts target into \(K=4\) zones and you assign the the following hitting probabilities:
Your null hypothesis is that, due to a moderate control on his darts skills, he has decrasing probabilities to hit the best zones:
\[ \text{H}_0: p_1=7/16, \quad p_2=5/16, \quad p_3=3/16, \quad p_4=1/16.\]
Any significative deviation from the above probability distribution, would cause the rejection of the null hypothesis. For checking your assumption, you count the first \(n=50\) attempts \(x_1, \ldots, x_n\) of your opponent, and you will code \(x_i=k\), if the \(i\)-th shot hits the \(k\)-th zone.
# Initial settings:
# number of attempts n = 50
# number of zones K = 4
# vector of probabilities p = c( 7/16, 5/16, 3/16, 1/16)
n <- 50
K <- 4
p <- c( 7/16, 5/16, 3/16, 1/16)
# generate the values: Let's suppose that we sampling from p
set.seed(1234)
x <- sample(1:K, n, replace = TRUE, prob = p)
# observed (absolute) frequencies
observed <- table(x)
observed
## x
## 1 2 3 4
## 23 17 9 1
# expected (absolute) frequencies
expected <- n*p
expected
## [1] 21.875 15.625 9.375 3.125
# observed test statistic
X2 <- sum( (observed - expected)^(2)/expected)
X2
## [1] 1.638857
#manually compute the p-value
pchisq(X2, df = K - 1, lower.tail = FALSE )
## [1] 0.6506117
#same result with the chisq.test function
chisq.test(observed, p = p)
##
## Chi-squared test for given probabilities
##
## data: observed
## X-squared = 1.6389, df = 3, p-value = 0.6506
Yes, your assumption may be accepted, your friend is hitting the target according to your probabilities… but he wants to play again, and performing another challenge. Before starting, he requires to drink an energetic drink. You are ready to count his next 50 attempts. Does the drink improve the performance?
# Let's suppose to generate according to a different vector of probabilities
p2 <- c(5/16, 3/16, 6/16, 2/16)
x_after_energy_drink <- sample(1:K, n, replace = TRUE, prob = p2)
new_observed <- table(x_after_energy_drink)
new_observed
## x_after_energy_drink
## 1 2 3 4
## 12 7 26 5
#new test after the energetic drink
chisq.test(new_observed, p = p)
##
## Chi-squared test for given probabilities
##
## data: new_observed
## X-squared = 39.826, df = 3, p-value = 1.16e-08
Your friend improved his performance! His new shots do not follow your assumed distribution. Apparently, the energetic drink was strongly required.
Suppose now that other five friends join you and other guy, for a total amount of \(M=6\) friends. Do all of your friends share the same probabilities, with the above probabilities to hit the four zones? We have now this test statistic: \[X^2=\sum_{k=1}^{K}\sum_{m=1}^{M} \frac{(O_{k,m}-E_{k,m})^2}{E_{k,m}} \underset{H_0}{\sim} \chi^2_{(K-1)(M-1)}\] For each of them, you count the first 50 shots
# Initial settings:
# number of attempts n = 50
# number of zones K = 4
# number of friends M = 6
n <- 50
K <- 4
M <- 6
x <- matrix(0, M, n)
p <- c(7/16, 5/16, 3/16, 1/16)
set.seed(123)
# generate the values
for (m in 1:M){
x[m, ] <- sample(1:K, n, replace = TRUE, prob = p)
}
observed_matrix <- apply(x, 1, table)
chisq.test(observed_matrix, p = p)
##
## Pearson's Chi-squared test
##
## data: observed_matrix
## X-squared = 12.743, df = 15, p-value = 0.6221
Yes, the test is suggesting that all of your friends homogeneously hit the darts target.
It’s your turn
What happens if a great player decides to join you, now? Try to simulate the data and perform the test again.
M <- 7
n <- 50
K <- 4
x <- matrix(0, M, n )
p <- c(7/16, 5/16, 3/16, 1/16)
set.seed(123)
# the 6 friends
for (m in 1:(M-1)){
x[m, ] <- sample(1:K, n, replace=TRUE, prob = p )
}
x[M,] <- sample(1:K, n, replace = TRUE, prob = c(1/16, 3/16, 5/16, 7/16)) #great player
observed_matrix <- apply(x, 1, table)
observed_matrix
## [,1] [,2] [,3] [,4] [,5] [,6] [,7]
## 1 21 23 22 21 23 25 3
## 2 15 14 16 21 17 11 8
## 3 9 12 7 7 6 11 17
## 4 5 1 5 1 4 3 22
# test homogeneity
chisq.test(observed_matrix, p)
##
## Pearson's Chi-squared test
##
## data: observed_matrix
## X-squared = 88.166, df = 18, p-value = 3.077e-11
Sometimes it could be useful to assess the degree of association, or correlation, between paired samples, by using the Pearson, the Kendall’s \(\tau\) or the Spearman’s \(\rho\) correlation coefficient. Regardless of the adopted coefficient, the null hypothesis for a given correlation coefficient \(\rho\) is: \[ \text{H}_0: \rho=0.\] The test statistic is then defined as \[T=r\sqrt{\frac{n-2}{1-r^2}} \underset{H_0}{\sim} t_{n-2},\] where \(r=corr(X,Y)\) is the Pearson correlation coefficient. Suppose to have two samples of the same length \(x_1, \ldots, x_n\) and \(y_1, \ldots, y_n\), and to measure the association between them. Once we compute the observed test statistic \(t_{obs}\), we may compute then compute the \(p\)-value, which, by considering a two-sided test, is obtained as:
\[p-value=2\rm{Pr}_{H_0}(T \geq |t_{obs}|) \]
Consider now some of the most followed Instagram accounts in 2018: for each of the owners, we report also the number of Twitter followers (in millions). Are the Instagram and Twitter accounts somehow associated? Perform a correlation test, compute the \(p\)-value and give an answer. Here is the dataframe.
Owners <- c( "Katy Perry", "Justin Bieber", "Taylor Swift", "Cristiano Ronaldo",
"Kim Kardashian", "Ariana Grande", "Selena Gomez", "Demi Lovato")
Instagram <- c( 69, 98,107, 123, 110, 118, 135, 67)
Twitter <- c( 109, 106, 86, 72, 59, 57, 56, 56)
plot( Instagram, Twitter, pch=21, bg=2, xlim=c(60, 150), ylim=c(40, 120) )
text( Instagram[-6], Twitter[-6]+5, Owners[-6], cex=0.8 )
text( Instagram[6], Twitter[6]-5, Owners[6], cex=0.8 )
Owners | ||
---|---|---|
Katy Perry | 69 | 109 |
Justin Bieber | 98 | 106 |
Taylor Swift | 107 | 86 |
Cristiano Ronaldo | 123 | 72 |
Kim Kardashian | 110 | 59 |
Ariana Grande | 118 | 57 |
Selena Gomez | 135 | 56 |
Demi Lovato | 67 | 56 |
Consider the Bernoulli trials \(X_1, \ldots, X_n\), with mean \(p\) and variance \(p(1-p)\). For this setting, the weak law of large numbers holds: \[ \overline{X}_n \overset{P}{\rightarrow} p, \ \mbox{as } n \rightarrow \infty,\] where the notation \(\overset{P}{\rightarrow}\) means convergence in probability. We will write a function that take in input \(n\) and \(p\).
law_large_numb <- function(n, p){
x<-rbinom(n, 1, p)
return(mean(x))
}
law_large_numb <- Vectorize(law_large_numb)
We are able to evaluate graphically the convergence to \(p\) of the sample mean, as function of \(n\).
set.seed(12345)
par(mfrow = c(1, 1))
curve(law_large_numb(x, p = 0.5), from = 1, to = 1000, xlab = "n",
ylab = "frequency")
abline(a = 0.5, b = 0, lwd = 2, col = "red")
legend(80, 0.9, lty = 1, col = c("black", "red"),
lwd =c(1, 2), c( "X/n", "p"))
A few trials are needed for the convergence to \(p\).