We can use the epiR library to do sample size calculations:
library(epiR)
Suppose we wish to test, at the 5% level of significance, the hypothesis that cholesterol means in a population are equal in two study years against the one-sided alternative that the mean is higher in the second of the two years.
Suppose that equal sized samples will be taken in each year, but that these will not from the same individuals (i.e. the two samples are drawn independently).
Our test is to have a power of 0.95 at detecting a difference of 0.5 mmol/L. The standard deviation of serum cholesterol is assumed to be 1.4 mmol/L.
Here we are in a very standard situation of random sampling, with two independent samples: the worthwhile difference is 0.5 and since we don’t know the actual means in the groups, we can substitute any values for ‘mu1’ and ‘mu2’, so long as the difference is 0.5, in the input of the function:
epi.sscompc(treat = 5, control = 4.5, n = NA, sigma = 1.4, power = 0.95,
r = 1, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 340
##
## $n.treat
## [1] 170
##
## $n.control
## [1] 170
##
## $power
## [1] 0.95
##
## $delta
## [1] 0.5
To satisfy the study requirements 340 individuals need to be tested: 170 in the first year and 170 in the second year.
This is equivalent to ask for the power of the test, given n in each group, using the power.t.test formula:
power.t.test(n = 170, delta = 0.5, sd = 1.4,sig.level = 0.05,
type = "two.sample",alternative = "one.sided")
##
## Two-sample t test power calculation
##
## n = 170
## delta = 0.5
## sd = 1.4
## sig.level = 0.05
## power = 0.9496262
## alternative = one.sided
##
## NOTE: n is number in *each* group
Now, instead of using random sampling of two independent groups, we test two times the same subjects, for example at 5 years of distance: for sake of simplicity, we assume the same standard deviation as before, even if in this case the standard deviation of the distribution of differences should be taken into account:
power.t.test(power=0.95, delta = 0.5, sd = 1.4,sig.level = 0.05,
type = "paired",alternative = "one.sided")
##
## Paired t test power calculation
##
## n = 86.21841
## delta = 0.5
## sd = 1.4
## sig.level = 0.05
## power = 0.95
## alternative = one.sided
##
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs
We need now 87 pairs of subjects (the same subject with two measures).
Note that the sample size requirement is well below that of 170 (per sample) of the unpaired case.
Provided that the correlation between the before and after measurements is high and positive (as is likely in practice), pairing will give a considerable saving in sample numbers.
However, there may be costs elsewhere. It may be difficult to keep track of the 87 individuals over the 5-year period. Furthermore, bias may also be present because the individuals concerned may be more health conscious simply by virtue of being studied. This may lead them to (say) consume less saturated fat than other members of the population. Sometimes, moreover, no data can be found from which to calculate the standard deviation of the differences.
Let’s consider a randomized clinical trial (RCT) to evaluate comparative efficacy of two HIV treatments:
Load the pwr library: note that here we are considering the difference between the proportions.
library(pwr)
Call the function with the required parameters: the effect size is calculated using a transformation of the difference between proportions (se if interested in details about the formulas used: Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.)):
p1=0.8
p2=0.6
h.calc = 2*asin(sqrt(p1))-2*asin(sqrt(p2))
pwr.2p.test(h =h.calc , n =NULL, sig.level =0.05, power =0.80)
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.4421432
## n = 80.29912
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: same sample sizes
81 patients per arm should be randomized.
Note that if you use the one-sided test the required sample size is lower (this is a general finding):
pwr.2p.test(h =h.calc , n =NULL, sig.level =0.05, power =0.80, alternative="greater")
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.4421432
## n = 63.25171
## sig.level = 0.05
## power = 0.8
## alternative = greater
##
## NOTE: same sample sizes
Of note, again: samples are sometimes drawn using stratification and/or clustering, in which case a complex sampling design has been used. This has a fundamental effect on sample size requirements, compared with the simple random sampling (SRS). A full discussion is well beyond the scope of this course, but note that in general we should expect sensible stratification to decrease the required sample size (lower variance), but clustering often to increase it, to maintain the same power.
In case of paired binary data (for example: pre-post treatment outcomes) the statistical test to be used is the McNemar’s Test.
The McNemar sample size calculations are not included in the library, so here it is the R code:
pwr.mcnemar <- function(p10, p01, alpha, n, power) {
pdisc <- p10 + p01
pdiff <- p10 - p01
if (missing(power) && !missing(n)) {
x1 <- (pdiff * sqrt(n) - qnorm(1 - alpha / 2) * sqrt(pdisc)) /
sqrt(pdisc - pdiff ^ 2)
x2 <- (-pdiff * sqrt(n) - qnorm(1 - alpha / 2) * sqrt(pdisc)) /
sqrt(pdisc - pdiff ^ 2)
power <- pnorm(x1) + pnorm(x2)
} else if (missing(n) && !missing(power)) {
n <- ((qnorm(1 - alpha / 2) * sqrt(pdisc) + qnorm(power) *
sqrt(pdisc - pdiff ^ 2)) / pdiff) ^ 2
} else {
stop("Must supply one of `n` or `power`, but not both.")
}
c("n" = n, "power" = power)
}
Computes sample size or power for McNemar’s test for paired categorical data.
p10 : Probability of pre-test success and post-test failure.
p01 : Probability of pre-test failure and post-test success.
alpha : Specified significance level
n : Sample size. Cannot be left blank if power is missing.
power: Statistical power. Cannot be left blank if n is missing.
H0: Both groups have the same success probability. H1: The success probability is not equal between the Groups.
Post success | Post failure | |
---|---|---|
Pre success | p11 | p10 |
Pre failure | p01 | p00 |
Reference for details: Connor R. J. 1987. Sample size for testing differences in proportions for the paired-sample design. Biometrics 43(1):207-211. page 209.
pwr.mcnemar(p10=0.20, p01=0.30,alpha=0.05,power=0.90)
## n power
## 521.2043 0.9000
We want to see if there’s an association between gender and flossing teeth among college students. We randomly sample 100 students (males and females) and ask whether or not they floss daily. We want to carry out a chi-square test of association to determine if there’s an association between these two variables. We set our significance level to 0.01. To determine effect size we need to propose an alternative hypothesis, which in this case is a table of proportions. We propose the following:
prob <- matrix(c(0.1,0.2,0.4,0.3), ncol=2,
dimnames = list(c("M","F"),c("Floss","No Floss")))
prob
## Floss No Floss
## M 0.1 0.4
## F 0.2 0.3
This says we sample even proportions of male and females, but believe that 10% more females floss.
We now use the ES.w2 function to calculate effect size for chi-square tests of association:
ES.w2(prob)
## [1] 0.2182179
Of note: the degrees of freedom of the test are the product of the number of levels df = (2 - 1) * (2 - 1) = 1
And now let’s calculate the power:
pwr.chisq.test(w = ES.w2(prob), N = 100, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.2182179
## N = 100
## df = 1
## sig.level = 0.01
## power = 0.3469206
##
## NOTE: N is the number of observations
Being only 35% this is not a very powerful experiment.
We can ask: how many students should I survey if I wish to achieve 90% power?
pwr.chisq.test(w = ES.w2(prob), power = 0.9, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.2182179
## N = 312.4671
## df = 1
## sig.level = 0.01
## power = 0.9
##
## NOTE: N is the number of observations
About 313 students.
If you don’t suspect association in either direction, or you don’t feel like building a matrix with a specific hypothesis, you can try to use a conventional effect size.
For example, how many students should we sample to detect a small effect? [This range of values for the effect size was proposed by Cohen]:
cohen.ES(test = "chisq", size = "small")
##
## Conventional effect size from Cohen (1982)
##
## test = chisq
## size = small
## effect.size = 0.1
Therefore:
pwr.chisq.test(w = 0.1, power = 0.9, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.1
## N = 1487.939
## df = 1
## sig.level = 0.01
## power = 0.9
##
## NOTE: N is the number of observations
1488 students need to be enrolled in our study. Perhaps more than we thought we might need!
We could consider reframing the question as a simpler two-sample proportion test. What sample size do we need to detect a small effect in gender on the proportion of students who floss with 90% power and a significance level of 0.01?
pwr.2p.test(h = 0.2, sig.level = 0.01, power = 0.9)
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.2
## n = 743.9694
## sig.level = 0.01
## power = 0.9
## alternative = two.sided
##
## NOTE: same sample sizes
About 744 per group.
A graduate student is investigating the effectiveness of a fitness program. She wants to see if there is a correlation between the weight of a participant at the beginning of the program and the participant’s weight change after 6 months. She suspects there is a a small positive linear relationship between these two quantities (say 0.1). She will measure this relationship with the correlation coefficient, r, and conduct a correlation test to determine if the estimated correlation is statistically greater than 0. How many subjects does she need to sample to detect this small positive (i.e., r > 0) relationship with 80% power and 0.01 significance level?
Here there is nothing tricky about the effect size argument, r. It is simply the hypothesized correlation. It can take values ranging from -1 to 1.
cohen.ES(test = "r", size = "small")
##
## Conventional effect size from Cohen (1982)
##
## test = r
## size = small
## effect.size = 0.1
pwr.r.test(r = 0.1, sig.level = 0.01, power = 0.8, alternative = "greater")
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 999.2054
## r = 0.1
## sig.level = 0.01
## power = 0.8
## alternative = greater
She needs to observe about a 1000 students.
The default is a two-sided test. We specify alternative = “greater” since we believe there is small positive effect.
If she just wants to detect a small effect in either direction (positive or negative correlation), use the default settings of a two.sided test, which we can do by removing the alternative argument from the function:
pwr.r.test(r = 0.1, sig.level = 0.01, power = 0.8)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 1162.564
## r = 0.1
## sig.level = 0.01
## power = 0.8
## alternative = two.sided
Now she needs to observe 1163 students: as always two-sided tests are more “expensive” in terms of sample size.