Clinical development is an integral part of pharmaceutical development, which is a lengthy and costly process for providing accurate and reliable assessment of the efficacy and safety of pharmaceutical entities under investigation. Sample size calculation plays an important role, which ensures the success of studies conducted at various phases of clinical development. It not only ensures the validity of the clinical trials, but also assures that the intended trials will have a desired power for correctly detecting a clinically meaningful difference of the pharmaceutical entity under study if such a difference truly exists. Sample size calculation is usually conducted through a pre-specified power analysis. The purpose is to select a sample size such that the selected sample size will achieve a desired power for correctly detecting a pre-specified clinically meaningful difference at a given level of significance.
In clinical research, during the planning stage of a clinical study, the following questions are of particular interest to the investigators:
how many subjects are needed in order to have a desired power for detecting a clinically meaningful difference (e.g., an 80% chance of correctly detecting a clinically meaningful difference)
what’s the trade-off between cost effectiveness and power if only a small number of subjects are available for the study due to limited budget and/or some medical considerations.
For a given study, sample size calculation is usually performed based on some statistical criteria like controlling precision, type I and/or type II errors.
When we choose sample size in such a way that there is a desired precision at a fixed confidence level we are referring to the precision analysis for sample size calculation previously discussed.
As an alternative, when we focus on a specific effect size to be detected, the method of pre-study power analysis is usually conducted to estimate sample size.
The concept of the pre-study power analysis is to select required sample size for achieving a desired power for detecting a clinically/scientifically meaningful difference at a fixed type I and type II error rate.
We can use the following libraries to perform sample size calculations:
library(epiR)
library(epiDisplay)
Suppose we wish to test, at the 5% level of significance, the hypothesis that cholesterol means in a population are equal in two study years against the one-sided alternative that the mean is higher in the second of the two years.
Suppose that equal sized samples will be taken in each year, but that these will not from the same individuals (i.e. the two samples are drawn independently).
Our test is to have a power of 0.95 at detecting a difference of 0.5 mmol/L (this is the effect size).
The standard deviation of serum cholesterol is assumed to be 1.4 mmol/L.
Here we are in a very standard situation of random sampling, with two independent samples: the worthwhile difference is 0.5 and since we don’t know the actual means in the groups, we can substitute any values for ‘treat’ and ‘control’, so long as the difference is 0.5, in the input of the function:
epi.sscompc(treat = 5, control = 4.5, n = NA, sigma = 1.4, power = 0.95,
r = 1, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 340
##
## $n.treat
## [1] 170
##
## $n.control
## [1] 170
##
## $power
## [1] 0.95
##
## $delta
## [1] 0.5
To satisfy the study requirements 340 individuals need to be tested: 170 in the first year and 170 in the second year.
This is equivalent to ask for the power of the test, given n in each group, using the power.t.test formula:
power.t.test(n = 170, delta = 0.5, sd = 1.4,sig.level = 0.05,
type = "two.sample",alternative = "one.sided")
##
## Two-sample t test power calculation
##
## n = 170
## delta = 0.5
## sd = 1.4
## sig.level = 0.05
## power = 0.9496262
## alternative = one.sided
##
## NOTE: n is number in *each* group
Now, instead of using random sampling of two independent groups, we test two times the same subjects, for example at 5 years of distance; for sake of simplicity, we assume the same standard deviation as before, (even if in this case the standard deviation of the distribution of differences should be taken into account):
power.t.test(power=0.95, delta = 0.5, sd = 1.4,sig.level = 0.05,
type = "paired",alternative = "one.sided")
##
## Paired t test power calculation
##
## n = 86.21841
## delta = 0.5
## sd = 1.4
## sig.level = 0.05
## power = 0.95
## alternative = one.sided
##
## NOTE: n is number of *pairs*, sd is std.dev. of *differences* within pairs
We need now 87 pairs of subjects (the same subject with two measures).
Note that the sample size requirement is well below that of 170 (per sample) of the unpaired case.
Provided that the correlation between the before and after measurements is high and positive (as is likely in practice), pairing will give a considerable saving in sample numbers.
However, there may be costs elsewhere. It may be difficult to keep track of the 87 individuals over the 5-year period. Furthermore, bias may also be present because the individuals concerned may be more health conscious simply by virtue of being studied. This may lead them to (say) consume less saturated fat than other members of the population. Sometimes, moreover, no data can be found from which to calculate the standard deviation of the differences.
Let’s consider now a randomized clinical trial (RCT) to evaluate comparative efficacy of two HIV treatments:
Load the pwr library: note that here we are considering the difference between the proportions.
library(pwr)
Call the function with the required parameters: the effect size is calculated using a transformation of the difference between proportions (se if interested in details about the formulas used: Cohen, J. (1988). Statistical power analysis for the behavioral sciences (2nd ed.)):
p1=0.8
p2=0.6
h.calc = 2*asin(sqrt(p1))-2*asin(sqrt(p2))
pwr.2p.test(h =h.calc , n =NULL, sig.level =0.05, power =0.80)
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.4421432
## n = 80.29912
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: same sample sizes
81 patients per arm should be randomized.
Note that if you use the one-sided test the required sample size is lower (this is a general finding):
pwr.2p.test(h =h.calc , n =NULL, sig.level =0.05, power =0.80, alternative="greater")
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.4421432
## n = 63.25171
## sig.level = 0.05
## power = 0.8
## alternative = greater
##
## NOTE: same sample sizes
It is a strong assumption use a one-tailed test, we should be quite “sure” that the alternative hypothesis is just in one direction (otherwise we run the risk to do not reject the null in cases when the experimental drug is inferior to the standard one).
Of note, again: samples are sometimes drawn using stratification and/or clustering, in which case a complex sampling design has been used. This has a fundamental effect on sample size requirements, compared with the simple random sampling (SRS). A full discussion is well beyond the scope of this course, but note that in general we should expect sensible stratification to decrease the required sample size (lower variance), but clustering often to increase it, to maintain the same power.
In case of paired binary data (for example: pre-post treatment outcomes) the statistical test to be used is the McNemar’s Test.
The McNemar sample size calculations are not included in the library, so here it is the R code:
pwr.mcnemar <- function(p10, p01, alpha, n, power) {
pdisc <- p10 + p01
pdiff <- p10 - p01
if (missing(power) && !missing(n)) {
x1 <- (pdiff * sqrt(n) - qnorm(1 - alpha / 2) * sqrt(pdisc)) /
sqrt(pdisc - pdiff ^ 2)
x2 <- (-pdiff * sqrt(n) - qnorm(1 - alpha / 2) * sqrt(pdisc)) /
sqrt(pdisc - pdiff ^ 2)
power <- pnorm(x1) + pnorm(x2)
} else if (missing(n) && !missing(power)) {
n <- ((qnorm(1 - alpha / 2) * sqrt(pdisc) + qnorm(power) *
sqrt(pdisc - pdiff ^ 2)) / pdiff) ^ 2
} else {
stop("Must supply one of `n` or `power`, but not both.")
}
c("n" = n, "power" = power)
}
Computes sample size or power for McNemar’s test for paired categorical data.
p10 : Probability of pre-test success and post-test failure.
p01 : Probability of pre-test failure and post-test success.
alpha : Specified significance level
n : Sample size. Cannot be left blank if power is missing.
power: Statistical power. Cannot be left blank if n is missing.
H0: Both groups have the same success probability. H1: The success probability is not equal between the Groups.
Post success | Post failure | |
---|---|---|
Pre success | p11 | p10 |
Pre failure | p01 | p00 |
Reference for details: Connor R. J. 1987. Sample size for testing differences in proportions for the paired-sample design. Biometrics 43(1):207-211. page 209.
pwr.mcnemar(p10=0.20, p01=0.30,alpha=0.05,power=0.90)
## n power
## 521.2043 0.9000
We want to see if there’s an association between gender and flossing teeth among college students. We randomly sample 100 students (males and females) and ask whether or not they floss daily. We want to carry out a chi-square test of association to determine if there’s an association between these two variables. We set our significance level to 0.01. To determine effect size we need to propose an alternative hypothesis, which in this case is a table of proportions. We propose the following:
prob <- matrix(c(0.1,0.2,0.4,0.3), ncol=2,
dimnames = list(c("M","F"),c("Floss","No Floss")))
prob
## Floss No Floss
## M 0.1 0.4
## F 0.2 0.3
This says we sample even proportions of male and females, but believe that 10% more females floss.
We now use the ES.w2 function to calculate effect size for chi-square tests of association:
ES.w2(prob)
## [1] 0.2182179
Of note: the degrees of freedom of the test are the product of the number of levels df = (2 - 1) * (2 - 1) = 1
And now let’s calculate the power:
pwr.chisq.test(w = ES.w2(prob), N = 100, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.2182179
## N = 100
## df = 1
## sig.level = 0.01
## power = 0.3469206
##
## NOTE: N is the number of observations
Being only 35% this is not a very powerful experiment.
We can ask: how many students should I survey if I wish to achieve 90% power?
pwr.chisq.test(w = ES.w2(prob), power = 0.9, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.2182179
## N = 312.4671
## df = 1
## sig.level = 0.01
## power = 0.9
##
## NOTE: N is the number of observations
About 313 students.
If you don’t suspect association in either direction, or you don’t feel like building a matrix with a specific hypothesis, you can try to use a conventional effect size.
For example, how many students should we sample to detect a small effect? [This pre-specified range of values for the effect size was proposed by Cohen]:
cohen.ES(test = "chisq", size = "small")
##
## Conventional effect size from Cohen (1982)
##
## test = chisq
## size = small
## effect.size = 0.1
Therefore:
pwr.chisq.test(w = 0.1, power = 0.9, df = 1, sig.level = 0.01)
##
## Chi squared power calculation
##
## w = 0.1
## N = 1487.939
## df = 1
## sig.level = 0.01
## power = 0.9
##
## NOTE: N is the number of observations
1488 students need to be enrolled in our study. Perhaps more than we thought we might need!
We could consider reframing the question as a simpler two-sample proportion test. What sample size do we need to detect a small effect in gender on the proportion of students who floss with 90% power and a significance level of 0.01?
pwr.2p.test(h = 0.2, sig.level = 0.01, power = 0.9)
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.2
## n = 743.9694
## sig.level = 0.01
## power = 0.9
## alternative = two.sided
##
## NOTE: same sample sizes
About 744 per group.
A graduate student is investigating the effectiveness of a fitness program. She wants to see if there is a correlation between the weight of a participant at the beginning of the program and the participant’s weight change after 6 months. She suspects there is a a small positive linear relationship between these two quantities (say 0.1, this is the effect size). She will measure this relationship with the correlation coefficient, r, and conduct a correlation test to determine if the estimated correlation is statistically greater than 0. How many subjects does she need to sample to detect this small positive relationship with 80% power and 0.01 significance level?
Here there is nothing tricky about the effect size argument, r. It is simply the hypothesized correlation. It can take values ranging from -1 to 1.
cohen.ES(test = "r", size = "small")
##
## Conventional effect size from Cohen (1982)
##
## test = r
## size = small
## effect.size = 0.1
pwr.r.test(r = 0.1, sig.level = 0.01, power = 0.8, alternative = "greater")
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 999.2054
## r = 0.1
## sig.level = 0.01
## power = 0.8
## alternative = greater
She needs to observe about a 1000 students.
The default is a two-sided test. We specify alternative = “greater” since we believe there could be only a small positive effect.
If she just wants to detect a small effect in either direction (positive or negative correlation), use the default settings of a two.sided test, which we can do by removing the alternative argument from the function:
pwr.r.test(r = 0.1, sig.level = 0.01, power = 0.8)
##
## approximate correlation power calculation (arctangh transformation)
##
## n = 1162.564
## r = 0.1
## sig.level = 0.01
## power = 0.8
## alternative = two.sided
Now she needs to observe 1163 students: as always two-sided tests are more “expensive” in terms of sample size.
A case-control study of the relationship between smoking and CHD (coronary heart disease) is planned. A sample of men with newly diagnosed CHD will be compared for smoking status with a sample of controls. Assuming an equal number of cases and controls, how many study subject are required to detect an odds ratio of 2.0 (effect size) with 0.90 power using a two-sided 0.05 test? Previous surveys have shown that around 0.30 of males without CHD are smokers.
We use the epi.sscc function, that require the following arguments:
*OR = scalar, the expected study odds ratio.
*p0 = scalar, the prevalence of exposure among the controls.
*n = scalar, the total number of subjects in the study (i.e. the number of cases plus the number of controls).
*power = scalar, the required study power.
*r = scalar, the number in the control group divided by the number in the case group.
*phi.coef = scalar, the correlation between case and control exposures for matched pairs. Ignored when method = “unmatched”.
*design= scalar, the design effect.The design effect is used to take into account the possible presence of clustering in the random sampling of the data. The design effect is a measure of the variability between clusters and is generally calculated as the ratio of the variance calculated assuming a complex sample design divided by the variance calculated assuming simple random sampling.
*sided.test = Use a two-sided test if you wish to evaluate whether or not the odds of exposure in cases is greater than or less than the odds of exposure in controls. Use a one-sided test to evaluate whether or not the odds of exposure in cases is greater than the odds of exposure in controls.
*conf.level = scalar, the level of confidence in the computed result.
*method = a character string defining the method to be used. Options are unmatched or matched.
*fleiss = logical, indicating whether or not the Fleiss correction should be applied (a continuity correction factor is used when you use a continuous probability distribution to approximate a discrete probability distribution). This argument is ignored when method = “matched”.
library(epiR)
epi.sscc(OR = 2.0, p0 = 0.30, n = NA, power = 0.90, r = 1, phi.coef = 0,
design = 1, sided.test = 2, conf.level = 0.95, method = "unmatched")
## $n.total
## [1] 376
##
## $n.case
## [1] 188
##
## $n.control
## [1] 188
##
## $power
## [1] 0.9
##
## $OR
## [1] 2
A total of 376 men need to be sampled: 188 cases and 188 controls.
Suppose we wish to determine the power to detect an odds ratio of 2.0 using a two-sided 0.05 test when 188 cases and 940 controls are available (that is, the ratio of controls to cases is 5:1). Assume the prevalence of smoking in males without CHD is 0.30. Here we use the Fleiss correction because we are dealing with an inverse problem (i.e. estimating power given the sample size). See for technical details: Fleiss JL et al., (2003). Statistical Methods for Rates and Proportions. Wiley, New York, 3rd edition.
n <- 188 + 940
epi.sscc(OR = 2.0, p0 = 0.30, n = n, power = NA, r = 5, phi.coef = 0,
design = 1, sided.test = 2, conf.level = 0.95, method = "unmatched",
fleiss = TRUE)
## $n.total
## [1] 1128
##
## $n.case
## [1] 188
##
## $n.control
## [1] 940
##
## $power
## [1] 0.9880212
##
## $OR
## [1] 2
The power of this study, with the given sample size allocation is 0.99.
We wish to conduct a case-control study to assess whether bladder cancer may be associated with past exposure to cigarette smoking. Cases will be patients with bladder cancer and controls will be patients hospitalized for injury. It is assumed that 20% of controls will be smokers or past smokers, and we wish to detect an odds ratio of 2 with power 90%. Three controls will be recruited for every case.
How many subjects need to be enrolled in the study?
epi.sscc(OR = 2.0, p0 = 0.20, n = NA, power = 0.90, r = 3, phi.coef = 0,
design = 1, sided.test = 2, conf.level = 0.95, method = "unmatched")
## $n.total
## [1] 619
##
## $n.case
## [1] 155
##
## $n.control
## [1] 464
##
## $power
## [1] 0.9
##
## $OR
## [1] 2
A total of 619 subjects need to be enrolled in the study: 155 cases and 464 controls.
We use the epi.sscohortc function, that require the following arguments:
A cohort study (population-based) of smoking and coronary heart disease (CHD) in middle aged men is planned. A sample of men will be selected at random from the population and those that agree to participate will be asked to complete a questionnaire. The follow-up period will be 5 years. The investigators would like to be 0.90 confident of being able to detect when the relative risk of CHD is 1.4 (effect size) for smokers, using a one-sided 0.05 significance test. Previous evidence suggests that the incidence rate of death in non-smokers is 413 per 100000 person-years. Assuming equal numbers of smokers and non-smokers are sampled, how many men should be sampled overall?
irexp1 = 1.4 * (5 * 413)/100000
irexp0 = (5 * 413)/100000
epi.sscohortc(irexp1 = irexp1, irexp0 = irexp0, n = NA, power = 0.90, r = 1, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 12130
##
## $n.exp1
## [1] 6065
##
## $n.exp0
## [1] 6065
##
## $power
## [1] 0.9
##
## $irr
## [1] 1.4
##
## $or
## [1] 1.411908
Over a 5-year period the estimated chance of death in the non exposed (irexp0) is 0.02065. Thus, rounding up to the next highest even number, 12130 men should be sampled (6065 smokers and 6065 nonsmokers).
Since in this design we are extracting a simple random sample from the overall population (we are not sampling separately from smokers and not-smokers) we could know in advance the expected prevalence of the exposure, and therefore we could also be more flexible, in allowing unbalancement between exposed and not exposed.
The imbalance is set through the parameter r.
If the proportion of men smoking is one-third, then in a random sample we would expect to find n1 = n/3. Hence n2 = 2n/3 and thus r = n1/n2 = 0.5. Using this new value for r:
epi.sscohortc(irexp1 = irexp1, irexp0 = irexp0, n = NA, power = 0.90, r = 0.50, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 13543.5
##
## $n.exp1
## [1] 4514.5
##
## $n.exp0
## [1] 9029
##
## $power
## [1] 0.9
##
## $irr
## [1] 1.4
##
## $or
## [1] 1.411908
Rounding up gives a total sample size requirement of 13544. We would expect that, after random sampling about a third of these would be nonsmokers. Notice that 13544 is not exactly divisible by 3, but any further rounding up would not be justifiable because we cannot guarantee the exact proportion of smokers.
Say, for example, we are only able to enroll 5000 subjects into the study described above for some financial constraint.
What is the minimum and maximum detectable relative risk ?
irexp0 = (5 * 413)/100000
epi.sscohortc(irexp1 = NA, irexp0 = irexp0, n = 5000, power = 0.90,
r = 1, design = 1, sided.test = 1, conf.level = 0.95)
## $n.total
## [1] 5000
##
## $n.exp1
## [1] 2500
##
## $n.exp0
## [1] 2500
##
## $power
## [1] 0.9
##
## $irr
## [1] 0.5047104 1.6537812
##
## $or
## numeric(0)
The minimum detectable relative risk > 1 is 1.65. The maximum detectable relative risk < 1 is 0.50.
A study is to be carried out to assess the effect of a new treatment for the reproductive period in dairy cattle. What is the required sample size if we expect the proportion of cows responding in the treatment (exposed) group to be 0.30 and the proportion of cows responding in the control (unexposed) group to be 0.15? (relative risk of 2)
The required power for this study is 0.80 using a two-sided 0.05 test.
epi.sscohortc(irexp1 = 0.30, irexp0 = 0.15, n = NA, power = 0.80,
r = 1, design = 1, sided.test = 2, conf.level = 0.95)
## $n.total
## [1] 242
##
## $n.exp1
## [1] 121
##
## $n.exp0
## [1] 121
##
## $power
## [1] 0.8
##
## $irr
## [1] 2
##
## $or
## [1] 2.428571
A total of 242 cows are required: 121 in the treatment (exposed) group and 121 in the control (unexposed) group.
A case-control study is carried out to determine the efficacy of a vaccine for the prevention of childhood tuberculosis. Assume that 50% of the controls are not vaccinated. If the number of cases and controls are equal, what sample size is needed to detect, with 80% power and 5% type I error, an odds ratio of at least 2 in the target population ?
We can use the function epi.sscc of the epiR library:
epi.sscc(OR = 2.0, p0 = 0.50, n = NA, power = 0.80, r = 1,
design = 1, sided.test = 2, conf.level = 0.95, method = "unmatched")
## $n.total
## [1] 274
##
## $n.case
## [1] 137
##
## $n.control
## [1] 137
##
## $power
## [1] 0.8
##
## $OR
## [1] 2
Or equivalently, a sample size calculation for each group can be derived from testing differences in proportions (slight difference in the resulting number due to different formula’s approximations): remind that p2 corresponds to the probability of being exposed (i.e. not vaccinated) given that you are a control (50%) and p1 corresponds to the probability of being exposed (i.e. not vaccinated) given that you are a case, and it is derived from the formula that links the Odds Ratio, p2 and p1:
p2=0.5
p1=0.667
h.calc = 2*asin(sqrt(p1))-2*asin(sqrt(p2))
pwr.2p.test(h =h.calc , n =NULL, sig.level =0.05, power =0.80)
##
## Difference of proportion power calculation for binomial distribution (arcsine transformation)
##
## h = 0.3405441
## n = 135.3599
## sig.level = 0.05
## power = 0.8
## alternative = two.sided
##
## NOTE: same sample sizes
A randomised trial is to be conducted comparing two new treatments aimed at increasing the weights of malnourished children with a control group. The minimal worthwhile benefit is an increase in mean weight of 2.5kg (effect size), with respect to the controls, and the standard deviations of weight changes are believed to be 3.5kg. What are the required sample sizes, assuming that the control group is twice as large as each of the two treatment groups and an 80% power is required for each comparison?
The worthwhile benefit is 2.5kg and since we don’t know the actual means in the three groups, we can substitute hypothetical reasonable values for ‘treat’ and ‘control’, so long as the difference is 2.5. Also, given that we are performing two comparisons, a reasonable type I error level (alpha) would be 0.025, instead of the conventional 0.05 (the so-called alpha-inflation issue for multiple tests, Bonferroni correction…do you know what’s about?).
The required sample sizes can then be obtained as follows:
epi.sscompc(treat = 20, control = 17.5, n = NA, sigma = 3.5, power = 0.80,
r = 2, design = 1, sided.test = 2, conf.level = 0.975)
## $n.total
## [1] 84
##
## $n.treat
## [1] 56
##
## $n.control
## [1] 28
##
## $power
## [1] 0.8
##
## $delta
## [1] 2.5
Or equivalently (slight difference in the resulting number due to different formula’s approximations):
n.for.2means(mu1=17.5, mu2=20, sd1=3.5, sd2=3.5, ratio=2,alpha=0.025)
##
## Estimation of sample size for testing Ho: mu1==mu2
## Assumptions:
##
## alpha = 0.025
## power = 0.8
## n2/n1 = 2
## mu1 = 17.5
## mu2 = 20
## sd1 = 3.5
## sd2 = 3.5
##
## Estimated required sample size:
##
## n1 = 29
## n2 = 57
## n1 + n2 = 86
##
It means that we should enroll a total of 115 subjects (57 controls + 29*2 in the two treatment arms).