Basic tools for Sample Size based on precision (I)

Estimating the population mean

Suppose that an estimate is desired of the average price of tablets of a tranquilizer. A random sample of pharmacies is selected. The estimate is required to be within 10 cents of the true average price with 95% confidence. Based on a small pilot study, the standard deviation in price can be estimated as 85 cents. How many pharmacies should be randomly selected ?

We can do this simple calculation writing two lines of code (is not necessary to use a specific R library):

z_alpha <- 1.96
sigma   <- 0.85
prec    <- 0.10

n <- (z_alpha^2*sigma^2)/prec^2

sample_size <- ceiling(n)
sample_size

## [1] 278

As a result, a sample of 278 pharmacies should be taken.

It is important to note that, to calculate n, an anticipated value of the population standard deviation, is required.

In the absence of knowledge of this, a rough guide is provided by the largest minus the smallest anticipated values of the measurement of concern divided by 4.

Estimating the population proportion (prevalence/cumulative incidence)

The formula below provide the sample size needed under the requirement of population proportion interval estimate at 0.95 confidence level, margin of error E, and planned proportion estimate p.

Example: Using a 0.50 planned proportion estimate, find the sample size needed to achieve 0.05 margin of error for the estimate at 0.95 confidence level.

zstar <- qnorm(.975) 
p = 0.5 
E = 0.05 
n <- zstar^2*p*(1-p) / E^2
minsamp <- ceiling(n)
minsamp

## [1] 385

It should be noted that this approach is valid if simple random sampling is used. This is rarely the case in an actual observational study. In case of a different sampling scheme, a design effect should be considered. This is a more advanced topic, not covered in this course.

Estimating the confidence interval for an incidence rate

The diet data frame has 337 rows and 14 columns. The data concern a subsample of subjects drawn from larger cohort studies of the incidence of coronary heart disease (CHD). These subjects had all completed a 7-day weighed dietary survey while taking part in validation studies of dietary questionnaire methods. Upon the end of observation (follow up) it was found that 46 CHD events had occurred in this group, thus allowing a study of the relationship between diet and the incidence of CHD. We now load the R library required and the dataset.

library(Epi)

## Warning: il pacchetto 'Epi' è stato creato con R versione 4.4.3

data(diet)

First of all we want to estimate the overall incidence rate of CHD: we should compute the follow up time in years for each subject in the study.

attach(diet)
y <- cal.yr(dox)-cal.yr(doe)

Then we compute the total follow up of the study and the number of incident cases:

Y <- sum(y)
D <- sum(chd)

Finally, assuming a constant rate, we estimate the incidence rate as follows:

rate <- D/Y
rate

## [1] 0.009992031

c(round(rate, digits=3), round(rate-1.96*(sqrt(D)/Y),digits=3), round(rate+1.96*(sqrt(D)/Y), digits=3))

## [1] 0.010 0.007 0.013

Equivalently:

erf <- exp(1.96/sqrt(D))
c(round(rate, digits=3), round(rate/erf, digits=3), round(rate*erf, digits=3))

## [1] 0.010 0.007 0.013

We can also estimate the incidence rate directly using a Poisson regression model; in the model we need the log of the follow up time for each person as the offset variable:

m1 <- glm(chd~1,offset=log(y), family=poisson, data=diet)
ci.exp(m1)

##               exp(Est.)        2.5%      97.5%
## (Intercept) 0.009992031 0.007484355 0.01333992

The function glm with family=Poisson fits a Poisson regression model: here with just one parameter, the intercept (called ‘1’) while including the log-person-time as a covariate with a fixed coefficient of 1; this is what is called an offset.

Another example of computation using the rate in the original scale: suppose 15 events are observed during 5532 person-years in a given study cohort. Let’s now estimate the underlying incidence rate λ (in 1000 person-years: therefore 5.532) and to get an approximate confidence interval:

D <- 15
Y <- 5.532 
rate <- D / Y
SE.rate <- rate/sqrt(D)
c(round(rate, digits=3), round(SE.rate, digits=3), round(rate, digits=3) + c(-1.96, 1.96)*round(SE.rate, digits=3))

## [1] 2.711 0.700 1.339 4.083

# equivalent to:

c(round(rate, digits=3), round(rate-1.96*(sqrt(D)/Y),digits=3), round(rate+1.96*(sqrt(D)/Y), digits=3))

## [1] 2.711 1.339 4.084

Sample size for the incidence rate based on precision

Now, if we want to estimate the sample size required in a study to estimate an incidence rate within a pre-specified precision, let’s follow this example: based on data from previously conducted studies, we expect the rate to be about 50 per 10.000 pyrs.

We want to determine the size of the sample that will be required to estimate the incidence rate in that population within ± 5 per 10.000 pyrs.

We are here imposing that the margin of precision (E=5) should be E = 1.96*S.e.(rate), so we can derive the standard error of the rate as:

se.rate <- (5/1.96)

then, we can derive the number of cases needed by:

number.cases <- (50/se.rate)**2
number.cases

## [1] 384.16

and finally, we need to derive the person-time necessary to observe that number of cases as:

person.years <- number.cases/50
person.years*10000

## [1] 76832

Therefore, we could follow 76832 subjects for one year (or 38416 for two years…etc) in order to observe 384 events and be able to estimate a 95% confidence interval of the required precision.

Examples

Problem 1.

Calculate the maximum sample size required to estimate the prevalence of respiratory tract infection, with a precision of 5%, in a target population consisting of children in a particular region of a developing country. N.B: An estimate of the population prevalence is not known. However, we can obtain a range of sample sizes required corresponding to a wide range of values for p, say from 0.1 to 0.9.

Solution

We can here use the R library epiDisplay that has a lot of useful functions, for example the n.for.survey function that is useful for cross-sectional studies with the aim of estimating prevalence.

library(epiDisplay)
library(epiR)
library(pwr)
p <- seq(0.1,0.9,0.1)
d <- 0.05

n.for.survey(p=p, delta=d)

## Sample size for survey. 
## Assumptions: 
##   Confidence limit = 95 % 
##   Delta            = 0.05 from the estimate. 
## 
##     p delta   n
## 1 0.1  0.05 138
## 2 0.2  0.05 246
## 3 0.3  0.05 323
## 4 0.4  0.05 369
## 5 0.5  0.05 384
## 6 0.6  0.05 369
## 7 0.7  0.05 323
## 8 0.8  0.05 246
## 9 0.9  0.05 138

We can see from the output above that the maximum sample size required is found when p is equal to 0.5, as expected. As we know, this is true for any study where the estimated prevalence is not known beforehand and the precision is fixed. For these situations, the safest choice is to assume that p = 0.5.