Suppose that an estimate is desired of the average price of tablets of a tranquilizer. A random sample of pharmacies is selected. The estimate is required to be within 10 cents of the true average price with 95% confidence. Based on a small pilot study, the standard deviation in price can be estimated as 85 cents. How many pharmacies should be randomly selected ?
We can do this simple calculation not using a particular R library:
z_alpha <- 1.96
sigma <- 0.85
prec <- 0.10
n <- (z_alpha^2*sigma^2)/prec^2
sample_size <- ceiling(n)
sample_size
## [1] 278
As a result, a sample of 278 pharmacies should be taken.
It is important to note that, to calculate n, an anticipated value of the population standard deviation, is required. In the absence of knowledge of this, a rough guide is provided by the largest minus the smallest anticipated values of the measurement of concern divided by 4.
If instead we would to determine the sample size necessary to be 95% confident of estimating the average price of tablets within 5% of the true value, and we know, based on pilot survey data, that the true price should be about 1 dollar:
z_alpha <- 1.96
sigma <- 0.85
eps <- 0.05
hyp_mu <- 1
n <- (z_alpha^2*sigma^2)/(eps^2*hyp_mu^2)
sample_size <- ceiling(n)
sample_size
## [1] 1111
Hence, 1111 pharmacies should be sampled in order to be 95% confident that the resulting estimate will fall between 0.95 and 1.05 dollars, if the true average price is 1 dollar.
The formula below provide the sample size needed under the requirement of population proportion interval estimate at 0.95 confidence level, margin of error E, and planned proportion estimate p.
Example: Using a 0.50 planned proportion estimate, find the sample size needed to achieve 0.05 margin of error for the estimate at 0.95 confidence level.
zstar <- qnorm(.975)
p = 0.5
E = 0.05
n <- zstar^2*p*(1-p) / E^2
minsamp <- ceiling(n)
minsamp
## [1] 385
Another similar example: a district medical officer seeks to estimate the proportion of children in the district receiving appropriate childhood vaccinations. Assuming a simple random sample of a community is to be selected, how many children must be studied if the resulting estimate is to fall within 5 percentage points of the true proportion with 95% confidence ? In the following, we assume a value of 0.5 for the unknown proportion:
zstar <- qnorm(.975)
p = 0.5
eps = 0.05
n <- zstar^2*((1-p)/(p* (eps)^2))
minsamp <- ceiling(n)
minsamp
## [1] 1537
A sample of 1537 children would be needed.
It should be noted that this approach is valid if simple random sampling is used. This is rarely the case in an actual observational study. In case of a different sampling scheme, a design effect should be considered.This is a more advanced topic, not covered in this course.
The diet data frame has 337 rows and 14 columns. The data concern a subsample of subjects drawn from larger cohort studies of the incidence of coronary heart disease (CHD). These subjects had all completed a 7-day weighed dietary survey while taking part in validation studies of dietary questionnaire methods. Upon the closure of the MRC Social Medicine Unit, from where these studies were directed, it was found that 46 CHD events had occurred in this group, thus allowing a study of the relationship between diet and the incidence of CHD. We now load the R library required and the dataset.
library(Epi)
data(diet)
First of all we want to estimate the overall incidence rate of CHD:we should compute the follow up time in years for each subject in the study.
attach(diet)
y <- cal.yr(dox)-cal.yr(doe)
Then we compute the total follow up of the study and the number of incident cases:
Y <- sum(y)
D <- sum(chd)
Finally, assuming a constant rate, we estimate the incidence rate as follows:
rate <- D/Y
rate
## [1] 0.009992031
And we can use the log(rate) in order to derive the 95% confidence interval:
erf <- exp(1.96/sqrt(D))
c(rate, rate/erf, rate*erf)
## [1] 0.009992031 0.007484256 0.013340094
Of note: as we have seen, the likelihood for a constant rate based on the number of events D and the risk time Y is proportional to a Poisson likelihood for the observation D with mean rate*Y. Hence we can estimate the rate also using a Poisson regression model; in the model we need the log of the follow up time for each person as the offset variable:
m1 <- glm(chd~1,offset=log(y), family=poisson, data=diet)
ci.exp(m1)
## exp(Est.) 2.5% 97.5%
## (Intercept) 0.009992031 0.007484355 0.01333992
the function glm with family=Poisson fits a Poisson regression model (see Block 3 for other examples) here with one parameter, the intercept (called ‘1’) while including the log-person-time as a covariate with a fixed coefficient of 1; this is what is called an offset.
Another example using the rate in the original scale: suppose 15 events are observed during 5532 person-years in a given study cohort.Let’s now estimate the underlying incidence rate λ (in 1000 person-years: therefore 5.532) and to get an approximate confidence interval:
D <- 15
Y <- 5.532
rate <- D / Y
SE.rate <- rate/sqrt(D)
c(rate, SE.rate, rate + c(-1.96, 1.96)*SE.rate )
## [1] 2.7114967 0.7001054 1.3392901 4.0837034
Now, if we want to estimate the sample size required in a study to estimate an incidence rate within a pre-specified precision, let’s follow this example: based on data from previously conducted studies, we expect the rate to be about 50 per 10.000 pyrs. We want to determine the size of the sample that will be required to estimate the incidence rate in that population within ± 5 per 10.000 pyrs.
We are here imposing that the margin of precision should be E=1.96*S.e.(rate), so we can derive the standard error of the rate as:
se.rate <- (5/1.96)
then, we can derive the number of cases needed by:
number.cases <- (50/se.rate)**2
number.cases
## [1] 384.16
and finally, we need to derive the person-time to observe that number of cases as:
person.years <- number.cases/50
person.years*10000
## [1] 76832
Therefore, we could follow 76832 subjects for one year (or 38416 for two years…etc) in order to observe 384 events and be able to estimate a 95% confidence interval of the required precision.