Epidemiology is the study of the frequency, distribution and determinants of health-related states in populations and the application of such knowledge to control health problems (Disease Control and Prevention 2006).
This introduction provides examples on the way R can be used for initial descriptive epidemiological analyses, that is, to describe how the frequency of disease varies in the populations, places and times. Note that we will not discuss specific methods for spatial epidemiology in this course, that is an advanced and more specific topic.
Load the libraries that we will use in this session:
library(epiR)
library(Epi)
library(popEpi)
library(ggplot2);
library(scales);
library(lubridate)
library(AER)
library(MASS)
library(survival)
Descriptions of disease frequency involves reporting either the prevalence or incidence of disease.
Recall of the definitions: strictly speaking, prevalence equals the number of cases of a given disease or attribute that exists in a population at a specified point in time. Prevalence is therefore the proportion of a population that has a specific disease or attribute at a specified point in time.
Two types of prevalence are reported in the literature: (1) point prevalence : the proportion of a population in a diseased state at a single point in time, (2) period prevalence : the proportion of a population with a given disease or condition over a specific period of time (i.e. the number of existing cases at the start of a follow-up period plus the number of incident cases that occur during the follow-up period).
Incidence provides a measure of how frequently susceptible (at risk) individuals become disease cases as they are observed over time. An incident case occurs when an individual changes from being susceptible (at risk) to being diseased. The count of incident cases is the number of such events that occur in a population during a defined follow-up period.
There are two ways to express incidence:
Cumulative incidence is the proportion of initially susceptible individuals in a population who become new cases during a defined follow-up period.
Incidence rate (also known as incidence density) is the number of new cases of disease that occur per unit of individual time at risk during a follow-up period.
Of note: since we usually work with samples from the population of interest, in addition to reporting the point estimate of disease frequency, it is important to provide an indication of the uncertainty around that point estimate. The epi.conf function in the epiR package allows you to obtain confidence intervals for prevalence, cumulative incidence and incidence rates. In fact, these functions are useful for the confidence intervals calculations more than for the simple formulas of prevalence and incidence.
Let’s say we’re interested in the prevalence of disease X in a population comprised of 1000 individuals. Two hundred are tested and four returned a positive result. Assuming 100% test sensitivity and specificity (=i.e. the accuracy of the diagnostic tool, see later in this block for a detailed explanation of these concepts), what is the estimated prevalence of disease X in this population?
ncas <- 4; npop <- 200
tmp <- as.matrix(cbind(ncas, npop))
epi.conf(tmp, ctype = "prevalence", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 100
## est lower upper
## 1 2 0.5475566 5.041361
The estimated prevalence of disease X in this population is 2.0 (95% confidence interval [CI] 0.55 - 5.0) cases per 100 individuals. How the confidence interval is computed? See the “from data to estimates” paragraph !
Of note: the design effect (option design) is used to adjust the confidence interval around a prevalence or incidence risk estimate in the presence of clustering i.e. when the sampling mechanism is not related to single subjects but to groups. The design effect is a measure of the variability between clusters and is calculated as the ratio of the variance calculated assuming a complex sample design divided by the variance calculated assuming simple random sampling.
As we have seen, the prevalence of a disease in a population is the fraction of the population that has the disease at a given time. The total number of persons in Country X alive with colon cancer on 1 January 2023 is 16281, and the number of persons in Country X on 1 January 2023 is 5534738, so the point prevalence of colon cancer is :
16281/5534738
## [1] 0.002941603
that is, 0.29%.
This is the empirical prevalence of colon cancer patients in Country X on 1 January 2023.
The theoretical prevalence is the probability that a randomly chosen person from the population has colorectal cancer.
This is an unknown quantity, but a reasonable estimate of it would be the empirical prevalence.
That is, however, just an estimate from a statistical model that asserts that colon cancer is randomly distributed among the population, with a probability p,and there is an uncertainty associated with the estimate which we can quantify in a confidence interval. This is an interval that captures the true prevalence with some pre-specified probability (usually set at 95%).
The data used to estimate p are X=number of persons with colorectal cancer, and N=number of persons in the population. In order to assess p we can fit a binomial model:
X <- 16281
N <- 5534738
mp <- glm(cbind(X, N-X)~1, family=binomial)
round(ci.pred(mp)*100,4)
## Estimate 2.5% 97.5%
## 1 0.2942 0.2897 0.2987
We can use the function glm in R that fits generalized linear models, in this case a model for a binomial outcome. The observed outcome must be specified as the number of successes (cases of cancer) and the number of failures (not-diseased persons). The prevalence as a function of explanatory variables is described by a logistic regression model, here specified by ~1 (without covariates) which means that the prevalence is constant across all observational units (not surprisingly since we only have one data point!). The ci.pred produces a prediction of the theoretical prevalence from the model. We will see in block 3 how the logistic regression model could be used to describe prevalence in epidemiological studies and to evaluate (multiple) factors associated with prevalence.
We will use a dataset of Diabetes prevalence in Denmark in 1-year age classes by sex, measured in 01-01-2010. This dataset has 200 rows and the following 4 variables:
A: Numeric, age, 0-99
sex: Sex, a factor with levels M F
X: Number of diabetic patients
N: Population size
data(pr)
str(pr)
## 'data.frame': 200 obs. of 4 variables:
## $ A : num 0 0 10 10 11 11 12 12 13 13 ...
## $ sex: Factor w/ 2 levels "M","F": 2 1 2 1 2 1 2 1 2 1 ...
## $ X : num 1 0 84 83 106 70 104 111 133 120 ...
## $ N : num 30743 32435 32922 34294 32890 ...
head(pr)
## A sex X N
## 1 0 F 1 30743
## 2 0 M 0 32435
## 3 10 F 84 32922
## 4 10 M 83 34294
## 5 11 F 106 32890
## 6 11 M 70 34644
We would like to know the prevalence at different ages and separately by sex. Since these data come from the entire population of Denmark, the interpretation here will not include the sampling variation (i.e. confidence intervals). We can compute the observed prevalence by age and sex as follows:
prt <- stat.table(index=list(A,sex), contents=ratio(X,N,100), data=pr)[1,,]
round(prt[70:75,],1)
## sex
## A M F
## 69 16.2 12.4
## 70 16.5 12.7
## 71 17.3 13.5
## 72 17.7 13.9
## 73 18.3 14.8
## 74 18.9 15.1
a few rows of the resulting array is printed, from age of 69 to 64, just to show some results.
We can plot these entries as a function of age, with one curve for each sex:
matplot(0:99+0.5, prt, type="l", lwd=2, col=c(4,2),lty=1, xlab="Age" )
abline(v=c(76,81)+0.5)
Ages between 76 and 81 correspond to the highest prevalences for men and women. We plotted prevalence against the midpoints of the age-classes so we add 0.5 to the position of the vertical lines too.
We use a dataset of US traffic fatalities data for the “lower 48” US states (i.e., excluding Alaska and Hawaii), annually for 1982 through 1988.
data(Fatalities)
head(Fatalities)
## state year spirits unemp income emppop beertax baptist mormon drinkage
## 1 al 1982 1.37 14.4 10544.15 50.69204 1.539379 30.3557 0.32829 19.00
## 2 al 1983 1.36 13.7 10732.80 52.14703 1.788991 30.3336 0.34341 19.00
## 3 al 1984 1.32 11.1 11108.79 54.16809 1.714286 30.3115 0.35924 19.00
## 4 al 1985 1.28 8.9 11332.63 55.27114 1.652542 30.2895 0.37579 19.67
## 5 al 1986 1.23 9.8 11661.51 56.51450 1.609907 30.2674 0.39311 21.00
## 6 al 1987 1.18 7.8 11944.00 57.50988 1.560000 30.2453 0.41123 21.00
## dry youngdrivers miles breath jail service fatal nfatal sfatal
## 1 25.0063 0.211572 7233.887 no no no 839 146 99
## 2 22.9942 0.210768 7836.348 no no no 930 154 98
## 3 24.0426 0.211484 8262.990 no no no 932 165 94
## 4 23.6339 0.211140 8726.917 no no no 882 146 98
## 5 23.4647 0.213400 8952.854 no no no 1081 172 119
## 6 23.7924 0.215527 9166.302 no no no 1110 181 114
## fatal1517 nfatal1517 fatal1820 nfatal1820 fatal2124 nfatal2124 afatal
## 1 53 9 99 34 120 32 309.438
## 2 71 8 108 26 124 35 341.834
## 3 49 7 103 25 118 34 304.872
## 4 66 9 100 23 114 45 276.742
## 5 82 10 120 23 119 29 360.716
## 6 94 11 127 31 138 30 368.421
## pop pop1517 pop1820 pop2124 milestot unempus emppopus gsp
## 1 3942002 208999.6 221553.4 290000.1 28516 9.7 57.8 -0.02212476
## 2 3960008 202000.1 219125.5 290000.2 31032 9.6 57.9 0.04655825
## 3 3988992 197000.0 216724.1 288000.2 32961 7.5 59.5 0.06279784
## 4 4021008 194999.7 214349.0 284000.3 35091 7.2 60.1 0.02748997
## 5 4049994 203999.9 212000.0 263000.3 36259 7.0 60.7 0.03214295
## 6 4082999 204999.8 208998.5 258999.8 37426 6.2 61.5 0.04897637
This data frame contain 336 observations on 34 variables. Traffic fatalities are extracted from the US Department of Transportation Fatal Accident Reporting System. There are a lot of information, for example the beer tax is the tax on a case of beer, which is an available measure of state alcohol taxes more generally. The drinking age variable (drinkage) indicates whether the minimum legal drinking age is 18, 19, or 20. The two binary punishment variables (jail and service) describe the state’s minimum sentencing requirements for an initial drunk driving conviction.Total vehicle miles traveled annually by state was obtained from the Department of Transportation. Personal income was obtained from the US Bureau of Economic Analysis, and the unemployment rate was obtained from the US Bureau of Labor Statistics.
Variables of interest for us now are:
We can calculate the cumulative incidence per state per year, and plot as time series:
table.deaths <- with(Fatalities,tapply(fatal, list(state,year), sum))
table.exp <- with(Fatalities,tapply(pop, list(state,year), sum))
inc.cum.year.state <-(table.deaths/table.exp)*10000
Plot just the first five states:
plot.ts(t(inc.cum.year.state[1:5,]), plot.type="single", xlab="years", ylab="Incidence", col=c(1:5), lty=c(1:5), ylim=c(1,4))
legend("topleft", col=c(1:5), lty=c(1:5), legend=c("al", "az", "ar", "ca", "co"),bty="n")
Note that here we are not showing the confidence intervals around these estimates.
A study was conducted by Feychting, Osterlund, and Ahlbom (1998) to report the frequency of cancer among the blind. A total of 136 diagnoses of cancer were made from 22050 person-years at risk (sampling from the population of interest). What was the incidence rate of cancer in this population?
ncas <- 136; ntar <- 22050
tmp <- as.matrix(cbind(ncas, ntar))
epi.conf(tmp, ctype = "inc.rate", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 1000
## est lower upper
## ncas 6.1678 5.174806 7.295817
The incidence rate of cancer in this population was 6.2 (95% CI 5.2 to 7.3) cases per 1000 person-years at risk. How the confidence interval can be computed? …
In the case when the incidence rate is assumed constant, we can assume a Poisson likelihood that leads to the estimate : (the observed outcome is D, the number of events, and the amount of follow up time is Y):
\[\hat{\lambda}=D/Y\] \[s.e.{\log(\hat{\lambda})}=1/\sqrt(D)\]
From this formula we can derive 95% confidence intervals for the estimated rate.
Mortality is typically reported as a number of people that have died in a population of a certain size.
For example, 2648 persons died out of 150000 persons. This is a mortality of 17.6 per 1000 persons, apparently:
2648/150000
## [1] 0.01765333
Looking a bit closer at this, it is a bit illogical : the number of deaths must be recorded over some period of time, in this case presumably a year. The number of persons, on the other hand, refer to the population size at some specific date, presumably 1st January or July some year. So the two components, number of deaths and population size, should refer to a period and a specific time point, respectively.
The number of persons at risk of dying will not vary a lot over a year, so as a first approximation we might assume that the population was constant in the period.
Thus, the 150000 above refers to the average population over the period.
But, if for example the 2648 deaths were accumulated over a period of two years, the mortality would then be half of what we just computed, because the 150000 people had each two years to encounter death !!
This leads to the epidemiological concept of person-years also called risk time, time at risk, exposed time or even just exposure.
The person-years is basically the population size multiplied by the length of the observation interval.
So if the population size is 150000 and the observation period is 2 years, the number of person-years is 300000 and the mortality rate is:
2648/300
## [1] 8.826667
8.83 deaths per 1000 person-years.
Note that the rate is a scaled quantity, it has unit of measurement, in this example per 1000 py.
We might equally well have given the mortality rate as 88.3 per 10000 person-years.
Loading and viewing now the dataset Melanoma:
data(Melanoma)
head(Melanoma)
## time status sex age year thickness ulcer
## 1 10 3 1 76 1972 6.76 1
## 2 30 3 1 56 1968 0.65 0
## 3 35 2 1 41 1977 1.34 0
## 4 99 3 0 71 1968 2.90 0
## 5 185 1 1 52 1965 12.08 1
## 6 204 1 1 28 1971 4.84 1
The Melanoma data frame has data on 205 patients in Denmark with malignant melanoma. This data frame contains the following columns:
time : survival time in days, possibly censored (we will define in detail characteristics of survival data later on block 4).
status: 1 died from melanoma, 2 alive, 3 dead from other causes.
sex: 1 = male, 0 = female.
age: age in years.
year: year of operation.
thickness: tumour thickness in mm.
ulcer: 1 = presence, 0 = absence.
Now we want to calculate the incidence rate of death for melanoma using the person time epidemiological approach. (We will ignore now death for other causes, that represents a competing risk in the survival jargon…).
mm.deaths <-sum(Melanoma$status==1)
mm.deaths
## [1] 57
per.time <- sum((Melanoma$time)/365.25)
per.time
## [1] 1208.279
mortality.rate <-mm.deaths/per.time
incidence.rate <- round(100*mortality.rate,1)
incidence.rate
## [1] 4.7
If you report the incidence rate of deaths as 4.7 per 100 person-years, epidemiologists might understand, but most others will not. Person-time is epidemiologic jargon.
To convert this jargon to something understandable, simply replace person-years with persons per year. Reporting the results as 4.7 deaths per 100 persons per year sounds like English rather than jargon.
It also conveys the sense of the incidence rate as a dynamic process, the speed at which new cases of disease occur in the population.
In the R library(survival) there is a specific function that computes the person-years of follow-up time contributed by a cohort of subjects, also stratified into subgroups.
py <- pyears(Surv(time, status==1) ~1, data=Melanoma)
py$pyears
## [1] 1208.279
We can evaluate for example if the incidence rate of deaths seems different among men and women (we will only examine it from a descriptive point of view, without making any inference):
py.sex <- pyears(Surv(time, status==1) ~sex, data=Melanoma)
py.sex
## Call:
## pyears(formula = Surv(time, status == 1) ~ sex, data = Melanoma)
##
## Total number of events: 57
## Total number of person-years tabulated: 1208.279
## Total number of person-years off table: 0
## Observations in the data set: 205
py.sex$pyears
## sex
## 0 1
## 787.4415 420.8378
Now let’s re-do the calculations stratified by sex:
index.males <- Melanoma$sex==1
mm.deaths.males <-sum(Melanoma[index.males,]$status==1)
mm.deaths.males
## [1] 29
index.females <- Melanoma$sex==0
mm.deaths.females <-sum(Melanoma[index.females,]$status==1)
mm.deaths.females
## [1] 28
mortality.rate.males <-mm.deaths.males/py.sex$pyears[2]
incidence.rate.males <- round(100*mortality.rate.males,1)
mortality.rate.females <-mm.deaths.females/py.sex$pyears[1]
incidence.rate.females <- round(100*mortality.rate.females,1)
incidence.rate.males
## 1
## 6.9
incidence.rate.females
## 0
## 3.6
So, for males we observe 6.9 deaths per 100 persons per year, instead for females 3.6 deaths per 100 persons per year.
Disease Control, Centers for, and Prevention. 2006. Principles of Epidemiology in Public Health Practice: An Introduction to Applied Epidemiology and Biostatistics. Atlanta, Georgia: Centers for Disease Control; Prevention.
Feychting, M, B Osterlund, and A Ahlbom. 1998. “Reduced Cancer Incidence Among the Blind.” Epidemiology 9: 490-94.
Carstensen B., Epidemiology with R, 2021, Oxford University Press.