Epidemiology is the study of the frequency, distribution and determinants of health-related states in populations and the application of such knowledge to control health problems (Disease Control and Prevention 2006).

This introduction provides examples on the way R can be used for initial descriptive epidemiological analyses, that is, to describe how the frequency of disease varies in the populations, places and times. Note that we will not discuss specific methods for spatial epidemiology in this course, that is an advanced and more specific topic.

Prevalence and Incidence in the populations

Descriptions of disease frequency involves reporting either the prevalence or incidence of disease.

Recall of the definitions: strictly speaking, prevalence equals the number of cases of a given disease or attribute that exists in a population at a specified point in time. Prevalence is therefore the proportion of a population that has a specific disease or attribute at a specified point in time.

Two types of prevalence are reported in the literature: (1) point prevalence : the proportion of a population in a diseased state at a single point in time, (2) period prevalence : the proportion of a population with a given disease or condition over a specific period of time (i.e. the number of existing cases at the start of a follow-up period plus the number of incident cases that occur during the follow-up period).

Incidence provides a measure of how frequently susceptible (at risk) individuals become disease cases as they are observed over time. An incident case occurs when an individual changes from being susceptible (at risk) to being diseased. The count of incident cases is the number of such events that occur in a population during a defined follow-up period. There are two ways to express incidence:

Cumulative incidence is the proportion of initially susceptible individuals in a population who become new cases during a defined follow-up period.

Incidence rate (also known as incidence density) is the number of new cases of disease that occur per unit of individual time at risk during a defined follow-up period.

In addition to reporting the point estimate of disease frequency, it is important to provide an indication of the uncertainty around that point estimate. The epi.conf function in the epiR package allows you to obtain confidence intervals for prevalence, cumulative incidence and incidence rates. In fact, these functions are useful for the confidence intervals calculations more than for the simple formulas of prevalence and incidence.

Prevalence example

Let’s say we’re interested in the prevalence of disease X in a population comprised of 1000 individuals. Two hundred are tested and four returned a positive result. Assuming 100% test sensitivity and specificity (=i.e. the accuracy of the diagnostic tool, see later in this block for a detailed explanation), what is the estimated prevalence of disease X in this population?

ncas <- 4; npop <- 200
tmp <- as.matrix(cbind(ncas, npop))
epi.conf(tmp, ctype = "prevalence", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 100

##   est     lower    upper
## 1   2 0.5475566 5.041361

The estimated prevalence of disease X in this population is 2.0 (95% confidence interval [CI] 0.55 - 5.0) cases per 100 individuals.

Of note: the design effect (option design) is used to adjust the confidence interval around a prevalence or incidence risk estimate in the presence of clustering i.e. when the sampling mechanism is not related to single subjects but to groups. The design effect is a measure of the variability between clusters and is calculated as the ratio of the variance calculated assuming a complex sample design divided by the variance calculated assuming simple random sampling.

From data to estimates…

As we have seen, the prevalence of a disease in a population is the fraction of the population that has the disease at a given time. The total number of persons in Country X alive with colon cancer on 1 January 2023 is 16281, and the number of persons in Country X on 1 January 2023 is 5534738, so the point prevalence of colon cancer is :

16281/5534738

## [1] 0.002941603

that is, 0.29%.

This is the empirical prevalence of colon cancer patients in Country X on 1 January 2023.

The theoretical prevalence is the probability that a randomly chosen person from the population has colorectal cancer.

This is an unknown quantity, but a reasonable estimate of it would be the empirical prevalence.

That is, however, just an estimate from a statistical model that asserts that colon cancer is randomly distributed among the population, with a probability p,and there is an uncertainty associated with the estimate which we can quantify in a confidence interval. This is an interval that captures the true prevalence with some pre-specified probability (usually set at 95%).

The data used to estimate p are X=number of persons with colorectal cancer, and N=number of persons in the population. In order to assess p we can fit a binomial model:

X <- 16281
N <- 5534738
mp <- glm(cbind(X, N-X)~1, family=binomial)
round(ci.pred(mp)*100,4)

##   Estimate   2.5%  97.5%
## 1   0.2942 0.2897 0.2987

We can use the function glm in R that fits generalized linear models, in this case a model for a binomial outcome. The observed outcome must be specified as the number of successes (cases of cancer) and the number of failures (not-diseased persons). The prevalence as a function of explanatory variables is described by a model (more of this in block 3!), here specified by ~1, which means that the prevalence is constant across all observational units (not surprisingly since we only have one data point!). The ci.pred produces a prediction of the theoretical prevalence from the model.

Another example : prevalence of diabetes by age and sex

We will use a dataset of Diabetes prevalence in Denmark in 1-year age classes by sex, measured in 01-01-2010. This dataset has 200 rows and the following 4 variables:

A: Numeric, age, 0-99

sex: Sex, a factor with levels M F

X: Number of diabetes patients

N: Population size

data(pr)
str(pr)

## 'data.frame':    200 obs. of  4 variables:
##  $ A  : num  0 0 10 10 11 11 12 12 13 13 ...
##  $ sex: Factor w/ 2 levels "M","F": 2 1 2 1 2 1 2 1 2 1 ...
##  $ X  : num  1 0 84 83 106 70 104 111 133 120 ...
##  $ N  : num  30743 32435 32922 34294 32890 ...

head(pr)

##    A sex   X     N
## 1  0   F   1 30743
## 2  0   M   0 32435
## 3 10   F  84 32922
## 4 10   M  83 34294
## 5 11   F 106 32890
## 6 11   M  70 34644

We would like to know the prevalence at different ages and separately by sex. Since these data come from the entire population of Denmark, the interpretation will not include the sampling variation (i.e. confidence intervals). We can compute the observed prevalence by age and sex as follows:

prt <- stat.table(index=list(A,sex), contents=ratio(X,N,100), data=pr)[1,,]
round(prt[70:75,],1)

##     sex
## A       M    F
##   69 16.2 12.4
##   70 16.5 12.7
##   71 17.3 13.5
##   72 17.7 13.9
##   73 18.3 14.8
##   74 18.9 15.1

a few rows of the resulting array is printed, from age of 69 to 64, just to show some results.

We can plot these entries as a function of age, with one curve for each sex:

matplot(0:99+0.5, prt, type="l", lwd=2, col=c(4,2),lty=1 )
abline(v=c(76,81)+0.5)

Ages between 76 and 81 correspond to the maximal prevalences for men and women. We plotted the prevalences against the midpoints of the age-classes so we add 0.5 to the position of the vertical lines too.

Incidence rate basic example

A study was conducted by Feychting, Osterlund, and Ahlbom (1998) to report the frequency of cancer among the blind. A total of 136 diagnoses of cancer were made from 22050 person-years at risk. What was the incidence rate of cancer in this population?

ncas <- 136; ntar <- 22050
tmp <- as.matrix(cbind(ncas, ntar))
epi.conf(tmp, ctype = "inc.rate", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 1000

##         est    lower    upper
## ncas 6.1678 5.174806 7.295817

The incidence rate of cancer in this population was 6.2 (95% CI 5.2 to 7.3) cases per 1000 person-years at risk.

From data to estimates…again

In the case when the incidence rate is assumed constant, we can assume a Poisson likelihood that leads to the estimate : (the observed outcome is D, number of events, and the amount of follow up time is Y):

\[\hat{\lambda}=D/Y\] \[s.e.{\log(\hat{\lambda})}=1/\sqrt(D)\]

From this formula we can derive also approximate 95% confidence intervals for the estimated rate (see Block 2). A summary of the methods used for each of the confidence interval calculations in the epi.conf function are available in the “help”of the function.

Mortality rate

Mortality is typically reported as a number of people that have died in a population of a certain size. For example, 2648 persons died out of 150000 persons. This is a mortality of 17.6 per 1000 persons, apparently:

2648/150000

## [1] 0.01765333

Looking a bit closer at this, it is a bit illogical : the number of deaths must be recorded over some period of time, in this case presumably a year. The number of persons, on the other hand, refer to the population size at some specific date, presumably 1st January or July some year. So the two components, number of deaths and population size, refer to a period and a specific time point, respectively. The number of persons at risk of dying will not vary a lot over a year, so as a first approximation we might assume that the population was constant in the period. Thus, the 150000 above refers to the average population over the period. But, if the 2648 deaths were accumulated over a period of two years, the mortality would then be half of what we just computed, because the 150000 people had each two years to encounter death. This leads to the concept of person-years also called risk time, time at risk, exposed time or even just exposure. The person-years is the population size multiplied by the length of the observation interval. So since the population size is 150000 and the observation period is 2 years, the number of person-years is 300000 and the mortality rate is:

2648/300

## [1] 8.826667

8.83 deaths per 1000 person-years. Note that the rate is a scaled quantity, it has unit of measurement, in this example per 1000 py. We might equally well have given the mortality rate as 88.3 per 10000 person-years. This quantity is called the empirical mortality rate. The theoretical counterpart is the probability of dying in a small interval relative to the length of the interval. This is usually denoted by the Greek letter \(\lambda\):

\[\lambda=P\{{\sf death\ in} (t,t+dt)|{\tt alive\ at\ t\}}/dt\]

When we say small interval, we are approaching the formal definition as the limit of the quantity as dt tends to 0. The vertical bar denotes conditional probability (given). Note that the theoretical rate is also a scaled quantity; it is measured in units of \(time^{-1}\) (deaths per person time) because the denominator dt is measured in time units. As in the case of prevalence, there is an uncertainty associated with the estimate of the theoretical mortality rate, which again we can quantify in a confidence interval. The data used to estimate \(\lambda\) are D=number of deaths and Y=amount of follow up time. In order to assess this we can fit a Poisson model (other examples in block 3):

D <- 2648
Y <- 300000
mm <- glm(D~1,offset=log(Y), family=poisson)
round(ci.pred(mm,newdata=data.frame(Y=1000)),3)

##   Estimate  2.5% 97.5%
## 1    8.827 8.497 9.169

The observed outcome is D, number of deaths, and the amount of follow up time (Y) is specified in the offset argument. The mortality rate is described through a model for the number of events. The model is specified by ~1, which means that the mortality rate is constant over all observed units, specifically, that the number of deaths is proportional to the number of person-years as supplied in the offset argument. The log is used to linearize the model. The result is an estimate of the mortality rate per 1000 person-years.

Basic data structure to estimate incidence rates (more on block 4!)

Usually data are not aggregated but instead come from individuals subjects that are observed for a period of time (follow up) (see: cohort studies, block 2). The simplest example is following persons for the occurrence of death. The basic requirements for recordings in a follow-up study is: - date of entry to the study (the first date a person is known to be alive and at risk of having the event) - date of exit from the study (the last date a person is known to be alive and at risk of having the event/or the date of death) - The status of the person at the exit date - with or without the event.

It is assumed that the event terminates the follow-up, so we are implicitly referring to events like death or diagnosis of a chronic disease, where a person ceases to be at risk of the event if it occurs. More complex data structures and methods allow for recurrent events or competing risk (some basic ideas in block 4).

In the Epi package there is the dataset DMlate:

data(DMlate)
str(DMlate)

## 'data.frame':    10000 obs. of  7 variables:
##  $ sex  : Factor w/ 2 levels "M","F": 2 1 2 2 1 2 1 1 2 1 ...
##  $ dobth: num  1940 1939 1918 1965 1933 ...
##  $ dodm : num  1999 2003 2005 2009 2009 ...
##  $ dodth: num  NA NA NA NA NA ...
##  $ dooad: num  NA 2007 NA NA NA ...
##  $ doins: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ dox  : num  2010 2010 2010 2010 2010 ...

head(DMlate)

##        sex    dobth     dodm    dodth    dooad doins      dox
## 50185    F 1940.256 1998.917       NA       NA    NA 2009.997
## 307563   M 1939.218 2003.309       NA 2007.446    NA 2009.997
## 294104   F 1918.301 2004.552       NA       NA    NA 2009.997
## 336439   F 1965.225 2009.261       NA       NA    NA 2009.997
## 245651   M 1932.877 2008.653       NA       NA    NA 2009.997
## 216824   F 1927.870 2007.886 2009.923       NA    NA 2009.923

This dataset represents follow up of a random sample of 10000 diabetes patients in Denmark diagnosed after 1995, followed till the end of 2009. Random noise has been added to all dates in order to make persons untraceable. Each line represents the follow up of one person.

Sex: a factor with levels M F

dobth: Date of birth

dodm: Date of inclusion in the register

dodth: Date of death

dooad: Date of 2nd prescription of OAD

doins: Date of 2nd insulin prescription

dox: Date of exit from follow-up.

All dates are given in fractions of years, so 1998.000 corresponds to 1 January 1998 and 1998.997 to 31 December 1998.

All dates are randomly perturbed by a small amount, so no real persons have any of the combinations of the 6 dates in the dataset. But results derived from the data will be quite close to those that would be obtained if the entire ‘real’ diabetes register were used. The overall number of deaths, follow up time and mortality rates by sex can be shown using stat.table function:

stat.table(index=sex, contents=list(D=sum(!is.na(dodth)), Y=sum(dox-dodm), rate=ratio(!is.na(dodth),dox-dodm,1000)), margin=TRUE, data=DMlate)

##  --------------------------------- 
##  sex           D        Y    rate  
##  --------------------------------- 
##  M       1345.00 27614.21   48.71  
##  F       1158.00 26659.05   43.44  
##                                    
##  Total   2503.00 54273.27   46.12  
##  ---------------------------------

The overall mortality rate is 46.1 deaths per 1000 person-years. In making this calculation we are already implicitly imposed a statistical model describing the data, namely, the model which assumes a constant mortality rate for all persons during the study period. The single entries are from a model that assumes that mortality rates only differ by sex. While this model is clearly wrong for almost any purpose, it serves as an essential building block in derivation of more realistic models (see Block 4).

Different times …

Follow-up data basically consist of a time entry, a time exit and an indication of the status at exit. Note that time comes in two guises: one is the risk time (or exposure) and the other is the time-scale. The risk time is : “How long have you been alive” (i.e. exposed to potential death). The time scale is : “When were you alive” meaning for example age or calendar time. The risk time is part of the outcome measurement (it is the denominator of the rate), whereas the time-scale is an explanatory variable. We will usually be interested in how rates vary along some time-scale, age or calendar time, for example, so in most analyses we will using both risk time and time-scales.

References

Disease Control, Centers for, and Prevention. 2006. Principles of Epidemiology in Public Health Practice: An Introduction to Applied Epidemiology and Biostatistics. Atlanta, Georgia: Centers for Disease Control; Prevention.

Feychting, M, B Osterlund, and A Ahlbom. 1998. “Reduced Cancer Incidence Among the Blind.” Epidemiology 9: 490-94.

Carstensen B., Epidemiology with R, 2021, Oxford University Press.

Descriptive Epidemiology (Introduction)