Load R libraries
library(epiR)
library(Epi)
library(popEpi)
library(ggplot2);
library(scales);
library(lubridate)
library(AER)
library(MASS)
library(survival)
Let’s imagine to have a closed cohort of 1000 subjects. They
experience a mortality rate of 11 deaths per 1000 person-years over a
period of 50 years.
After 50 years, the cumulative incidence of death would expected to be
CI = IR X T = 11 X 50 = 550 deaths. In reality, there would only be 423
deaths after 50 years. The trick here is that the size of the population
at risk declines over time !! [This concept will be better
“formalized” in the survival analysis approach]. After the first year
there have been 11 deaths, and the population now has only 989 people,
not 1000 (and so on..)… This “decay” process could be modelled using the
exponential function, so that for example the relationship between
cumulative incidence and incidence rate could be expressed as :
\[CI=1-exp(-IR*t)\]
time <- seq(from=1, to=50)
cum.inc.base <- 0.011*time
cum.inc.exp <- 1-exp(-0.011*time)
plot(time, cum.inc.base, col="blue", xlab="Follow up (years)", ylab="Cumulative incidence")
points(time, cum.inc.exp, col="red")
lines(time, cum.inc.base,col="blue", lwd=2)
lines(time, cum.inc.exp,col="red", lwd=2)
The theoretical counterpart of the incidence rate is the probability of dying in a small interval relative to the length of the interval. This is usually denoted by the Greek letter \(\lambda\):
\[\lambda=P\{{\sf death\ in} (t,t+dt)|{\tt alive\ at\ t\}}/dt\]
When we say small interval, we are approaching the formal definition as the limit of the quantity as dt tends to 0. The vertical bar denotes conditional probability (given). Note that the theoretical rate is also a scaled quantity; it is measured in units of \(time^{-1}\) (deaths per person time) because the denominator dt is measured in time units. As in the case of prevalence, there is an uncertainty associated with the estimate of the theoretical mortality rate, which again we can quantify in a confidence interval. The data used to estimate \(\lambda\) are D=number of deaths and Y=amount of follow up time.
In order to model this quantity we can fit a Poisson regression model (other examples will follow in block 3): 2648 persons died out of 150000 persons observed for two years, therefore the number of person-years is 300000. Let’s fit a Poisson model to this data :
D <- 2648
Y <- 300000
mm <- glm(D~1,offset=log(Y), family=poisson)
round(ci.pred(mm,newdata=data.frame(Y=1000)),3)
## Estimate 2.5% 97.5%
## 1 8.827 8.497 9.169
The observed outcome is D, number of deaths, and the amount of follow up time (Y) is specified in the offset argument.
The estimated mortality rate is 8.83 deaths per 1000 person-years and now we also have the confidence interval around this estimate.
In this way, the incidence rate of mortality is described through a model for the number of events. The model is specified by ~1, which means that the mortality rate is assumed constant over all observed units, specifically, that the number of deaths is proportional to the number of person-years as supplied in the offset argument. The log is used to linearize the model.
Usually data are not aggregated but instead come from individuals subjects that are observed for a period of time (follow up). The simplest example is following persons for the occurrence of death. The basic requirements for recordings in a follow-up study is: - date of entry to the study (the first date a person is known to be alive and at risk of having the event) - date of exit from the study (the last date a person is known to be alive and at risk of having the event/or the date of death) - The status of the person at the exit date - with or without the event.
It is assumed that the event terminates the follow-up, so we are implicitly referring to events like death or diagnosis of a chronic disease, where a person ceases to be at risk of the event if it occurs. More complex data structures and methods allow for recurrent events or competing risk (some basic ideas will be described at the end of block 4).
In the Epi package there is the dataset DMlate:
data(DMlate)
str(DMlate)
## 'data.frame': 10000 obs. of 7 variables:
## $ sex : Factor w/ 2 levels "M","F": 2 1 2 2 1 2 1 1 2 1 ...
## $ dobth: num 1940 1939 1918 1965 1933 ...
## $ dodm : num 1999 2003 2005 2009 2009 ...
## $ dodth: num NA NA NA NA NA ...
## $ dooad: num NA 2007 NA NA NA ...
## $ doins: num NA NA NA NA NA NA NA NA NA NA ...
## $ dox : num 2010 2010 2010 2010 2010 ...
head(DMlate)
## sex dobth dodm dodth dooad doins dox
## 50185 F 1940.256 1998.917 NA NA NA 2009.997
## 307563 M 1939.218 2003.309 NA 2007.446 NA 2009.997
## 294104 F 1918.301 2004.552 NA NA NA 2009.997
## 336439 F 1965.225 2009.261 NA NA NA 2009.997
## 245651 M 1932.877 2008.653 NA NA NA 2009.997
## 216824 F 1927.870 2007.886 2009.923 NA NA 2009.923
This dataset represents follow up of a random sample of 10000 diabetes patients in Denmark diagnosed after 1995, followed till the end of 2009. Random noise has been added to all dates in order to make persons untraceable. Each line represents the follow up of one person.
We have the following variables in the dataset:
Sex: a factor with levels M F
dobth: Date of birth
dodm: Date of inclusion in the register
dodth: Date of death
dooad: Date of 2nd prescription of OAD (a drug)
doins: Date of 2nd insulin prescription (a drug)
dox: Date of exit from follow-up.
All dates are given in fractions of years, so 1998.000 corresponds to 1 January 1998 and 1998.997 to 31 December 1998.
All dates are randomly perturbed by a small amount, so no real persons have any of the combinations of the 6 dates in the dataset. But results derived from the data will be quite close to those that would be obtained if the entire ‘real’ diabetes register were used. The overall number of deaths, follow up time and mortality rates by sex can be shown using stat.table function:
stat.table(index=sex, contents=list(D=sum(!is.na(dodth)), Y=sum(dox-dodm), rate=ratio(!is.na(dodth),dox-dodm,1000)), margin=TRUE, data=DMlate)
## ---------------------------------
## sex D Y rate
## ---------------------------------
## M 1345.00 27614.21 48.71
## F 1158.00 26659.05 43.44
##
## Total 2503.00 54273.27 46.12
## ---------------------------------
The overall mortality rate is 46.1 deaths per 1000 person-years. Note that in making this calculation we are already implicitly imposed a statistical model describing the data, namely, the model which assumes a constant mortality rate for all persons during the study period. The single entries are from a model that assumes that mortality rates only differ by sex. While this model is clearly wrong for almost any purpose, it serves as an essential building block in derivation of more realistic models (see Block 4).
The following example does not require the use of any specific R library, we will do simply “by hand” calculations.
Age-specific data on the incidence of colon cancer in male and female populations of Finland during 1999 are given in the following table:
Let’s calculate the following summary measures:
We will compare and comment the results obtained in items 1 to 3.
World Standard Population:
First of all, we create the data matrix:
M <- matrix( c(10,1157,46.0,0.9,22,1109,41.9,2.0,0.44
,76,809,32.0,9.4,68,786,29.7,8.6,1.09
,305,455,18.0,67,288,524,19.8,55,1.22
,201,102,4.0,196,354,229,8.6,155,1.27), nrow=4, byrow=T )
M <- data.frame(M)
names(M) <- c("mca","mpy","mp","mr",
"fca","fpy","fp","fr","rr")
M
## mca mpy mp mr fca fpy fp fr rr
## 1 10 1157 46 0.9 22 1109 41.9 2.0 0.44
## 2 76 809 32 9.4 68 786 29.7 8.6 1.09
## 3 305 455 18 67.0 288 524 19.8 55.0 1.22
## 4 201 102 4 196.0 354 229 8.6 155.0 1.27
Then, we compute the crude incidence rates:
rates <- with( M, c( sum(mca)/sum(mpy)*100,
sum(fca)/sum(fpy)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR
## 23.46 27.64 0.85
Now we compute the age-standardized rates and their ratio using the male population as the standard:
wm <- with( M, mpy/sum(mpy) )
wm
## [1] 0.45858105 0.32065002 0.18034086 0.04042806
rates <-with( M, c( sum(mca/mpy*wm)*100,
sum(fca/fpy*wm)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR
## 23.46 19.85 1.18
And now we compute the age-standardized rates and their ratio using the World Standard Population:
WSP <- c(120,100,90,90,80,80,60,60,60,60,50,40,40,30,20,10,5,3,2)
WSP
## [1] 120 100 90 90 80 80 60 60 60 60 50 40 40 30 20 10 5 3 2
wt <- sum(WSP[1:7])
wt[2] <- sum(WSP[8:11])
wt[3] <- sum(WSP[12:15])
wt[4] <- sum(WSP[16:19])
wt <- wt/sum(wt)
wt
## [1] 0.62 0.23 0.13 0.02
rates <- with( M, c( sum(mca/mpy*wt)*100,
sum(fca/fpy*wt)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR
## 15.35 13.46 1.14
We can notice that when using the original data from the Finland population, males seem to have a low risk of colon cancer with respect to females. But this result is driven by the confounding effect of age, since females are older than males. When comparing the rates adjusting for the differences in age groups, the relative risk of males is instead higher than females.