Load R libraries

library(epiR)
library(Epi)
library(popEpi)
library(ggplot2); 
library(scales); 
library(lubridate)
library(AER)
library(MASS)
library(survival)

Relationship between Cumulative Incidence and Incidence Rate

Let’s imagine to have a closed cohort of 1000 subjects. They experience a mortality rate of 11 deaths per 1000 person-years over a period of 50 years.
After 50 years, the cumulative incidence of death would expected to be CI = IR X T = 11 X 50 = 550 deaths. In reality, there would only be 423 deaths after 50 years. The trick here is that the size of the population at risk declines over time !! [This concept will be better “formalized” in the survival analysis approach]. After the first year there have been 11 deaths, and the population now has only 989 people, not 1000 (and so on..)… This “decay” process could be modelled using the exponential function, so that for example the relationship between cumulative incidence and incidence rate could be expressed as :

\[CI=1-exp(-IR*t)\]

time         <- seq(from=1, to=50)
cum.inc.base <- 0.011*time
cum.inc.exp  <- 1-exp(-0.011*time)

plot(time, cum.inc.base, col="blue", xlab="Follow up (years)", ylab="Cumulative incidence")
points(time, cum.inc.exp, col="red")
lines(time, cum.inc.base,col="blue", lwd=2)
lines(time, cum.inc.exp,col="red", lwd=2)

“Theoretical” model for the incidence rate

The theoretical counterpart of the incidence rate is the probability of dying in a small interval relative to the length of the interval. This is usually denoted by the Greek letter \(\lambda\):

\[\lambda=P\{{\sf death\ in} (t,t+dt)|{\tt alive\ at\ t\}}/dt\]

When we say small interval, we are approaching the formal definition as the limit of the quantity as dt tends to 0. The vertical bar denotes conditional probability (given). Note that the theoretical rate is also a scaled quantity; it is measured in units of \(time^{-1}\) (deaths per person time) because the denominator dt is measured in time units. As in the case of prevalence, there is an uncertainty associated with the estimate of the theoretical mortality rate, which again we can quantify in a confidence interval. The data used to estimate \(\lambda\) are D=number of deaths and Y=amount of follow up time.

In order to model this quantity we can fit a Poisson regression model (other examples will follow in block 3): 2648 persons died out of 150000 persons observed for two years, therefore the number of person-years is 300000. Let’s fit a Poisson model to this data :

 D    <- 2648
    Y <- 300000
    mm <- glm(D~1,offset=log(Y), family=poisson)
    round(ci.pred(mm,newdata=data.frame(Y=1000)),3)
##   Estimate  2.5% 97.5%
## 1    8.827 8.497 9.169

The observed outcome is D, number of deaths, and the amount of follow up time (Y) is specified in the offset argument.

The estimated mortality rate is 8.83 deaths per 1000 person-years and now we also have the confidence interval around this estimate.

In this way, the incidence rate of mortality is described through a model for the number of events. The model is specified by ~1, which means that the mortality rate is assumed constant over all observed units, specifically, that the number of deaths is proportional to the number of person-years as supplied in the offset argument. The log is used to linearize the model.

Basic data structure to estimate incidence rates (more on block 4!!)

Usually data are not aggregated but instead come from individuals subjects that are observed for a period of time (follow up). The simplest example is following persons for the occurrence of death. The basic requirements for recordings in a follow-up study is: - date of entry to the study (the first date a person is known to be alive and at risk of having the event) - date of exit from the study (the last date a person is known to be alive and at risk of having the event/or the date of death) - The status of the person at the exit date - with or without the event.

It is assumed that the event terminates the follow-up, so we are implicitly referring to events like death or diagnosis of a chronic disease, where a person ceases to be at risk of the event if it occurs. More complex data structures and methods allow for recurrent events or competing risk (some basic ideas will be described at the end of block 4).

In the Epi package there is the dataset DMlate:

data(DMlate)
str(DMlate)
## 'data.frame':    10000 obs. of  7 variables:
##  $ sex  : Factor w/ 2 levels "M","F": 2 1 2 2 1 2 1 1 2 1 ...
##  $ dobth: num  1940 1939 1918 1965 1933 ...
##  $ dodm : num  1999 2003 2005 2009 2009 ...
##  $ dodth: num  NA NA NA NA NA ...
##  $ dooad: num  NA 2007 NA NA NA ...
##  $ doins: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ dox  : num  2010 2010 2010 2010 2010 ...
head(DMlate)
##        sex    dobth     dodm    dodth    dooad doins      dox
## 50185    F 1940.256 1998.917       NA       NA    NA 2009.997
## 307563   M 1939.218 2003.309       NA 2007.446    NA 2009.997
## 294104   F 1918.301 2004.552       NA       NA    NA 2009.997
## 336439   F 1965.225 2009.261       NA       NA    NA 2009.997
## 245651   M 1932.877 2008.653       NA       NA    NA 2009.997
## 216824   F 1927.870 2007.886 2009.923       NA    NA 2009.923

This dataset represents follow up of a random sample of 10000 diabetes patients in Denmark diagnosed after 1995, followed till the end of 2009. Random noise has been added to all dates in order to make persons untraceable. Each line represents the follow up of one person.

We have the following variables in the dataset:

All dates are given in fractions of years, so 1998.000 corresponds to 1 January 1998 and 1998.997 to 31 December 1998.

All dates are randomly perturbed by a small amount, so no real persons have any of the combinations of the 6 dates in the dataset. But results derived from the data will be quite close to those that would be obtained if the entire ‘real’ diabetes register were used. The overall number of deaths, follow up time and mortality rates by sex can be shown using stat.table function:

stat.table(index=sex, contents=list(D=sum(!is.na(dodth)), Y=sum(dox-dodm), rate=ratio(!is.na(dodth),dox-dodm,1000)), margin=TRUE, data=DMlate)
##  --------------------------------- 
##  sex           D        Y    rate  
##  --------------------------------- 
##  M       1345.00 27614.21   48.71  
##  F       1158.00 26659.05   43.44  
##                                    
##  Total   2503.00 54273.27   46.12  
##  ---------------------------------

The overall mortality rate is 46.1 deaths per 1000 person-years. Note that in making this calculation we are already implicitly imposed a statistical model describing the data, namely, the model which assumes a constant mortality rate for all persons during the study period. The single entries are from a model that assumes that mortality rates only differ by sex. While this model is clearly wrong for almost any purpose, it serves as an essential building block in derivation of more realistic models (see Block 4).

Example of standardized rates calculations

The following example does not require the use of any specific R library, we will do simply “by hand” calculations.

Age-specific data on the incidence of colon cancer in male and female populations of Finland during 1999 are given in the following table:

Let’s calculate the following summary measures:

  1. crude incidence rate in both populations and the rate ratio (anticipation of association measures…): males vs. females,
  2. age-standardized rates and their ratio using the male population as the standard,i.e. the “indirect method”;
  3. age-standardized rates and their ratio using the World Standard Population, i.e. the “direct method”

We will compare and comment the results obtained in items 1 to 3.

World Standard Population:

Let’s do it:

First of all, we create the data matrix:

M <- matrix( c(10,1157,46.0,0.9,22,1109,41.9,2.0,0.44
,76,809,32.0,9.4,68,786,29.7,8.6,1.09
,305,455,18.0,67,288,524,19.8,55,1.22
,201,102,4.0,196,354,229,8.6,155,1.27), nrow=4, byrow=T )
M <- data.frame(M)
names(M) <- c("mca","mpy","mp","mr",
              "fca","fpy","fp","fr","rr")
M
##   mca  mpy mp    mr fca  fpy   fp    fr   rr
## 1  10 1157 46   0.9  22 1109 41.9   2.0 0.44
## 2  76  809 32   9.4  68  786 29.7   8.6 1.09
## 3 305  455 18  67.0 288  524 19.8  55.0 1.22
## 4 201  102  4 196.0 354  229  8.6 155.0 1.27

Then, we compute the crude incidence rates:

rates <- with( M, c( sum(mca)/sum(mpy)*100,
            sum(fca)/sum(fpy)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  23.46  27.64   0.85

Now we compute the age-standardized rates and their ratio using the male population as the standard:

wm <- with( M, mpy/sum(mpy) )
wm
## [1] 0.45858105 0.32065002 0.18034086 0.04042806
rates <-with( M, c( sum(mca/mpy*wm)*100,
            sum(fca/fpy*wm)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  23.46  19.85   1.18

And now we compute the age-standardized rates and their ratio using the World Standard Population:

WSP <- c(120,100,90,90,80,80,60,60,60,60,50,40,40,30,20,10,5,3,2)
WSP
##  [1] 120 100  90  90  80  80  60  60  60  60  50  40  40  30  20  10   5   3   2
wt <- sum(WSP[1:7])
wt[2] <- sum(WSP[8:11])
wt[3] <- sum(WSP[12:15])
wt[4] <- sum(WSP[16:19])
wt <- wt/sum(wt)
wt
## [1] 0.62 0.23 0.13 0.02
rates <- with( M, c( sum(mca/mpy*wt)*100,
            sum(fca/fpy*wt)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  15.35  13.46   1.14

We can notice that when using the original data from the Finland population, males seem to have a low risk of colon cancer with respect to females. But this result is driven by the confounding effect of age, since females are older than males. When comparing the rates adjusting for the differences in age groups, the relative risk of males is instead higher than females.