Load the R libraries that we will use: note that in R it is possible to use different libraries for the same type of calculations; we will introduce here the survival library for the person years calculations. We will use a lot this package in Block 4.
library(Epi)
library(AER)
library(MASS)
library(survival)
We use a dataset of US traffic fatalities data for the “lower 48” US states (i.e., excluding Alaska and Hawaii), annually for 1982 through 1988.
data(Fatalities)
head(Fatalities)
## state year spirits unemp income emppop beertax baptist mormon drinkage
## 1 al 1982 1.37 14.4 10544.15 50.69204 1.539379 30.3557 0.32829 19.00
## 2 al 1983 1.36 13.7 10732.80 52.14703 1.788991 30.3336 0.34341 19.00
## 3 al 1984 1.32 11.1 11108.79 54.16809 1.714286 30.3115 0.35924 19.00
## 4 al 1985 1.28 8.9 11332.63 55.27114 1.652542 30.2895 0.37579 19.67
## 5 al 1986 1.23 9.8 11661.51 56.51450 1.609907 30.2674 0.39311 21.00
## 6 al 1987 1.18 7.8 11944.00 57.50988 1.560000 30.2453 0.41123 21.00
## dry youngdrivers miles breath jail service fatal nfatal sfatal
## 1 25.0063 0.211572 7233.887 no no no 839 146 99
## 2 22.9942 0.210768 7836.348 no no no 930 154 98
## 3 24.0426 0.211484 8262.990 no no no 932 165 94
## 4 23.6339 0.211140 8726.917 no no no 882 146 98
## 5 23.4647 0.213400 8952.854 no no no 1081 172 119
## 6 23.7924 0.215527 9166.302 no no no 1110 181 114
## fatal1517 nfatal1517 fatal1820 nfatal1820 fatal2124 nfatal2124 afatal
## 1 53 9 99 34 120 32 309.438
## 2 71 8 108 26 124 35 341.834
## 3 49 7 103 25 118 34 304.872
## 4 66 9 100 23 114 45 276.742
## 5 82 10 120 23 119 29 360.716
## 6 94 11 127 31 138 30 368.421
## pop pop1517 pop1820 pop2124 milestot unempus emppopus gsp
## 1 3942002 208999.6 221553.4 290000.1 28516 9.7 57.8 -0.02212476
## 2 3960008 202000.1 219125.5 290000.2 31032 9.6 57.9 0.04655825
## 3 3988992 197000.0 216724.1 288000.2 32961 7.5 59.5 0.06279784
## 4 4021008 194999.7 214349.0 284000.3 35091 7.2 60.1 0.02748997
## 5 4049994 203999.9 212000.0 263000.3 36259 7.0 60.7 0.03214295
## 6 4082999 204999.8 208998.5 258999.8 37426 6.2 61.5 0.04897637
This data frame contain 336 observations on 34 variables. Traffic fatalities are extracted from the US Department of Transportation Fatal Accident Reporting System. There are a lot of information, for example the beer tax is the tax on a case of beer, which is an available measure of state alcohol taxes more generally. The drinking age variable (drinkage) indicates whether the minimum legal drinking age is 18, 19, or 20. The two binary punishment variables (jail and service) describe the state’s minimum sentencing requirements for an initial drunk driving conviction.Total vehicle miles traveled annually by state was obtained from the Department of Transportation. Personal income was obtained from the US Bureau of Economic Analysis, and the unemployment rate was obtained from the US Bureau of Labor Statistics.
Variables of interest for us now are:
We can calculate the fatality cumulative incidence rate (number of traffic deaths per 10000 people living in that state in that year) as follows:
Fatalities$frate <- with(Fatalities, fatal/pop*10000)
head(Fatalities$frate)
## [1] 2.12836 2.34848 2.33643 2.19348 2.66914 2.71859
Then, we can calculate the cumulative incidence per state per year, and plot as time series.
table.deaths <- with(Fatalities,tapply(fatal, list(state,year), sum))
table.exp <- with(Fatalities,tapply(pop, list(state,year), sum))
inc.cum.year.state <-(table.deaths/table.exp)*10000
Plot just the first five states:
plot.ts(t(inc.cum.year.state[1:5,]), plot.type="single", xlab="years", ylab="Incidence", col=c(1:5), lty=c(1:5), ylim=c(1,4))
legend("topleft", col=c(1:5), lty=c(1:5), legend=c("al", "az", "ar", "ca", "co"),bty="n")
Note that here we are not showing the confidence intervals (assuming that these data come from random sampling…) around these estimates. We can use library epiR to do this :
library(epiR)
tmp <- cbind(Fatalities$fatal,Fatalities$pop)
pp <- epi.conf(tmp, ctype="inc.risk", method="exact")
inc.rates <- pp*10000
head(inc.rates)
## est lower upper
## 1 2.12836 1.986775 2.277370
## 2 2.34848 2.199966 2.504381
## 3 2.33643 2.188834 2.491359
## 4 2.19348 2.051104 2.343132
## 5 2.66914 2.512397 2.833098
## 6 2.71859 2.561012 2.883324
Loading and viewing now the dataset Melanoma:
data(Melanoma)
head(Melanoma)
## time status sex age year thickness ulcer
## 1 10 3 1 76 1972 6.76 1
## 2 30 3 1 56 1968 0.65 0
## 3 35 2 1 41 1977 1.34 0
## 4 99 3 0 71 1968 2.90 0
## 5 185 1 1 52 1965 12.08 1
## 6 204 1 1 28 1971 4.84 1
The Melanoma data frame has data on 205 patients in Denmark with malignant melanoma. This data frame contains the following columns:
time : survival time in days, possibly censored (we will define in detail characteristics of survival data later on block 4).
status: 1 died from melanoma, 2 alive, 3 dead from other causes.
sex: 1 = male, 0 = female.
age: age in years.
year: year of operation.
thickness: tumour thickness in mm.
ulcer: 1 = presence, 0 = absence.
Now we want to calculate the incidence rate of death for melanoma using the person time epidemiological approach. (We will ignore now death for other causes, that represents a competing risk in the survival jargon…).
mm.deaths <-sum(Melanoma$status==1)
mm.deaths
## [1] 57
per.time <- sum((Melanoma$time)/365.25)
per.time
## [1] 1208.279
mortality.rate <-mm.deaths/per.time
incidence.rate <- round(100*mortality.rate,1)
incidence.rate
## [1] 4.7
If you report the incidence rate of deaths as 4.7 per 100 person-years, epidemiologists might understand, but most others will not. Person-time is epidemiologic jargon.
To convert this jargon to something understandable, simply replace person-years with persons per year. Reporting the results as 4.7 deaths per 100 persons per year sounds like English rather than jargon.
It also conveys the sense of the incidence rate as a dynamic process, the speed at which new cases of disease occur in the population.
In the R library(survival) there is a specific function that computes the person-years of follow-up time contributed by a cohort of subjects, also stratified into subgroups.
py <- pyears(Surv(time, status==1) ~1, data=Melanoma)
py$pyears
## [1] 1208.279
We can evaluate for example if the incidence rate of deaths seems different among men and women (we will only examine it from a descriptive point of view, without making any inference):
py.sex <- pyears(Surv(time, status==1) ~sex, data=Melanoma)
py.sex
## Call:
## pyears(formula = Surv(time, status == 1) ~ sex, data = Melanoma)
##
## Total number of events: 57
## Total number of person-years tabulated: 1208.279
## Total number of person-years off table: 0
## Observations in the data set: 205
py.sex$pyears
## sex
## 0 1
## 787.4415 420.8378
Now let’s re-do the calculations stratified by sex:
index.males <- Melanoma$sex==1
mm.deaths.males <-sum(Melanoma[index.males,]$status==1)
mm.deaths.males
## [1] 29
index.females <- Melanoma$sex==0
mm.deaths.females <-sum(Melanoma[index.females,]$status==1)
mm.deaths.females
## [1] 28
mortality.rate.males <-mm.deaths.males/py.sex$pyears[2]
incidence.rate.males <- round(100*mortality.rate.males,1)
mortality.rate.females <-mm.deaths.females/py.sex$pyears[1]
incidence.rate.females <- round(100*mortality.rate.females,1)
incidence.rate.males
## 1
## 6.9
incidence.rate.females
## 0
## 3.6
So, for males we observe 6.9 deaths per 100 persons per year, instead for females 3.6 deaths per 100 persons per year.