1 Descriptive epidemiology

Epidemiology is the study of the frequency, distribution and determinants of health-related states in populations and the application of such knowledge to control health problems (Disease Control and Prevention 2006).

This introduction provides examples on the way R can be used for initial descriptive epidemiological analyses, that is, to describe how the frequency of disease varies in the populations, places and times. Note that we will not discuss specific methods for spatial epidemiology in this course, that is an advanced and more specific topic.

Load the libraries that we will use in this session:

library(epiR)
library(Epi)
library(popEpi)
library(ggplot2)
library(scales)
library(lubridate)

1.1 Prevalence and Incidence in the populations

Descriptions of disease frequency involves reporting either the prevalence or incidence of disease.

Recall of the definitions: strictly speaking, prevalence equals the number of cases of a given disease or attribute that exists in a population at a specified point in time. Prevalence is therefore the proportion of a population that has a specific disease or attribute at a specified point in time.

Two types of prevalence are reported in the literature: (1) point prevalence : the proportion of a population in a diseased state at a single point in time, (2) period prevalence : the proportion of a population with a given disease or condition over a specific period of time (i.e. the number of existing cases at the start of a follow-up period plus the number of incident cases that occur during the follow-up period).

Incidence provides a measure of how frequently susceptible (at risk) individuals become disease cases as they are observed over time. An incident case occurs when an individual changes from being susceptible (at risk) to being diseased. The count of incident cases is the number of such events that occur in a population during a defined follow-up period. There are two ways to express incidence:

Cumulative incidence is the proportion of initially susceptible individuals in a population who become new cases during a defined follow-up period.

Incidence rate (also known as incidence density) is the number of new cases of disease that occur per unit of individual time at risk during a defined follow-up period.

In addition to reporting the point estimate of disease frequency, it is important to provide an indication of the uncertainty around that point estimate. The epi.conf function in the epiR package allows you to obtain confidence intervals for prevalence, cumulative incidence and incidence rates. In fact, these functions are useful for the confidence intervals calculations more than for the simple formulas of prevalence and incidence.

1.1.1 Prevalence example

Let’s say we’re interested in the prevalence of disease X in a population comprised of 1000 individuals. Two hundred are tested and four returned a positive result. Assuming 100% test sensitivity and specificity (=i.e. the accuracy of the diagnostic tool, see later in this block for a detailed explanation), what is the estimated prevalence of disease X in this population?

ncas <- 4; npop <- 200
tmp <- as.matrix(cbind(ncas, npop))
epi.conf(tmp, ctype = "prevalence", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 100

##   est     lower    upper
## 1   2 0.5475566 5.041361

The estimated prevalence of disease X in this population is 2.0 (95% confidence interval [CI] 0.55 - 5.0) cases per 100 individuals.

Of note: the design effect (option design) is used to adjust the confidence interval around a prevalence or incidence risk estimate in the presence of clustering i.e. when the sampling mechanism is not related to single subjects but to groups. The design effect is a measure of the variability between clusters and is calculated as the ratio of the variance calculated assuming a complex sample design divided by the variance calculated assuming simple random sampling.

1.1.2 From data to estimates…

As we have seen, the prevalence of a disease in a population is the fraction of the population that has the disease at a given time. The total number of persons in Country X alive with colon cancer on 1 January 2023 is 16281, and the number of persons in Country X on 1 January 2023 is 5534738, so the point prevalence of colon cancer is :

16281/5534738

## [1] 0.002941603

that is, 0.29%.

This is the empirical prevalence of colon cancer patients in Country X on 1 January 2023.

The theoretical prevalence is the probability that a randomly chosen person from the population has colorectal cancer.

This is an unknown quantity, but a reasonable estimate of it would be the empirical prevalence.

That is, however, just an estimate from a statistical model that asserts that colon cancer is randomly distributed among the population, with a probability p,and there is an uncertainty associated with the estimate which we can quantify in a confidence interval. This is an interval that captures the true prevalence with some pre-specified probability (usually set at 95%).

The data used to estimate p are X=number of persons with colorectal cancer, and N=number of persons in the population. In order to assess p we can fit a binomial model:

X <- 16281
N <- 5534738
mp <- glm(cbind(X, N-X)~1, family=binomial)
round(ci.pred(mp)*100,4)

##   Estimate   2.5%  97.5%
## 1   0.2942 0.2897 0.2987

We can use the function glm in R that fits generalized linear models, in this case a model for a binomial outcome. The observed outcome must be specified as the number of successes (cases of cancer) and the number of failures (not-diseased persons). The prevalence as a function of explanatory variables is described by a model (more of this in block 3!), here specified by ~1, which means that the prevalence is constant across all observational units (not surprisingly since we only have one data point!). The ci.pred produces a prediction of the theoretical prevalence from the model.

1.1.3 Another example : prevalence of diabetes by age and sex

We will use a dataset of Diabetes prevalence in Denmark in 1-year age classes by sex, measured in 01-01-2010. This dataset has 200 rows and the following 4 variables:

A: Numeric, age, 0-99

sex: Sex, a factor with levels M F

X: Number of diabetes patients

N: Population size

data(pr)
str(pr)

## 'data.frame':    200 obs. of  4 variables:
##  $ A  : num  0 0 10 10 11 11 12 12 13 13 ...
##  $ sex: Factor w/ 2 levels "M","F": 2 1 2 1 2 1 2 1 2 1 ...
##  $ X  : num  1 0 84 83 106 70 104 111 133 120 ...
##  $ N  : num  30743 32435 32922 34294 32890 ...

head(pr)

##    A sex   X     N
## 1  0   F   1 30743
## 2  0   M   0 32435
## 3 10   F  84 32922
## 4 10   M  83 34294
## 5 11   F 106 32890
## 6 11   M  70 34644

We would like to know the prevalence at different ages and separately by sex. Since these data come from the entire population of Denmark, the interpretation will not include the sampling variation (i.e. confidence intervals). We can compute the observed prevalence by age and sex as follows:

prt <- stat.table(index=list(A,sex), contents=ratio(X,N,100), data=pr)[1,,]
round(prt[70:75,],1)

##     sex
## A       M    F
##   69 16.2 12.4
##   70 16.5 12.7
##   71 17.3 13.5
##   72 17.7 13.9
##   73 18.3 14.8
##   74 18.9 15.1

a few rows of the resulting array is printed, from age of 69 to 74, just to show some results.

We can plot these entries as a function of age, with one curve for each sex:

matplot(0:99+0.5, prt, type="l", lwd=2, col=c(4,2),lty=1 )
abline(v=c(76,81)+0.5)

Ages between 76 and 81 correspond to the maximal prevalences for men and women. We plotted the prevalences against the midpoints of the age-classes so we add 0.5 to the position of the vertical lines too.

1.2 Incidence rate basic example

A study was conducted by Feychting, Osterlund, and Ahlbom (1998) to report the frequency of cancer among the blind. A total of 136 diagnoses of cancer were made from 22050 person-years at risk. What was the incidence rate of cancer in this population?

ncas <- 136; ntar <- 22050
tmp <- as.matrix(cbind(ncas, ntar))
epi.conf(tmp, ctype = "inc.rate", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 1000

##         est    lower    upper
## ncas 6.1678 5.174806 7.295817

The incidence rate of cancer in this population was 6.2 (95% CI 5.2 to 7.3) cases per 1000 person-years at risk.

1.3 Mortality rate

Mortality is typically reported as a number of people that have died in a population of a certain size. For example, 2648 persons died out of 150000 persons. This is a mortality of 17.6 per 1000 persons, apparently:

2648/150000

## [1] 0.01765333

Looking a bit closer at this, it is a bit illogical : the number of deaths must be recorded over some period of time, in this case presumably a year. The number of persons, on the other hand, refer to the population size at some specific date, presumably 1st January or July some year. So the two components, number of deaths and population size, refer to a period and a specific time point, respectively. The number of persons at risk of dying will not vary a lot over a year, so as a first approximation we might assume that the population was constant in the period. Thus, the 150000 above refers to the average population over the period. But, if the 2648 deaths were accumulated over a period of two years, the mortality would then be half of what we just computed, because the 150000 people had each two years to encounter death. This leads to the concept of person-years also called risk time, time at risk, exposed time or even just exposure. The person-years is the population size multiplied by the length of the observation interval. So since the population size is 150000 and the observation period is 2 years, the number of person-years is 300000 and the mortality rate is:

2648/300

## [1] 8.826667

8.83 deaths per 1000 person-years. Note that the rate is a scaled quantity, it has unit of measurement, in this example per 1000 py. We might equally well have given the mortality rate as 88.3 per 10000 person-years. This quantity is called the empirical mortality rate. The theoretical counterpart is the probability of dying in a small interval relative to the length of the interval. This is usually denoted by the Greek letter \(\lambda\):

\[\lambda=P\{{\sf death\ in} (t,t+dt)|{\tt alive\ at\ t\}}/dt\]

When we say small interval, we are approaching the formal definition as the limit of the quantity as dt tends to 0. The vertical bar denotes conditional probability (given). Note that the theoretical rate is also a scaled quantity; it is measured in units of \(time^{-1}\) (deaths per person time) because the denominator dt is measured in time units. As in the case of prevalence, there is an uncertainty associated with the estimate of the theoretical mortality rate, which again we can quantify in a confidence interval. The data used to estimate \(\lambda\) are D=number of deaths and Y=amount of follow up time. In order to assess this we can fit a Poisson model (other examples in block 3):

D <- 2648
Y <- 300000
mm <- glm(D~1,offset=log(Y), family=poisson)
round(ci.pred(mm,newdata=data.frame(Y=1000)),3)

##   Estimate  2.5% 97.5%
## 1    8.827 8.497 9.169

The observed outcome is D, number of deaths, and the amount of follow up time (Y) is specified in the offset argument. The mortality rate is described through a model for the number of events. The model is specified by ~1, which means that the mortality rate is constant over all observed units, specifically, that the number of deaths is proportional to the number of person-years as supplied in the offset argument. The log is used to linearize the model. The result is an estimate of the mortality rate per 1000 person-years.

1.4 Exercises

1.4.1 Exercise (a)

The figure below shows the follow-up experience of members of a small study cohort between 1 January 2004 to 30 June 2009, from entry to follow-up until death (black filled circles indicate death due to cancer and empty circles indicate death for other causes). Follow-up until the occurrence of cancer C is shown with a broken line. For those subjects contracting cancer C, follow-up after diagnosis is shown with a solid line.

Looking at the plot, we will calculate the values of the incidence rate of the disease and of various mortality measures:

What is the incidence rate (per 100 pyears) of cancer C during the period from 1 Jan 2004 to 31 Dec 2008?
What is the mortality rate from cancer C during the same period?
What is the mortality rate from all causes during the same period?
What is the prevalence of C on 30 September 2006, and on 31 December 2008?

1.4.2 Solution

Looking at the plot, the person-years from start of follow-up until cancer, death, or end of 2008, whichever comes first, would be summed as follows:

pt <- 2.5 + 3.5 + 1.5 + 3.0 + 4.5 + 0.5 +1.0 + 2.5 + 2.5 + 2.5 + 1.5 + 1.5

The total person-time is therefore pt = 27 years. New cases of cancer are D = 5. Thus, the incidence rate is:

Cases <- 5
Irate <- 100*Cases/pt
Irate

## [1] 18.51852

Person-times until death or censoring at the end of 2008: add the follow-up times after diagnosis for the five subjects getting cancer:

pt2 <- pt + 2 + 1.5 + 1 + 0.5 + 0.5
pt2

## [1] 32.5

There was 1 death from cancer and 32.5 person-years, so the mortality rate from cancer is 3.1 per 100 years:

Dth.C <- 1
Mrate.C <- 100*Dth.C/pt2
round(Mrate.C, 1)

## [1] 3.1

To evaluate total mortality: 4 deaths per 32.5 years = 12.3 per 100 years

Dth.all <- 4
Mrate.all <- 100*Dth.all/pt2
round( Mrate.all, 1 )

## [1] 12.3

The prevalence at the two time points are 1/7=14.3% and 3/5 = 60%:

N1 <- 7 ; N2 <- 5
D1 <- 1 ; D2 <- 3
P1 <- 100*D1/N1; P2 <- 100*D2/N2
round( c(P1, P2), 1)

## [1] 14.3 60.0

if we also want to add nice labels:

Prev <- c(D1, D2)/c(N1, N2)
names(Prev) <- c("30sep2006","31dec2008")
round( 100*Prev, 1 )

## 30sep2006 31dec2008 
##      14.3      60.0

Comment: The simple formula for computing the incidence proportion or the mortality proportion for a given cause of death ignores the incidence of competing events, for instance death before getting cancer when assessing the incidence of cancer, and death from other causes when considering the cause-specific mortality proportion. When, however, the total mortality is estimated, there are no competing events. We will come back on these concepts in block 4.

1.4.3 Exercise (b)

In the table below are given the size (in 1000s) of the male population in Finland aged 0-14 years (the age range of “childhood” in pediatrics) on the 31 December in each year from 1991 to 2000.

The following numbers of cases describe the incidence of and mortality from acute leukaemia in this population for two calendar periods: 5 years 1993 to 1997, and year 1999 only.

Calculate the incidence rates of acute leukaemia in this population for the two periods. (Key: person-years could be obtained by the mid-population principle, i.e. using population in 1995 for the first period or taking an average between years)
Calculate similarly the mortality rates of leukaemia.

1.4.4 Solution

Person-years are obtained by the mid-population principle: for the period 1993-1997 we can take the mid-average (i.e. the 1995 population) and for the year 1999 the average between 1998 and 1999 could be a reasonable choice. We shall form a vector containing the person-years of both periods, both expressed in 100.000s for later convenience:

py1 <- 5*(4.96)
py2 <- (4.85+4.81 )/2 
py <- c(py1, py2)
names(py) <- c("1993-97","1999")
py

## 1993-97    1999 
##   24.80    4.83

The incidence rates are 113/24.80 = 4.6 and 26/4.83 = 5.4, both per 100.10000 py. Calculations are shortest when the pertinent quantities are arranged in vectors.

cases <- c(113, 26)
incid <- cases/py
round( incid, 2 )

## 1993-97    1999 
##    4.56    5.38

The mortality rates are 0.9 and 0.6, respectively, both per 100.000 py:

dths <- c(22, 3)
mort <- dths/py
round( mort, 2 )

## 1993-97    1999 
##    0.89    0.62

At first glance it looks as if the incidence has come up, but the mortality has gone down. However, the evidence is way too thin to draw any conclusions. More formal assessment requires appropriate statistical analysis of the error margin of the contrast between the observed rates of the two periods.

As the observed mortality-to-incidence ratios (M/I ratio) of leukaemia in these periods, 22/113 = 0.19 and 3/26 = 0.12, are quite low, this suggests that the great majority of children contracting leukaemia would have survived. However, a more appropriate assessment of the question requires proper survival analysis tools, based on following-up those children with leukaemia over a sufficiently long period (we will se later in block 4).

2 Standardized rates

2.1 Example of standardized rates calculations

The following example does not require the use of any specific R library, we will do simply “by hand” calculations.

Age-specific data on the incidence of colon cancer in male and female populations of Finland during 1999 are given in the following table:

Let’s calculate the following summary measures:

crude incidence rate in both populations and the rate ratio (anticipation of association measures…): males vs. females,
age-standardized rates and their ratio using the male population as the standard,i.e. the “indirect method”;
age-standardized rates and their ratio using the World Standard Population, i.e. the “direct method”

We will compare and comment the results obtained in items 1 to 3.

World Standard Population:

Let’s do it:

First of all, we create the data matrix:

M <- matrix( c(10,1157,46.0,0.9,22,1109,41.9,2.0,0.44
,76,809,32.0,9.4,68,786,29.7,8.6,1.09
,305,455,18.0,67,288,524,19.8,55,1.22
,201,102,4.0,196,354,229,8.6,155,1.27), nrow=4, byrow=T )
M <- data.frame(M)
names(M) <- c("mca","mpy","mp","mr",
              "fca","fpy","fp","fr","rr")
M

##   mca  mpy mp    mr fca  fpy   fp    fr   rr
## 1  10 1157 46   0.9  22 1109 41.9   2.0 0.44
## 2  76  809 32   9.4  68  786 29.7   8.6 1.09
## 3 305  455 18  67.0 288  524 19.8  55.0 1.22
## 4 201  102  4 196.0 354  229  8.6 155.0 1.27

Then, we compute the crude incidence rates:

rates <- with( M, c( sum(mca)/sum(mpy)*100,
            sum(fca)/sum(fpy)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )

## M rate F rate M/F RR 
##  23.46  27.64   0.85

Now we compute the age-standardized rates and their ratio using the male population as the standard:

wm <- with( M, mpy/sum(mpy) )
wm

## [1] 0.45858105 0.32065002 0.18034086 0.04042806

rates <-with( M, c( sum(mca/mpy*wm)*100,
            sum(fca/fpy*wm)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )

## M rate F rate M/F RR 
##  23.46  19.85   1.18

And now we compute the age-standardized rates and their ratio using the World Standard Population:

WSP <- c(120,100,90,90,80,80,60,60,60,60,50,40,40,30,20,10,5,3,2)
WSP

##  [1] 120 100  90  90  80  80  60  60  60  60  50  40  40  30  20  10   5   3   2

wt <- sum(WSP[1:7])
wt[2] <- sum(WSP[8:11])
wt[3] <- sum(WSP[12:15])
wt[4] <- sum(WSP[16:19])
wt <- wt/sum(wt)
wt

## [1] 0.62 0.23 0.13 0.02

rates <- with( M, c( sum(mca/mpy*wt)*100,
            sum(fca/fpy*wt)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )

## M rate F rate M/F RR 
##  15.35  13.46   1.14

We can notice that when using the original data from the Finland population, males seem to have a low risk of colon cancer with respect to females. But this result is driven by the confounding effect of age, since females are older than males. When comparing the rates adjusting for the differences in age groups, the relative risk of males is instead higher than females.

If you want to see other examples using specific R libraries check the following website: https://epirhandbook.com/en/standardised-rates.html

2.2 Exercises

Standardized rates could be considered as adjusted summary estimators, controlling for example for the confounding related to different age structures in the populations under study. These estimators are weighted averages of stratum-specific estimators.

We have seen the method of (direct) standardization by using an external standard population (CMR, comparative mortality ratio) and the (indirect) method of the standardized morbidity ratio estimation (Standardized Mortality ratio, SMR). These approaches are useful in simple descriptive analyses.

As we have seen, to compute the crude incidence rate we evaluate the ratio between the total number of new cases over the total person-years; if we have data splitted in age groups, we can sum at the numerator all the age-specific numbers of cases and then divide by the sum of all the age-specific person-years.

In order to compute the age standardized rates (using the direct method) we apply instead the following formula:

\[ASR=\frac{\sum_{k=1}^{K} weight_{k}*rate_{k}}{\sum_{k=1}^{K} weight_{k}}\] Load the dataset for the exercice: is the number of cases (D) and the age-specific incidence rates (in cases per 100,000 person-years) from the Danish Cancer Register for the period 1983-87 for colon cancer, rectum cancer and lung cancer, by sex.

library(here)

## here() starts at /Users/PaoloMacbook/Documents/PHD/esercitazioni_stat_learn_epi

std.rates <- read.delim(here("datasets","std-rates.txt"))
head(std.rates)

##   age sex   typ  D rate
## 1   0   M Colon  0 0.00
## 2   5   M Colon  2 0.25
## 3  10   M Colon  0 0.00
## 4  15   M Colon  3 0.30
## 5  20   M Colon  4 0.39
## 6  25   M Colon 13 1.36

std <- std.rates

2.2.1 Exercise (c)

Calculate the crude rates for each sex and site:

Since we have given that the number of male-person-years is 2521177 and the female is 2596061, we just need the total number of cases to compute the crude rates:

raw.rectum.m   <- sum( subset( std, sex=="M" & typ=="Rectum" )$D ) / 25.21177
raw.colon.m    <- sum( subset( std, sex=="M" & typ=="Colon" )$D ) / 25.21177
raw.lung.m     <- sum( subset( std, sex=="M" & typ=="Lung" )$D ) / 25.21177

raw.rectum.f  <- sum( subset( std, sex=="F" & typ=="Rectum" )$D ) / 25.96061
raw.colon.f   <- sum( subset( std, sex=="F" & typ=="Colon" )$D ) / 25.96061
raw.lung.f   <- sum( subset( std, sex=="F" & typ=="Lung" )$D ) / 25.96061

c(raw.colon.m,raw.colon.f)

## [1] 184.5170 221.3353

c(raw.lung.m,raw.lung.f)

## [1] 460.3009 192.0602

c(raw.rectum.m,raw.rectum.f)

## [1] 127.7975  96.0301

2.2.2 Exercise (d)

Calculate the standardized rates, standardized to the world standard population:

The standardized rates are just the observed rates multiplied by the defined weights and then summed over all ages. Therefore we first need to enter a vector of weights for doing the standardization (and make sure the weight sum to 1 by dividing with the sum):

wt <- c(120,100,90,90,80,80,60,60,60,60,50,40,40,30,20,10,5,3,2)
wt <- wt / sum(wt )
wt

##  [1] 0.120 0.100 0.090 0.090 0.080 0.080 0.060 0.060 0.060 0.060 0.050 0.040
## [13] 0.040 0.030 0.020 0.010 0.005 0.003 0.002

Now we can compute the standardized rates:

std.colon.m   <- sum( subset( std, sex=="M" & typ=="Colon" )$rate*wt)
std.lung.m    <- sum( subset( std, sex=="M" & typ=="Lung" )$rate*wt )
std.rectum.m  <- sum( subset( std, sex=="M" & typ=="Rectum" )$rate*wt )


std.rectum.f   <- sum( subset( std, sex=="F" & typ=="Rectum" )$rate*wt )
std.colon.f    <- sum( subset( std, sex=="F" & typ=="Colon" )$rate*wt )
std.lung.f     <- sum( subset( std, sex=="F" & typ=="Lung" )$rate*wt )

c(std.colon.m,std.colon.f)

## [1] 22.06645 20.61557

c(std.rectum.m,std.rectum.f)

## [1] 15.88366  9.55478

c(std.lung.m,std.lung.f)

## [1] 58.51967 23.14566

There are facilities in R for doing these analyses in one go: first we compute the crude rates and to that end we need the person-years in the population as a vector, and then the total number of cases by sex and site:

Y <- c( 25.21177, 25.96061 )
names( Y )  <- c("M","F")
Y

##        M        F 
## 25.21177 25.96061

D <- with( std, tapply( D, list(sex,typ), sum ) )
D

##   Colon  Lung Rectum
## F  5746  4986   2493
## M  4652 11605   3222

When we compute the rates we need to have the person-years in the right order, then we can divide the table by the person-years:

Y <- Y[2:1]
Y

##        F        M 
## 25.96061 25.21177

round( D/Y, 1 )

##   Colon  Lung Rectum
## F 221.3 192.1   96.0
## M 184.5 460.3  127.8

Same for the standardized rates:

wst <- with( std, tapply( rate*wt, list(sex,typ), sum ) )
wst

##      Colon     Lung   Rectum
## F 20.61557 23.14566  9.55478
## M 22.06645 58.51967 15.88366

Note that the rate-ratios between men and women vary by the measure they are based on. The ratios based on standardized rates assume that the incidence rate-ratio is the same throughout the age-span, which obviously is not. The more this assumption is violated the larger the differences will be between the ratios based on the different approaches.

crude.rates       <- round( D/Y, 1 )
crude.rates$ratio <- round(crude.rates[2,]/crude.rates[1,],2)
crude.rates$ratio

##  Colon   Lung Rectum 
##   0.83   2.40   1.33

wst$ratio <- round(wst[2,]/wst[1,],2)
wst$ratio

##  Colon   Lung Rectum 
##   1.07   2.53   1.66

3 Diagnostic tests

Load the required library:

library(epiR)

We will use an an example of data from the stress test to detect the presence of coronary disease.

dat           <- as.table(matrix(c(815,115,208,327), nrow = 2, byrow = TRUE))
colnames(dat) <- c("Dis+","Dis-")
rownames(dat) <- c("Test+","Test-")
dat

##       Dis+ Dis-
## Test+  815  115
## Test-  208  327

Let’s estimate the parameters of interest related to the performance of the stress test in detecting the disease: the following function computes true and apparent prevalence (apparent prevalence is in this case the number of subjects testing positive by a diagnostic test divided by the total number of subjects in the sample tested; true prevalence is the actual number of diseased subjects divided by the number of individuals in the population), sensitivity, specificity, positive and negative predictive values, and positive and negative likelihood ratios from count data provided in a 2 by 2 table. Exact binomial confidence limits are calculated for test sensitivity, specificity, and positive and negative predictive value. The positive likelihood ratio is calculated as \[LR+ = Sensitivity/1-Specificity\] It represents the probability of a person who has the disease testing positive divided by the probability of a person who does not have the disease testing positive. The negative likelihood ratio is calculated as \[LR- = 1-Sensitivity/Specificity\] It represents the probability of a person who has the disease testing negative divided by the probability of a person who does not have the disease testing negative.

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)

##           Outcome +    Outcome -      Total
## Test +          815          115        930
## Test -          208          327        535
## Total          1023          442       1465
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.63 (0.61, 0.66)
## True prevalence *                      0.70 (0.67, 0.72)
## Sensitivity *                          0.80 (0.77, 0.82)
## Specificity *                          0.74 (0.70, 0.78)
## Positive predictive value *            0.88 (0.85, 0.90)
## Negative predictive value *            0.61 (0.57, 0.65)
## Positive likelihood ratio              3.06 (2.61, 3.59)
## Negative likelihood ratio              0.27 (0.24, 0.31)
## False T+ proportion for true D- *      0.26 (0.22, 0.30)
## False T- proportion for true D+ *      0.20 (0.18, 0.23)
## False T+ proportion for T+ *           0.12 (0.10, 0.15)
## False T- proportion for T- *           0.39 (0.35, 0.43)
## Correctly classified proportion *      0.78 (0.76, 0.80)
## --------------------------------------------------------------
## * Exact CIs

3.1 Example: CK dataset

Let’s load the libraries:

library(epiR)
library(tidyverse)
library(OptimalCutpoints)

The CK file reports the creatine kinase (CK, U/L) measurements in patients with unstable angina and acute myocardial infarction, two conditions that are clinically difficult to distinguish. A diagnostic test has been devised such that for CK values > 100 the patient is classified as a case of acute myocardial infarction, for CK < 100 the patient is classified as a case of unstable angina. Calculate sensitivity, specificity, PPV, NPV of the diagnostic test. What changes when choosing CK = 80 as the threshold value?

load(here("datasets","CK.Rdata"))

Define the cutoff:

CK$ck.100 <- ifelse(CK$ck>100, 'CK > 100', 'CK <= 100')

Create the table:

pp <- table(CK$ck.100,CK$disease)
pp

##            
##             angina infarto
##   CK <= 100     58       1
##   CK > 100      34      26

Reverse the order:

dat <- as.table(matrix(c(26,34,1,58), nrow = 2, byrow = TRUE))
colnames(dat) <- c("AMI+","AMI-")
rownames(dat) <- c("Test+","Test-")
dat

##       AMI+ AMI-
## Test+   26   34
## Test-    1   58

Calculate sensitivity, specificity, PPV, NPV of the diagnostic test:

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)

##           Outcome +    Outcome -      Total
## Test +           26           34         60
## Test -            1           58         59
## Total            27           92        119
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.50 (0.41, 0.60)
## True prevalence *                      0.23 (0.16, 0.31)
## Sensitivity *                          0.96 (0.81, 1.00)
## Specificity *                          0.63 (0.52, 0.73)
## Positive predictive value *            0.43 (0.31, 0.57)
## Negative predictive value *            0.98 (0.91, 1.00)
## Positive likelihood ratio              2.61 (1.98, 3.44)
## Negative likelihood ratio              0.06 (0.01, 0.40)
## False T+ proportion for true D- *      0.37 (0.27, 0.48)
## False T- proportion for true D+ *      0.04 (0.00, 0.19)
## False T+ proportion for T+ *           0.57 (0.43, 0.69)
## False T- proportion for T- *           0.02 (0.00, 0.09)
## Correctly classified proportion *      0.71 (0.62, 0.79)
## --------------------------------------------------------------
## * Exact CIs

Change the cutoff:

CK$ck.80 <- ifelse(CK$ck>80, 'CK > 80', 'CK <= 80')

Create the table:

pp <- table(CK$ck.80,CK$disease)
pp

##           
##            angina infarto
##   CK <= 80     44       0
##   CK > 80      48      27

dat <- as.table(matrix(c(27,48,0,44), nrow = 2, byrow = TRUE))
colnames(dat) <- c("AMI+","AMI-")
rownames(dat) <- c("Test+","Test-")
dat

##       AMI+ AMI-
## Test+   27   48
## Test-    0   44

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)

##           Outcome +    Outcome -      Total
## Test +           27           48         75
## Test -            0           44         44
## Total            27           92        119
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.63 (0.54, 0.72)
## True prevalence *                      0.23 (0.16, 0.31)
## Sensitivity *                          1.00 (0.87, 1.00)
## Specificity *                          0.48 (0.37, 0.58)
## Positive predictive value *            0.36 (0.25, 0.48)
## Negative predictive value *            1.00 (0.92, 1.00)
## Positive likelihood ratio              1.92 (1.58, 2.33)
## Negative likelihood ratio              0.00 (0.00, NaN)
## False T+ proportion for true D- *      0.52 (0.42, 0.63)
## False T- proportion for true D+ *      0.00 (0.00, 0.13)
## False T+ proportion for T+ *           0.64 (0.52, 0.75)
## False T- proportion for T- *           0.00 (0.00, 0.08)
## Correctly classified proportion *      0.60 (0.50, 0.69)
## --------------------------------------------------------------
## * Exact CIs

We can observe that sensibility is improved but specificity is worse.

3.2 Exercise (e)

As part of a prostate cancer screening program, blood samples from 100.000 men have been analyzed looking for a biomarker known to be related to the disease. In total 4000 tested positive, but fortunately only 800 of these cases were then confirmed with biopsy (a much more reliable examination but also long and expensive). Of the 96.000 negative screening tests, 100 individuals have then developed the tumor within 12 months and are therefore to be considered false negatives. Calculate and comment sensitivity, specificity, PPV, NPV of the diagnostic test.

Let’s organize data in a 2x2 table :

dat <- as.table(matrix(c(800,3200,100,95900), nrow = 2, byrow = TRUE))
colnames(dat) <- c("Disease+","Disease-")
rownames(dat) <- c("Test+","Test-")
dat

##       Disease+ Disease-
## Test+      800     3200
## Test-      100    95900

Now let’s calculate the quantities of interest:

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)

##           Outcome +    Outcome -      Total
## Test +          800         3200       4000
## Test -          100        95900      96000
## Total           900        99100     100000
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.04 (0.04, 0.04)
## True prevalence *                      0.01 (0.01, 0.01)
## Sensitivity *                          0.89 (0.87, 0.91)
## Specificity *                          0.97 (0.97, 0.97)
## Positive predictive value *            0.20 (0.19, 0.21)
## Negative predictive value *            1.00 (1.00, 1.00)
## Positive likelihood ratio              27.53 (26.42, 28.68)
## Negative likelihood ratio              0.11 (0.10, 0.14)
## False T+ proportion for true D- *      0.03 (0.03, 0.03)
## False T- proportion for true D+ *      0.11 (0.09, 0.13)
## False T+ proportion for T+ *           0.80 (0.79, 0.81)
## False T- proportion for T- *           0.00 (0.00, 0.00)
## Correctly classified proportion *      0.97 (0.97, 0.97)
## --------------------------------------------------------------
## * Exact CIs

Sensibility is 89%. This means that 89% of people with the disease is positive to the test. Taking into account the severity of the disease this value is not satisfactory. Specificity is 97%. This means that 97% of healthy subjects were negative to the test. PPV, positive predictive value is 20%. This means that only 20% of subjects positive to the test had the disease. This is a very low value, but if the severity of the disease is considered high, it could be reasonable to afford the costs of so many negative biopsy.

4 ROC curves for diagnostic tests

For this example we will use the R package pROC. It builds a ROC curve and returns a roc object, a list of class roc. This object can be printed, plotted, or passed to the functions auc, ci, smooth.roc and coords. Additionally, two roc objects can be compared with the function roc.test. Data can be provided as response, predictor, where the predictor is the numeric (or ordered) level of the evaluated risk factor, and the response encodes the observation class (control or case). The level argument specifies which response level must be taken as controls (first value of level) or cases (second). It can safely be ignored when the response is encoded as 0 and 1, but it will frequently fail otherwise. By default, the first two values of levels(as.factor(response)) are taken, and the remaining levels are ignored. This means that if your response is coded “control” and “case”, the levels will be inverted.

library(pROC)

4.1 Example: Subarachnoid hemorrhage data

This dataset summarizes several clinical and one laboratory variable of 113 patients with an aneurysmal subarachnoid hemorrhage. The purpose of the case presented here is to identify patients at risk of poor post-aSAH outcome, as they require specific healthcare management; therefore the clinical test must be highly specific. Detailed results of the study are reported in [1]. We only outline the features relevant to the ROC analysis.

data(aSAH)
head(aSAH)

##    gos6 outcome gender age wfns s100b  ndka
## 29    5    Good Female  42    1  0.13  3.01
## 30    5    Good Female  37    1  0.14  8.54
## 31    5    Good Female  42    1  0.10  8.09
## 32    5    Good Female  27    1  0.04 10.42
## 33    1    Poor Female  42    3  0.13 17.40
## 34    1    Poor   Male  48    2  0.10 12.75

with(aSAH, table(gender, outcome))

##         outcome
## gender   Good Poor
##   Male     22   20
##   Female   50   21

Let’s estimate the ROC curve for the parameter s100b, a biomarker:

roc(aSAH$outcome, aSAH$s100b,levels=c("Good", "Poor"))

## 
## Call:
## roc.default(response = aSAH$outcome, predictor = aSAH$s100b,     levels = c("Good", "Poor"))
## 
## Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
## Area under the curve: 0.7314

Inspect the ROC plot and report the 95% CI:

roc(aSAH$outcome, aSAH$s100b,percent=TRUE, plot=TRUE, ci=TRUE)

## 
## Call:
## roc.default(response = aSAH$outcome, predictor = aSAH$s100b,     percent = TRUE, ci = TRUE, plot = TRUE)
## 
## Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
## Area under the curve: 73.14%
## 95% CI: 63.01%-83.26% (DeLong)

We can also inspect the confidence intervals around our estimates, point by point:

rocobj <- plot.roc(aSAH$outcome, aSAH$s100b)
ci.sp.obj <- ci.sp(rocobj, sensitivities=seq(0, 1, .01), boot.n=100)
plot(rocobj)
plot(ci.sp.obj, type="shape", col="blue")

We can also compare the performance of two different biomarkers:

roc.test(response = aSAH$outcome, predictor1= aSAH$wfns, predictor2 = aSAH$s100)

## 
##  DeLong's test for two correlated ROC curves
## 
## data:  aSAH$wfns and aSAH$s100 by aSAH$outcome (Good, Poor)
## Z = 2.209, p-value = 0.02718
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.01040618 0.17421442
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.8236789   0.7313686

And plot the two ROC curves in one graph:

roc.s100b <- roc(aSAH$outcome, aSAH$s100b)
roc.wfns <- roc(aSAH$outcome, aSAH$wfns)
plot(roc.s100b)
plot(roc.wfns, add=TRUE, col="red")

4.1.1 Determining an “optimal” cut-off

Load the required library:

library(OptimalCutpoints)

The package Optimal.cutpoints calculates optimal cutpoints in diagnostic tests. Several methods or criteria for selecting optimal cutoffs have been implemented, including methods based on cost-benefit analysis and diagnostic test accuracy measures (Sensitivity/Specificity, Predictive Values and Diagnostic Likelihood Ratios) or prevalence (Lopez-Raton et al. 2014). As a basic example, we can use the Youden Index approach:

perf.s100b <-optimal.cutpoints(X ="s100b", status ="outcome", tag.healthy ="Good",
                                    methods = "Youden", data =aSAH, control = control.cutpoints(), 
                                    ci.fit = TRUE, 
                                    conf.level = 0.95, trace = FALSE)
perf.s100b

## 
## Call:
## optimal.cutpoints.default(X = "s100b", status = "outcome", tag.healthy = "Good", 
##     methods = "Youden", data = aSAH, control = control.cutpoints(), 
##     ci.fit = TRUE, conf.level = 0.95, trace = FALSE)
## 
## Optimal cutoffs:
##   Youden
## 1 0.2200
## 
## Area under the ROC curve (AUC):  0.731 (0.63, 0.833)

Therefore, 0.22 is the selected cut-off for the biomarker that maximizes the sum of Sensitivity and Specificity. But, pay always attention to the shape of the functional relationship between the biomarker and the risk of outcome…more on this in block 3.

4.2 Exercise (f)

Let’s load the libraries:

library(tidyverse)
library(pROC)
library(MASS)
library(OptimalCutpoints)
library(epiR)

Load the data:

diabetes <- read.csv(here("datasets","diabetes.csv"))
glimpse(diabetes)

## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

We transform Outcome from an integer variable into a factor with categories “No” and “Yes”:

diabetes$Outcome = factor(diabetes$Outcome, labels=c("No", "Yes"))
table(diabetes$Outcome)

## 
##  No Yes 
## 500 268

Now we estimate the ROC curve for the glucose biomarker :

roc_gluc = roc(response = diabetes$Outcome,
                 predictor = diabetes$Glucose)

## Setting levels: control = No, case = Yes

## Setting direction: controls < cases

auc(roc_gluc)

## Area under the curve: 0.7881

We plot the curve:

g <- ggroc(roc_gluc)

g + theme_minimal() + ggtitle("Glucose ROC curve") + 
  geom_segment(aes(x = 1, xend = 0, y = 0, yend = 1), color="grey", linetype="dashed")

And finally we try to define an optimal cut-off, adopting the Youden Index approach:

perf.s100b <-optimal.cutpoints(X ="Glucose", status ="Outcome", tag.healthy ="No",
                               methods = "Youden", data =diabetes, control = control.cutpoints(), 
                               ci.fit = TRUE, 
                               conf.level = 0.95, trace = FALSE)
perf.s100b

## 
## Call:
## optimal.cutpoints.default(X = "Glucose", status = "Outcome", 
##     tag.healthy = "No", methods = "Youden", data = diabetes, 
##     control = control.cutpoints(), ci.fit = TRUE, conf.level = 0.95, 
##     trace = FALSE)
## 
## Optimal cutoffs:
##     Youden
## 1 124.0000
## 
## Area under the ROC curve (AUC):  0.788 (0.755, 0.822)

According to the Youden Index approach, the optimal cutoff for Glucose is 124.

Let’s see the sensitivity, specificity, PPV and NPV of this cut-off:

diabetes$gluc.124 <- ifelse(diabetes$Glucose>=124, 'Gluc >= 124', 'Gluc < 124')

pp <- table(diabetes$gluc.124,diabetes$Outcome)
pp

##              
##                No Yes
##   Gluc < 124  366  80
##   Gluc >= 124 134 188

dat <- as.table(matrix(c(188,134,80,366), nrow = 2, byrow = TRUE))
colnames(dat) <- c("Diab+","Diab-")
rownames(dat) <- c("Gluc+","Gluc-")
dat

##       Diab+ Diab-
## Gluc+   188   134
## Gluc-    80   366

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)

##           Outcome +    Outcome -      Total
## Test +          188          134        322
## Test -           80          366        446
## Total           268          500        768
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.42 (0.38, 0.46)
## True prevalence *                      0.35 (0.32, 0.38)
## Sensitivity *                          0.70 (0.64, 0.76)
## Specificity *                          0.73 (0.69, 0.77)
## Positive predictive value *            0.58 (0.53, 0.64)
## Negative predictive value *            0.82 (0.78, 0.86)
## Positive likelihood ratio              2.62 (2.22, 3.09)
## Negative likelihood ratio              0.41 (0.34, 0.49)
## False T+ proportion for true D- *      0.27 (0.23, 0.31)
## False T- proportion for true D+ *      0.30 (0.24, 0.36)
## False T+ proportion for T+ *           0.42 (0.36, 0.47)
## False T- proportion for T- *           0.18 (0.14, 0.22)
## Correctly classified proportion *      0.72 (0.69, 0.75)
## --------------------------------------------------------------
## * Exact CIs

It has to be noted that although sensitivity and specificity are considered the fundamental operating characteristics of a diagnostic test, there are nevertheless times when such measures may not be so useful in clinical practice, since clinical staff do not have prior information about the patient’s true disease status when applying the diagnostic test in practice.

Indeed, the problem tends to be just the opposite, and involves the need to ascertain the probability of the patient being healthy (or diseased) in a case where the test result is negative (or positive). Hence, strategies for selecting the optimal cutpoint based on predictive values can be more useful in certain situations.

From an applied point of view, it is usual to seek elevated positive predictive values in any case where treating false positives may have serious consequences, be these psychological, physical or economic (e.g., chemotherapy in cancer or AIDS).

Taking the diabetes example, in view of the fact that 1) diabetes is potentially curable (at least, there is a treatment that can stabilize patients), 2) a false negative could delay the diagnosis and having therefore consequences on the patients, instead a false positive does not have particular consequences (the first line treatment is just a change of diet and lifestyle) one would seek elevated negative predictive values.

Therefore, since the NPV of the estimated cutoff is 82% we could be quite satisfied also from the result obtained with the Youden Index approach.

Just as an example, one could, for instance, use the “MinValueNPV” criterion (by default the minimum value of NPV is set at 90%):

perf.s100bb <-optimal.cutpoints(X ="Glucose", status ="Outcome", tag.healthy ="No",
                               methods = "MinValueNPV", data =diabetes, control = control.cutpoints(valueNPV=0.90), 
                               ci.fit = TRUE)
perf.s100bb

## 
## Call:
## optimal.cutpoints.default(X = "Glucose", status = "Outcome", 
##     tag.healthy = "No", methods = "MinValueNPV", data = diabetes, 
##     control = control.cutpoints(valueNPV = 0.9), ci.fit = TRUE)
## 
## Optimal cutoffs:
##   MinValueNPV
## 1    102.0000
## 
## Area under the ROC curve (AUC):  0.788 (0.755, 0.822)

Let’s evaluate this new cutoff:

diabetes$gluc.102 <- ifelse(diabetes$Glucose>=102, 'Gluc >= 102', 'Gluc < 102')
pp <- table(diabetes$gluc.102,diabetes$Outcome)
pp

##              
##                No Yes
##   Gluc < 102  202  21
##   Gluc >= 102 298 247

dat <- as.table(matrix(c(247,298,21,202), nrow = 2, byrow = TRUE))
colnames(dat) <- c("Diab+","Diab-")
rownames(dat) <- c("Gluc+","Gluc-")
dat

##       Diab+ Diab-
## Gluc+   247   298
## Gluc-    21   202

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)

##           Outcome +    Outcome -      Total
## Test +          247          298        545
## Test -           21          202        223
## Total           268          500        768
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.71 (0.68, 0.74)
## True prevalence *                      0.35 (0.32, 0.38)
## Sensitivity *                          0.92 (0.88, 0.95)
## Specificity *                          0.40 (0.36, 0.45)
## Positive predictive value *            0.45 (0.41, 0.50)
## Negative predictive value *            0.91 (0.86, 0.94)
## Positive likelihood ratio              1.55 (1.43, 1.68)
## Negative likelihood ratio              0.19 (0.13, 0.30)
## False T+ proportion for true D- *      0.60 (0.55, 0.64)
## False T- proportion for true D+ *      0.08 (0.05, 0.12)
## False T+ proportion for T+ *           0.55 (0.50, 0.59)
## False T- proportion for T- *           0.09 (0.06, 0.14)
## Correctly classified proportion *      0.58 (0.55, 0.62)
## --------------------------------------------------------------
## * Exact CIs

Thus, the cutpoint obtained would be 102, with a PPV =0.45 and an NPV = 0.91 (the maximum value).

5 Measures of association

5.1 Introduction

An important task in epidemiology is to quantify the strength of association between exposures and outcomes. In this context the term exposure is taken to mean a variable whose association with the outcome is to be estimated.

Exposures can be harmful, beneficial or both harmful and beneficial (e.g. if an immunisable disease is circulating, exposure to immunising agents helps most recipients but may harm those who experience adverse reactions). The term outcome is used to describe all the possible results that may arise from exposure to a factor or from preventive or therapeutic interventions.

In the following examples we will consider only binary exposure factors (in block 3 we will extend to exposures on a quantitative scale, through the use of the appropriate regression models).

5.2 Relative Risk

RR (relative risk) is the ratio of the risk of getting disease among the exposed compared with that among the non-exposed. It indicates how many times the risk would increase had the subject changed their status from non-exposed to exposed. The increment is considered in fold, thus has a mathematical notation of being a multiplicative model.

Load the R libraries that we will use in this session:

library(epiR)
library(epitools)

First of all, let’s construct the contingency table of our study: the exposure factor is type of behaviour and the disease of interest is the occurrence of a coronary heart problem.

data.type <- matrix(c(178,79,1141,1486), 2, 2)
dimnames(data.type) <- list("Behavior type" = c("Type A", "Type B"),
                      "Occurrence of CHD Event" = c("Yes", "No"))
data.type

##              Occurrence of CHD Event
## Behavior type Yes   No
##        Type A 178 1141
##        Type B  79 1486

The aim is to calculate the Relative Risk and corresponding 95% CI. (We will come back to the 95% CI also later in Block 2 when discussing sample size determination). Here we use a function from the epiR library, where they call the the Relative Risk Incidence risk ratio. See the help of the function for all the details. Of note, the option method refers to the study design that has generated the data under study. We will see in more detail in Block 2 differences between cohort/case-controls and cross-sectional studies.

epi.2by2(data.type, method = "cohort.count")

##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +          178         1141       1319     13.50 (11.70 to 15.46)
## Exposed -           79         1486       1565        5.05 (4.02 to 6.25)
## Total              257         2627       2884       8.91 (7.90 to 10.01)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 2.67 (2.07, 3.45)
## Inc odds ratio                                 2.93 (2.23, 3.87)
## Attrib risk in the exposed *                   8.45 (6.31, 10.59)
## Attrib fraction in the exposed (%)            62.59 (51.75, 71.00)
## Attrib risk in the population *                3.86 (2.36, 5.37)
## Attrib fraction in the population (%)         43.35 (32.37, 52.55)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 62.919 Pr>chi2 = <0.001
## Fisher exact test that OR = 1: Pr>chi2 = <0.001
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

Therefore, the relative risk of a CHD event is 2.67 times for the behavior type “A” with respect to “B”.

5.3 Odds Ratio

As we have already seen in a dedicated note, odds has a meaning related with probability. If p is the probability, p/(1-p) is known as the odds. First of all, also here let’s build the contingency table: now the exposure factor is coffee drinking (equal or more than 1 cup per day vs zero) and the disease of interest is pancreatic cancer.

data.coffee <- matrix(c(347,20,555,88), 2, 2)
dimnames(data.coffee) <- list("Coffee Drinking (cups per day)" = c(">=1", "0"),
                       "Pancreatic cancer" = c("cases", "controls"))
data.coffee

##                               Pancreatic cancer
## Coffee Drinking (cups per day) cases controls
##                            >=1   347      555
##                            0      20       88

We want to estimate the odds ratio and corresponding 95% CI: the function oddsratio calculates the odds ratio by the median-unbiased estimation (mid-p), the conditional maximum likelihood estimation (Fisher), the unconditional maximum likelihood estimation (Wald), and the small sample adjustment (small). The reason of these various approach is related to the computation of the confidence intervals, that are calculated using exact methods (mid-p and Fisher), normal approximation (Wald), and normal approximation with small sample adjustment (small). If you are interested in these boring statistical details, see the help of the function for all the details about the default values.

oddsratio(data.coffee, method="wald")

## $data
##                               Pancreatic cancer
## Coffee Drinking (cups per day) cases controls Total
##                          >=1     347      555   902
##                          0        20       88   108
##                          Total   367      643  1010
## 
## $measure
##                               odds ratio with 95% C.I.
## Coffee Drinking (cups per day) estimate    lower    upper
##                            >=1 1.000000       NA       NA
##                            0   2.750991 1.662391 4.552449
## 
## $p.value
##                               two-sided
## Coffee Drinking (cups per day)   midp.exact fisher.exact   chi.square
##                            >=1           NA           NA           NA
##                            0   2.335156e-05  3.02823e-05 4.622573e-05
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

The odds ratio of the disease in the exposed is 2.75 w.r.t the unexposed.

5.4 Absolute effect and attributable risks

In this library, they call attributable risk the risk of disease in the exposed group minus the risk of disease in the unexposed group (we have defined it in our slides the excess risk).This provides a measure of the absolute increase or decrease in risk associated with exposure.Risk difference suggests the amount of risk gained or lost had the subject changed from non-exposed to exposed. The increase is absolute, and has the mathematical notation of an additive model. As we have discussed, the risk difference has more public health implications than the relative risk. A high relative risk may not be of public health importance if the disease is very rare. The risk difference, on the other hand, measures direct health burden and the need for health services. In this library there is also a quantity called Attrib risk in population, the difference between the risk in the total population minus the risk in the unexposed. Of note, in this library they call Attrib fraction in population (%) the fraction of all cases of D in the population that can be attributed to E (we call it simply attributable risk). And finally, they call Attrib fraction in exposed (%) the proportion of cases attributable to exposure in the exposed population (i.e. when considering the only population on which exposure can act). (We call it attributable risk in the exposed).

We now use as an example the low infant birth weight data presented by Hosmer and Lemeshow (2000) and available in the MASS package in R. The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Massachusetts USA during 1986.

library(MASS)
bwt <- birthwt
head(bwt)

##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

Each row of this data set represents data for one mother. We’re interested in the association between smoke (the mother’s smoking status during pregnancy) and low (delivery of a baby less than 2.5 kg bodyweight). It is important that the table you present to the epi.2by2 function is in the correct format: disease positives in the first column, disease negatives in the second column, exposure positives in the first row and exposure negatives in the second row.

bwt$low    <- factor(bwt$low, levels = c(1,0))
bwt$smoke  <- factor(bwt$smoke, levels = c(1,0))
bwt$race   <- factor(bwt$race, levels = c(1,2,3))

Generate the 2 by 2 table. Exposure (rows) = smoke, outcome (columns) = low:

low.tab <- table(bwt$smoke, bwt$low, dnn = c("Smoke", "Low BW"))
low.tab

##      Low BW
## Smoke  1  0
##     1 30 44
##     0 29 86

Compute the measures of association for smoking and delivery of a low birth weight baby:

epi.2by2(dat = low.tab, method = "cohort.count", conf.level = 0.95, 
   units = 100, outcome = "as.columns")

##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +           30           44         74     40.54 (29.27 to 52.59)
## Exposed -           29           86        115     25.22 (17.58 to 34.17)
## Total               59          130        189     31.22 (24.69 to 38.35)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 1.61 (1.06, 2.44)
## Inc odds ratio                                 2.02 (1.08, 3.78)
## Attrib risk in the exposed *                   15.32 (1.61, 29.04)
## Attrib fraction in the exposed (%)            37.80 (5.47, 59.07)
## Attrib risk in the population *                6.00 (-4.33, 16.33)
## Attrib fraction in the population (%)         19.22 (-0.21, 34.88)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 4.924 Pr>chi2 = 0.026
## Fisher exact test that OR = 1: Pr>chi2 = 0.036
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

The estimated excess risk is 15.3, i.e. we would expect the low birth weight to increase by 15% if all women smoked as compared to all women not smoking. The population attributable risk is estimated at 6%, the percent of cases in the total population that can be attributed to smoking. The attributable fraction in the population is 19%, i.e. the fraction of all cases of low birth weight that can be attributed to smoking (the naive interpretation is that low birth weight could be reduced by 19% if all mothers stop smoking). Finally, the attributable risk in the exposed is estimated at 38%, i.e. 38% of the low birth weight that occurred among women who smoked could be attributed to smoking (assuming causality).

5.5 Exercises

5.5.1 Exercise (g)

Assuming that the risk of some disease is 1 in 100.000 among unexposed people, create a table of attributable fractions for exposures that increase the risk to 1 in 50.000, 1 in 10.000 and 1 in 1000.

risk.unexp <- 1/100000
risk.exp <- c((1/50000), (1/10000), (1/1000))
rr <- risk.exp/risk.unexp
af <- 1 - (1/rr)
af.table <- cbind(unexposed = risk.unexp, exposed = risk.exp, rr, attrib.fx = af)
af.table

##      unexposed exposed  rr attrib.fx
## [1,]     1e-05   2e-05   2      0.50
## [2,]     1e-05   1e-04  10      0.90
## [3,]     1e-05   1e-03 100      0.99

The attributable fraction in the exposed increases from 50% in the case of 1 in 50.000 to 90% in the case of 1 in 10.000 to 99% in the case of 1 in 1000.

5.5.2 Exercise (h)

Let’s say that the probability (prevalence) to be exposed for the previous data is 5%. Add a column to the matrix created above for evaluating the population attributable risk.

p.exp <- 5/100
af.pop <- round((p.exp*(rr-1))/(1+p.exp*(rr-1)), digits=2)
af.table2 <- cbind(af.table, af.pop)
af.table2

##      unexposed exposed  rr attrib.fx af.pop
## [1,]     1e-05   2e-05   2      0.50   0.05
## [2,]     1e-05   1e-04  10      0.90   0.31
## [3,]     1e-05   1e-03 100      0.99   0.83

Given this level of prevalence of exposure, as expected the population attributable risk increases from 5% when the RR is relatively low (2) to 83% when the RR is very high (100!).

5.5.3 Exercise (i)

In the Scottish Health cohort Study 85 of 1821 people who lived in rented apartments had coronary artery disease, compared to 77 of 2477 people who owned their homes. Create a 2-by-2 matrix object from this data. Calculate the risks, relative risks, odds and odds ratios for these data. Comment the results.

Let’s create the data matrix:

rent <- c(85, 1821 - 85)
own <- c(77, 2477 - 77)
twoxtwo <- rbind(rent, own)
twoxtwo

##      [,1] [,2]
## rent   85 1736
## own    77 2400

Assign names to rows and columns:

rownames(twoxtwo) <- c("Rent", "Own")
colnames(twoxtwo) <- c("CAD+", "CAD-")
twoxtwo

##      CAD+ CAD-
## Rent   85 1736
## Own    77 2400

Apply the function:

epi.2by2(twoxtwo, method = "cohort.count")

##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +           85         1736       1821        4.67 (3.75 to 5.74)
## Exposed -           77         2400       2477        3.11 (2.46 to 3.87)
## Total              162         4136       4298        3.77 (3.22 to 4.38)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 1.50 (1.11, 2.03)
## Inc odds ratio                                 1.53 (1.11, 2.09)
## Attrib risk in the exposed *                   1.56 (0.37, 2.74)
## Attrib fraction in the exposed (%)            33.40 (9.89, 50.78)
## Attrib risk in the population *                0.66 (-0.23, 1.55)
## Attrib fraction in the population (%)         17.53 (3.34, 29.63)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 7.034 Pr>chi2 = 0.008
## Fisher exact test that OR = 1: Pr>chi2 = 0.009
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

The relative risk for people living in rented apartments is 1.5 times higher with respect to people who owned their homes. Of note: this is clearly not a causal relationship; the exposure is a proxy for the socio-economic status that is linked to life style and access to health facilities. The odds ratio is estimated as 1.53, similar to the relative risk; this is expected since the disease is quite rare in this population, both in the exposed people and in the unexposed ones.

5.5.4 Exercise (j)

Let’s load the University Group Diabetes Program (UGDP) data. This was a placebo-controlled, multi-center randomized clinical trial (we will see in block 2 definition of this kind of study design) from the early 1960’s intended to establish the efficacy of treatments for type 2 diabetes. There were a total of 409 patients. 204 patients received tolbutamide. 205 patients received placebo. 30 of the 204 tolbutamide patients died. 21 of the placebo patients died. There are three variables of interest: Status (Survivor or Death), Treatment (Placebo or Tolbutamide) and Age Group (older or younger than 55).

Calculate the relative risk and the odds ratio for the association of tolbutamide with death and comment the results (for the overall population). Comment the results.

ugdp <- read.csv(here("datasets","ugdp.txt"), header = T, stringsAsFactors = F)
head(ugdp)

##   Status   Treatment Agegrp
## 1  Death Tolbutamide    <55
## 2  Death Tolbutamide    <55
## 3  Death Tolbutamide    <55
## 4  Death Tolbutamide    <55
## 5  Death Tolbutamide    <55
## 6  Death Tolbutamide    <55

str(ugdp)

## 'data.frame':    409 obs. of  3 variables:
##  $ Status   : chr  "Death" "Death" "Death" "Death" ...
##  $ Treatment: chr  "Tolbutamide" "Tolbutamide" "Tolbutamide" "Tolbutamide" ...
##  $ Agegrp   : chr  "<55" "<55" "<55" "<55" ...

table(ugdp$Treatment)

## 
##     Placebo Tolbutamide 
##         205         204

table(ugdp$Agegrp)

## 
## <55 55+ 
## 226 183

tab <- table(ugdp$Treatment, ugdp$Status)
tab <- tab[2:1, 2:1]
tab

##              
##               Survivor Death
##   Tolbutamide      174    30
##   Placebo          184    21

epi.2by2(tab, method = "cohort.count")

##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +          174           30        204     85.29 (79.68 to 89.85)
## Exposed -          184           21        205     89.76 (84.77 to 93.55)
## Total              358           51        409     87.53 (83.93 to 90.57)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 0.95 (0.88, 1.02)
## Inc odds ratio                                 0.66 (0.37, 1.20)
## Attrib risk in the exposed *                   -4.46 (-10.85, 1.93)
## Attrib fraction in the exposed (%)            -5.23 (-13.24, 2.21)
## Attrib risk in the population *                -2.23 (-7.47, 3.02)
## Attrib fraction in the population (%)         -2.54 (-6.28, 1.06)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 1.865 Pr>chi2 = 0.172
## Fisher exact test that OR = 1: Pr>chi2 = 0.181
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

We can note that the estimated odds ratio is 0.66, with a 95% confidence intervals that include 1; this means that we can not conclude from this study for a significant protective effect for patients taking Tolbutamide with respect to placebo. We can also note that the RR estimated from this data is very different from the OR, and this is related to the “non rarity” of the “disease” assumption in this case (the outcome of interest is surviving!).

Exercises - Block 1 (with solutions)

STATISTICAL LEARNING IN EPIDEMIOLOGY 2023/2024

Prof. Giulia Barbati & Paolo Dalena

22/03/2024