1 Descriptive epidemiology

Epidemiology is the study of the frequency, distribution and determinants of health-related states in populations and the application of such knowledge to control health problems (Disease Control and Prevention 2006).

This introduction provides examples on the way R can be used for initial descriptive epidemiological analyses, that is, to describe how the frequency of disease varies in the populations, places and times. Note that we will not discuss specific methods for spatial epidemiology in this course, that is an advanced and more specific topic.

Load the libraries that we will use in this session:

library(epiR)
library(Epi)
library(popEpi)
library(ggplot2)
library(scales)
library(lubridate)

1.1 Prevalence and Incidence in the populations

Descriptions of disease frequency involves reporting either the prevalence or incidence of disease.

Recall of the definitions: strictly speaking, prevalence equals the number of cases of a given disease or attribute that exists in a population at a specified point in time. Prevalence is therefore the proportion of a population that has a specific disease or attribute at a specified point in time.

Two types of prevalence are reported in the literature: (1) point prevalence : the proportion of a population in a diseased state at a single point in time, (2) period prevalence : the proportion of a population with a given disease or condition over a specific period of time (i.e. the number of existing cases at the start of a follow-up period plus the number of incident cases that occur during the follow-up period).

Incidence provides a measure of how frequently susceptible (at risk) individuals become disease cases as they are observed over time. An incident case occurs when an individual changes from being susceptible (at risk) to being diseased. The count of incident cases is the number of such events that occur in a population during a defined follow-up period. There are two ways to express incidence:

Cumulative incidence is the proportion of initially susceptible individuals in a population who become new cases during a defined follow-up period.

Incidence rate (also known as incidence density) is the number of new cases of disease that occur per unit of individual time at risk during a defined follow-up period.

In addition to reporting the point estimate of disease frequency, it is important to provide an indication of the uncertainty around that point estimate. The epi.conf function in the epiR package allows you to obtain confidence intervals for prevalence, cumulative incidence and incidence rates. In fact, these functions are useful for the confidence intervals calculations more than for the simple formulas of prevalence and incidence.

1.1.1 Prevalence example

Let’s say we’re interested in the prevalence of disease X in a population comprised of 1000 individuals. Two hundred are tested and four returned a positive result. Assuming 100% test sensitivity and specificity (=i.e. the accuracy of the diagnostic tool, see later in this block for a detailed explanation), what is the estimated prevalence of disease X in this population?

ncas <- 4; npop <- 200
tmp <- as.matrix(cbind(ncas, npop))
epi.conf(tmp, ctype = "prevalence", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 100
##   est     lower    upper
## 1   2 0.5475566 5.041361

The estimated prevalence of disease X in this population is 2.0 (95% confidence interval [CI] 0.55 - 5.0) cases per 100 individuals.

Of note: the design effect (option design) is used to adjust the confidence interval around a prevalence or incidence risk estimate in the presence of clustering i.e. when the sampling mechanism is not related to single subjects but to groups. The design effect is a measure of the variability between clusters and is calculated as the ratio of the variance calculated assuming a complex sample design divided by the variance calculated assuming simple random sampling.

1.1.2 From data to estimates…

As we have seen, the prevalence of a disease in a population is the fraction of the population that has the disease at a given time. The total number of persons in Country X alive with colon cancer on 1 January 2023 is 16281, and the number of persons in Country X on 1 January 2023 is 5534738, so the point prevalence of colon cancer is :

16281/5534738
## [1] 0.002941603

that is, 0.29%.

This is the empirical prevalence of colon cancer patients in Country X on 1 January 2023.

The theoretical prevalence is the probability that a randomly chosen person from the population has colorectal cancer.

This is an unknown quantity, but a reasonable estimate of it would be the empirical prevalence.

That is, however, just an estimate from a statistical model that asserts that colon cancer is randomly distributed among the population, with a probability p,and there is an uncertainty associated with the estimate which we can quantify in a confidence interval. This is an interval that captures the true prevalence with some pre-specified probability (usually set at 95%).

The data used to estimate p are X=number of persons with colorectal cancer, and N=number of persons in the population. In order to assess p we can fit a binomial model:

X <- 16281
N <- 5534738
mp <- glm(cbind(X, N-X)~1, family=binomial)
round(ci.pred(mp)*100,4)
##   Estimate   2.5%  97.5%
## 1   0.2942 0.2897 0.2987

We can use the function glm in R that fits generalized linear models, in this case a model for a binomial outcome. The observed outcome must be specified as the number of successes (cases of cancer) and the number of failures (not-diseased persons). The prevalence as a function of explanatory variables is described by a model (more of this in block 3!), here specified by ~1, which means that the prevalence is constant across all observational units (not surprisingly since we only have one data point!). The ci.pred produces a prediction of the theoretical prevalence from the model.

1.1.3 Another example : prevalence of diabetes by age and sex

We will use a dataset of Diabetes prevalence in Denmark in 1-year age classes by sex, measured in 01-01-2010. This dataset has 200 rows and the following 4 variables:

A: Numeric, age, 0-99

sex: Sex, a factor with levels M F

X: Number of diabetes patients

N: Population size

data(pr)
str(pr)
## 'data.frame':    200 obs. of  4 variables:
##  $ A  : num  0 0 10 10 11 11 12 12 13 13 ...
##  $ sex: Factor w/ 2 levels "M","F": 2 1 2 1 2 1 2 1 2 1 ...
##  $ X  : num  1 0 84 83 106 70 104 111 133 120 ...
##  $ N  : num  30743 32435 32922 34294 32890 ...
head(pr)
##    A sex   X     N
## 1  0   F   1 30743
## 2  0   M   0 32435
## 3 10   F  84 32922
## 4 10   M  83 34294
## 5 11   F 106 32890
## 6 11   M  70 34644

We would like to know the prevalence at different ages and separately by sex. Since these data come from the entire population of Denmark, the interpretation will not include the sampling variation (i.e. confidence intervals). We can compute the observed prevalence by age and sex as follows:

prt <- stat.table(index=list(A,sex), contents=ratio(X,N,100), data=pr)[1,,]
round(prt[70:75,],1)
##     sex
## A       M    F
##   69 16.2 12.4
##   70 16.5 12.7
##   71 17.3 13.5
##   72 17.7 13.9
##   73 18.3 14.8
##   74 18.9 15.1

a few rows of the resulting array is printed, from age of 69 to 74, just to show some results.

We can plot these entries as a function of age, with one curve for each sex:

matplot(0:99+0.5, prt, type="l", lwd=2, col=c(4,2),lty=1 )
abline(v=c(76,81)+0.5)

Ages between 76 and 81 correspond to the maximal prevalences for men and women. We plotted the prevalences against the midpoints of the age-classes so we add 0.5 to the position of the vertical lines too.

1.2 Incidence rate basic example

A study was conducted by Feychting, Osterlund, and Ahlbom (1998) to report the frequency of cancer among the blind. A total of 136 diagnoses of cancer were made from 22050 person-years at risk. What was the incidence rate of cancer in this population?

ncas <- 136; ntar <- 22050
tmp <- as.matrix(cbind(ncas, ntar))
epi.conf(tmp, ctype = "inc.rate", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 1000
##         est    lower    upper
## ncas 6.1678 5.174806 7.295817

The incidence rate of cancer in this population was 6.2 (95% CI 5.2 to 7.3) cases per 1000 person-years at risk.

1.3 Mortality rate

Mortality is typically reported as a number of people that have died in a population of a certain size. For example, 2648 persons died out of 150000 persons. This is a mortality of 17.6 per 1000 persons, apparently:

2648/150000
## [1] 0.01765333

Looking a bit closer at this, it is a bit illogical : the number of deaths must be recorded over some period of time, in this case presumably a year. The number of persons, on the other hand, refer to the population size at some specific date, presumably 1st January or July some year. So the two components, number of deaths and population size, refer to a period and a specific time point, respectively. The number of persons at risk of dying will not vary a lot over a year, so as a first approximation we might assume that the population was constant in the period. Thus, the 150000 above refers to the average population over the period. But, if the 2648 deaths were accumulated over a period of two years, the mortality would then be half of what we just computed, because the 150000 people had each two years to encounter death. This leads to the concept of person-years also called risk time, time at risk, exposed time or even just exposure. The person-years is the population size multiplied by the length of the observation interval. So since the population size is 150000 and the observation period is 2 years, the number of person-years is 300000 and the mortality rate is:

2648/300
## [1] 8.826667

8.83 deaths per 1000 person-years. Note that the rate is a scaled quantity, it has unit of measurement, in this example per 1000 py. We might equally well have given the mortality rate as 88.3 per 10000 person-years. This quantity is called the empirical mortality rate. The theoretical counterpart is the probability of dying in a small interval relative to the length of the interval. This is usually denoted by the Greek letter \(\lambda\):

\[\lambda=P\{{\sf death\ in} (t,t+dt)|{\tt alive\ at\ t\}}/dt\]

When we say small interval, we are approaching the formal definition as the limit of the quantity as dt tends to 0. The vertical bar denotes conditional probability (given). Note that the theoretical rate is also a scaled quantity; it is measured in units of \(time^{-1}\) (deaths per person time) because the denominator dt is measured in time units. As in the case of prevalence, there is an uncertainty associated with the estimate of the theoretical mortality rate, which again we can quantify in a confidence interval. The data used to estimate \(\lambda\) are D=number of deaths and Y=amount of follow up time. In order to assess this we can fit a Poisson model (other examples in block 3):

D <- 2648
Y <- 300000
mm <- glm(D~1,offset=log(Y), family=poisson)
round(ci.pred(mm,newdata=data.frame(Y=1000)),3)
##   Estimate  2.5% 97.5%
## 1    8.827 8.497 9.169

The observed outcome is D, number of deaths, and the amount of follow up time (Y) is specified in the offset argument. The mortality rate is described through a model for the number of events. The model is specified by ~1, which means that the mortality rate is constant over all observed units, specifically, that the number of deaths is proportional to the number of person-years as supplied in the offset argument. The log is used to linearize the model. The result is an estimate of the mortality rate per 1000 person-years.

1.4 Exercises

1.4.1 Exercise (a)

The figure below shows the follow-up experience of members of a small study cohort between 1 January 2004 to 30 June 2009, from entry to follow-up until death (black filled circles indicate death due to cancer and empty circles indicate death for other causes). Follow-up until the occurrence of cancer C is shown with a broken line. For those subjects contracting cancer C, follow-up after diagnosis is shown with a solid line.

Looking at the plot, we will calculate the values of the incidence rate of the disease and of various mortality measures:

  1. What is the incidence rate (per 100 pyears) of cancer C during the period from 1 Jan 2004 to 31 Dec 2008?

  2. What is the mortality rate from cancer C during the same period?

  3. What is the mortality rate from all causes during the same period?

  4. What is the prevalence of C on 30 September 2006, and on 31 December 2008?

1.4.2 Exercise (b)

In the table below are given the size (in 1000s) of the male population in Finland aged 0-14 years (the age range of “childhood” in pediatrics) on the 31 December in each year from 1991 to 2000.

The following numbers of cases describe the incidence of and mortality from acute leukaemia in this population for two calendar periods: 5 years 1993 to 1997, and year 1999 only.

  1. Calculate the incidence rates of acute leukaemia in this population for the two periods. (Key: person-years could be obtained by the mid-population principle, i.e. using population in 1995 for the first period or taking an average between years)
  2. Calculate similarly the mortality rates of leukaemia.

2 Standardized rates

2.1 Example of standardized rates calculations

The following example does not require the use of any specific R library, we will do simply “by hand” calculations.

Age-specific data on the incidence of colon cancer in male and female populations of Finland during 1999 are given in the following table:

Let’s calculate the following summary measures:

  1. crude incidence rate in both populations and the rate ratio (anticipation of association measures…): males vs. females,
  2. age-standardized rates and their ratio using the male population as the standard,i.e. the “indirect method”;
  3. age-standardized rates and their ratio using the World Standard Population, i.e. the “direct method”

We will compare and comment the results obtained in items 1 to 3.

World Standard Population:

Let’s do it:

First of all, we create the data matrix:

M <- matrix( c(10,1157,46.0,0.9,22,1109,41.9,2.0,0.44
,76,809,32.0,9.4,68,786,29.7,8.6,1.09
,305,455,18.0,67,288,524,19.8,55,1.22
,201,102,4.0,196,354,229,8.6,155,1.27), nrow=4, byrow=T )
M <- data.frame(M)
names(M) <- c("mca","mpy","mp","mr",
              "fca","fpy","fp","fr","rr")
M
##   mca  mpy mp    mr fca  fpy   fp    fr   rr
## 1  10 1157 46   0.9  22 1109 41.9   2.0 0.44
## 2  76  809 32   9.4  68  786 29.7   8.6 1.09
## 3 305  455 18  67.0 288  524 19.8  55.0 1.22
## 4 201  102  4 196.0 354  229  8.6 155.0 1.27

Then, we compute the crude incidence rates:

rates <- with( M, c( sum(mca)/sum(mpy)*100,
            sum(fca)/sum(fpy)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  23.46  27.64   0.85

Now we compute the age-standardized rates and their ratio using the male population as the standard:

wm <- with( M, mpy/sum(mpy) )
wm
## [1] 0.45858105 0.32065002 0.18034086 0.04042806
rates <-with( M, c( sum(mca/mpy*wm)*100,
            sum(fca/fpy*wm)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  23.46  19.85   1.18

And now we compute the age-standardized rates and their ratio using the World Standard Population:

WSP <- c(120,100,90,90,80,80,60,60,60,60,50,40,40,30,20,10,5,3,2)
WSP
##  [1] 120 100  90  90  80  80  60  60  60  60  50  40  40  30  20  10   5   3   2
wt <- sum(WSP[1:7])
wt[2] <- sum(WSP[8:11])
wt[3] <- sum(WSP[12:15])
wt[4] <- sum(WSP[16:19])
wt <- wt/sum(wt)
wt
## [1] 0.62 0.23 0.13 0.02
rates <- with( M, c( sum(mca/mpy*wt)*100,
            sum(fca/fpy*wt)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  15.35  13.46   1.14

We can notice that when using the original data from the Finland population, males seem to have a low risk of colon cancer with respect to females. But this result is driven by the confounding effect of age, since females are older than males. When comparing the rates adjusting for the differences in age groups, the relative risk of males is instead higher than females.

If you want to see other examples using specific R libraries check the following website: https://epirhandbook.com/en/standardised-rates.html

2.2 Exercises

Standardized rates could be considered as adjusted summary estimators, controlling for example for the confounding related to different age structures in the populations under study. These estimators are weighted averages of stratum-specific estimators.

We have seen the method of (direct) standardization by using an external standard population (CMR, comparative mortality ratio) and the (indirect) method of the standardized morbidity ratio estimation (Standardized Mortality ratio, SMR). These approaches are useful in simple descriptive analyses.

As we have seen, to compute the crude incidence rate we evaluate the ratio between the total number of new cases over the total person-years; if we have data splitted in age groups, we can sum at the numerator all the age-specific numbers of cases and then divide by the sum of all the age-specific person-years.

In order to compute the age standardized rates (using the direct method) we apply instead the following formula:

\[ASR=\frac{\sum_{k=1}^{K} weight_{k}*rate_{k}}{\sum_{k=1}^{K} weight_{k}}\] Load the dataset for the exercice: is the number of cases (D) and the age-specific incidence rates (in cases per 100,000 person-years) from the Danish Cancer Register for the period 1983-87 for colon cancer, rectum cancer and lung cancer, by sex.

library(here)
## here() starts at /Users/PaoloMacbook/Documents/PHD/esercitazioni_stat_learn_epi
std.rates <- read.delim(here("datasets","std-rates.txt"))
head(std.rates)
##   age sex   typ  D rate
## 1   0   M Colon  0 0.00
## 2   5   M Colon  2 0.25
## 3  10   M Colon  0 0.00
## 4  15   M Colon  3 0.30
## 5  20   M Colon  4 0.39
## 6  25   M Colon 13 1.36
std <- std.rates

2.2.1 Exercise (c)

Calculate the crude rates for each sex and site, given that the number of male-person-years is 2521177 and the female is 2596061.

2.2.2 Exercise (d)

Calculate the standardized rates, standardized to the world standard population.

3 Diagnostic tests

Load the required library:

library(epiR)

We will use an an example of data from the stress test to detect the presence of coronary disease.

dat           <- as.table(matrix(c(815,115,208,327), nrow = 2, byrow = TRUE))
colnames(dat) <- c("Dis+","Dis-")
rownames(dat) <- c("Test+","Test-")
dat
##       Dis+ Dis-
## Test+  815  115
## Test-  208  327

Let’s estimate the parameters of interest related to the performance of the stress test in detecting the disease: the following function computes true and apparent prevalence (apparent prevalence is in this case the number of subjects testing positive by a diagnostic test divided by the total number of subjects in the sample tested; true prevalence is the actual number of diseased subjects divided by the number of individuals in the population), sensitivity, specificity, positive and negative predictive values, and positive and negative likelihood ratios from count data provided in a 2 by 2 table. Exact binomial confidence limits are calculated for test sensitivity, specificity, and positive and negative predictive value. The positive likelihood ratio is calculated as \[LR+ = Sensitivity/1-Specificity\] It represents the probability of a person who has the disease testing positive divided by the probability of a person who does not have the disease testing positive. The negative likelihood ratio is calculated as \[LR- = 1-Sensitivity/Specificity\] It represents the probability of a person who has the disease testing negative divided by the probability of a person who does not have the disease testing negative.

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
##           Outcome +    Outcome -      Total
## Test +          815          115        930
## Test -          208          327        535
## Total          1023          442       1465
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.63 (0.61, 0.66)
## True prevalence *                      0.70 (0.67, 0.72)
## Sensitivity *                          0.80 (0.77, 0.82)
## Specificity *                          0.74 (0.70, 0.78)
## Positive predictive value *            0.88 (0.85, 0.90)
## Negative predictive value *            0.61 (0.57, 0.65)
## Positive likelihood ratio              3.06 (2.61, 3.59)
## Negative likelihood ratio              0.27 (0.24, 0.31)
## False T+ proportion for true D- *      0.26 (0.22, 0.30)
## False T- proportion for true D+ *      0.20 (0.18, 0.23)
## False T+ proportion for T+ *           0.12 (0.10, 0.15)
## False T- proportion for T- *           0.39 (0.35, 0.43)
## Correctly classified proportion *      0.78 (0.76, 0.80)
## --------------------------------------------------------------
## * Exact CIs

3.1 Example: CK dataset

Let’s load the libraries:

library(epiR)
library(tidyverse)
library(OptimalCutpoints)

The CK file reports the creatine kinase (CK, U/L) measurements in patients with unstable angina and acute myocardial infarction, two conditions that are clinically difficult to distinguish. A diagnostic test has been devised such that for CK values > 100 the patient is classified as a case of acute myocardial infarction, for CK < 100 the patient is classified as a case of unstable angina. Calculate sensitivity, specificity, PPV, NPV of the diagnostic test. What changes when choosing CK = 80 as the threshold value?

load(here("datasets","CK.Rdata"))

Define the cutoff:

CK$ck.100 <- ifelse(CK$ck>100, 'CK > 100', 'CK <= 100')

Create the table:

pp <- table(CK$ck.100,CK$disease)
pp
##            
##             angina infarto
##   CK <= 100     58       1
##   CK > 100      34      26

Reverse the order:

dat <- as.table(matrix(c(26,34,1,58), nrow = 2, byrow = TRUE))
colnames(dat) <- c("AMI+","AMI-")
rownames(dat) <- c("Test+","Test-")
dat
##       AMI+ AMI-
## Test+   26   34
## Test-    1   58

Calculate sensitivity, specificity, PPV, NPV of the diagnostic test:

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
##           Outcome +    Outcome -      Total
## Test +           26           34         60
## Test -            1           58         59
## Total            27           92        119
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.50 (0.41, 0.60)
## True prevalence *                      0.23 (0.16, 0.31)
## Sensitivity *                          0.96 (0.81, 1.00)
## Specificity *                          0.63 (0.52, 0.73)
## Positive predictive value *            0.43 (0.31, 0.57)
## Negative predictive value *            0.98 (0.91, 1.00)
## Positive likelihood ratio              2.61 (1.98, 3.44)
## Negative likelihood ratio              0.06 (0.01, 0.40)
## False T+ proportion for true D- *      0.37 (0.27, 0.48)
## False T- proportion for true D+ *      0.04 (0.00, 0.19)
## False T+ proportion for T+ *           0.57 (0.43, 0.69)
## False T- proportion for T- *           0.02 (0.00, 0.09)
## Correctly classified proportion *      0.71 (0.62, 0.79)
## --------------------------------------------------------------
## * Exact CIs

Change the cutoff:

CK$ck.80 <- ifelse(CK$ck>80, 'CK > 80', 'CK <= 80') 

Create the table:

pp <- table(CK$ck.80,CK$disease)
pp
##           
##            angina infarto
##   CK <= 80     44       0
##   CK > 80      48      27
dat <- as.table(matrix(c(27,48,0,44), nrow = 2, byrow = TRUE))
colnames(dat) <- c("AMI+","AMI-")
rownames(dat) <- c("Test+","Test-")
dat
##       AMI+ AMI-
## Test+   27   48
## Test-    0   44
rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
##           Outcome +    Outcome -      Total
## Test +           27           48         75
## Test -            0           44         44
## Total            27           92        119
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.63 (0.54, 0.72)
## True prevalence *                      0.23 (0.16, 0.31)
## Sensitivity *                          1.00 (0.87, 1.00)
## Specificity *                          0.48 (0.37, 0.58)
## Positive predictive value *            0.36 (0.25, 0.48)
## Negative predictive value *            1.00 (0.92, 1.00)
## Positive likelihood ratio              1.92 (1.58, 2.33)
## Negative likelihood ratio              0.00 (0.00, NaN)
## False T+ proportion for true D- *      0.52 (0.42, 0.63)
## False T- proportion for true D+ *      0.00 (0.00, 0.13)
## False T+ proportion for T+ *           0.64 (0.52, 0.75)
## False T- proportion for T- *           0.00 (0.00, 0.08)
## Correctly classified proportion *      0.60 (0.50, 0.69)
## --------------------------------------------------------------
## * Exact CIs

We can observe that sensibility is improved but specificity is worse.

3.2 Exercise (e)

As part of a prostate cancer screening program, blood samples from 100.000 men have been analyzed looking for a biomarker known to be related to the disease. In total 4000 tested positive, but fortunately only 800 of these cases were then confirmed with biopsy (a much more reliable examination but also long and expensive). Of the 96.000 negative screening tests, 100 individuals have then developed the tumor within 12 months and are therefore to be considered false negatives. Calculate and comment sensitivity, specificity, PPV, NPV of the diagnostic test.

4 ROC curves for diagnostic tests

For this example we will use the R package pROC. It builds a ROC curve and returns a roc object, a list of class roc. This object can be printed, plotted, or passed to the functions auc, ci, smooth.roc and coords. Additionally, two roc objects can be compared with the function roc.test. Data can be provided as response, predictor, where the predictor is the numeric (or ordered) level of the evaluated risk factor, and the response encodes the observation class (control or case). The level argument specifies which response level must be taken as controls (first value of level) or cases (second). It can safely be ignored when the response is encoded as 0 and 1, but it will frequently fail otherwise. By default, the first two values of levels(as.factor(response)) are taken, and the remaining levels are ignored. This means that if your response is coded “control” and “case”, the levels will be inverted.

library(pROC)

4.1 Example: Subarachnoid hemorrhage data

This dataset summarizes several clinical and one laboratory variable of 113 patients with an aneurysmal subarachnoid hemorrhage. The purpose of the case presented here is to identify patients at risk of poor post-aSAH outcome, as they require specific healthcare management; therefore the clinical test must be highly specific. Detailed results of the study are reported in [1]. We only outline the features relevant to the ROC analysis.

data(aSAH)
head(aSAH)
##    gos6 outcome gender age wfns s100b  ndka
## 29    5    Good Female  42    1  0.13  3.01
## 30    5    Good Female  37    1  0.14  8.54
## 31    5    Good Female  42    1  0.10  8.09
## 32    5    Good Female  27    1  0.04 10.42
## 33    1    Poor Female  42    3  0.13 17.40
## 34    1    Poor   Male  48    2  0.10 12.75
with(aSAH, table(gender, outcome))
##         outcome
## gender   Good Poor
##   Male     22   20
##   Female   50   21

Let’s estimate the ROC curve for the parameter s100b, a biomarker:

roc(aSAH$outcome, aSAH$s100b,levels=c("Good", "Poor"))
## 
## Call:
## roc.default(response = aSAH$outcome, predictor = aSAH$s100b,     levels = c("Good", "Poor"))
## 
## Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
## Area under the curve: 0.7314

Inspect the ROC plot and report the 95% CI:

roc(aSAH$outcome, aSAH$s100b,percent=TRUE, plot=TRUE, ci=TRUE)

## 
## Call:
## roc.default(response = aSAH$outcome, predictor = aSAH$s100b,     percent = TRUE, ci = TRUE, plot = TRUE)
## 
## Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
## Area under the curve: 73.14%
## 95% CI: 63.01%-83.26% (DeLong)

We can also inspect the confidence intervals around our estimates, point by point:

rocobj <- plot.roc(aSAH$outcome, aSAH$s100b)
ci.sp.obj <- ci.sp(rocobj, sensitivities=seq(0, 1, .01), boot.n=100)
plot(rocobj)
plot(ci.sp.obj, type="shape", col="blue")

We can also compare the performance of two different biomarkers:

roc.test(response = aSAH$outcome, predictor1= aSAH$wfns, predictor2 = aSAH$s100)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  aSAH$wfns and aSAH$s100 by aSAH$outcome (Good, Poor)
## Z = 2.209, p-value = 0.02718
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.01040618 0.17421442
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.8236789   0.7313686

And plot the two ROC curves in one graph:

roc.s100b <- roc(aSAH$outcome, aSAH$s100b)
roc.wfns <- roc(aSAH$outcome, aSAH$wfns)
plot(roc.s100b)
plot(roc.wfns, add=TRUE, col="red")

4.1.1 Determining an “optimal” cut-off

Load the required library:

library(OptimalCutpoints)

The package Optimal.cutpoints calculates optimal cutpoints in diagnostic tests. Several methods or criteria for selecting optimal cutoffs have been implemented, including methods based on cost-benefit analysis and diagnostic test accuracy measures (Sensitivity/Specificity, Predictive Values and Diagnostic Likelihood Ratios) or prevalence (Lopez-Raton et al. 2014). As a basic example, we can use the Youden Index approach:

perf.s100b <-optimal.cutpoints(X ="s100b", status ="outcome", tag.healthy ="Good",
                                    methods = "Youden", data =aSAH, control = control.cutpoints(), 
                                    ci.fit = TRUE, 
                                    conf.level = 0.95, trace = FALSE)
perf.s100b
## 
## Call:
## optimal.cutpoints.default(X = "s100b", status = "outcome", tag.healthy = "Good", 
##     methods = "Youden", data = aSAH, control = control.cutpoints(), 
##     ci.fit = TRUE, conf.level = 0.95, trace = FALSE)
## 
## Optimal cutoffs:
##   Youden
## 1 0.2200
## 
## Area under the ROC curve (AUC):  0.731 (0.63, 0.833)

Therefore, 0.22 is the selected cut-off for the biomarker that maximizes the sum of Sensitivity and Specificity. But, pay always attention to the shape of the functional relationship between the biomarker and the risk of outcome…more on this in block 3.

4.2 Exercise (f)

Let’s load the libraries:

library(tidyverse)
library(pROC)
library(MASS)
library(OptimalCutpoints)
library(epiR)

Load the data:

diabetes <- read.csv(here("datasets","diabetes.csv"))
glimpse(diabetes)
## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

We transform Outcome from an integer variable into a factor with categories “No” and “Yes”:

diabetes$Outcome = factor(diabetes$Outcome, labels=c("No", "Yes"))
table(diabetes$Outcome)
## 
##  No Yes 
## 500 268
  • estimate the ROC curve for the glucose biomarker;
  • plot the curve;
  • try to define an optimal cut-off, adopting the Youden Index approach;
  • check the sensitivity, specificity, PPV and NPV of this cut-off;
  • OPTIONAL: try to define a cut-off considering a different approach (e.g. “MinValueNPV” criterion \(\implies\) methods = "MinValueNPV")

5 Measures of association

5.1 Introduction

An important task in epidemiology is to quantify the strength of association between exposures and outcomes. In this context the term exposure is taken to mean a variable whose association with the outcome is to be estimated.

Exposures can be harmful, beneficial or both harmful and beneficial (e.g. if an immunisable disease is circulating, exposure to immunising agents helps most recipients but may harm those who experience adverse reactions). The term outcome is used to describe all the possible results that may arise from exposure to a factor or from preventive or therapeutic interventions.

In the following examples we will consider only binary exposure factors (in block 3 we will extend to exposures on a quantitative scale, through the use of the appropriate regression models).

5.2 Relative Risk

RR (relative risk) is the ratio of the risk of getting disease among the exposed compared with that among the non-exposed. It indicates how many times the risk would increase had the subject changed their status from non-exposed to exposed. The increment is considered in fold, thus has a mathematical notation of being a multiplicative model.

Load the R libraries that we will use in this session:

library(epiR)
library(epitools)

First of all, let’s construct the contingency table of our study: the exposure factor is type of behaviour and the disease of interest is the occurrence of a coronary heart problem.

data.type <- matrix(c(178,79,1141,1486), 2, 2)
dimnames(data.type) <- list("Behavior type" = c("Type A", "Type B"),
                      "Occurrence of CHD Event" = c("Yes", "No"))
data.type
##              Occurrence of CHD Event
## Behavior type Yes   No
##        Type A 178 1141
##        Type B  79 1486

The aim is to calculate the Relative Risk and corresponding 95% CI. (We will come back to the 95% CI also later in Block 2 when discussing sample size determination). Here we use a function from the epiR library, where they call the the Relative Risk Incidence risk ratio. See the help of the function for all the details. Of note, the option method refers to the study design that has generated the data under study. We will see in more detail in Block 2 differences between cohort/case-controls and cross-sectional studies.

epi.2by2(data.type, method = "cohort.count")
##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +          178         1141       1319     13.50 (11.70 to 15.46)
## Exposed -           79         1486       1565        5.05 (4.02 to 6.25)
## Total              257         2627       2884       8.91 (7.90 to 10.01)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 2.67 (2.07, 3.45)
## Inc odds ratio                                 2.93 (2.23, 3.87)
## Attrib risk in the exposed *                   8.45 (6.31, 10.59)
## Attrib fraction in the exposed (%)            62.59 (51.75, 71.00)
## Attrib risk in the population *                3.86 (2.36, 5.37)
## Attrib fraction in the population (%)         43.35 (32.37, 52.55)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 62.919 Pr>chi2 = <0.001
## Fisher exact test that OR = 1: Pr>chi2 = <0.001
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

Therefore, the relative risk of a CHD event is 2.67 times for the behavior type “A” with respect to “B”.

5.3 Odds Ratio

As we have already seen in a dedicated note, odds has a meaning related with probability. If p is the probability, p/(1-p) is known as the odds. First of all, also here let’s build the contingency table: now the exposure factor is coffee drinking (equal or more than 1 cup per day vs zero) and the disease of interest is pancreatic cancer.

data.coffee <- matrix(c(347,20,555,88), 2, 2)
dimnames(data.coffee) <- list("Coffee Drinking (cups per day)" = c(">=1", "0"),
                       "Pancreatic cancer" = c("cases", "controls"))
data.coffee
##                               Pancreatic cancer
## Coffee Drinking (cups per day) cases controls
##                            >=1   347      555
##                            0      20       88

We want to estimate the odds ratio and corresponding 95% CI: the function oddsratio calculates the odds ratio by the median-unbiased estimation (mid-p), the conditional maximum likelihood estimation (Fisher), the unconditional maximum likelihood estimation (Wald), and the small sample adjustment (small). The reason of these various approach is related to the computation of the confidence intervals, that are calculated using exact methods (mid-p and Fisher), normal approximation (Wald), and normal approximation with small sample adjustment (small). If you are interested in these boring statistical details, see the help of the function for all the details about the default values.

oddsratio(data.coffee, method="wald")
## $data
##                               Pancreatic cancer
## Coffee Drinking (cups per day) cases controls Total
##                          >=1     347      555   902
##                          0        20       88   108
##                          Total   367      643  1010
## 
## $measure
##                               odds ratio with 95% C.I.
## Coffee Drinking (cups per day) estimate    lower    upper
##                            >=1 1.000000       NA       NA
##                            0   2.750991 1.662391 4.552449
## 
## $p.value
##                               two-sided
## Coffee Drinking (cups per day)   midp.exact fisher.exact   chi.square
##                            >=1           NA           NA           NA
##                            0   2.335156e-05  3.02823e-05 4.622573e-05
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

The odds ratio of the disease in the exposed is 2.75 w.r.t the unexposed.

5.4 Absolute effect and attributable risks

In this library, they call attributable risk the risk of disease in the exposed group minus the risk of disease in the unexposed group (we have defined it in our slides the excess risk).This provides a measure of the absolute increase or decrease in risk associated with exposure.Risk difference suggests the amount of risk gained or lost had the subject changed from non-exposed to exposed. The increase is absolute, and has the mathematical notation of an additive model. As we have discussed, the risk difference has more public health implications than the relative risk. A high relative risk may not be of public health importance if the disease is very rare. The risk difference, on the other hand, measures direct health burden and the need for health services. In this library there is also a quantity called Attrib risk in population, the difference between the risk in the total population minus the risk in the unexposed. Of note, in this library they call Attrib fraction in population (%) the fraction of all cases of D in the population that can be attributed to E (we call it simply attributable risk). And finally, they call Attrib fraction in exposed (%) the proportion of cases attributable to exposure in the exposed population (i.e. when considering the only population on which exposure can act). (We call it attributable risk in the exposed).

We now use as an example the low infant birth weight data presented by Hosmer and Lemeshow (2000) and available in the MASS package in R. The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Massachusetts USA during 1986.

library(MASS)
bwt <- birthwt
head(bwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

Each row of this data set represents data for one mother. We’re interested in the association between smoke (the mother’s smoking status during pregnancy) and low (delivery of a baby less than 2.5 kg bodyweight). It is important that the table you present to the epi.2by2 function is in the correct format: disease positives in the first column, disease negatives in the second column, exposure positives in the first row and exposure negatives in the second row.

bwt$low    <- factor(bwt$low, levels = c(1,0))
bwt$smoke  <- factor(bwt$smoke, levels = c(1,0))
bwt$race   <- factor(bwt$race, levels = c(1,2,3))

Generate the 2 by 2 table. Exposure (rows) = smoke, outcome (columns) = low:

low.tab <- table(bwt$smoke, bwt$low, dnn = c("Smoke", "Low BW"))
low.tab
##      Low BW
## Smoke  1  0
##     1 30 44
##     0 29 86

Compute the measures of association for smoking and delivery of a low birth weight baby:

epi.2by2(dat = low.tab, method = "cohort.count", conf.level = 0.95, 
   units = 100, outcome = "as.columns")
##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +           30           44         74     40.54 (29.27 to 52.59)
## Exposed -           29           86        115     25.22 (17.58 to 34.17)
## Total               59          130        189     31.22 (24.69 to 38.35)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 1.61 (1.06, 2.44)
## Inc odds ratio                                 2.02 (1.08, 3.78)
## Attrib risk in the exposed *                   15.32 (1.61, 29.04)
## Attrib fraction in the exposed (%)            37.80 (5.47, 59.07)
## Attrib risk in the population *                6.00 (-4.33, 16.33)
## Attrib fraction in the population (%)         19.22 (-0.21, 34.88)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 4.924 Pr>chi2 = 0.026
## Fisher exact test that OR = 1: Pr>chi2 = 0.036
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

The estimated excess risk is 15.3, i.e. we would expect the low birth weight to increase by 15% if all women smoked as compared to all women not smoking. The population attributable risk is estimated at 6%, the percent of cases in the total population that can be attributed to smoking. The attributable fraction in the population is 19%, i.e. the fraction of all cases of low birth weight that can be attributed to smoking (the naive interpretation is that low birth weight could be reduced by 19% if all mothers stop smoking). Finally, the attributable risk in the exposed is estimated at 38%, i.e. 38% of the low birth weight that occurred among women who smoked could be attributed to smoking (assuming causality).

5.5 Exercises

5.5.1 Exercise (g)

Assuming that the risk of some disease is 1 in 100.000 among unexposed people, create a table of attributable fractions for exposures that increase the risk to 1 in 50.000, 1 in 10.000 and 1 in 1000.

5.5.2 Exercise (h)

Let’s say that the probability (prevalence) to be exposed for the previous data is 5%. Add a column to the matrix created above for evaluating the population attributable risk.

5.5.3 Exercise (i)

In the Scottish Health cohort Study 85 of 1821 people who lived in rented apartments had coronary artery disease, compared to 77 of 2477 people who owned their homes. Create a 2-by-2 matrix object from this data. Calculate the risks, relative risks, odds and odds ratios for these data. Comment the results.

5.5.4 Exercise (j)

Let’s load the University Group Diabetes Program (UGDP) data. This was a placebo-controlled, multi-center randomized clinical trial (we will see in block 2 definition of this kind of study design) from the early 1960’s intended to establish the efficacy of treatments for type 2 diabetes. There were a total of 409 patients. 204 patients received tolbutamide. 205 patients received placebo. 30 of the 204 tolbutamide patients died. 21 of the placebo patients died. There are three variables of interest: Status (Survivor or Death), Treatment (Placebo or Tolbutamide) and Age Group (older or younger than 55).

Calculate the relative risk and the odds ratio for the association of tolbutamide with death and comment the results (for the overall population). Comment the results.

6 Your turn! :)

6.1 Descriptive epidemiology

6.1.1 Exercise (a)

The figure below shows the follow-up experience of members of a small study cohort between 1 January 2004 to 30 June 2009, from entry to follow-up until death (black filled circles indicate death due to cancer and empty circles indicate death for other causes). Follow-up until the occurrence of cancer C is shown with a broken line. For those subjects contracting cancer C, follow-up after diagnosis is shown with a solid line.

Looking at the plot, we will calculate the values of the incidence rate of the disease and of various mortality measures:

  1. What is the incidence rate (per 100 pyears) of cancer C during the period from 1 Jan 2004 to 31 Dec 2008?

  2. What is the mortality rate from cancer C during the same period?

  3. What is the mortality rate from all causes during the same period?

  4. What is the prevalence of C on 30 September 2006, and on 31 December 2008?

6.1.2 Exercise (b)

In the table below are given the size (in 1000s) of the male population in Finland aged 0-14 years (the age range of “childhood” in pediatrics) on the 31 December in each year from 1991 to 2000.

The following numbers of cases describe the incidence of and mortality from acute leukaemia in this population for two calendar periods: 5 years 1993 to 1997, and year 1999 only.

  1. Calculate the incidence rates of acute leukaemia in this population for the two periods. (Key: person-years could be obtained by the mid-population principle, i.e. using population in 1995 for the first period or taking an average between years)
  2. Calculate similarly the mortality rates of leukaemia.

6.2 Standardized rates

Load the dataset for the exercice: is the number of cases (D) and the age-specific incidence rates (in cases per 100,000 person-years) from the Danish Cancer Register for the period 1983-87 for colon cancer, rectum cancer and lung cancer, by sex.

library(here)

std.rates <- read.delim(here("datasets","std-rates.txt"))
head(std.rates)
##   age sex   typ  D rate
## 1   0   M Colon  0 0.00
## 2   5   M Colon  2 0.25
## 3  10   M Colon  0 0.00
## 4  15   M Colon  3 0.30
## 5  20   M Colon  4 0.39
## 6  25   M Colon 13 1.36
std <- std.rates

6.2.1 Exercise (c)

Calculate the crude rates for each sex and site, given that the number of male-person-years is 2521177 and the female is 2596061.

6.2.2 Exercise (d)

Calculate the standardized rates, standardized to the world standard population.

6.3 Diagnostic tests

6.3.1 Exercise (e)

As part of a prostate cancer screening program, blood samples from 100.000 men have been analyzed looking for a biomarker known to be related to the disease. In total 4000 tested positive, but fortunately only 800 of these cases were then confirmed with biopsy (a much more reliable examination but also long and expensive). Of the 96.000 negative screening tests, 100 individuals have then developed the tumor within 12 months and are therefore to be considered false negatives. Calculate and comment sensitivity, specificity, PPV, NPV of the diagnostic test.

6.4 ROC curves for diagnostic tests

6.4.1 Exercise (f)

Let’s load the libraries:

library(tidyverse)
library(pROC)
library(MASS)
library(OptimalCutpoints)
library(epiR)

Load the data:

diabetes <- read.csv(here("datasets","diabetes.csv"))
glimpse(diabetes)
## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

We transform Outcome from an integer variable into a factor with categories “No” and “Yes”:

diabetes$Outcome = factor(diabetes$Outcome, labels=c("No", "Yes"))
table(diabetes$Outcome)
## 
##  No Yes 
## 500 268
  • estimate the ROC curve for the glucose biomarker;
  • plot the curve;
  • try to define an optimal cut-off, adopting the Youden Index approach;
  • check the sensitivity, specificity, PPV and NPV of this cut-off;
  • OPTIONAL: try to define a cut-off considering a different approach (e.g. “MinValueNPV” criterion \(\implies\) methods = "MinValueNPV")

6.5 Measures of association

6.5.1 Exercise (g)

Assuming that the risk of some disease is 1 in 100.000 among unexposed people, create a table of attributable fractions for exposures that increase the risk to 1 in 50.000, 1 in 10.000 and 1 in 1000.

6.5.2 Exercise (h)

Let’s say that the probability (prevalence) to be exposed for the previous data is 5%. Add a column to the matrix created above for evaluating the population attributable risk.

6.5.3 Exercise (i)

In the Scottish Health cohort Study 85 of 1821 people who lived in rented apartments had coronary artery disease, compared to 77 of 2477 people who owned their homes. Create a 2-by-2 matrix object from this data. Calculate the risks, relative risks, odds and odds ratios for these data. Comment the results.

6.5.4 Exercise (j)

Let’s load the University Group Diabetes Program (UGDP) data. This was a placebo-controlled, multi-center randomized clinical trial (we will see in block 2 definition of this kind of study design) from the early 1960’s intended to establish the efficacy of treatments for type 2 diabetes. There were a total of 409 patients. 204 patients received tolbutamide. 205 patients received placebo. 30 of the 204 tolbutamide patients died. 21 of the placebo patients died. There are three variables of interest: Status (Survivor or Death), Treatment (Placebo or Tolbutamide) and Age Group (older or younger than 55).

Calculate the relative risk and the odds ratio for the association of tolbutamide with death and comment the results (for the overall population). Comment the results.