1 Causal Inference Introduction

Useful libraries:

library(tidyverse)
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.5.1     ✔ tibble    3.2.1
## ✔ lubridate 1.9.4     ✔ tidyr     1.3.1
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
library(ggplot2)

1.1 Estimating counterfactuals

Conceptually, the missing counterfactual outcome is one that would have occurred under a different set of circumstances. In causal inference, we wish we could observe the conterfactual outcome that would have occurred in an alternate universe where the exposure status for a given observation was flipped. To do this, we attempt to control for all factors that are related to an exposure and outcome such that we can construct (or estimate) such a counterfactual outcome. Let’s suppose some happiness index, from 1-10 exists. We are interested in assessing whether eating chocolate ice cream versus vanilla will increase happiness. We have 10 individuals with two potential outcomes for each, one is what their happiness would be if they ate chocolate ice cream, (defined as y_chocolate in the code below), and one is what their happiness would be if they ate vanilla ice cream (defined as y_vanilla in the code below). We can define the true causal effect of eating chocolate ice cream (versus vanilla) on happiness for each individual as the difference between the two:

data <- data.frame(
  id = 1:10,
  y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),
  y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)
)

data <- data |>
  mutate(causal_effect = y_chocolate - y_vanilla)

data
##    id y_chocolate y_vanilla causal_effect
## 1   1           4         1             3
## 2   2           4         3             1
## 3   3           6         4             2
## 4   4           5         5             0
## 5   5           6         5             1
## 6   6           5         6            -1
## 7   7           6         8            -2
## 8   8           7         6             1
## 9   9           5         3             2
## 10 10           6         5             1
data |>
  summarize(
    avg_chocolate = mean(y_chocolate),
    avg_vanilla = mean(y_vanilla),
    avg_causal_effect = mean(causal_effect)
  )
##   avg_chocolate avg_vanilla avg_causal_effect
## 1           5.4         4.6               0.8

The causal effect of eating chocolate ice cream (versus vanilla) for individual 4 is 0, whereas the causal effect for individual 9 is 2. The average potential happiness after eating chocolate is 5.4 and the average potential happiness after eating vanilla is 4.6. The average treatment effect of eating chocolate (versus vanilla) ice cream among the ten individuals in this study is 0.8.

In reality, we cannot observe both potential outcomes, in any moment in time, each individual in our study can only eat one flavor of ice cream. Suppose we let our participants choose which ice cream they wanted to eat and each choose their favorite (i.e. they knew which would make them “happier” and picked that one). Now what we observe is:

data_observed <- data |>
  mutate(
    exposure = case_when(
      # people who like chocolate more chose that
      y_chocolate > y_vanilla ~ "chocolate",
      # people who like vanilla more chose that
      y_vanilla >= y_chocolate ~ "vanilla"
    ),
    observed_outcome = case_when(
      exposure == "chocolate" ~ y_chocolate,
      exposure == "vanilla" ~ y_vanilla
    )
  ) |>
  # we can only observe the exposure and one potential outcome
  select(id, exposure, observed_outcome)
data_observed
##    id  exposure observed_outcome
## 1   1 chocolate                4
## 2   2 chocolate                4
## 3   3 chocolate                6
## 4   4   vanilla                5
## 5   5 chocolate                6
## 6   6   vanilla                6
## 7   7   vanilla                8
## 8   8 chocolate                7
## 9   9 chocolate                5
## 10 10 chocolate                6
data_observed |>
  group_by(exposure) |>
  summarise(avg_outcome = mean(observed_outcome))
## # A tibble: 2 × 2
##   exposure  avg_outcome
##   <chr>           <dbl>
## 1 chocolate        5.43
## 2 vanilla          6.33

Now, the observed average outcome among those who ate chocolate ice cream is 5.4 (the same as the true average potential outcome), while the observed average outcome among those who ate vanilla is 6.3 – quite different from the actual average (4.6). The estimated causal effect here could be calculated as 5.4 - 6.3 = -0.9.

It turns out here, these 10 participants chose which ice cream they wanted to eat and they always chose to eat their favorite! This artificially made it look like eating vanilla ice cream would increase the happiness in this population when in fact we know the opposite is true. The next section will discuss which assumptions need to be true in order to allow us to accurately estimate causal effects using observed data. As a sneak peak, our issue here was that how the exposure was decided, if instead we randomized who ate chocolate versus vanilla ice cream we would (on average, with a large enough sample) recover the true causal effect.

1.2 The magic of randomization

Since now we are doing something random so let’s set a seed so we always observe the same result each time we run the code:

set.seed(11)
data_observed <- data |>
  mutate(
    # change the exposure to randomized, generate from a binomial distribution
    # with a probability 0.5 for being in either group
    exposure = case_when(
      rbinom(10, 1, 0.5) == 1 ~ "chocolate",
      TRUE ~ "vanilla"
    ),
    observed_outcome = case_when(
      exposure == "chocolate" ~ y_chocolate,
      exposure == "vanilla" ~ y_vanilla
    )
  ) |>
  # as before, we can only observe the exposure and one potential outcome
  select(id, exposure, observed_outcome)
data_observed |>
  group_by(exposure) |>
  summarise(avg_outcome = mean(observed_outcome))
## # A tibble: 2 × 2
##   exposure  avg_outcome
##   <chr>           <dbl>
## 1 chocolate        5.33
## 2 vanilla          4.71

As you can see, now we have correctly estimated the causal effect in this population… how this happened? More on this in block 2 (Study Design) and block 3 !

2 Descriptive epidemiology - introduction

Epidemiology is the study of the frequency, distribution and determinants of health-related states in populations and the application of such knowledge to control health problems (Disease Control and Prevention 2006).

This introduction provides examples on the way R can be used for initial descriptive epidemiological analyses, that is, to describe how the frequency of disease varies in the populations, places and times. Note that we will not discuss specific methods for spatial epidemiology in this course, that is an advanced and more specific topic.

Load the libraries that we will use in this session:

library(epiR)
library(Epi)
library(popEpi)
library(ggplot2)
library(scales)
library(lubridate)

2.1 Prevalence and Incidence in the populations

Descriptions of disease frequency involves reporting either the prevalence or incidence of disease.

Recall of the definitions: strictly speaking, prevalence equals the number of cases of a given disease or attribute that exists in a population at a specified point in time. Prevalence is therefore the proportion of a population that has a specific disease or attribute at a specified point in time.

Two types of prevalence are reported in the literature: (1) point prevalence : the proportion of a population in a diseased state at a single point in time, (2) period prevalence : the proportion of a population with a given disease or condition over a specific period of time (i.e. the number of existing cases at the start of a follow-up period plus the number of incident cases that occur during the follow-up period).

Incidence provides a measure of how frequently susceptible (at risk) individuals become disease cases as they are observed over time. An incident case occurs when an individual changes from being susceptible (at risk) to being diseased. The count of incident cases is the number of such events that occur in a population during a defined follow-up period. There are two ways to express incidence:

Cumulative incidence is the proportion of initially susceptible individuals in a population who become new cases during a defined follow-up period.

Incidence rate (also known as incidence density) is the number of new cases of disease that occur per unit of individual time at risk during a defined follow-up period.

In addition to reporting the point estimate of disease frequency, it is important to provide an indication of the uncertainty around that point estimate. The epi.conf function in the epiR package allows you to obtain confidence intervals for prevalence, cumulative incidence and incidence rates. In fact, these functions are useful for the confidence intervals calculations more than for the simple formulas of prevalence and incidence.

2.1.1 Prevalence example

Let’s say we’re interested in the prevalence of disease X in a population comprised of 1000 individuals. Two hundred are tested and four returned a positive result. Assuming 100% test sensitivity and specificity (=i.e. the accuracy of the diagnostic tool, see later in this block for a detailed explanation), what is the estimated prevalence of disease X in this population?

ncas <- 4; npop <- 200
tmp <- as.matrix(cbind(ncas, npop))
epi.conf(tmp, ctype = "prevalence", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 100
##   est     lower    upper
## 1   2 0.5475566 5.041361

The estimated prevalence of disease X in this population is 2.0 (95% confidence interval [CI] 0.55 - 5.0) cases per 100 individuals.

Of note: the design effect (option design) is used to adjust the confidence interval around a prevalence or incidence risk estimate in the presence of clustering i.e. when the sampling mechanism is not related to single subjects but to groups. The design effect is a measure of the variability between clusters and is calculated as the ratio of the variance calculated assuming a complex sample design divided by the variance calculated assuming simple random sampling.

2.1.2 From data to estimates…

As we have seen, the prevalence of a disease in a population is the fraction of the population that has the disease at a given time. The total number of persons in Country X alive with colon cancer on 1 January 2023 is 16281, and the number of persons in Country X on 1 January 2023 is 5534738, so the point prevalence of colon cancer is :

16281/5534738
## [1] 0.002941603

that is, 0.29%.

This is the empirical prevalence of colon cancer patients in Country X on 1 January 2023.

The theoretical prevalence is the probability that a randomly chosen person from the population has colorectal cancer.

This is an unknown quantity, but a reasonable estimate of it would be the empirical prevalence.

That is, however, just an estimate from a statistical model that asserts that colon cancer is randomly distributed among the population, with a probability p,and there is an uncertainty associated with the estimate which we can quantify in a confidence interval. This is an interval that captures the true prevalence with some pre-specified probability (usually set at 95%).

The data used to estimate p are X=number of persons with colorectal cancer, and N=number of persons in the population. In order to assess p we can fit a binomial model:

X <- 16281
N <- 5534738
mp <- glm(cbind(X, N-X)~1, family=binomial)
round(ci.pred(mp)*100,4)
##   Estimate   2.5%  97.5%
## 1   0.2942 0.2897 0.2987

We can use the function glm in R that fits generalized linear models, in this case a model for a binomial outcome. The observed outcome must be specified as the number of successes (cases of cancer) and the number of failures (not-diseased persons). The prevalence as a function of explanatory variables is described by a model (more of this in block 3!), here specified by ~1, which means that the prevalence is constant across all observational units (not surprisingly since we only have one data point!). The ci.pred produces a prediction of the theoretical prevalence from the model.

2.1.3 Another example : prevalence of diabetes by age and sex

We will use a dataset of Diabetes prevalence in Denmark in 1-year age classes by sex, measured in 01-01-2010. This dataset has 200 rows and the following 4 variables:

A: Numeric, age, 0-99

sex: Sex, a factor with levels M F

X: Number of diabetes patients

N: Population size

data(pr)
str(pr)
## 'data.frame':    200 obs. of  4 variables:
##  $ A  : num  0 0 10 10 11 11 12 12 13 13 ...
##  $ sex: Factor w/ 2 levels "M","F": 2 1 2 1 2 1 2 1 2 1 ...
##  $ X  : num  1 0 84 83 106 70 104 111 133 120 ...
##  $ N  : num  30743 32435 32922 34294 32890 ...
head(pr)
##    A sex   X     N
## 1  0   F   1 30743
## 2  0   M   0 32435
## 3 10   F  84 32922
## 4 10   M  83 34294
## 5 11   F 106 32890
## 6 11   M  70 34644

We would like to know the prevalence at different ages and separately by sex. Since these data come from the entire population of Denmark, the interpretation will not include the sampling variation (i.e. confidence intervals). We can compute the observed prevalence by age and sex as follows:

prt <- stat.table(index=list(A,sex), contents=ratio(X,N,100), data=pr)[1,,]
round(prt[70:75,],1)
##     sex
## A       M    F
##   69 16.2 12.4
##   70 16.5 12.7
##   71 17.3 13.5
##   72 17.7 13.9
##   73 18.3 14.8
##   74 18.9 15.1

a few rows of the resulting array is printed, from age of 69 to 64, just to show some results.

We can plot these entries as a function of age, with one curve for each sex:

matplot(0:99+0.5, prt, type="l", lwd=2, col=c(4,2),lty=1 )
abline(v=c(76,81)+0.5)

Ages between 76 and 81 correspond to the maximal prevalences for men and women. We plotted the prevalences against the midpoints of the age-classes so we add 0.5 to the position of the vertical lines too.

2.2 Incidence rate basic example

A study was conducted by Feychting, Osterlund, and Ahlbom (1998) to report the frequency of cancer among the blind. A total of 136 diagnoses of cancer were made from 22050 person-years at risk. What was the incidence rate of cancer in this population?

ncas <- 136; ntar <- 22050
tmp <- as.matrix(cbind(ncas, ntar))
epi.conf(tmp, ctype = "inc.rate", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 1000
##         est    lower    upper
## ncas 6.1678 5.174806 7.295817

The incidence rate of cancer in this population was 6.2 (95% CI 5.2 to 7.3) cases per 1000 person-years at risk.

2.2.1 From data to estimates…again

In the case when the incidence rate is assumed constant, we can assume a Poisson likelihood that leads to the estimate : (the observed outcome is D, number of events, and the amount of follow up time is Y):

\[\hat{\lambda}=D/Y\] \[s.e.{\log(\hat{\lambda})}=1/\sqrt(D)\]

From this formula we can derive also approximate 95% confidence intervals for the estimated rate (see Block 2). A summary of the methods used for each of the confidence interval calculations in the epi.conf function are available in the “help”of the function.

2.3 Mortality rate

Mortality is typically reported as a number of people that have died in a population of a certain size. For example, 2648 persons died out of 150000 persons. This is a mortality of 17.6 per 1000 persons, apparently:

2648/150000
## [1] 0.01765333

Looking a bit closer at this, it is a bit illogical : the number of deaths must be recorded over some period of time, in this case presumably a year. The number of persons, on the other hand, refer to the population size at some specific date, presumably 1st January or July some year. So the two components, number of deaths and population size, refer to a period and a specific time point, respectively. The number of persons at risk of dying will not vary a lot over a year, so as a first approximation we might assume that the population was constant in the period. Thus, the 150000 above refers to the average population over the period. But, if the 2648 deaths were accumulated over a period of two years, the mortality would then be half of what we just computed, because the 150000 people had each two years to encounter death. This leads to the concept of person-years also called risk time, time at risk, exposed time or even just exposure. The person-years is the population size multiplied by the length of the observation interval. So since the population size is 150000 and the observation period is 2 years, the number of person-years is 300000 and the mortality rate is:

2648/300
## [1] 8.826667

8.83 deaths per 1000 person-years. Note that the rate is a scaled quantity, it has unit of measurement, in this example per 1000 py. We might equally well have given the mortality rate as 88.3 per 10000 person-years. This quantity is called the empirical mortality rate. The theoretical counterpart is the probability of dying in a small interval relative to the length of the interval. This is usually denoted by the Greek letter \(\lambda\):

\[\lambda=P\{{\sf death\ in} (t,t+dt)|{\tt alive\ at\ t\}}/dt\]

When we say small interval, we are approaching the formal definition as the limit of the quantity as dt tends to 0. The vertical bar denotes conditional probability (given). Note that the theoretical rate is also a scaled quantity; it is measured in units of \(time^{-1}\) (deaths per person time) because the denominator dt is measured in time units. As in the case of prevalence, there is an uncertainty associated with the estimate of the theoretical mortality rate, which again we can quantify in a confidence interval. The data used to estimate \(\lambda\) are D=number of deaths and Y=amount of follow up time. In order to assess this we can fit a Poisson model (other examples in block 3):

D <- 2648
Y <- 300000
mm <- glm(D~1,offset=log(Y), family=poisson)
round(ci.pred(mm,newdata=data.frame(Y=1000)),3)
##   Estimate  2.5% 97.5%
## 1    8.827 8.497 9.169

The observed outcome is D, number of deaths, and the amount of follow up time (Y) is specified in the offset argument. The mortality rate is described through a model for the number of events. The model is specified by ~1, which means that the mortality rate is constant over all observed units, specifically, that the number of deaths is proportional to the number of person-years as supplied in the offset argument. The log is used to linearize the model. The result is an estimate of the mortality rate per 1000 person-years.

2.3.1 Basic data structure to estimate incidence rates (more on block 4!)

Usually data are not aggregated but instead come from individuals subjects that are observed for a period of time (follow up) (see: cohort studies, block 2). The simplest example is following persons for the occurrence of death. The basic requirements for recordings in a follow-up study is: - date of entry to the study (the first date a person is known to be alive and at risk of having the event) - date of exit from the study (the last date a person is known to be alive and at risk of having the event/or the date of death) - The status of the person at the exit date - with or without the event.

It is assumed that the event terminates the follow-up, so we are implicitly referring to events like death or diagnosis of a chronic disease, where a person ceases to be at risk of the event if it occurs. More complex data structures and methods allow for recurrent events or competing risk (some basic ideas in block 4).

In the Epi package there is the dataset DMlate:

data(DMlate)
str(DMlate)
## 'data.frame':    10000 obs. of  7 variables:
##  $ sex  : Factor w/ 2 levels "M","F": 2 1 2 2 1 2 1 1 2 1 ...
##  $ dobth: num  1940 1939 1918 1965 1933 ...
##  $ dodm : num  1999 2003 2005 2009 2009 ...
##  $ dodth: num  NA NA NA NA NA ...
##  $ dooad: num  NA 2007 NA NA NA ...
##  $ doins: num  NA NA NA NA NA NA NA NA NA NA ...
##  $ dox  : num  2010 2010 2010 2010 2010 ...
head(DMlate)
##        sex    dobth     dodm    dodth    dooad doins      dox
## 50185    F 1940.256 1998.917       NA       NA    NA 2009.997
## 307563   M 1939.218 2003.309       NA 2007.446    NA 2009.997
## 294104   F 1918.301 2004.552       NA       NA    NA 2009.997
## 336439   F 1965.225 2009.261       NA       NA    NA 2009.997
## 245651   M 1932.877 2008.653       NA       NA    NA 2009.997
## 216824   F 1927.870 2007.886 2009.923       NA    NA 2009.923

This dataset represents follow up of a random sample of 10000 diabetes patients in Denmark diagnosed after 1995, followed till the end of 2009. Random noise has been added to all dates in order to make persons untraceable. Each line represents the follow up of one person.

Sex: a factor with levels M F

dobth: Date of birth

dodm: Date of inclusion in the register

dodth: Date of death

dooad: Date of 2nd prescription of OAD

doins: Date of 2nd insulin prescription

dox: Date of exit from follow-up.

All dates are given in fractions of years, so 1998.000 corresponds to 1 January 1998 and 1998.997 to 31 December 1998.

All dates are randomly perturbed by a small amount, so no real persons have any of the combinations of the 6 dates in the dataset. But results derived from the data will be quite close to those that would be obtained if the entire ‘real’ diabetes register were used. The overall number of deaths, follow up time and mortality rates by sex can be shown using stat.table function:

stat.table(index=sex, contents=list(D=sum(!is.na(dodth)), Y=sum(dox-dodm), rate=ratio(!is.na(dodth),dox-dodm,1000)), margin=TRUE, data=DMlate)
##  --------------------------------- 
##  sex           D        Y    rate  
##  --------------------------------- 
##  M       1345.00 27614.21   48.71  
##  F       1158.00 26659.05   43.44  
##                                    
##  Total   2503.00 54273.27   46.12  
##  ---------------------------------

The overall mortality rate is 46.1 deaths per 1000 person-years. In making this calculation we are already implicitly imposed a statistical model describing the data, namely, the model which assumes a constant mortality rate for all persons during the study period. The single entries are from a model that assumes that mortality rates only differ by sex. While this model is clearly wrong for almost any purpose, it serves as an essential building block in derivation of more realistic models (see Block 4).

2.3.2 Different times …

Follow-up data basically consist of a time entry, a time exit and an indication of the status at exit. Note that time comes in two guises: one is the risk time (or exposure) and the other is the time-scale. The risk time is : “How long have you been alive” (i.e. exposed to potential death). The time scale is : “When were you alive” meaning for example age or calendar time. The risk time is part of the outcome measurement (it is the denominator of the rate), whereas the time-scale is an explanatory variable. We will usually be interested in how rates vary along some time-scale, age or calendar time, for example, so in most analyses we will using both risk time and time-scales.

2.4 References

Disease Control, Centers for, and Prevention. 2006. Principles of Epidemiology in Public Health Practice: An Introduction to Applied Epidemiology and Biostatistics. Atlanta, Georgia: Centers for Disease Control; Prevention.

Feychting, M, B Osterlund, and A Ahlbom. 1998. “Reduced Cancer Incidence Among the Blind.” Epidemiology 9: 490-94.

Carstensen B., Epidemiology with R, 2021, Oxford University Press.

2.5 Exercises

2.5.1 Exercise (a)

The figure below shows the follow-up experience of members of a small study cohort between 1 January 2004 to 30 June 2009, from entry to follow-up until death (black filled circles indicate death due to cancer and empty circles indicate death for other causes). Follow-up until the occurrence of cancer C is shown with a broken line. For those subjects contracting cancer C, follow-up after diagnosis is shown with a solid line.

Looking at the plot, we will calculate the values of the incidence rate of the disease and of various mortality measures:

  1. What is the incidence rate (per 100 pyears) of cancer C during the period from 1 Jan 2004 to 31 Dec 2008?

  2. What is the mortality rate from cancer C during the same period?

  3. What is the mortality rate from all causes during the same period?

  4. What is the prevalence of C on 30 September 2006, and on 31 December 2008?

2.5.2 Exercise (b)

In the table below are given the size (in 1000s) of the male population in Finland aged 0-14 years (the age range of “childhood” in pediatrics) on the 31 December in each year from 1991 to 2000.

The following numbers of cases describe the incidence of and mortality from acute leukaemia in this population for two calendar periods: 5 years 1993 to 1997, and year 1999 only.

  1. Calculate the incidence rates of acute leukaemia in this population for the two periods. (Key: person-years could be obtained by the mid-population principle, i.e. using population in 1995 for the first period or taking an average between years)
  2. Calculate similarly the mortality rates of leukaemia.

3 Standardized rates

3.1 Example of standardized rates calculations

The following example does not require the use of any specific R library, we will do simply “by hand” calculations.

Age-specific data on the incidence of colon cancer in male and female populations of Finland during 1999 are given in the following table:

Let’s calculate the following summary measures:

  1. crude incidence rate in both populations and the rate ratio (anticipation of association measures…): males vs. females,
  2. age-standardized rates and their ratio using the male population as the standard,i.e. the “indirect method”;
  3. age-standardized rates and their ratio using the World Standard Population, i.e. the “direct method”

We will compare and comment the results obtained in items 1 to 3.

World Standard Population:

Let’s do it:

First of all, we create the data matrix:

M <- matrix( c(10,1157,46.0,0.9,22,1109,41.9,2.0,0.44
,76,809,32.0,9.4,68,786,29.7,8.6,1.09
,305,455,18.0,67,288,524,19.8,55,1.22
,201,102,4.0,196,354,229,8.6,155,1.27), nrow=4, byrow=T )
M <- data.frame(M)
names(M) <- c("mca","mpy","mp","mr",
              "fca","fpy","fp","fr","rr")
M
##   mca  mpy mp    mr fca  fpy   fp    fr   rr
## 1  10 1157 46   0.9  22 1109 41.9   2.0 0.44
## 2  76  809 32   9.4  68  786 29.7   8.6 1.09
## 3 305  455 18  67.0 288  524 19.8  55.0 1.22
## 4 201  102  4 196.0 354  229  8.6 155.0 1.27

Then, we compute the crude incidence rates:

rates <- with( M, c( sum(mca)/sum(mpy)*100,
            sum(fca)/sum(fpy)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  23.46  27.64   0.85

Now we compute the age-standardized rates and their ratio using the male population as the standard:

wm <- with( M, mpy/sum(mpy) )
wm
## [1] 0.45858105 0.32065002 0.18034086 0.04042806
rates <-with( M, c( sum(mca/mpy*wm)*100,
            sum(fca/fpy*wm)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  23.46  19.85   1.18

And now we compute the age-standardized rates and their ratio using the World Standard Population:

WSP <- c(120,100,90,90,80,80,60,60,60,60,50,40,40,30,20,10,5,3,2)
WSP
##  [1] 120 100  90  90  80  80  60  60  60  60  50  40  40  30  20  10   5   3   2
wt <- sum(WSP[1:7])
wt[2] <- sum(WSP[8:11])
wt[3] <- sum(WSP[12:15])
wt[4] <- sum(WSP[16:19])
wt <- wt/sum(wt)
wt
## [1] 0.62 0.23 0.13 0.02
rates <- with( M, c( sum(mca/mpy*wt)*100,
            sum(fca/fpy*wt)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
## M rate F rate M/F RR 
##  15.35  13.46   1.14

We can notice that when using the original data from the Finland population, males seem to have a low risk of colon cancer with respect to females. But this result is driven by the confounding effect of age, since females are older than males. When comparing the rates adjusting for the differences in age groups, the relative risk of males is instead higher than females.

If you want to see other examples using specific R libraries check the following website: https://epirhandbook.com/en/standardised-rates.html

3.2 Exercises

Standardized rates could be considered as adjusted summary estimators, controlling for example for the confounding related to different age structures in the populations under study. These estimators are weighted averages of stratum-specific estimators.

We have seen the method of (direct) standardization by using an external standard population (CMR, comparative mortality ratio) and the (indirect) method of the standardized morbidity ratio estimation (Standardized Mortality ratio, SMR). These approaches are useful in simple descriptive analyses.

As we have seen, to compute the crude incidence rate we evaluate the ratio between the total number of new cases over the total person-years; if we have data splitted in age groups, we can sum at the numerator all the age-specific numbers of cases and then divide by the sum of all the age-specific person-years.

In order to compute the age standardized rates (using the direct method) we apply instead the following formula:

\[ASR=\frac{\sum_{k=1}^{K} weight_{k}*rate_{k}}{\sum_{k=1}^{K} weight_{k}}\] Load the dataset for the exercice: is the number of cases (D) and the age-specific incidence rates (in cases per 100,000 person-years) from the Danish Cancer Register for the period 1983-87 for colon cancer, rectum cancer and lung cancer, by sex.

library(here)
## here() starts at /Users/PaoloMacbook/Documents/PHD/esercitazioni_stat_learn_epi
std.rates <- read.delim(here("datasets","std-rates.txt"))
head(std.rates)
##   age sex   typ  D rate
## 1   0   M Colon  0 0.00
## 2   5   M Colon  2 0.25
## 3  10   M Colon  0 0.00
## 4  15   M Colon  3 0.30
## 5  20   M Colon  4 0.39
## 6  25   M Colon 13 1.36
std <- std.rates

3.2.1 Exercise (c)

Calculate the crude rates for each sex and site, given that the number of male-person-years is 2521177 and the female is 2596061.

3.2.2 Exercise (d)

Calculate the standardized rates, standardized to the world standard population.

4 Confidence Intervals - recap

4.1 CI for proportions (prevalence/cumulative incidence)

Useful libraries:

library(DescTools)
library(PropCIs)
library(epiR)
library(Epi)
library(popEpi)

The binom.test function output includes a confidence interval for a proportion, and the proportion of “success” as a decimal number.
The binom.test function in the native stats package uses the Clopper-Pearson method for confidence intervals (1934).

This guarantees that the confidence level is at least conf.level, but in general does not give the shortest-length confidence intervals.

Example: as part of a survey, a researcher asks a class of 21 students if they have ever studied statistics before.

The following are the data are collected:

Experience Count: Yes= 7; No=14

Note that when calculating confidence intervals for a binomial variable, one level of the nominal variable is chosen to be the “success” level.

This is an arbitrary decision, but you should be cautious to remember that the confidence interval is reported for the proportion of “success” responses.

binom.test(7, 21,0.5,alternative="two.sided",conf.level=0.95)
## 
##  Exact binomial test
## 
## data:  7 and 21
## number of successes = 7, number of trials = 21, p-value = 0.1892
## alternative hypothesis: true probability of success is not equal to 0.5
## 95 percent confidence interval:
##  0.1458769 0.5696755
## sample estimates:
## probability of success 
##              0.3333333

The BinomCI function in the DescTools package has several methods for calculating confidence intervals for a binomial proportion.

[see from the help of the function: method = c(“wilson”, “wald”, “agresti-coull”, “jeffreys”, “modified wilson”, “wilsoncc”,“modified jeffreys”, “clopper-pearson”, “arcsine”, “logit”, “witting”, “pratt”)]

BinomCI(7, 21,
        conf.level = 0.95,
        method = "clopper-pearson")
##            est    lwr.ci    upr.ci
## [1,] 0.3333333 0.1458769 0.5696755

The BinomCI function can also produce the confidence intervals for “success” and “failure” in one step.

observed = c(7, 14)
total = sum(observed)

BinomCI(observed, total,
        conf.level = 0.95,
        method = "clopper-pearson")
##           est    lwr.ci    upr.ci
## x.1 0.3333333 0.1458769 0.5696755
## x.2 0.6666667 0.4303245 0.8541231

Also the PropCIs package has functions for calculating confidence intervals for a binomial proportion.The exactci function uses the Clopper-Pearson exact method.

exactci(7, 21, conf.level=0.95)
## 
## 
## 
## data:  
## 
## 95 percent confidence interval:
##  0.1458769 0.5696755

The blakerci function uses the Blaker exact method.

blakerci(7, 21,
         conf.level=0.95)
## 
## 
## 
## data:  
## 
## 95 percent confidence interval:
##  0.1523669 0.5510455

4.2 CI for the incidence rate

As we already introduced in Epi Intro, in the case when the incidence rate is assumed constant, we can assume a Poisson likelihood that leads to the estimate : (the observed outcome is D, number of events, and the amount of follow up time is Y):

\[\hat{\lambda}=D/Y\] \[s.e.{\log(\hat{\lambda})}=1/\sqrt(D)\]

From this formula and using the normal approximation we can derive also approximate 95% confidence intervals for the estimated rate. A summary of the methods used for confidence interval calculations in the epi.conf function of the epiR library are available in the “help”of the function (refer to later in this section) .

4.3 CI for the mean (numerical variable/outcome)

Let’s do now a simple simulation to demonstrate the computation of the confidence interval for the mean:

nSim <- 10000
n    <- 10
ciArray <- array(0, dim=c(nSim,2))

for (i in 1:nSim){
  x <- rnorm(n, mean=10, sd=1)
  ciArray[i,] <- c(t.test(x)$conf.int[1],t.test(x)$conf.int[2]) 
}

head(ciArray)
##          [,1]      [,2]
## [1,] 8.959856 10.208932
## [2,] 9.514130 10.228671
## [3,] 8.933576  9.874017
## [4,] 9.335909 10.699682
## [5,] 9.300846 10.637837
## [6,] 9.390400 11.075759

What proportion of these CIs include the “true” parameter?

mean(1*((ciArray[,1] < 10) & (ciArray[,2] > 10)))
## [1] 0.9546

By default with the t.test function in R we obtain a 95% confidence interval for the mean.

Remember: confidence intervals (in the frequentist approach) make sense in the long run …

4.4 The Epi.conf function

As we have already explored, the function epi.conf estimate confidence intervals for means, proportions, incidence, etc… in one function. Let’s see a couple of examples:

library(epiR)

4.4.1 Estimate CI for a mean

Method mean.single requires a vector as input.

dat <- rnorm(n = 100, mean = 0, sd = 1)
epi.conf(dat, ctype = "mean.single")
##          est        se      lower    upper
## 1 0.08172988 0.1040925 -0.1248122 0.288272

4.4.2 Estimate CI for the difference between two independent groups means

In statistics we distinguish between independent samples and paired samples. Roughly, independent samples are composed by different statistical units (for example different subjects) that are randomly selected and that have “nothing” in common. Paired data refer instead to the concept of a correlation between the statistical units, as for example repeated measures on the same subjects, or some induced correlation (we will introduce matching procedures later in the course, block 3).

Very often we want to compare mean values of a numerical outcome of interest between two groups, and in this case an option could be to compute the mean difference and the corresponding 95% CI and to check if the interval contain zero or not, Why ? Because if the 95% CI contain zero, then we can not conclude in favor of a statistical significant difference, but instead we conclude that data offer no evidence of coming from different populations. Remind that this procedure heavily depend also on variability around the means and on sample size!!

Method mean.unpaired requires a two-column data frame; the first column defining the groups must be a factor.

group <- c(rep("A", times = 5), rep("B", times = 5))
val = round(c(rnorm(n = 5, mean = 10, sd = 5), rnorm(n = 5, mean = 7, sd = 5)), digits = 0)
dat <- data.frame(group = group, val = val)
epi.conf(dat, ctype = "mean.unpaired")
##   est       se     lower    upper
## 1 3.2 3.307567 -4.427263 10.82726

4.4.3 Estimate CI for the difference between two paired groups means

Two paired samples: systolic blood pressure levels were measured in 16 middle-aged men before and after a standard exercise test. The mean rise in systolic blood pressure was 6.6 mmHg. The standard deviation of the difference was 6.0 mm Hg. The standard error of the mean difference was 1.49 mm Hg.

before <- c(148,142,136,134,138,140,132,144,128,170,162,150,138,154,126,116)
after <- c(152,152,134,148,144,136,144,150,146,174,162,162,146,156,132,126)
dat <- data.frame(before, after)
dat <- data.frame(cbind(before, after))
epi.conf(dat, ctype = "mean.paired", conf.level = 0.95)
##     est       se    lower    upper
## 1 6.625 1.491294 3.446382 9.803618

The 95% confidence interval for the population value of the mean systolic blood pressure increase after standard exercise was 3.4 to 9.8 mm Hg. Is there a significant effect of the exercise test? What do you think?

4.4.4 Estimate CI for a proportion (again :-) )

Out of 263 patients giving their views on the use of personal computers in general medical practice, 81 of them thought that the privacy of their medical file had been reduced.

pos <- 81
neg <- (263 - 81)
dat <- as.matrix(cbind(pos, neg))
round(epi.conf(dat, ctype = "prop.single"), digits = 3)
##       est    se lower upper
## pos 0.308 0.028 0.255 0.366

The 95% confidence interval for the population value of the proportion of patients thinking their privacy was reduced was from 0.255 to 0.366.

4.4.5 Estimate CI for the difference between two independent proportions

Adverse effects in 85 patients receiving either terbinafine or placebo treatment for dermatophyte onchomychois are reported. Out of 56 patients receiving terbinafine, 5 patients experienced adverse effects. Out of 29 patients receiving a placebo, none experienced adverse effects.

grp1 <- matrix(cbind(5, 51), ncol = 2)
grp2 <- matrix(cbind(0, 29), ncol = 2)
dat <- as.matrix(cbind(grp1, grp2))
round(epi.conf(dat, ctype = "prop.unpaired"), digits = 3)
##     est    se  lower upper
## 1 0.089 0.038 -0.038 0.193

The 95% confidence interval for the difference between the two groups is from -0.038 to +0.193.Is there a significant different rate of adverse effects of the terbinafine vs the placebo treatment ? What do you think?

4.4.6 Estimate CI for the difference between two paired proportions

In a reliability exercise, 41 patients were randomly selected from those who had undergone a stress test. The 41 sets of images were classified as normal or not by the laboratory and, independently, by clinical investigators from different centres. Of the 19 samples identified as ischaemic by clinical investigators 5 were identified as normal by the laboratory. Of the 22 samples identified as normal by clinical investigators 0 were identified as ischaemic by the laboratory.

dat <- as.matrix(cbind(14, 5, 0, 22))
round(epi.conf(dat, ctype = "prop.paired", conf.level = 0.95), digits = 3)
##     est    se lower upper
## 1 0.122 0.051 0.011 0.226

The 95% confidence interval for the difference in evaluations is 0.011 to 0.226 or approximately +1% to +23%. Is there a significant different evaluation between the laboratory and clinical investigators? What do you think?

4.4.7 Estimate CI for a prevalence

A herd of 1000 cattle were tested for brucellosis. Four samples out of 200 test returned a positive result. Assuming 100% test sensitivity and specificity, what is the estimated prevalence of brucellosis in this group of animals?

pos <- 4; pop <- 200
dat <- as.matrix(cbind(pos, pop))
epi.conf(dat, ctype = "prevalence", method = "exact", N = 1000, 
   design = 1, conf.level = 0.95) * 100
##   est     lower    upper
## 1   2 0.5475566 5.041361

The estimated prevalence of brucellosis in this herd is 2 cases per 100 cattle (95% CI 0.54 – 5.0 cases per 100 cattle).

4.5 References

Clopper, C.; Pearson, E. S. (1934). “The use of confidence or fiducial limits illustrated in the case of the binomial”. Biometrika. 26 (4): 404-413.

Blaker, H. (2000). Confidence curves and improved exact confidence intervals for discrete distributions, Canadian Journal of Statistics 28 (4), 783-798.

5 Diagnostic tests

Load the required library:

library(epiR)

We will use an an example of data from the stress test to detect the presence of coronary disease.

dat           <- as.table(matrix(c(815,115,208,327), nrow = 2, byrow = TRUE))
colnames(dat) <- c("Dis+","Dis-")
rownames(dat) <- c("Test+","Test-")
dat
##       Dis+ Dis-
## Test+  815  115
## Test-  208  327

Let’s estimate the parameters of interest related to the performance of the stress test in detecting the disease: the following function computes true and apparent prevalence (apparent prevalence is in this case the number of subjects testing positive by a diagnostic test divided by the total number of subjects in the sample tested; true prevalence is the actual number of diseased subjects divided by the number of individuals in the population), sensitivity, specificity, positive and negative predictive values, and positive and negative likelihood ratios from count data provided in a 2 by 2 table. Exact binomial confidence limits are calculated for test sensitivity, specificity, and positive and negative predictive value. The positive likelihood ratio is calculated as \[LR+ = Sensitivity/1-Specificity\] It represents the probability of a person who has the disease testing positive divided by the probability of a person who does not have the disease testing positive. The negative likelihood ratio is calculated as \[LR- = 1-Sensitivity/Specificity\] It represents the probability of a person who has the disease testing negative divided by the probability of a person who does not have the disease testing negative.

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
##           Outcome +    Outcome -      Total
## Test +          815          115        930
## Test -          208          327        535
## Total          1023          442       1465
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.63 (0.61, 0.66)
## True prevalence *                      0.70 (0.67, 0.72)
## Sensitivity *                          0.80 (0.77, 0.82)
## Specificity *                          0.74 (0.70, 0.78)
## Positive predictive value *            0.88 (0.85, 0.90)
## Negative predictive value *            0.61 (0.57, 0.65)
## Positive likelihood ratio              3.06 (2.61, 3.59)
## Negative likelihood ratio              0.27 (0.24, 0.31)
## False T+ proportion for true D- *      0.26 (0.22, 0.30)
## False T- proportion for true D+ *      0.20 (0.18, 0.23)
## False T+ proportion for T+ *           0.12 (0.10, 0.15)
## False T- proportion for T- *           0.39 (0.35, 0.43)
## Correctly classified proportion *      0.78 (0.76, 0.80)
## --------------------------------------------------------------
## * Exact CIs

5.1 Comparing two binary diagnostic tests [Optional topic!!]

The comparison of the performance of two binary diagnostic tests is an important topic in clinical research. The most frequent type of sample design to compare two binary diagnostic tests is the paired design. This design consists of applying the two binary diagnostic tests to all of the individuals in a random sample, where the disease status of each individual is known through the application of a gold standard.

We will use for this example an R program written by Roldán-Nofuentes in 2020, called compbdt. You can download it from : https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-00988-y#MOESM1

The objective is to estimate the sensitivity and the specificity, the likelihood ratios and the predictive values of each diagnostic test applying the confidence intervals with the best asymptotic performance.

The program compares the sensitivities and specificities of the two diagnostic tests simultaneously, as well as the likelihood ratios and the predictive values, applying the global hypothesis tests with the best performance in terms of type I error and power.

When the global hypothesis test is significant, the causes of the significance are investigated solving the individual hypothesis tests and applying the multiple comparison method of Holm.

The most optimal confidence intervals are also calculated for the difference or ratio between the respective parameters.

Based on the data observed in the sample, the program also estimates the probability of making a type II error if the null hypothesis is not rejected, or estimates the power if the if the alternative hypothesis is accepted. We will review some basic ideas about statistical testing procedure in Block 2, related to sample size determination.

Type of Errors \(H_{0}\) TRUE \(H_{0}\) FALSE
Reject \(H_{0}\) Type I error \(\alpha\) ok
Do not reject \(H_{0}\) ok Type II error \(\beta\)

We define as power of a statistical test the probability of not incurring in Type II error, i.e. 1-\(\beta\). The estimation of the probability of making a type II error allows the researcher to decide about the reliability of the null hypothesis when this hypothesis is not rejected.

We will apply this function to a real example on the diagnosis of coronary artery disease.

The diagnosis of coronary artery disease (CAD) was investigated using as diagnostic tests the exercise test (Test 1) and the clinical history of chest pain (Test 2), and the coronary angiography as the gold standard. The following table shows the frequencies obtained by applying three medical tests to a sample of 871 individuals.

we source the R program that we will use to do the calculations:

source(here('compbdt.R'))

and we run the program: the input are the frequencies of diseased (non-diseased) patients in which the Test 1 gives a result positive and negative and Test 2 gives a result positive and negative.

pp <- compbdt(473, 29, 81, 25, 22, 46, 44, 151)
## 
##           PREVALENCE OF THE DISEASE 
## 
## Estimated prevalence of the disease is  69.805 % and its standard error is 0.016 
## 
## 95 % confidence interval for the prevalence of the disease is ( 66.681 % ;  72.768 %) 
## 
## 
##           COMPARISON OF THE ACCURACIES (SENSITIVITIES AND SPECIFICITIES) 
## 
## Estimated sensitivity of Test 1 is  82.566 % and its standard error is 0.015 
## 
## 95 % confidence interval for the sensitivity of Test 1 is ( 79.363 % ;  85.389 %) 
## 
## Estimated sensitivity of Test 2 is  91.118 % and its standard error is 0.012 
## 
## 95 % confidence interval for the sensitivity of Test 1 is ( 88.61 % ;  93.148 %) 
## 
## Estimated specificity of Test 1 is  74.144 % and its standard error is 0.027 
## 
## 95 % confidence interval for the specificity of Test 1 is ( 68.557 % ;  79.087 %) 
## 
## Estimated specificity of Test 2 is  74.905 % and its standard error is 0.027 
## 
## 95 % confidence interval for the specificity of Test 1 is ( 69.358 % ;  79.787 %) 
## 
## 
## Wald test statistic for the global hypothesis test H0: (Se1 = Se2 and Sp1 = Sp2) is  25.662  
## 
##   Global p-value is  0  
## 
##   Applying the global Wald test (to an alpha error of 5 %), we reject the hypothesis H0: (Se1 = Se2 and Sp1 = Sp2) 
## 
##   Estimated power (to an alpha error of 5 %) is 99.8 %  
## 
##   Investigation of the causes of significance: 
## 
##    McNemar test statistic (with cc) for H0: Se1 = Se2 is  23.645  and the two-sided p-value is  0  
## 
##    McNemar test statistic (with cc) for H0: Sp1 = Sp2 is  0.011  and the two-sided p-value is  0.991  
## 
##    Applying the Holm method (to an alpha error of 5 %), we reject the hypothesis H0: Se1 = Se2 and we do not reject the hypothesis H0: Sp1 = Sp2 
## 
##    Sensitivity of Test 2 is significantly greater than sensitivity of Test 1 
## 
##     95 % confidence interval for the difference Se2 - Se1 is ( 5.192 % ;  11.857 %) 
## 
## 
## 
##           COMPARISON OF THE LIKELIHOOD RATIOS 
## 
## Estimated positive LR of Test 1 is  3.193  and its standard error is 0.339 
## 
## 95 % confidence interval for the positive LR of Test 1 is ( 2.61  ;  3.952 ) 
## 
## Estimated positive LR of Test 2 is  3.631  and its standard error is 0.39 
## 
## 95 % confidence interval for the positive LR of Test 1 is ( 2.962  ;  4.505 ) 
## 
## Estimated negative LR of Test 1 is  0.235  and its standard error is 0.022 
## 
## 95 % confidence interval for the negative LR of Test 1 is ( 0.195  ;  0.283 ) 
## 
## Estimated negative LR of Test 2 is  0.119  and its standard error is 0.016 
## 
## 95 % confidence interval for the negative LR of Test 2 is ( 0.09  ;  0.153 ) 
## 
## 
## Test statistic for the global hypothesis test H0: (PLR1 = PLR2 and NLR1 = NLR2) is  23.438  
## 
##   Global p-value is  0  
## 
##   Applying the global hypothesis test (to an alpha error of 5 %), we reject the hypothesis H0: (PLR1 = PLR2 and NLR1 = NLR2) 
## 
##   Estimated power (to an alpha error of 5 %) is 99.78 %  
## 
##   Investigation of the causes of significance: 
## 
##    Test statistic for H0: PLR1 = PLR2 is  0.898  and the two-sided p-value is  0.369  
## 
##    Test statistic for H0: NLR1 = NLR2 is  4.663  and the two-sided p-value is  0  
## 
##    Applying the Holm method (to an alpha error of 5 %), we do not reject the hypothesis H0: PLR1 = PLR2 and we reject the hypothesis H0: NLR1 = NLR2 
## 
##    Negative likelihood ratio of Test 1 is significantly greater than negative likelihood ratio of Test 2 
## 
##     95 % confidence interval for the ratio NLR1 / NLR2 is ( 1.412  ;  2.554 ) 
## 
## 
## 
##           COMPARISON OF THE PREDICTIVE VALUES 
## 
## Estimated positive PV of Test 1 is  88.07 % and its standard error is 0.014 
## 
## 95 % confidence interval for the positive PV of Test 1 is ( 85.17 % ;  90.498 %) 
## 
## Estimated positive PV of Test 2 is  89.355 % and its standard error is 0.012 
## 
## 95 % confidence interval for the positive PV of Test 2 is ( 86.698 % ;  91.562 %) 
## 
## Estimated negative PV of Test 1 is  64.784 % and its standard error is 0.028 
## 
## 95 % confidence interval for the negative PV of Test 1 is ( 59.246 % ;  69.976 %) 
## 
## Estimated negative PV of Test 2 is  78.486 % and its standard error is 0.026 
## 
## 95 % confidence interval for the negative PV of Test 2 is ( 73.024 % ;  83.151 %) 
## 
## 
## Wald test statistic for the global hypothesis test H0: (PPV1 = PPV2 and NPV1 = NPV2) is  25.944  
## 
##   Global p-value is  0  
## 
##   Applying the global hypothesis test (to an alpha error of 5 %), we reject the hypothesis H0: (PPV1 = PPV2 and NPV1 = NPV2) 
## 
##   Estimated power (to an alpha error of 5 %) is 99.26 %  
## 
##   Investigation of the causes of significance: 
## 
##    Weighted generalized score statistic for H0: PPV1 = PPV2 is  0.807  and the two-sided p-value is  0.369  
## 
##    Weighted generalized score statistic for H0: NPV1 = NPV2 is  22.502  and the two-sided p-value is  0  
## 
##    Applying the Holm method (to an alpha error of 5 %), we do not reject the hypothesis H0: PPV1 = PPV2 and we reject the hypothesis H0: NPV1 = NPV2 
## 
##    Negative PV of Test 2 is significantly greater than negative PV of Test 1 
## 
##     95 % confidence interval for the difference NPV2 - NPV1 is ( 8.041 % ;  19.363 %)

You will also find three text files in the folder that you are currently using. The results obtained comparing the sensitivities and specificities are recorded in the file “Results_Comparison_Accuracies.txt”, those obtained when comparing the LRs are recorded in the file “Results_Comparison_LRs.txt”, and those obtained when comparing the PVs are recorded in the file “Results_Comparison_PVs.txt”. In R, an alternative program to “compbdt” is the DTComPair package. The DTComPair package estimates the same parameters as the “compbdt” and compares the parameters individually, i.e. solving each hypothesis test to an α error.

5.2 References

Leisenring, W., Alonzo, T., and Pepe, M. S. (2000). Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics, 56(2):345-51.

Kosinski, A.S. (2013). A weighted generalized score statistic for comparison of predictive values of diagnostic tests. Stat Med, 32(6):964-77.

Moskowitz, C.S., and Pepe, M.S. (2006). Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs. Clin Trials, 3(3):272-9.

Roldán-Nofuentes, J.A. Compbdt: an R program to compare two binary diagnostic tests subject to a paired design. BMC Med Res Methodol 20, 143 (2020). https://doi.org/10.1186/s12874-020-00988-y

5.3 Exercises

Let’s load the libraries:

library(epiR)
library(tidyverse)
library(OptimalCutpoints)

5.3.1 CK dataset

The CK file reports the creatine kinase (CK, U/L) measurements in patients with unstable angina and acute myocardial infarction, two conditions that are clinically difficult to distinguish. A diagnostic test has been devised such that for CK values > 100 the patient is classified as a case of acute myocardial infarction, for CK < 100 the patient is classified as a case of unstable angina. Calculate sensitivity, specificity, PPV, NPV of the diagnostic test. What changes when choosing CK = 80 as the threshold value?

load(here("datasets","CK.Rdata"))

Define the cutoff:

CK$ck.100 <- ifelse(CK$ck>100, 'CK > 100', 'CK <= 100')

Create the table:

pp <- table(CK$ck.100,CK$disease)
pp
##            
##             angina infarto
##   CK <= 100     58       1
##   CK > 100      34      26

Reverse the order:

dat <- as.table(matrix(c(26,34,1,58), nrow = 2, byrow = TRUE))
colnames(dat) <- c("AMI+","AMI-")
rownames(dat) <- c("Test+","Test-")
dat
##       AMI+ AMI-
## Test+   26   34
## Test-    1   58

Calculate sensitivity, specificity, PPV, NPV of the diagnostic test:

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
##           Outcome +    Outcome -      Total
## Test +           26           34         60
## Test -            1           58         59
## Total            27           92        119
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.50 (0.41, 0.60)
## True prevalence *                      0.23 (0.16, 0.31)
## Sensitivity *                          0.96 (0.81, 1.00)
## Specificity *                          0.63 (0.52, 0.73)
## Positive predictive value *            0.43 (0.31, 0.57)
## Negative predictive value *            0.98 (0.91, 1.00)
## Positive likelihood ratio              2.61 (1.98, 3.44)
## Negative likelihood ratio              0.06 (0.01, 0.40)
## False T+ proportion for true D- *      0.37 (0.27, 0.48)
## False T- proportion for true D+ *      0.04 (0.00, 0.19)
## False T+ proportion for T+ *           0.57 (0.43, 0.69)
## False T- proportion for T- *           0.02 (0.00, 0.09)
## Correctly classified proportion *      0.71 (0.62, 0.79)
## --------------------------------------------------------------
## * Exact CIs

Change the cutoff:

CK$ck.80 <- ifelse(CK$ck>80, 'CK > 80', 'CK <= 80') 

Create the table:

pp <- table(CK$ck.80,CK$disease)
pp
##           
##            angina infarto
##   CK <= 80     44       0
##   CK > 80      48      27
dat <- as.table(matrix(c(27,48,0,44), nrow = 2, byrow = TRUE))
colnames(dat) <- c("AMI+","AMI-")
rownames(dat) <- c("Test+","Test-")
dat
##       AMI+ AMI-
## Test+   27   48
## Test-    0   44
rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
##           Outcome +    Outcome -      Total
## Test +           27           48         75
## Test -            0           44         44
## Total            27           92        119
## 
## Point estimates and 95% CIs:
## --------------------------------------------------------------
## Apparent prevalence *                  0.63 (0.54, 0.72)
## True prevalence *                      0.23 (0.16, 0.31)
## Sensitivity *                          1.00 (0.87, 1.00)
## Specificity *                          0.48 (0.37, 0.58)
## Positive predictive value *            0.36 (0.25, 0.48)
## Negative predictive value *            1.00 (0.92, 1.00)
## Positive likelihood ratio              1.92 (1.58, 2.33)
## Negative likelihood ratio              0.00 (0.00, NaN)
## False T+ proportion for true D- *      0.52 (0.42, 0.63)
## False T- proportion for true D+ *      0.00 (0.00, 0.13)
## False T+ proportion for T+ *           0.64 (0.52, 0.75)
## False T- proportion for T- *           0.00 (0.00, 0.08)
## Correctly classified proportion *      0.60 (0.50, 0.69)
## --------------------------------------------------------------
## * Exact CIs

We can observe that sensibility is improved but specificity is worse.

5.4 Exercise (e)

As part of a prostate cancer screening program, blood samples from 100.000 men have been analyzed looking for a biomarker known to be related to the disease. In total 4000 tested positive, but fortunately only 800 of these cases were then confirmed with biopsy (a much more reliable examination but also long and expensive). Of the 96.000 negative screening tests, 100 individuals have then developed the tumor within 12 months and are therefore to be considered false negatives. Calculate and comment sensitivity, specificity, PPV, NPV of the diagnostic test.

6 ROC curves for diagnostic tests

For this example we will use the R package pROC. It builds a ROC curve and returns a roc object, a list of class roc. This object can be printed, plotted, or passed to the functions auc, ci, smooth.roc and coords. Additionally, two roc objects can be compared with the function roc.test. Data can be provided as response, predictor, where the predictor is the numeric (or ordered) level of the evaluated risk factor, and the response encodes the observation class (control or case). The level argument specifies which response level must be taken as controls (first value of level) or cases (second). It can safely be ignored when the response is encoded as 0 and 1, but it will frequently fail otherwise. By default, the first two values of levels(as.factor(response)) are taken, and the remaining levels are ignored. This means that if your response is coded “control” and “case”, the levels will be inverted.

library(pROC)

6.1 Subarachnoid hemorrhage data

This dataset summarizes several clinical and one laboratory variable of 113 patients with an aneurysmal subarachnoid hemorrhage. The purpose of the case presented here is to identify patients at risk of poor post-aSAH outcome, as they require specific healthcare management; therefore the clinical test must be highly specific. Detailed results of the study are reported in [1]. We only outline the features relevant to the ROC analysis.

data(aSAH)
head(aSAH)
##    gos6 outcome gender age wfns s100b  ndka
## 29    5    Good Female  42    1  0.13  3.01
## 30    5    Good Female  37    1  0.14  8.54
## 31    5    Good Female  42    1  0.10  8.09
## 32    5    Good Female  27    1  0.04 10.42
## 33    1    Poor Female  42    3  0.13 17.40
## 34    1    Poor   Male  48    2  0.10 12.75
with(aSAH, table(gender, outcome))
##         outcome
## gender   Good Poor
##   Male     22   20
##   Female   50   21

Let’s estimate the ROC curve for the parameter s100b, a biomarker:

roc(aSAH$outcome, aSAH$s100b,levels=c("Good", "Poor"))
## 
## Call:
## roc.default(response = aSAH$outcome, predictor = aSAH$s100b,     levels = c("Good", "Poor"))
## 
## Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
## Area under the curve: 0.7314

Inspect the ROC plot and report the 95% CI:

roc(aSAH$outcome, aSAH$s100b,percent=TRUE, plot=TRUE, ci=TRUE)

## 
## Call:
## roc.default(response = aSAH$outcome, predictor = aSAH$s100b,     percent = TRUE, ci = TRUE, plot = TRUE)
## 
## Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
## Area under the curve: 73.14%
## 95% CI: 63.01%-83.26% (DeLong)

We can also inspect the confidence intervals around our estimates, point by point:

rocobj <- plot.roc(aSAH$outcome, aSAH$s100b)
ci.sp.obj <- ci.sp(rocobj, sensitivities=seq(0, 1, .01), boot.n=100)
plot(rocobj)
plot(ci.sp.obj, type="shape", col="blue")

We can also compare the performance of two different biomarkers:

roc.test(response = aSAH$outcome, predictor1= aSAH$wfns, predictor2 = aSAH$s100)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  aSAH$wfns and aSAH$s100 by aSAH$outcome (Good, Poor)
## Z = 2.209, p-value = 0.02718
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.01040618 0.17421442
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.8236789   0.7313686

And plot the two ROC curves in one graph:

roc.s100b <- roc(aSAH$outcome, aSAH$s100b)
roc.wfns <- roc(aSAH$outcome, aSAH$wfns)
plot(roc.s100b)
plot(roc.wfns, add=TRUE, col="red")

6.2 Determining an “optimal” cut-off

Load the required library:

library(OptimalCutpoints)

The package Optimal.cutpoints calculates optimal cutpoints in diagnostic tests. Several methods or criteria for selecting optimal cutoffs have been implemented, including methods based on cost-benefit analysis and diagnostic test accuracy measures (Sensitivity/Specificity, Predictive Values and Diagnostic Likelihood Ratios) or prevalence (Lopez-Raton et al. 2014). As a basic example, we can use the Youden Index approach:

perf.s100b <-optimal.cutpoints(X ="s100b", status ="outcome", tag.healthy ="Good",
                                    methods = "Youden", data =aSAH, control = control.cutpoints(), 
                                    ci.fit = TRUE, 
                                    conf.level = 0.95, trace = FALSE)
perf.s100b
## 
## Call:
## optimal.cutpoints.default(X = "s100b", status = "outcome", tag.healthy = "Good", 
##     methods = "Youden", data = aSAH, control = control.cutpoints(), 
##     ci.fit = TRUE, conf.level = 0.95, trace = FALSE)
## 
## Optimal cutoffs:
##   Youden
## 1 0.2200
## 
## Area under the ROC curve (AUC):  0.731 (0.63, 0.833)

Therefore, 0.22 is the selected cut-off for the biomarker that maximizes the sum of Sensitivity and Specificity. But, pay always attention to the shape of the functional relationship between the biomarker and the risk of outcome…more on this in block 3.

6.3 References

[1] Natacha Turck, Laszlo Vutskits, Paola Sanchez-Pena, Xavier Robin, Alexandre Hainard, Marianne Gex-Fabry, Catherine Fouda, Hadiji Bassem, Markus Mueller, Frédérique Lisacek, Louis Puybasset and Jean-Charles Sanchez (2010) “A multiparameter panel method for outcome prediction following aneurysmal subarachnoid hemorrhage”. Intensive Care Medicine 36(1), 107-115. DOI: 10.1007/s00134-009-1641-y.

[2] Lopez-Raton, M., Rodriguez-Alvarez, M.X, Cadarso-Suarez, C. and Gude-Sampedro, F. (2014). OptimalCutpoints: An R Package for Selecting Optimal Cutpoints in Diagnostic Tests. Journal of Statistical Software 61(8), 1-36. URL http://www.jstatsoft.org/v61/i08/.

6.4 Exercise (f)

Let’s load the libraries:

library(tidyverse)
library(pROC)
library(MASS)
library(OptimalCutpoints)
library(epiR)

Load the data:

diabetes <- read.csv(here("datasets","diabetes.csv"))
glimpse(diabetes)
## Rows: 768
## Columns: 9
## $ Pregnancies              <int> 6, 1, 8, 1, 0, 5, 3, 10, 2, 8, 4, 10, 10, 1, …
## $ Glucose                  <int> 148, 85, 183, 89, 137, 116, 78, 115, 197, 125…
## $ BloodPressure            <int> 72, 66, 64, 66, 40, 74, 50, 0, 70, 96, 92, 74…
## $ SkinThickness            <int> 35, 29, 0, 23, 35, 0, 32, 0, 45, 0, 0, 0, 0, …
## $ Insulin                  <int> 0, 0, 0, 94, 168, 0, 88, 0, 543, 0, 0, 0, 0, …
## $ BMI                      <dbl> 33.6, 26.6, 23.3, 28.1, 43.1, 25.6, 31.0, 35.…
## $ DiabetesPedigreeFunction <dbl> 0.627, 0.351, 0.672, 0.167, 2.288, 0.201, 0.2…
## $ Age                      <int> 50, 31, 32, 21, 33, 30, 26, 29, 53, 54, 30, 3…
## $ Outcome                  <int> 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0, 1, 0, 1, 1, …

We transform Outcome from an integer variable into a factor with categories “No” and “Yes”:

diabetes$Outcome = factor(diabetes$Outcome, labels=c("No", "Yes"))
table(diabetes$Outcome)
## 
##  No Yes 
## 500 268
  • estimate the ROC curve for the glucose biomarker;
  • plot the curve;
  • try to define an optimal cut-off, adopting the Youden Index approach;
  • check the sensitivity, specificity, PPV and NPV of this cut-off;
  • OPTIONAL: try to define a cut-off considering a different approach (e.g. “MinValueNPV” criterion \(\implies\) methods = "MinValueNPV")

7 Odds and probability

7.1 Relationship between probabilities and odds

Much of the statistical modeling done in epidemiology has involved transforming probabilities (risks), which are constrained to 0 and 1, into outcomes that are amenable to linear models that can range from negative to positive infinity. One useful approach to this problem has been the logistic transformation. It involves two steps. First, we take our risk measurement, and extend it to include values beyond 1. We do this by taking the odds. Second, we will take the log of the odds. This effectively establishes a linear model that can range from minus to plus infinity. Odds has a meaning related with probability. If p is the probability, p/(1-p) is known as the odds. Conversely, the probability would be equal to odds/(odds+1). While a probability always ranges from 0 to 1, an odds ranges from 0 to infinity.

risk    <- seq(0,0.99,by=0.01)
odds <- risk/(1-risk)
plot(risk, odds)

If we restrict our attention to the range of probability less than 10% we can note that these two quantities are very close:

plot(risk[1:10], odds[1:10])
abline(0,1)

We can therefore use two simple functions to move from probability to odds and from odds to probability:

p2o <- function (p) p/(1-p)
o2p <- function (o) o/(1+o)
curve(log(x/(1 - x)), 0, 1)

7.2 Remind: Odds are not Odds Ratios

Note that in logistic regression (see block 3), the non-intercept coefficients of the independent variables represent the (log) odds ratios, not the odds. Let’s suppose to have an independent variable X in the model that is binary (i.e. has only two possible values 1 and 2), then the regression coefficient corresponding to X is:

\[ OR = \frac{Odds_1}{Odds_2} = \frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}} \]

We will see the interpretation of the regression coefficients of the logistic regression also in the case when X is a numerical variable or a categorical variable with more than two levels.

7.3 The effectsize R package

The effectsize package contains also functions to convert among indices of effect size. This can be useful for meta-analyses, or any comparison between different types of statistical analyses.

library(effectsize)

7.3.1 Converting Between Odds Ratios and Risk Ratios (Relative Risks)

As we have discussed, odds ratio, although popular, are not very intuitive in their interpretations. We don’t often think about the chances of catching a disease in terms of odds, instead we tend to think in terms of probability or some event - or the risk.

For example, if we find that for individual suffering from a migraine, for every bowl of brussels sprouts they eat, their odds of reducing the migraine increase by an \(OR = 3.5\) over a period of an hour. So, should people eat brussels sprouts to effectively reduce pain? Well, hard to say… Maybe if we look at RR we’ll get a clue.

We can convert between OR and RR with the following formula, but always pay attention to the fact that depending on the baseline risk these two measures can be quite different one from the other!

\[ RR = \frac{OR}{(1 - p0 + (p0 \times OR))} \] Where \(p0\) is the base-rate risk - the probability of the event without the intervention (e.g., what is the probability of the migraine subsiding within an hour without eating any brussels sprouts).

If it the base-rate risk is, say, 85% (very high!), we get a RR of:

OR <- 3.5
baserate <- 0.85

(RR <- oddsratio_to_riskratio(OR, baserate))
## [1] 1.12

That is - for every bowl of brussels sprouts, we increase the chances of reducing the migraine by a mere 12%! Is if worth it? Depends on you affinity to brussels sprouts…

Note that the base-rate risk is crucial here. If instead of 85% it was only 4%, then the RR would be:

oddsratio_to_riskratio(OR, 0.04)
## [1] 3.181818

That is - for every bowl of brussels sprouts, we increase the chances of reducing the migraine by a whopping 318%! Is if worth it? I guess that still depends on your affinity to brussels sprouts… If the outcome is not rare (>10%), it is better to use RR directly (if the study design permits!), as it provides a more accurate estimate of the association.

8 Measures of association

8.1 Introduction

An important task in epidemiology is to quantify the strength of association between exposures and outcomes. In this context the term exposure is taken to mean a variable whose association with the outcome is to be estimated.

Exposures can be harmful, beneficial or both harmful and beneficial (e.g. if an immunisable disease is circulating, exposure to immunising agents helps most recipients but may harm those who experience adverse reactions). The term outcome is used to describe all the possible results that may arise from exposure to a factor or from preventive or therapeutic interventions.

In the following examples we will consider only binary exposure factors (in block 3 we will extend to exposures on a quantitative scale, through the use of the appropriate regression models).

8.2 Relative Risk

RR (relative risk) is the ratio of the risk of getting disease among the exposed compared with that among the non-exposed. It indicates how many times the risk would increase had the subject changed their status from non-exposed to exposed. The increment is considered in fold, thus has a mathematical notation of being a multiplicative model.

Load the R libraries that we will use in this session:

library(epiR)
library(epitools)

First of all, let’s construct the contingency table of our study: the exposure factor is type of behaviour and the disease of interest is the occurrence of a coronary heart problem.

data.type <- matrix(c(178,79,1141,1486), 2, 2)
dimnames(data.type) <- list("Behavior type" = c("Type A", "Type B"),
                      "Occurrence of CHD Event" = c("Yes", "No"))
data.type
##              Occurrence of CHD Event
## Behavior type Yes   No
##        Type A 178 1141
##        Type B  79 1486

The aim is to calculate the Relative Risk and corresponding 95% CI. (We will come back to the 95% CI also later in Block 2 when discussing sample size determination). Here we use a function from the epiR library, where they call the the Relative Risk Incidence risk ratio. See the help of the function for all the details. Of note, the option method refers to the study design that has generated the data under study. We will see in more detail in Block 2 differences between cohort/case-controls and cross-sectional studies.

epi.2by2(data.type, method = "cohort.count")
##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +          178         1141       1319     13.50 (11.70 to 15.46)
## Exposed -           79         1486       1565        5.05 (4.02 to 6.25)
## Total              257         2627       2884       8.91 (7.90 to 10.01)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 2.67 (2.07, 3.45)
## Inc odds ratio                                 2.93 (2.23, 3.87)
## Attrib risk in the exposed *                   8.45 (6.31, 10.59)
## Attrib fraction in the exposed (%)            62.59 (51.75, 71.00)
## Attrib risk in the population *                3.86 (2.36, 5.37)
## Attrib fraction in the population (%)         43.35 (32.37, 52.55)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 62.919 Pr>chi2 = <0.001
## Fisher exact test that OR = 1: Pr>chi2 = <0.001
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

Therefore, the relative risk of a CHD event is 2.67 times for the behavior type “A” with respect to “B”.

8.3 Odds Ratio

As we have already seen in a dedicated note, odds has a meaning related with probability. If p is the probability, p/(1-p) is known as the odds. First of all, also here let’s build the contingency table: now the exposure factor is coffee drinking (equal or more than 1 cup per day vs zero) and the disease of interest is pancreatic cancer.

data.coffee <- matrix(c(347,20,555,88), 2, 2)
dimnames(data.coffee) <- list("Coffee Drinking (cups per day)" = c(">=1", "0"),
                       "Pancreatic cancer" = c("cases", "controls"))
data.coffee
##                               Pancreatic cancer
## Coffee Drinking (cups per day) cases controls
##                            >=1   347      555
##                            0      20       88

We want to estimate the odds ratio and corresponding 95% CI: the function oddsratio calculates the odds ratio by the median-unbiased estimation (mid-p), the conditional maximum likelihood estimation (Fisher), the unconditional maximum likelihood estimation (Wald), and the small sample adjustment (small). The reason of these various approach is related to the computation of the confidence intervals, that are calculated using exact methods (mid-p and Fisher), normal approximation (Wald), and normal approximation with small sample adjustment (small). If you are interested in these boring statistical details, see the help of the function for all the details about the default values.

oddsratio(data.coffee, method="wald")
## $data
##                               Pancreatic cancer
## Coffee Drinking (cups per day) cases controls Total
##                          >=1     347      555   902
##                          0        20       88   108
##                          Total   367      643  1010
## 
## $measure
##                               odds ratio with 95% C.I.
## Coffee Drinking (cups per day) estimate    lower    upper
##                            >=1 1.000000       NA       NA
##                            0   2.750991 1.662391 4.552449
## 
## $p.value
##                               two-sided
## Coffee Drinking (cups per day)   midp.exact fisher.exact   chi.square
##                            >=1           NA           NA           NA
##                            0   2.335156e-05  3.02823e-05 4.622573e-05
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

The odds ratio of the disease in the exposed is 2.75 w.r.t the unexposed.

8.4 Absolute effect and attributable risks

In this library, they call attributable risk the risk of disease in the exposed group minus the risk of disease in the unexposed group (we have defined it in our slides the excess risk).This provides a measure of the absolute increase or decrease in risk associated with exposure.Risk difference suggests the amount of risk gained or lost had the subject changed from non-exposed to exposed. The increase is absolute, and has the mathematical notation of an additive model. As we have discussed, the risk difference has more public health implications than the relative risk. A high relative risk may not be of public health importance if the disease is very rare. The risk difference, on the other hand, measures direct health burden and the need for health services. In this library there is also a quantity called Attrib risk in population, the difference between the risk in the total population minus the risk in the unexposed. Of note, in this library they call Attrib fraction in population (%) the fraction of all cases of D in the population that can be attributed to E (we call it simply attributable risk). And finally, they call Attrib fraction in exposed (%) the proportion of cases attributable to exposure in the exposed population (i.e. when considering the only population on which exposure can act). (We call it attributable risk in the exposed).

We now use as an example the low infant birth weight data presented by Hosmer and Lemeshow (2000) and available in the MASS package in R. The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Massachusetts USA during 1986.

library(MASS)
bwt <- birthwt
head(bwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

Each row of this data set represents data for one mother. We’re interested in the association between smoke (the mother’s smoking status during pregnancy) and low (delivery of a baby less than 2.5 kg bodyweight). It is important that the table you present to the epi.2by2 function is in the correct format: disease positives in the first column, disease negatives in the second column, exposure positives in the first row and exposure negatives in the second row.

bwt$low    <- factor(bwt$low, levels = c(1,0))
bwt$smoke  <- factor(bwt$smoke, levels = c(1,0))
bwt$race   <- factor(bwt$race, levels = c(1,2,3))

Generate the 2 by 2 table. Exposure (rows) = smoke, outcome (columns) = low:

low.tab <- table(bwt$smoke, bwt$low, dnn = c("Smoke", "Low BW"))
low.tab
##      Low BW
## Smoke  1  0
##     1 30 44
##     0 29 86

Compute the measures of association for smoking and delivery of a low birth weight baby:

epi.2by2(dat = low.tab, method = "cohort.count", conf.level = 0.95, 
   units = 100, outcome = "as.columns")
##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +           30           44         74     40.54 (29.27 to 52.59)
## Exposed -           29           86        115     25.22 (17.58 to 34.17)
## Total               59          130        189     31.22 (24.69 to 38.35)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 1.61 (1.06, 2.44)
## Inc odds ratio                                 2.02 (1.08, 3.78)
## Attrib risk in the exposed *                   15.32 (1.61, 29.04)
## Attrib fraction in the exposed (%)            37.80 (5.47, 59.07)
## Attrib risk in the population *                6.00 (-4.33, 16.33)
## Attrib fraction in the population (%)         19.22 (-0.21, 34.88)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 4.924 Pr>chi2 = 0.026
## Fisher exact test that OR = 1: Pr>chi2 = 0.036
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

The estimated excess risk is 15.3, i.e. we would expect the low birth weight to increase by 15% if all women smoked as compared to all women not smoking. The population attributable risk is estimated at 6%, the percent of cases in the total population that can be attributed to smoking. The attributable fraction in the population is 19%, i.e. the fraction of all cases of low birth weight that can be attributed to smoking (the naive interpretation is that low birth weight could be reduced by 19% if all mothers stop smoking). Finally, the attributable risk in the exposed is estimated at 38%, i.e. 38% of the low birth weight that occurred among women who smoked could be attributed to smoking (assuming causality).

8.5 Supplementary/Optional Material !!!

8.5.1 Number needed to treat

The Number Needed to Treat (NNT) is a measure of treatment effectiveness mostly used in clinical trials with binary outcomes. The NNT is the number of patients that need to be treated for one to benefit compared with a control. More precisely, the NNT is the number of patients that need to be treat to avoid some bad outcome. Usually, the NNT is calculated as the inverse of the absolute risk reduction, i.e., the difference between the experimental success event rate (EER) and the control success event rate (CER):

NNT = 1 / (EER - CER)

It is quite obvious that the calculation is easy for categorical event rates. (However, when you deal with continuous measures it may be difficult to calculate the NNT).

You may also want to calculate the Number Needed to Harm (NNH), which is the converse of the categorical NNT. The NNH is the number of patients that need to receive a treatment for one additional adverse event to occur, such as a severe side effect or a real harm, until the death. The NNH is calculated on the basis of the Absolute Risk Increase (ARI), which is the difference between the event rate for the adverse event in the experimental group and that in the control group.

ARI = Adv.Ev.Exp.R - Adv.Ev.C.R

NNH = 1 / ARI

A treatment should always have a NNH higher than the NNT, and a large ratio between the two (NNH consistently higher than the NNT) is desirable.

From the NNT and the NNH the likelihood of being helped or harmed (LHH) can also be calculated, according to the formula:

LHH = (1/NNT) / (1/NNH)

A LHH = 3 means that the patients are 3 time more likely to benefit from the treatment than to be harmed by it.

Let’s input some data :

s.e : sample size in the experimental group s.c : sample size in the control group b.e : How many participants experienced the bad outcome in the experimental group b.c : How many participants experienced the bad outcome in the control group

s.e <- 100  
s.c <- 100
b.e <- 20
b.c <- 60

Formula to calculate NNT from categorical data:

EER <- round((b.e/s.e),2)      
CER <- round((b.c/s.c),2)        
ARR <- CER - EER 
ARR
## [1] 0.4
nnt1 <- 1/ARR 
nnt1 <- round(nnt1,3)
nnt1
## [1] 2.5

Now another example to calculate NNH:

s.e : sample size in the experimental group s.c : sample size in the control group a.e : How many participants experienced the adverse event in the experimental group a.c : How many participants experienced the adverse event in the control group

s.e <- 100 
s.c <- 100
a.e <- 30
a.c <- 10
Adv.Ev.Exp.R <- round((a.e/s.e),2)     
Adv.Ev.C.R   <- round((a.c/s.c),2)     
ARI          <- Adv.Ev.Exp.R - Adv.Ev.C.R
ARI 
## [1] 0.2
nnh <- 1/ARI 
nnh <- round(nnh,3) 
nnh
## [1] 5

Finally, let’s evaluate the LHH:

 LHH = (1/nnt1)/(1/nnh)    
  
lhh <- round(LHH,3) 
lhh
## [1] 2

8.6 Exercises

8.6.1 Exercise (g)

Assuming that the risk of some disease is 1 in 100.000 among unexposed people, create a table of attributable fractions for exposures that increase the risk to 1 in 50.000, 1 in 10.000 and 1 in 1000.

8.6.2 Exercise (h)

Let’s say that the probability (prevalence) to be exposed for the previous data is 5%. Add a column to the matrix created above for evaluating the population attributable risk.

8.6.3 Exercise (i)

In the Scottish Health cohort Study 85 of 1821 people who lived in rented apartments had coronary artery disease, compared to 77 of 2477 people who owned their homes. Create a 2-by-2 matrix object from this data. Calculate the risks, relative risks, odds and odds ratios for these data. Comment the results.

8.6.4 Exercise (j)

Let’s load the University Group Diabetes Program (UGDP) data. This was a placebo-controlled, multi-center randomized clinical trial (we will see in block 2 definition of this kind of study design) from the early 1960’s intended to establish the efficacy of treatments for type 2 diabetes. There were a total of 409 patients. 204 patients received tolbutamide. 205 patients received placebo. 30 of the 204 tolbutamide patients died. 21 of the placebo patients died. There are three variables of interest: Status (Survivor or Death), Treatment (Placebo or Tolbutamide) and Age Group (older or younger than 55).

Calculate the relative risk and the odds ratio for the association of tolbutamide with death and comment the results (for the overall population). Comment the results.