Introduction

An important task in epidemiology is to quantify the strength of association between exposures and outcomes. In this context the term exposure is taken to mean a variable whose association with the outcome is to be estimated.

Exposures can be harmful, beneficial or both harmful and beneficial (e.g. if an immunisable disease is circulating, exposure to immunising agents helps most recipients but may harm those who experience adverse reactions). The term outcome is used to describe all the possible results that may arise from exposure to a factor or from preventive or therapeutic interventions.

In the following examples we will consider only binary exposure factors (in block 3 we will extend to exposures on a quantitative scale, through the use of the appropriate regression models).

Relative Risk

RR (relative risk) is the ratio of the risk of getting disease among the exposed compared with that among the non-exposed. It indicates how many times the risk would increase had the subject changed their status from non-exposed to exposed. The increment is considered in fold, thus has a mathematical notation of being a multiplicative model.

Load the R libraries that we will use in this session:

library(epiR)
library(epitools)

First of all, let’s construct the contingency table of our study: the exposure factor is type of behaviour and the disease of interest is the occurrence of a coronary heart problem.

data.type <- matrix(c(178,79,1141,1486), 2, 2)
dimnames(data.type) <- list("Behavior type" = c("Type A", "Type B"),
                      "Occurrence of CHD Event" = c("Yes", "No"))
data.type
##              Occurrence of CHD Event
## Behavior type Yes   No
##        Type A 178 1141
##        Type B  79 1486

The aim is to calculate the Relative Risk and corresponding 95% CI. (We will come back to the 95% CI also later in Block 2 when discussing sample size determination). Here we use a function from the epiR library, where they call the the Relative Risk Incidence risk ratio. See the help of the function for all the details. Of note, the option method refers to the study design that has generated the data under study. We will see in more detail in Block 2 differences between cohort/case-controls and cross-sectional studies.

epi.2by2(data.type, method = "cohort.count")
##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +          178         1141       1319     13.50 (11.70 to 15.46)
## Exposed -           79         1486       1565        5.05 (4.02 to 6.25)
## Total              257         2627       2884       8.91 (7.90 to 10.01)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 2.67 (2.07, 3.45)
## Odds ratio                                     2.93 (2.23, 3.87)
## Attrib risk in the exposed *                   8.45 (6.31, 10.59)
## Attrib fraction in the exposed (%)            62.59 (51.75, 71.00)
## Attrib risk in the population *                3.86 (2.36, 5.37)
## Attrib fraction in the population (%)         43.35 (32.37, 52.55)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 62.919 Pr>chi2 = <0.001
## Fisher exact test that OR = 1: Pr>chi2 = <0.001
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

Therefore, the relative risk of a CHD event is 2.67 times for the behavior type “A” with respect to “B”.

Odds Ratio

As we have already seen in a dedicated note, odds has a meaning related with probability. If p is the probability, p/(1-p) is known as the odds. First of all, also here let’s build the contingency table: now the exposure factor is coffee drinking (equal or more than 1 cup per day vs zero) and the disease of interest is pancreatic cancer.

data.coffee <- matrix(c(347,20,555,88), 2, 2)
dimnames(data.coffee) <- list("Coffee Drinking (cups per day)" = c(">=1", "0"),
                       "Pancreatic cancer" = c("cases", "controls"))
data.coffee
##                               Pancreatic cancer
## Coffee Drinking (cups per day) cases controls
##                            >=1   347      555
##                            0      20       88

We want to estimate the odds ratio and corresponding 95% CI: the function oddsratio calculates the odds ratio by the median-unbiased estimation (mid-p), the conditional maximum likelihood estimation (Fisher), the unconditional maximum likelihood estimation (Wald), and the small sample adjustment (small). The reason of these various approach is related to the computation of the confidence intervals, that are calculated using exact methods (mid-p and Fisher), normal approximation (Wald), and normal approximation with small sample adjustment (small). If you are interested in these boring statistical details, see the help of the function for all the details about the default values.

oddsratio(data.coffee, method="wald")
## $data
##                               Pancreatic cancer
## Coffee Drinking (cups per day) cases controls Total
##                          >=1     347      555   902
##                          0        20       88   108
##                          Total   367      643  1010
## 
## $measure
##                               odds ratio with 95% C.I.
## Coffee Drinking (cups per day) estimate    lower    upper
##                            >=1 1.000000       NA       NA
##                            0   2.750991 1.662391 4.552449
## 
## $p.value
##                               two-sided
## Coffee Drinking (cups per day)   midp.exact fisher.exact   chi.square
##                            >=1           NA           NA           NA
##                            0   2.335156e-05  3.02823e-05 4.622573e-05
## 
## $correction
## [1] FALSE
## 
## attr(,"method")
## [1] "Unconditional MLE & normal approximation (Wald) CI"

The odds ratio of the disease in the exposed is 2.75 w.r.t the unexposed.

Absolute effect and attributable risks

In this library, they call attributable risk the risk of disease in the exposed group minus the risk of disease in the unexposed group (we have defined it in our slides the excess risk).This provides a measure of the absolute increase or decrease in risk associated with exposure.Risk difference suggests the amount of risk gained or lost had the subject changed from non-exposed to exposed. The increase is absolute, and has the mathematical notation of an additive model. As we have discussed, the risk difference has more public health implications than the relative risk. A high relative risk may not be of public health importance if the disease is very rare. The risk difference, on the other hand, measures direct health burden and the need for health services. In this library there is also a quantity called Attrib risk in population, the difference between the risk in the total population minus the risk in the unexposed. Of note, in this library they call Attrib fraction in population (%) the fraction of all cases of D in the population that can be attributed to E (we call it simply attributable risk). And finally, they call Attrib fraction in exposed (%) the proportion of cases attributable to exposure in the exposed population (i.e. when considering the only population on which exposure can act). (We call it attributable risk in the exposed).

We now use as an example the low infant birth weight data presented by Hosmer and Lemeshow (2000) and available in the MASS package in R. The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Massachusetts USA during 1986.

library(MASS)
bwt <- birthwt
head(bwt)
##    low age lwt race smoke ptl ht ui ftv  bwt
## 85   0  19 182    2     0   0  0  1   0 2523
## 86   0  33 155    3     0   0  0  0   3 2551
## 87   0  20 105    1     1   0  0  0   1 2557
## 88   0  21 108    1     1   0  0  1   2 2594
## 89   0  18 107    1     1   0  0  1   0 2600
## 91   0  21 124    3     0   0  0  0   0 2622

Each row of this data set represents data for one mother. We’re interested in the association between smoke (the mother’s smoking status during pregnancy) and low (delivery of a baby less than 2.5 kg bodyweight). It is important that the table you present to the epi.2by2 function is in the correct format: disease positives in the first column, disease negatives in the second column, exposure positives in the first row and exposure negatives in the second row.

bwt$low    <- factor(bwt$low, levels = c(1,0))
bwt$smoke  <- factor(bwt$smoke, levels = c(1,0))
bwt$race   <- factor(bwt$race, levels = c(1,2,3))

Generate the 2 by 2 table. Exposure (rows) = smoke, outcome (columns) = low:

low.tab <- table(bwt$smoke, bwt$low, dnn = c("Smoke", "Low BW"))
low.tab
##      Low BW
## Smoke  1  0
##     1 30 44
##     0 29 86

Compute the measures of association for smoking and delivery of a low birth weight baby:

epi.2by2(dat = low.tab, method = "cohort.count", conf.level = 0.95, 
   units = 100, outcome = "as.columns")
##              Outcome +    Outcome -      Total                 Inc risk *
## Exposed +           30           44         74     40.54 (29.27 to 52.59)
## Exposed -           29           86        115     25.22 (17.58 to 34.17)
## Total               59          130        189     31.22 (24.69 to 38.35)
## 
## Point estimates and 95% CIs:
## -------------------------------------------------------------------
## Inc risk ratio                                 1.61 (1.06, 2.44)
## Odds ratio                                     2.02 (1.08, 3.78)
## Attrib risk in the exposed *                   15.32 (1.61, 29.04)
## Attrib fraction in the exposed (%)            37.80 (5.47, 59.07)
## Attrib risk in the population *                6.00 (-4.33, 16.33)
## Attrib fraction in the population (%)         19.22 (-0.21, 34.88)
## -------------------------------------------------------------------
## Uncorrected chi2 test that OR = 1: chi2(1) = 4.924 Pr>chi2 = 0.026
## Fisher exact test that OR = 1: Pr>chi2 = 0.036
##  Wald confidence limits
##  CI: confidence interval
##  * Outcomes per 100 population units

The estimated excess risk is 15.3, i.e. we would expect the low birth weight to increase by 15% if all women smoked as compared to all women not smoking. The population attributable risk is estimated at 6%, the percent of cases in the total population that can be attributed to smoking. The attributable fraction in the population is 19%, i.e. the fraction of all cases of low birth weight that can be attributed to smoking (the naive interpretation is that low birth weight could be reduced by 19% if all mothers stop smoking). Finally, the attributable risk in the exposed is estimated at 38%, i.e. 38% of the low birth weight that occurred among women who smoked could be attributed to smoking (assuming causality).

Supplementary/Optional Material !!!

Number needed to treat

The Number Needed to Treat (NNT) is a measure of treatment effectiveness mostly used in clinical trials with binary outcomes. The NNT is the number of patients that need to be treated for one to benefit compared with a control. More precisely, the NNT is the number of patients that need to be treat to avoid some bad outcome. Usually, the NNT is calculated as the inverse of the absolute risk reduction, i.e., the difference between the experimental success event rate (EER) and the control success event rate (CER):

NNT = 1 / (EER - CER)

It is quite obvious that the calculation is easy for categorical event rates. (However, when you deal with continuous measures it may be difficult to calculate the NNT).

You may also want to calculate the Number Needed to Harm (NNH), which is the converse of the categorical NNT. The NNH is the number of patients that need to receive a treatment for one additional adverse event to occur, such as a severe side effect or a real harm, until the death. The NNH is calculated on the basis of the Absolute Risk Increase (ARI), which is the difference between the event rate for the adverse event in the experimental group and that in the control group.

ARI = Adv.Ev.Exp.R - Adv.Ev.C.R

NNH = 1 / ARI

A treatment should always have a NNH higher than the NNT, and a large ratio between the two (NNH consistently higher than the NNT) is desirable.

From the NNT and the NNH the likelihood of being helped or harmed (LHH) can also be calculated, according to the formula:

LHH = (1/NNT) / (1/NNH)

A LHH = 3 means that the patients are 3 time more likely to benefit from the treatment than to be harmed by it.

Let’s input some data :

s.e : sample size in the experimental group s.c : sample size in the control group b.e : How many participants experienced the bad outcome in the experimental group b.c : How many participants experienced the bad outcome in the control group

s.e <- 100  
s.c <- 100
b.e <- 20
b.c <- 60

Formula to calculate NNT from categorical data:

EER <- round((b.e/s.e),2)      
CER <- round((b.c/s.c),2)        
ARR <- CER - EER 
ARR
## [1] 0.4
nnt1 <- 1/ARR 
nnt1 <- round(nnt1,3)
nnt1
## [1] 2.5

Now another example to calculate NNH:

s.e : sample size in the experimental group s.c : sample size in the control group a.e : How many participants experienced the adverse event in the experimental group a.c : How many participants experienced the adverse event in the control group

s.e <- 100 
s.c <- 100
a.e <- 30
a.c <- 10
Adv.Ev.Exp.R <- round((a.e/s.e),2)     
Adv.Ev.C.R   <- round((a.c/s.c),2)     
ARI          <- Adv.Ev.Exp.R - Adv.Ev.C.R
ARI 
## [1] 0.2
nnh <- 1/ARI 
nnh <- round(nnh,3) 
nnh
## [1] 5

Finally, let’s evaluate the LHH:

 LHH = (1/nnt1)/(1/nnh)    
  
lhh <- round(LHH,3) 
lhh
## [1] 2