---
title: "R applications & Exercises - Block 1"
subtitle: "STATISTICAL LEARNING IN EPIDEMIOLOGY 2023/2024"
author: "Prof. Giulia Barbati & Paolo Dalena"
date: "22/03/2024"
output:
  html_document:
    toc: true
    number_sections: true
    theme: united
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```


# Causal Inference Introduction

Useful libraries:

```{r}
library(tidyverse)
library(ggplot2)
```


## Estimating counterfactuals

Conceptually, the *missing* counterfactual outcome is one that would have occurred under a different set of circumstances. In causal inference, we wish we could observe the conterfactual outcome that would have occurred in an *alternate universe* where the exposure status for a given observation was flipped. 
To do this, we attempt to control for all factors that are related to an exposure and outcome such that we can construct (or estimate) such a counterfactual outcome.
Let’s suppose some happiness index, from 1-10 exists. 
We are interested in assessing whether eating chocolate ice cream versus vanilla will increase happiness. 
We have 10 individuals with two potential outcomes for each, one is what their happiness would be if they ate chocolate ice cream, (defined as y_chocolate in the code below), and one is what their happiness would be if they ate vanilla ice cream (defined as y_vanilla in the code below). 
We can define the true causal effect of eating chocolate ice cream (versus vanilla) on happiness for each individual as the difference between the two:


```{r}
data <- data.frame(
  id = 1:10,
  y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),
  y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)
)

data <- data |>
  mutate(causal_effect = y_chocolate - y_vanilla)

data
```

```{r}
data |>
  summarize(
    avg_chocolate = mean(y_chocolate),
    avg_vanilla = mean(y_vanilla),
    avg_causal_effect = mean(causal_effect)
  )
```
The causal effect of eating chocolate ice cream (versus vanilla) for individual 4 is 0, whereas the causal effect for individual 9 is 2. The average potential happiness after eating chocolate is 5.4 and the average potential happiness after eating vanilla is 4.6. The average treatment effect of eating chocolate (versus vanilla) ice cream among the ten individuals in this study is 0.8.

In reality, we cannot observe both potential outcomes, in any moment in time, each individual in our study can only eat one flavor of ice cream. Suppose we let our participants *choose* which ice cream they wanted to eat and each choose their favorite (i.e. they knew which would make them “happier” and picked that one). Now what we observe is:

```{r}
data_observed <- data |>
  mutate(
    exposure = case_when(
      # people who like chocolate more chose that
      y_chocolate > y_vanilla ~ "chocolate",
      # people who like vanilla more chose that
      y_vanilla >= y_chocolate ~ "vanilla"
    ),
    observed_outcome = case_when(
      exposure == "chocolate" ~ y_chocolate,
      exposure == "vanilla" ~ y_vanilla
    )
  ) |>
  # we can only observe the exposure and one potential outcome
  select(id, exposure, observed_outcome)
data_observed
```

```{r}
data_observed |>
  group_by(exposure) |>
  summarise(avg_outcome = mean(observed_outcome))
```

Now, the observed average outcome among those who ate chocolate ice cream is 5.4 (the same as the true average potential outcome), while the observed average outcome among those who ate vanilla is 6.3 – quite different from the actual average (4.6). The estimated causal effect here could be calculated as 5.4 - 6.3 = -0.9.

It turns out here, these 10 participants chose which ice cream they wanted to eat and they always chose to eat their favorite! This *artificially* made it look like eating vanilla ice cream would increase the happiness in this population when in fact we know the opposite is true. 
The next section will discuss which assumptions need to be true in order to allow us to accurately estimate causal effects using observed data. As a sneak peak, our issue here was that how the exposure was decided, if instead we *randomized* who ate chocolate versus vanilla ice cream we would (on average, with a large enough sample) recover the true causal effect.

## The magic of randomization 

Since now we are doing something *random* so let's set a seed so we always observe the
same result each time we run the code:


```{r}
set.seed(11)
data_observed <- data |>
  mutate(
    # change the exposure to randomized, generate from a binomial distribution
    # with a probability 0.5 for being in either group
    exposure = case_when(
      rbinom(10, 1, 0.5) == 1 ~ "chocolate",
      TRUE ~ "vanilla"
    ),
    observed_outcome = case_when(
      exposure == "chocolate" ~ y_chocolate,
      exposure == "vanilla" ~ y_vanilla
    )
  ) |>
  # as before, we can only observe the exposure and one potential outcome
  select(id, exposure, observed_outcome)
data_observed |>
  group_by(exposure) |>
  summarise(avg_outcome = mean(observed_outcome))
```
As you can see, now we have correctly estimated the causal effect in this population... how this happened? More on this in block 2 (Study Design) and block 3 ! 


# Descriptive epidemiology - introduction

Epidemiology is the study of the frequency, distribution and determinants of health-related states in populations and the application of such knowledge to control health problems (Disease Control and Prevention 2006).

This introduction provides examples on the way R can be used for initial descriptive epidemiological analyses, that is, to describe how the frequency of disease varies in the populations, places and times. 
Note that we will not discuss specific methods for *spatial epidemiology* in this course, that is an advanced and more specific topic. 

Load the libraries that we will use in this session: 

```{r warning=FALSE,message=FALSE}
library(epiR)
library(Epi)
library(popEpi)
library(ggplot2)
library(scales)
library(lubridate)
```


## Prevalence and Incidence in the populations 

Descriptions of disease frequency involves reporting either the prevalence or incidence of disease.

Recall of the definitions: strictly speaking, *prevalence* equals the number of cases of a given disease or attribute that exists in a population at a specified point in time. Prevalence is therefore the proportion of a population that has a specific disease or attribute at a specified point in time. 

Two types of prevalence are reported in the literature: (1) point prevalence : the proportion of a population in a diseased state at a single point in time, (2) period prevalence : the proportion of a population with a given disease or condition over a specific period of time (i.e. the number of existing cases at the start of a follow-up period plus the number of incident cases that occur during the follow-up period).

*Incidence* provides a measure of how frequently susceptible (at risk) individuals become disease cases as they are observed over time. An incident case occurs when an individual changes from being susceptible (at risk) to being diseased. The count of incident cases is the number of such events that occur in a population during a defined follow-up period. There are two ways to express incidence:

*Cumulative incidence* is the proportion of initially susceptible individuals in a population who become new cases during a defined follow-up period.

*Incidence rate* (also known as incidence density) is the number of new cases of disease that occur *per unit of individual time at risk* during a defined follow-up period.

In addition to reporting the point estimate of disease frequency, it is important to provide an indication of the uncertainty around that point estimate. The *epi.conf* function in the epiR package allows you to obtain confidence intervals for prevalence, cumulative incidence and incidence rates. In fact, these functions are useful for the confidence intervals calculations more than for the simple formulas of prevalence and incidence. 

### Prevalence example 

Let's say we're interested in the prevalence of disease X in a population comprised of 1000 individuals. Two hundred are tested and four returned a positive result. Assuming 100% test sensitivity and specificity (=i.e. the accuracy of the diagnostic tool, see later in this block for a detailed explanation), what is the estimated prevalence of disease X in this population?

```{r}
ncas <- 4; npop <- 200
tmp <- as.matrix(cbind(ncas, npop))
epi.conf(tmp, ctype = "prevalence", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 100
```

The estimated prevalence of disease X in this population is 2.0 (95% confidence interval [CI] 0.55 - 5.0) cases per 100 individuals.

Of note: the design effect (option design) is used to adjust the confidence interval around a prevalence or incidence risk estimate in the presence of *clustering* i.e. when the sampling mechanism is not related to single subjects but to groups. The design effect is a measure of the variability between clusters and is calculated as the ratio of the variance calculated assuming a complex sample design divided by the variance calculated assuming simple random sampling.


### From data to estimates...

As we have seen, the prevalence of a disease in a population is the fraction of the population that has the disease at a given time. The total number of persons in Country X alive with colon cancer on 1 January 2023 is 16281, and the number of persons in Country X on 1 January 2023 is 5534738, so the point prevalence of colon cancer is :

```{r}
16281/5534738
```

that is, 0.29%. 

This is the *empirical* prevalence of colon cancer patients in Country X on 1 January 2023.

The *theoretical* prevalence is the probability that a *randomly* chosen person from the population has colorectal cancer. 

This is an unknown quantity, but a reasonable estimate of it would be the empirical prevalence. 

That is, however, just an *estimate* from a statistical model that asserts that colon cancer is randomly distributed among the population, with a probability *p*,and there is an uncertainty associated with the estimate which we can quantify in a confidence interval. This is an interval that captures the true prevalence with some pre-specified probability (usually set at 95%). 

The data used to estimate *p* are X=number of persons with colorectal cancer, and N=number of persons in the population. In order to assess *p* we can fit a *binomial* model: 

```{r}
X <- 16281
N <- 5534738
mp <- glm(cbind(X, N-X)~1, family=binomial)
round(ci.pred(mp)*100,4)
```

We can use the function *glm* in R that fits generalized linear models, in this case a model for a binomial outcome. The observed outcome must be specified as the number of *successes* (cases of cancer) and the number of *failures* (not-diseased persons). The prevalence as a function of explanatory variables is described by a *model* (more of this in block 3!), here specified by ~1, which means that the prevalence is constant across all observational units (not surprisingly since we only have one data point!). The *ci.pred* produces a prediction of the theoretical prevalence from the model. 

### Another example : prevalence of diabetes by age and sex 

We will use a dataset of Diabetes prevalence in Denmark in 1-year age classes by sex, measured in 01-01-2010. This dataset has 200 rows and the following 4 variables:

A: Numeric, age, 0-99

sex: Sex, a factor with levels M F

X: Number of diabetes patients

N: Population size

```{r}
data(pr)
str(pr)
head(pr)
```
We would like to know the prevalence at different ages and separately by sex. 
Since these data come from the entire population of Denmark, the interpretation will not include the sampling variation (i.e. confidence intervals). We can compute the observed prevalence by age and sex as follows:

```{r}
prt <- stat.table(index=list(A,sex), contents=ratio(X,N,100), data=pr)[1,,]
round(prt[70:75,],1)
```

a few rows of the resulting array is printed, from age of 69 to 64, just to show some results. 

We can plot these entries as a function of age, with one curve for each sex: 

```{r}
matplot(0:99+0.5, prt, type="l", lwd=2, col=c(4,2),lty=1 )
abline(v=c(76,81)+0.5)
```


Ages between 76 and 81 correspond to the maximal prevalences for men and women. We plotted the prevalences against the midpoints of the age-classes so we add 0.5 to the position of the vertical lines too. 


## Incidence rate basic example 

A study was conducted by Feychting, Osterlund, and Ahlbom (1998) to report the frequency of cancer among the blind. A total of 136 diagnoses of cancer were made from 22050 *person-years* at risk. What was the incidence rate of cancer in this population?

```{r}
ncas <- 136; ntar <- 22050
tmp <- as.matrix(cbind(ncas, ntar))
epi.conf(tmp, ctype = "inc.rate", method = "exact", N = 1000, design = 1, conf.level = 0.95) * 1000
```

The incidence rate of cancer in this population was 6.2 (95% CI 5.2 to 7.3) cases per 1000 person-years at risk.

### From data to estimates...again 

In the case when the incidence rate is assumed constant, we can assume a Poisson likelihood that leads to the estimate : 
(the observed outcome is D, number of events, and the amount of follow up time is Y): 

$$\hat{\lambda}=D/Y$$
$$s.e.{\log(\hat{\lambda})}=1/\sqrt(D)$$

From this formula we can derive also approximate 95% confidence intervals for the estimated rate (see Block 2). 
A summary of the methods used for each of the confidence interval calculations in the *epi.conf* function are available in the "help"of the function. 


## Mortality rate 

Mortality is typically reported as a number of people that have died in a population of a certain size. For example, 2648 persons died out of 150000 persons. This is a mortality of 17.6 per 1000 persons, apparently:

```{r}
2648/150000
```

Looking a bit closer at this, it is a bit illogical : the number of deaths must be recorded over some period of time, in this case presumably a year. The number of persons, on the other hand, refer to the population size at some specific date, presumably 1st January or July some year. So the two components, number of deaths and population size, refer to a *period* and a specific *time point*, respectively. 
The number of persons at risk of dying will not vary a lot over a year, so as a first approximation we might assume that the population was constant in the period. Thus, the 150000 above refers to the *average* population over the period. But, if the 2648 deaths were accumulated over a period of two years, the mortality would then be half of what we just computed, because the 150000 people had each two years to encounter death. 
This leads to the concept of *person-years* also called *risk time*, *time at risk*, *exposed time* or even just *exposure*. 
The person-years is the population size multiplied by the length of the observation interval. So since the population size is 150000 and the 
observation period is 2 years, the number of person-years is 300000 and the mortality rate is:


```{r}
2648/300
```

8.83 deaths per 1000 person-years. 
Note that the rate is a scaled quantity, it has unit of measurement, in this example *per 1000 py*. 
We might equally well have given the mortality rate as 88.3 per 10000 person-years. 
This quantity is called the *empirical* mortality rate. The *theoretical* counterpart is the probability of dying in a small interval relative to the length of the interval. This is usually denoted by the Greek letter $\lambda$:

$$\lambda=P\{{\sf death\ in} (t,t+dt)|{\tt alive\ at\ t\}}/dt$$

When we say *small* interval, we are approaching the formal definition as the limit of the quantity as *dt* tends to 0. The vertical bar denotes conditional probability (*given*). Note that the theoretical rate is also a scaled quantity; it is measured in units of $time^{-1}$ (deaths per person time) because the denominator *dt* is measured in time units. As in the case of prevalence, there is an *uncertainty* associated with the estimate of the theoretical mortality rate, which again we can quantify in a confidence interval. The data used to estimate $\lambda$ are D=number of deaths and Y=amount of follow up time. In order to assess this we can fit a Poisson model (other examples in block 3):

```{r}
D <- 2648
Y <- 300000
mm <- glm(D~1,offset=log(Y), family=poisson)
round(ci.pred(mm,newdata=data.frame(Y=1000)),3)
```

The observed outcome is D, number of deaths, and the amount of follow up time (Y) is specified in the *offset* argument. The mortality rate is described through a model for the number of events. The model is specified by ~1, which means that the mortality rate is constant over all observed units, specifically, that the number of deaths is proportional to the number of person-years as supplied in the offset argument. The log is used to linearize the model. The result is an estimate of the mortality rate per 1000 person-years.

### Basic data structure to estimate incidence rates (more on block 4!) 

Usually data are not aggregated but instead come from individuals subjects that are observed for a period of time (follow up) (see: cohort studies, block 2). The simplest example is following persons for the occurrence of death. 
The basic requirements for recordings in a follow-up study is: 
- date of entry to the study (the first date a person is known to be alive and at risk of having the event)
- date of exit from the study (the last date a person is known to be alive and at risk of having the event/or the date of death)
- The status of the person at the exit date - with or without the event.  

It is assumed that the event terminates the follow-up, so we are implicitly referring to events like death or diagnosis of a chronic disease, where a person ceases to be at risk of the event if it occurs. More complex data structures and methods allow for recurrent events or competing risk (some basic ideas in block 4). 

In the Epi package there is the dataset DMlate: 

```{r}
data(DMlate)
str(DMlate)
head(DMlate)
```

This dataset represents follow up of a random sample of 10000 diabetes patients in Denmark diagnosed after 1995, followed till the end of 2009.
Random noise has been added to all dates in order to make persons untraceable. Each line represents the follow up of one person. 

Sex: a factor with levels M F

dobth: Date of birth

dodm: Date of inclusion in the register

dodth: Date of death

dooad: Date of 2nd prescription of OAD

doins: Date of 2nd insulin prescription

dox: Date of exit from follow-up.

All dates are given in fractions of years, so 1998.000 corresponds to 1 January 1998 and 1998.997 to 31 December 1998.

All dates are randomly perturbed by a small amount, so no real persons have any of the combinations of the 6 dates in the dataset. 
But results derived from the data will be quite close to those that would be obtained if the entire 'real' diabetes register were used.
The overall number of deaths, follow up time and mortality rates by sex can be shown using *stat.table* function: 

```{r}
stat.table(index=sex, contents=list(D=sum(!is.na(dodth)), Y=sum(dox-dodm), rate=ratio(!is.na(dodth),dox-dodm,1000)), margin=TRUE, data=DMlate)
```

The overall mortality rate is 46.1 deaths per 1000 person-years. In making this calculation we are already implicitly imposed a statistical model describing the data, namely, the model which assumes a constant mortality rate for all persons during the study period. The single entries are from a model that assumes that mortality rates only differ by sex. While this model is clearly wrong for almost any purpose, it serves as an essential building block in derivation of more realistic models (see Block 4). 


### Different times ...

Follow-up data basically consist of a time entry, a time exit and an indication of the status at exit. 
Note that time comes in two guises: one is the *risk* time (or *exposure*) and the other is the time-scale.
The risk time is : "*How long* have you been alive" (i.e. exposed to potential death).
The time scale is : "*When* were you alive" meaning for example age or calendar time. 
The risk time is part of the outcome measurement (it is the denominator of the rate), whereas the time-scale is an *explanatory* variable. 
We will usually be interested in how rates vary along some time-scale, age or calendar time, for example, so in most analyses we will using both risk time and time-scales. 


## References 

Disease Control, Centers for, and Prevention. 2006. Principles of Epidemiology in Public Health Practice: An Introduction to Applied Epidemiology and Biostatistics. Atlanta, Georgia: Centers for Disease Control; Prevention.
  
Feychting, M, B Osterlund, and A Ahlbom. 1998. "Reduced Cancer Incidence Among the Blind." Epidemiology 9: 490-94.

Carstensen B., Epidemiology with R, 2021, Oxford University Press. 

## Exercises

### Exercise (a) 
The figure below shows the follow-up experience of members of a small study cohort between
1 January 2004 to 30 June 2009,  from entry to follow-up until death (black filled circles indicate death due to cancer
and empty circles indicate death for other causes). Follow-up until the occurrence of cancer C is
shown with a broken line. For those subjects contracting cancer C, follow-up after diagnosis is shown with a solid line. 

![](images/ex1.jpg)

Looking at the plot, we will calculate the values of the incidence rate of the disease and of various mortality
measures:

1. What is the incidence rate (per 100 pyears) of cancer C during the period from 1 Jan 2004 to 31 Dec 2008?

2. What is the mortality rate from cancer C during the same period?

3. What is the mortality rate from all causes during the same period?

4. What is the prevalence of C on 30 September 2006, and on 31 December 2008?


### Exercise (b)
In the table below are given the size (in 1000s) of the male population in Finland aged 0-14
years (the age range of "childhood" in pediatrics) on the 31 December in each year from 1991
to 2000.

![](images/ex2.jpg)

The following numbers of cases describe the incidence of and mortality from acute leukaemia
in this population for two calendar periods: 5 years 1993 to 1997, and
year 1999 only.

![](images/ex2b.jpg)

1. Calculate the incidence rates of acute leukaemia in this population for the two periods.
(Key: person-years could be obtained by the mid-population principle, i.e. using population in 1995 for the first period or taking an average between years)
2. Calculate similarly the mortality rates of leukaemia.


# Standardized rates

## Example of standardized rates calculations 

The following example does not require the use of any specific R library, we will do simply "by hand" calculations. 

Age-specific data on the incidence of colon cancer in male and female populations of Finland
during 1999 are given in the following table:

![](images/exSTD1.jpg)


Let's calculate the following summary measures:

1. crude incidence rate in both populations and the *rate ratio* (anticipation of association measures...): males vs. females,
2. age-standardized rates and their ratio using the male population as the standard,i.e. the "indirect method"; 
3. age-standardized rates and their ratio using the World Standard Population, i.e. the "direct method"

We will compare and comment the results obtained in items 1 to 3.

World Standard Population: 

![](images/wsp.jpg)

**Let's do it:**

First of all, we create the data matrix: 

```{r}
M <- matrix( c(10,1157,46.0,0.9,22,1109,41.9,2.0,0.44
,76,809,32.0,9.4,68,786,29.7,8.6,1.09
,305,455,18.0,67,288,524,19.8,55,1.22
,201,102,4.0,196,354,229,8.6,155,1.27), nrow=4, byrow=T )
M <- data.frame(M)
names(M) <- c("mca","mpy","mp","mr",
              "fca","fpy","fp","fr","rr")
M
```


Then, we compute the *crude* incidence rates: 

```{r}
rates <- with( M, c( sum(mca)/sum(mpy)*100,
            sum(fca)/sum(fpy)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
```


Now we compute the *age-standardized* rates and their *ratio* using the male population as the standard:

```{r}
wm <- with( M, mpy/sum(mpy) )
wm
rates <-with( M, c( sum(mca/mpy*wm)*100,
            sum(fca/fpy*wm)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
```

And now we compute the age-standardized rates and their ratio using the World Standard Population: 

```{r}
WSP <- c(120,100,90,90,80,80,60,60,60,60,50,40,40,30,20,10,5,3,2)
WSP

wt <- sum(WSP[1:7])
wt[2] <- sum(WSP[8:11])
wt[3] <- sum(WSP[12:15])
wt[4] <- sum(WSP[16:19])
wt <- wt/sum(wt)
wt

rates <- with( M, c( sum(mca/mpy*wt)*100,
            sum(fca/fpy*wt)*100 ) )
rates[3] <- rates[1]/rates[2]
names(rates) <- c("M rate","F rate","M/F RR")
round( rates, 2 )
```


We can notice that when using the original data from the Finland population, males seem to have a low risk of colon cancer with respect to females. 
But this result is driven by the *confounding* effect of age, since females are older than males. When comparing the rates adjusting for the differences in age groups, the relative risk of males is instead higher than females. 


If you want to see other examples using specific R libraries check the following website: https://epirhandbook.com/en/standardised-rates.html

## Exercises

Standardized rates could be considered as adjusted summary estimators, controlling for example for the confounding related to different age structures in the populations under study. 
These estimators are weighted averages of stratum-specific estimators.

We have seen the method of (direct) standardization by using an external standard population (CMR, comparative mortality ratio) and the (indirect) method of the standardized morbidity ratio estimation (Standardized Mortality ratio, SMR). These approaches are useful in simple descriptive analyses. 

As we have seen, to compute the crude incidence rate we evaluate the ratio between the total number of new cases over the total person-years; if we have data splitted in age groups, we can sum at the numerator all the age-specific numbers of cases and then divide by the sum of all the age-specific person-years. 

In order to compute the age standardized rates (using the direct method) we apply instead the following formula: 


$$ASR=\frac{\sum_{k=1}^{K} weight_{k}*rate_{k}}{\sum_{k=1}^{K} weight_{k}}$$
Load the dataset for the exercice: is the number of cases (D) and the age-specific incidence rates (in cases per 100,000
person-years) from the Danish Cancer Register for the period 1983-87 for colon cancer, rectum cancer and lung cancer, by sex.

```{r}
library(here)

std.rates <- read.delim(here("datasets","std-rates.txt"))
head(std.rates)
std <- std.rates
```


### Exercise (c)

Calculate the crude rates for each sex and site, given that the number of male-person-years is 2521177 and the female is
2596061.

### Exercise (d)

Calculate the standardized rates, standardized to the world standard population.

# Confidence Intervals - recap

## CI for proportions (prevalence/cumulative incidence)

Useful libraries: 

```{r warning=FALSE,message=FALSE}
library(DescTools)
library(PropCIs)
library(epiR)
library(Epi)
library(popEpi)
```

The binom.test function output includes a confidence interval for a proportion, 
and the proportion of "success" as a decimal number.  
The binom.test function in the native *stats* package uses the Clopper-Pearson method for confidence intervals (1934).

This guarantees that the confidence level is at least *conf.level*, but in general does not give the shortest-length confidence intervals.

Example: as part of a survey, a researcher asks a class of 21 students if they have ever studied statistics before.  

The following are the data are collected:

Experience Count: Yes= 7; No=14

Note that when calculating confidence intervals for a binomial variable, one level of the nominal variable is chosen to be the "success" level.  

This is an arbitrary decision, but you should be cautious to remember that the confidence interval is reported for the proportion of "success" responses.  


```{r}
binom.test(7, 21,0.5,alternative="two.sided",conf.level=0.95)
```


The *BinomCI* function in the *DescTools* package has several methods for calculating confidence intervals for a binomial proportion.

[see from the help of the function: method = c("wilson", "wald", "agresti-coull", "jeffreys",
                   "modified wilson", "wilsoncc","modified jeffreys",
                   "clopper-pearson", "arcsine", "logit", "witting", "pratt")]

```{r}
BinomCI(7, 21,
        conf.level = 0.95,
        method = "clopper-pearson")
```

The BinomCI function can also produce the confidence intervals for "success" and "failure" in one step.

```{r}
observed = c(7, 14)
total = sum(observed)

BinomCI(observed, total,
        conf.level = 0.95,
        method = "clopper-pearson")
```

Also the *PropCIs* package has functions for calculating confidence intervals for a binomial proportion.The *exactci* function uses the Clopper-Pearson exact method.

```{r}
exactci(7, 21, conf.level=0.95)
```

The *blakerci* function uses the Blaker exact method.

```{r}
blakerci(7, 21,
         conf.level=0.95)
```


## CI for the incidence rate 

As we already introduced in Epi Intro, in the case when the incidence rate is assumed constant, we can assume a Poisson likelihood that leads to the estimate : 
(the observed outcome is D, number of events, and the amount of follow up time is Y): 

$$\hat{\lambda}=D/Y$$
$$s.e.{\log(\hat{\lambda})}=1/\sqrt(D)$$

From this formula and using the normal approximation we can derive also approximate 95% confidence intervals for the estimated rate. 
A summary of the methods used for confidence interval calculations in the *epi.conf* function of the epiR library are available in the "help"of the function (refer to later in this section) . 


## CI for the mean (numerical variable/outcome)

Let's do now a simple simulation to demonstrate the computation of the confidence interval for the mean: 

```{r}
nSim <- 10000
n    <- 10
ciArray <- array(0, dim=c(nSim,2))

for (i in 1:nSim){
  x <- rnorm(n, mean=10, sd=1)
  ciArray[i,] <- c(t.test(x)$conf.int[1],t.test(x)$conf.int[2]) 
}

head(ciArray)
```

What proportion of these CIs include the "true" parameter?

```{r}
mean(1*((ciArray[,1] < 10) & (ciArray[,2] > 10)))
```


By default with the *t.test* function in R we obtain a 95% confidence interval for the mean. 

Remember: confidence intervals (in the frequentist approach) make sense in the long run ...

## The Epi.conf function 

As we have already explored, the function *epi.conf* estimate confidence intervals for means, proportions, incidence, etc... in one function. 
Let's see a couple of examples: 

```{r warning=FALSE,message=FALSE}
library(epiR)
```


### Estimate CI for a mean 

Method *mean.single* requires a vector as input.

```{r}
dat <- rnorm(n = 100, mean = 0, sd = 1)
epi.conf(dat, ctype = "mean.single")
```


### Estimate CI for the difference between two independent groups means 

In statistics we distinguish between *independent* samples and *paired* samples.
Roughly, independent samples are composed by *different* statistical units (for example different subjects) that are randomly selected and that have "nothing" in common. Paired data refer instead to the concept of a *correlation* between the statistical units, as for example repeated measures on the same subjects, or some induced correlation (we will introduce *matching* procedures later in the course, block 3). 

Very often we want to compare mean values of a numerical outcome of interest between two groups, and in this case an option could be to compute the mean difference and the corresponding 95% CI and to check if the interval contain zero or not, Why ? Because if the 95% CI contain zero, then we can not conclude in favor of a statistical significant difference, but instead we conclude that data offer no evidence of coming from different populations. Remind that this procedure heavily depend also on variability around the means and on sample size!! 

Method *mean.unpaired* requires a two-column data frame; the first column defining the groups must be a factor. 

```{r}
group <- c(rep("A", times = 5), rep("B", times = 5))
val = round(c(rnorm(n = 5, mean = 10, sd = 5), rnorm(n = 5, mean = 7, sd = 5)), digits = 0)
dat <- data.frame(group = group, val = val)
epi.conf(dat, ctype = "mean.unpaired")
```


### Estimate CI for the difference between two paired groups means  

Two paired samples: systolic blood pressure levels were measured in 16 middle-aged men
before and after a standard exercise test. The mean rise in systolic 
blood pressure was 6.6 mmHg. The standard deviation of the difference
was 6.0 mm Hg. The standard error of the mean difference was 1.49 mm Hg.

```{r}
before <- c(148,142,136,134,138,140,132,144,128,170,162,150,138,154,126,116)
after <- c(152,152,134,148,144,136,144,150,146,174,162,162,146,156,132,126)
dat <- data.frame(before, after)
dat <- data.frame(cbind(before, after))
epi.conf(dat, ctype = "mean.paired", conf.level = 0.95)
```

The 95% confidence interval for the population value of the mean 
systolic blood pressure increase after standard exercise was 3.4 to 9.8 mm Hg.
Is there a significant effect of the exercise test? What do you think? 

### Estimate CI for a proportion (again :-) )

Out of 263 patients giving their views on the use of personal computers in 
general medical practice, 81 of them thought that the privacy of their medical file
had been reduced.

```{r}
pos <- 81
neg <- (263 - 81)
dat <- as.matrix(cbind(pos, neg))
round(epi.conf(dat, ctype = "prop.single"), digits = 3)
```

The 95% confidence interval for the population value of the proportion
of patients thinking their privacy was reduced was from 0.255 to 0.366. 

### Estimate CI for the difference between two independent proportions 

Adverse effects in 85 patients receiving either terbinafine or placebo treatment for dermatophyte onchomychois are reported.
Out of 56 patients receiving terbinafine, 5 patients experienced
adverse effects. Out of 29 patients receiving a placebo, none experienced
adverse effects.

```{r}
grp1 <- matrix(cbind(5, 51), ncol = 2)
grp2 <- matrix(cbind(0, 29), ncol = 2)
dat <- as.matrix(cbind(grp1, grp2))
round(epi.conf(dat, ctype = "prop.unpaired"), digits = 3)
```


The 95% confidence interval for the difference between the two groups is
from -0.038 to +0.193.Is there a significant different rate of adverse effects of the terbinafine vs the placebo treatment ? What do you think? 

### Estimate CI for the difference between two paired proportions 

In a reliability exercise, 41 patients were randomly selected from those
who had undergone a stress test. The 41 sets of images were
classified as normal or not by the laboratory and, 
independently, by clinical investigators from different centres.
Of the 19 samples identified as ischaemic by clinical investigators 
5 were identified as normal by the laboratory. Of the 22 samples 
identified as normal by clinical investigators 0 were identified as 
ischaemic by the laboratory. 

![](images/table.jpg)

```{r}
dat <- as.matrix(cbind(14, 5, 0, 22))
round(epi.conf(dat, ctype = "prop.paired", conf.level = 0.95), digits = 3)
```

The 95% confidence interval for the difference in evaluations is 0.011 to 0.226 or approximately +1% to +23%.
Is there a significant different evaluation between the laboratory and clinical investigators? What do you think?

### Estimate CI for a prevalence 

A herd of 1000 cattle were tested for brucellosis. Four samples out of 200
test returned a positive result. Assuming 100% test sensitivity and 
specificity, what is the estimated prevalence of brucellosis in this 
group of animals?

```{r}
pos <- 4; pop <- 200
dat <- as.matrix(cbind(pos, pop))
epi.conf(dat, ctype = "prevalence", method = "exact", N = 1000, 
   design = 1, conf.level = 0.95) * 100
```

The estimated prevalence of brucellosis in this herd is 2 cases
per 100 cattle (95% CI 0.54 -- 5.0 cases per 100 cattle).


## References 


Clopper, C.; Pearson, E. S. (1934). "The use of confidence or fiducial limits illustrated in the case of the binomial". Biometrika. 26 (4): 404-413.

Blaker, H. (2000). Confidence curves and improved exact confidence intervals for discrete distributions, Canadian Journal of Statistics 28 (4), 783-798. 

# Diagnostic tests

Load the required library: 

```{r warning=FALSE,message=FALSE}
library(epiR)
```

We will use an an example of data from the stress test to detect the presence of coronary disease. 

```{r warning=FALSE,message=FALSE}
dat           <- as.table(matrix(c(815,115,208,327), nrow = 2, byrow = TRUE))
colnames(dat) <- c("Dis+","Dis-")
rownames(dat) <- c("Test+","Test-")
dat
```


Let's estimate the parameters of interest related to the performance of the stress test in detecting the disease: the following function computes true and apparent prevalence (*apparent prevalence* is in this case the number of subjects testing positive by a diagnostic test divided by the total number of subjects in the sample tested; *true prevalence* is the actual number of diseased subjects divided by the number of individuals in the population), sensitivity, specificity, positive and negative predictive values, and positive and negative likelihood ratios from count data provided in a 2 by 2 table.
Exact binomial confidence limits are calculated for test sensitivity, specificity, and positive and negative predictive value.
The positive likelihood ratio is calculated as $$LR+ = Sensitivity/1-Specificity$$
It represents the probability of a person who has the disease testing positive divided by the probability of a person who does not have the disease testing positive.
The negative likelihood ratio is calculated as $$LR- = 1-Sensitivity/Specificity$$
It represents the probability of a person who has the disease testing negative divided by the probability of a person who does not have the disease testing negative.

```{r}
rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
```


## Comparing two binary diagnostic tests [Optional topic!!]

The comparison of the performance of two binary diagnostic tests is an important topic in clinical research. 
The most frequent type of sample design to compare two binary diagnostic tests is the *paired* design.
This design consists of applying the two binary diagnostic tests to all of the individuals in a random sample, where
the disease status of each individual is known through the application of a gold standard. 

We will use for this example an R program written by Roldán-Nofuentes in 2020, called *compbdt*. You can download it from : https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-00988-y#MOESM1

The objective is to estimate the sensitivity and the specificity, the likelihood ratios and the predictive
values of each diagnostic test applying the confidence intervals with the best asymptotic performance. 

The program compares the sensitivities and specificities of the two diagnostic tests simultaneously, as well as the likelihood ratios
and the predictive values, applying the global hypothesis tests with the best performance in terms of type I error and
power. 

When the global hypothesis test is significant, the causes of the significance are investigated solving the
individual hypothesis tests and applying the *multiple comparison* method of Holm. 

The most optimal confidence intervals are also calculated for the difference or ratio between the respective parameters. 

Based on the data observed in the sample, the program also estimates the probability of making *a type II error* if the null hypothesis is not rejected,
or estimates the power if the if the alternative hypothesis is accepted. We will review some basic ideas about statistical testing procedure in Block 2, related to sample size determination.  


|  Type of Errors         | $H_{0}$ TRUE  | $H_{0}$ FALSE |
|---------:|:-----------:|:------:|
| Reject $H_{0}$   | Type I error $\alpha$      | ok   |
| Do not reject $H_{0}$ | ok         |  Type II error $\beta$   |


We define as *power* of a statistical test the probability of not incurring in Type II error, i.e. 1-$\beta$. 
The estimation of the probability of making a type II error allows the researcher to decide about the reliability of the null hypothesis when this hypothesis is not rejected.

We will apply this function to a real example on the diagnosis of coronary artery disease.

The diagnosis of coronary artery
disease (CAD) was investigated using as diagnostic tests the exercise
test (Test 1) and the clinical history of chest pain (Test
2), and the coronary angiography as the gold standard.
The following table shows the frequencies obtained by applying
three medical tests to a sample of 871 individuals.


![](images/data_tests.png)

we source the R program that we will use to do the calculations: 

```{r warning=FALSE,message=FALSE}
source(here('compbdt.R'))
```

and we run the program: the input are the frequencies of diseased (non-diseased) patients in which the
Test 1 gives a result positive and negative and Test 2 gives a result positive and negative.

```{r warning=FALSE,message=FALSE}
pp <- compbdt(473, 29, 81, 25, 22, 46, 44, 151)
```
 
You will also find three text files in the folder that you are currently using. 
The results obtained comparing the sensitivities and specificities are recorded in the file “Results_Comparison_Accuracies.txt”, those obtained when comparing the LRs are recorded in the file “Results_Comparison_LRs.txt”, and those obtained when comparing the PVs are recorded in the file “Results_Comparison_PVs.txt”.
In R, an alternative program to “compbdt” is the *DTComPair* package. The *DTComPair* package estimates the same parameters as the “compbdt” and compares the parameters individually, i.e. solving each hypothesis test to an α error. 

## References

Leisenring, W., Alonzo, T., and Pepe, M. S. (2000). Comparisons of predictive values of binary medical diagnostic tests for paired designs. Biometrics, 56(2):345-51.

Kosinski, A.S. (2013). A weighted generalized score statistic for comparison of predictive values of diagnostic tests. Stat Med, 32(6):964-77.

Moskowitz, C.S., and Pepe, M.S. (2006). Comparing the predictive values of diagnostic tests: sample size and analysis for paired study designs. Clin Trials, 3(3):272-9.

Roldán-Nofuentes, J.A. Compbdt: an R program to compare two binary diagnostic tests subject to a paired design. BMC Med Res Methodol 20, 143 (2020). https://doi.org/10.1186/s12874-020-00988-y

## Exercises

Let's load the libraries:

```{r, warning=F,message=FALSE}

library(epiR)
library(tidyverse)
library(OptimalCutpoints)
```

### CK dataset 

The CK file reports the creatine kinase (CK, U/L) measurements in patients
with unstable angina and acute myocardial infarction, two conditions that
are clinically difficult to distinguish. A diagnostic test has been devised
such that for CK values > 100 the patient is classified as a case of
acute myocardial infarction, for CK < 100 the patient is classified
as a case of unstable angina. Calculate sensitivity, specificity, PPV,
NPV of the diagnostic test.
What changes when choosing CK = 80 as the threshold value?

```{r,warning=F}
load(here("datasets","CK.Rdata"))
```

Define the cutoff:

```{r,warning=F}
CK$ck.100 <- ifelse(CK$ck>100, 'CK > 100', 'CK <= 100')
```

Create the table:

```{r,warning=F}
pp <- table(CK$ck.100,CK$disease)
pp
```

Reverse the order:

```{r,warning=F}
dat <- as.table(matrix(c(26,34,1,58), nrow = 2, byrow = TRUE))
colnames(dat) <- c("AMI+","AMI-")
rownames(dat) <- c("Test+","Test-")
dat
```

Calculate sensitivity, specificity, PPV,
NPV of the diagnostic test:

```{r,warning=F}
rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
```

Change the cutoff:

```{r,warning=F}
CK$ck.80 <- ifelse(CK$ck>80, 'CK > 80', 'CK <= 80') 
```

Create the table:

```{r,warning=F}
pp <- table(CK$ck.80,CK$disease)
pp
```


```{r,warning=F}
dat <- as.table(matrix(c(27,48,0,44), nrow = 2, byrow = TRUE))
colnames(dat) <- c("AMI+","AMI-")
rownames(dat) <- c("Test+","Test-")
dat

rval <- epi.tests(dat, conf.level = 0.95)
print(rval)
```

We can observe that sensibility is improved but specificity is worse. 

## Exercise (e)

As part of a prostate cancer screening program, blood samples from 100.000 men have been analyzed
looking for a biomarker known to be related to the disease.  In total 4000 tested positive, but fortunately only 800 of these cases were then confirmed with biopsy (a much more reliable examination but also long and expensive). Of the 96.000 negative screening tests, 100 individuals
have then developed the tumor within 12 months and are therefore to be considered false negatives.
Calculate and comment sensitivity, specificity, PPV, NPV of the diagnostic test.

# ROC curves for diagnostic tests

For this example we will use the R package *pROC*. 
It builds a ROC curve and returns a *roc* object, a list of class *roc*. 
This object can be printed, plotted, or passed to the functions *auc*, *ci*, *smooth.roc* and *coords*. 
Additionally, two roc objects can be compared with the function *roc.test*.
Data can be provided as response, predictor, where the predictor is the numeric (or ordered) level of 
the evaluated risk factor, and the response encodes the observation class (control or case). 
The level argument specifies which response level must be taken as controls (first value of level) 
or cases (second). It can safely be ignored when the response is encoded as 0 and 1, but 
it will frequently fail otherwise. By default, the first two values of *levels(as.factor(response))* 
are taken, and the remaining levels are ignored. This means that if your response is coded 
"control" and "case", the levels will be inverted.

```{r warning=FALSE,message=FALSE}
library(pROC)
```

## Subarachnoid hemorrhage data

This dataset summarizes several clinical and one laboratory variable of 113 patients with an aneurysmal subarachnoid hemorrhage. The purpose of the case presented here is to identify patients at risk of poor post-aSAH outcome, as they require specific healthcare management; therefore the clinical test must be highly specific. 
Detailed results of the study are reported in [1]. We only outline the features relevant to the ROC analysis.

```{r}
data(aSAH)
head(aSAH)
```

```{r warning=FALSE,message=FALSE}
with(aSAH, table(gender, outcome))
```

Let's estimate the ROC curve for the parameter s100b, a biomarker: 

```{r warning=FALSE,message=FALSE}
roc(aSAH$outcome, aSAH$s100b,levels=c("Good", "Poor"))
```

Inspect the ROC plot and report the 95% CI: 

```{r warning=FALSE,message=FALSE}
roc(aSAH$outcome, aSAH$s100b,percent=TRUE, plot=TRUE, ci=TRUE)

```

We can also inspect the confidence intervals around our estimates, point by point: 

```{r warning=FALSE,message=FALSE}
rocobj <- plot.roc(aSAH$outcome, aSAH$s100b)
ci.sp.obj <- ci.sp(rocobj, sensitivities=seq(0, 1, .01), boot.n=100)
plot(rocobj)
plot(ci.sp.obj, type="shape", col="blue")
```

We can also compare the performance of two different biomarkers: 

```{r warning=FALSE,message=FALSE}
roc.test(response = aSAH$outcome, predictor1= aSAH$wfns, predictor2 = aSAH$s100)

```

And plot the two ROC curves in one graph: 

```{r warning=FALSE,message=FALSE}
roc.s100b <- roc(aSAH$outcome, aSAH$s100b)
roc.wfns <- roc(aSAH$outcome, aSAH$wfns)
plot(roc.s100b)
plot(roc.wfns, add=TRUE, col="red")
```


## Determining an "optimal" cut-off

Load the required library: 

```{r warning=FALSE,message=FALSE}
library(OptimalCutpoints)
```

The package *Optimal.cutpoints* calculates optimal cutpoints in diagnostic tests. Several methods or criteria for selecting optimal cutoffs have been implemented, including methods based on cost-benefit analysis and diagnostic test accuracy measures (Sensitivity/Specificity, Predictive Values and Diagnostic Likelihood Ratios) or prevalence (Lopez-Raton et al. 2014). As a basic example, we can use the Youden Index approach: 

```{r warning=FALSE,message=FALSE}
perf.s100b <-optimal.cutpoints(X ="s100b", status ="outcome", tag.healthy ="Good",
                                    methods = "Youden", data =aSAH, control = control.cutpoints(), 
                                    ci.fit = TRUE, 
                                    conf.level = 0.95, trace = FALSE)
perf.s100b
```

Therefore, 0.22 is the selected cut-off for the biomarker that maximizes the sum of Sensitivity and Specificity. 
But, pay always attention to the *shape* of the functional relationship between the biomarker and the risk of outcome...more on this in block 3. 


## References 

[1] Natacha Turck, Laszlo Vutskits, Paola Sanchez-Pena, Xavier Robin, Alexandre Hainard, Marianne Gex-Fabry, Catherine Fouda, Hadiji Bassem, Markus Mueller, Frédérique Lisacek, Louis Puybasset and Jean-Charles Sanchez (2010) "A multiparameter panel method for outcome prediction following aneurysmal subarachnoid hemorrhage". Intensive Care Medicine 36(1), 107-115. DOI: 10.1007/s00134-009-1641-y.

[2] Lopez-Raton, M., Rodriguez-Alvarez, M.X, Cadarso-Suarez, C. and Gude-Sampedro, F. (2014). OptimalCutpoints: An R Package for Selecting Optimal Cutpoints in Diagnostic Tests. Journal of Statistical Software 61(8), 1-36. URL http://www.jstatsoft.org/v61/i08/.


## Exercise (f)

Let's load the libraries:

```{r, warning=F, message=F}
library(tidyverse)
library(pROC)
library(MASS)
library(OptimalCutpoints)
library(epiR)
```


Load the data:


```{r}
diabetes <- read.csv(here("datasets","diabetes.csv"))
glimpse(diabetes)
```


We transform *Outcome* from an integer variable into a factor with categories "No" and "Yes":
 
 
```{r}
diabetes$Outcome = factor(diabetes$Outcome, labels=c("No", "Yes"))
table(diabetes$Outcome)
```
 
 
- estimate the ROC curve for the glucose biomarker;
- plot the curve;
- try to define an *optimal* cut-off, adopting the Youden Index approach;
- check the sensitivity, specificity, PPV and NPV of this cut-off;
- OPTIONAL: try to define a cut-off considering a different approach (e.g. "MinValueNPV" criterion $\implies$ `methods = "MinValueNPV"`)

# Odds and probability 

## Relationship between probabilities and odds 


Much of the statistical modeling done in epidemiology has involved transforming
probabilities (risks), which are constrained to 0 and 1, into outcomes that are amenable to *linear models* that can range from negative to positive infinity.
One useful approach to this problem has been the logistic transformation. It involves
two steps. First, we take our risk measurement, and extend it to include values
beyond 1. We do this by taking the odds. Second, we will take the log of the odds. This effectively establishes a linear model that can range from minus to plus infinity.
Odds has a meaning related with probability. If *p* is the probability, *p/(1-p)* is
known as the odds. Conversely, the probability would be equal to *odds/(odds+1)*.
While a probability always ranges from 0 to 1, an odds ranges from 0 to infinity.


```{r}
risk    <- seq(0,0.99,by=0.01)
odds <- risk/(1-risk)
plot(risk, odds)
```

If we restrict our attention to the range of probability less than 10% we can note that these two quantities are very close: 

```{r}
plot(risk[1:10], odds[1:10])
abline(0,1)
```

We can therefore use two simple functions to move from probability to odds and from odds to probability:

```{r}
p2o <- function (p) p/(1-p)
o2p <- function (o) o/(1+o)
```


```{r}
curve(log(x/(1 - x)), 0, 1)
```


## Remind: Odds are *not* Odds Ratios 


Note that in logistic regression (see block 3), the non-intercept coefficients of the independent variables represent the (log) odds ratios, not the odds.
Let's suppose to have an independent variable X in the model that is binary (i.e. has only two possible values 1 and 2), then the regression coefficient corresponding to X is: 

$$
OR = \frac{Odds_1}{Odds_2} = \frac{\frac{p_1}{1-p_1}}{\frac{p_2}{1-p_2}}
$$

We will see the interpretation of the regression coefficients of the logistic regression also in the case when X is a numerical variable or a categorical variable with more than two levels. 

## The effectsize R package 

The *effectsize* package contains also functions to convert among indices of effect size. 
This can be useful for meta-analyses, or any comparison between different types of statistical analyses.

```{r}
library(effectsize)
```

### Converting Between Odds Ratios and Risk Ratios (Relative Risks) 

As we have discussed, odds ratio, although popular, are not very intuitive in their interpretations.
We don't often think about the chances of catching a disease in terms of *odds*,
instead we tend to think in terms of *probability* or some event - or
the *risk*. 

For example, if we find that for individual suffering from a migraine, for every
bowl of brussels sprouts they eat, their odds of reducing the migraine
increase by an $OR = 3.5$ over a period of an hour. So, should people eat
brussels sprouts to effectively reduce pain? Well, hard to say... Maybe if we
look at *RR* we'll get a clue.

We can convert between *OR* and *RR* with the following formula, but always pay attention to the fact that depending on the baseline risk these two measures can be quite different one from the other!

$$
RR = \frac{OR}{(1 - p0 + (p0 \times OR))}  
$$
Where $p0$ is the base-rate risk - the probability of the event without the
intervention (e.g., what is the probability of the migraine subsiding within an
hour without eating any brussels sprouts). 

If it the base-rate risk is, say, 85% (very high!), we get a *RR* of:

```{r}
OR <- 3.5
baserate <- 0.85

(RR <- oddsratio_to_riskratio(OR, baserate))
```

That is - for every bowl of brussels sprouts, we increase the chances of
reducing the migraine by a mere 12%! Is if worth it? Depends on you affinity to
brussels sprouts...

Note that the base-rate risk is crucial here. If instead of 85% it was only 4%,
then the *RR* would be:

```{r}
oddsratio_to_riskratio(OR, 0.04)
```

That is - for every bowl of brussels sprouts, we increase the chances of
reducing the migraine by a whopping 318%! Is if worth it? I guess that still
depends on your affinity to brussels sprouts...
If the outcome is not rare (>10%), it is better to use RR directly (if the study design permits!), as it provides a more accurate estimate of the association.

# Measures of association

## Introduction 

An important task in epidemiology is to quantify the *strength* of association between exposures and outcomes. In this context the term *exposure* is taken to mean a variable whose association with the outcome is to be estimated.

Exposures can be harmful, beneficial or both harmful and beneficial (e.g. if an immunisable disease is circulating, exposure to immunising agents helps most recipients but may harm those who experience adverse reactions). The term *outcome* is used to describe all the possible results that may arise from exposure to a factor or from preventive or therapeutic interventions.

In the following examples we will consider only binary exposure factors (in block 3 we will extend to exposures on a quantitative scale, through the use of the appropriate regression models). 


## Relative Risk 

RR (relative risk) is the ratio of the risk of getting disease among the exposed compared with that among the non-exposed. 
It indicates how many times the risk would increase had the subject changed their status from non-exposed to exposed. The increment is considered in fold, thus has a mathematical notation of being a *multiplicative model*.

Load the R libraries that we will use in this session: 

```{r warning=FALSE,message=FALSE}
library(epiR)
library(epitools)
```

First of all, let's construct the *contingency table* of our study: the exposure factor is type of behaviour and the disease of interest is the occurrence of a coronary heart problem. 

```{r}
data.type <- matrix(c(178,79,1141,1486), 2, 2)
dimnames(data.type) <- list("Behavior type" = c("Type A", "Type B"),
                      "Occurrence of CHD Event" = c("Yes", "No"))
data.type
```

The aim is to calculate the Relative Risk and corresponding 95% CI. (We will come back to the 95% CI also later in Block 2 when discussing sample size determination). 
Here we use a function from the *epiR* library, where they call the the Relative Risk *Incidence risk ratio*. See the help of the function for all the details. Of note, the option *method* refers to the *study design* that has generated the data under study. We will see in more detail in Block 2 differences between cohort/case-controls and cross-sectional studies. 

```{r}
epi.2by2(data.type, method = "cohort.count")
```

Therefore, the relative risk of a CHD event is 2.67 times for the behavior type "A" with respect to "B". 

## Odds Ratio

As we have already seen in a dedicated note, odds has a meaning related with probability. If *p* is the probability, *p/(1-p)* is known as the odds.
First of all, also here let's build the contingency table: now the exposure factor is coffee drinking (equal or more than 1 cup per day vs zero) and the disease of interest is pancreatic cancer. 

```{r}
data.coffee <- matrix(c(347,20,555,88), 2, 2)
dimnames(data.coffee) <- list("Coffee Drinking (cups per day)" = c(">=1", "0"),
                       "Pancreatic cancer" = c("cases", "controls"))
data.coffee
```

We want to estimate the odds ratio and corresponding 95% CI: the function *oddsratio* calculates the odds ratio by the median-unbiased estimation (mid-p), the conditional maximum likelihood estimation (Fisher), the unconditional maximum likelihood estimation (Wald), and the small sample adjustment (small). The reason of these various approach is related to the computation of the confidence intervals, that are calculated using exact methods (mid-p and Fisher), normal approximation (Wald), and normal approximation with small sample adjustment (small). If you are interested in these boring statistical details, see the help of the function for all the details about the default values. 

```{r}
oddsratio(data.coffee, method="wald")
```

The odds ratio of the disease in the exposed is 2.75 w.r.t the unexposed.  

## Absolute effect and attributable risks 

In this library, they call *attributable risk* the risk of disease in the exposed group minus the risk of disease in the unexposed group (we have defined it in our slides the *excess risk*).This provides a measure of the absolute increase or decrease in risk associated with exposure.Risk difference suggests the amount of risk gained or lost had the subject *changed* from non-exposed to exposed. The increase is absolute, and has the
mathematical notation of an additive model. As we have discussed, the risk difference has more public health implications than the relative risk. A high relative risk may not be of public health importance if the disease is very rare. The risk difference, on the other hand, measures direct health burden and the need for health
services.
In this library there is also a quantity called *Attrib risk in population*, the difference between the risk in the total population minus the risk in the unexposed. 
Of note, in this library they call *Attrib fraction in population (%)* the fraction of all cases of D in the population that can be attributed to E (we call it simply *attributable risk*).
And finally, they call *Attrib fraction in exposed (%)* the proportion of cases attributable to exposure in the exposed population (i.e. when considering the only population on which exposure can act).
(We call it *attributable risk in the exposed*).

We now use as an example the low infant birth weight data presented by Hosmer and Lemeshow (2000) and available in the MASS package in R. The birthwt data frame has 189 rows and 10 columns. The data were collected at Baystate Medical Center, Springfield, Massachusetts USA during 1986.

```{r}
library(MASS)
bwt <- birthwt
head(bwt)
```

Each row of this data set represents data for one mother. We're interested in the association between smoke (the mother's smoking status during pregnancy) and low (delivery of a baby less than 2.5 kg bodyweight). It is important that the table you present to the *epi.2by2* function is in the correct format: disease positives in the first column, disease negatives in the second column, exposure positives in the first row and exposure negatives in the second row. 


```{r}
bwt$low    <- factor(bwt$low, levels = c(1,0))
bwt$smoke  <- factor(bwt$smoke, levels = c(1,0))
bwt$race   <- factor(bwt$race, levels = c(1,2,3))
```

Generate the 2 by 2 table. Exposure (rows) = smoke, outcome (columns) = low:

```{r}
low.tab <- table(bwt$smoke, bwt$low, dnn = c("Smoke", "Low BW"))
low.tab
```

Compute the measures of association for smoking and delivery of a low birth weight baby:

```{r}
epi.2by2(dat = low.tab, method = "cohort.count", conf.level = 0.95, 
   units = 100, outcome = "as.columns")
```

The estimated excess risk is 15.3, i.e. we would expect the low birth weight to increase by 15% if all women smoked as compared to all women not smoking. The population attributable risk is estimated at 6%, the percent of cases in the total population that can be attributed to smoking. 
The attributable fraction in the population is 19%, i.e. the fraction of all cases of low birth weight that can be attributed to smoking (the naive interpretation is that low birth weight could be reduced by 19% if all mothers stop smoking). Finally, the attributable risk in the exposed is estimated at 38%, i.e. 38% of the low birth weight that occurred among women who smoked could be attributed to smoking (assuming causality). 


## Supplementary/Optional Material !!!
   
### Number needed to treat 

The Number Needed to Treat (NNT) is a measure of treatment effectiveness mostly used in clinical trials with binary outcomes. The NNT is the number of patients that need to be treated for one to benefit compared with a control. More precisely, the NNT is the number of patients that need to be treat to avoid some bad outcome. Usually, the NNT is calculated as the inverse of the absolute risk reduction, i.e., the difference between the experimental *success* event rate (EER) and the control *success* event rate (CER):

NNT = 1 / (EER - CER)

It is quite obvious that the calculation is easy for categorical event rates. 
(However, when you deal with continuous measures it may be difficult to calculate the NNT).

You may also want to calculate the Number Needed to Harm (NNH), which is the converse of the categorical NNT. 
The NNH is the number of patients that need to receive a treatment for one additional adverse event to occur, such as a severe side effect or a real harm, until the death. The NNH is calculated on the basis of the Absolute Risk Increase (ARI), which is the difference between the event rate *for the adverse event* in the experimental group and that in the control group.

ARI = Adv.Ev.Exp.R - Adv.Ev.C.R

NNH = 1 / ARI

A treatment should always have a NNH higher than the NNT, and a large ratio between the two (NNH consistently higher than the NNT) is desirable.

From the NNT and the NNH the likelihood of being helped or harmed (LHH) can also be calculated, according to the formula:

LHH = (1/NNT) / (1/NNH)

A LHH = 3 means that the patients are 3 time more likely to benefit from the treatment than to be harmed by it.

Let's input some data : 

s.e : sample size in the experimental group
s.c : sample size in the control group
b.e : How many participants experienced the bad outcome in the experimental group
b.c : How many participants experienced the bad outcome in the control group

```{r}
s.e <- 100  
s.c <- 100
b.e <- 20
b.c <- 60
```

Formula to calculate NNT from categorical data:

```{r}
EER <- round((b.e/s.e),2)      
CER <- round((b.c/s.c),2)        
ARR <- CER - EER 
ARR
```


```{r}
nnt1 <- 1/ARR 
nnt1 <- round(nnt1,3)
nnt1
```


Now another example to calculate NNH: 

s.e : sample size in the experimental group
s.c : sample size in the control group
a.e : How many participants experienced the adverse event in the experimental group
a.c : How many participants experienced the adverse event in the control group


```{r}
s.e <- 100 
s.c <- 100
a.e <- 30
a.c <- 10
```
  

```{r}
Adv.Ev.Exp.R <- round((a.e/s.e),2)     
Adv.Ev.C.R   <- round((a.c/s.c),2)     
ARI          <- Adv.Ev.Exp.R - Adv.Ev.C.R
ARI 
```

```{r}
nnh <- 1/ARI 
nnh <- round(nnh,3) 
nnh
```

Finally, let's evaluate the LHH:

```{r}
 LHH = (1/nnt1)/(1/nnh)    
  
lhh <- round(LHH,3) 
lhh
```


## Exercises

### Exercise (g)

Assuming that the risk of some disease is 1 in 100.000 among unexposed people,
create a table of attributable fractions for exposures that increase the risk to 1 in
50.000, 1 in 10.000 and 1 in 1000.

### Exercise (h)

Let's say that the probability (prevalence) to be exposed for the previous data is 5%. Add a column to the matrix created above for evaluating the population attributable risk. 


### Exercise (i)

In the Scottish Health cohort Study 85 of 1821 people who lived in rented apartments had coronary artery disease, compared to 77 of 2477 people who owned their homes. 
Create a 2-by-2 matrix object from this data. Calculate the risks, relative risks, odds and odds ratios for these data. Comment the results.

### Exercise (j)

Let's load the University Group Diabetes Program (UGDP) data. This
was a placebo-controlled, multi-center randomized clinical trial (we will see in block 2 definition of this kind of study design) from the early 1960's
intended to establish the efficacy of treatments for type 2 diabetes. There were a total of 409 patients.
204 patients received tolbutamide. 205 patients received placebo. 30 of the 204
tolbutamide patients died. 21 of the placebo patients died.
There are three variables of interest: Status (Survivor or Death), Treatment
(Placebo or Tolbutamide) and Age Group (older or younger than 55). 

Calculate the relative risk and the odds ratio for the association of tolbutamide with death and comment the results (for the overall population). Comment the results.