library(tidyverse)
library(ggplot2)
An association is present if the probability of occurrence of an event or characteristic, or the quantity of a variable, varies with the occurrence of one or more other events, the presence of one or more other characteristics, or the quantity of one or more other variables (Porta M, A dictionary of Epidemiology).
Examples :
Suppose now that A is a binary risk factor (smoking yes/no) and Y is a binary outcome (disease yes/no).
A note : in this example and in the following we will disregard statistical issues (no need for p-values, confidence intervals etc..) just for didactic purposes, like if we are measuring all the target population instead of a sample.
Imagine that we aere investigating a population of 1000 subjects and we observe these data :
# Set seed for reproducibility
set.seed(42)
# Define sample size
n <- 1000
A <- c(rep(1, 30), rep(1, 150), rep(0, 70), rep(0, 750))
Y <- c(rep(1, 30), rep(0, 150), rep(1, 70), rep(0, 750))
# Combine into a dataframe
df <- data.frame(A, Y)
df <- df[sample(nrow(df)), ]
# Check of the proportions
# Create the basic table
cont_table <- table(A, Y)
# Assign the specific names for rows (A) and columns (Y)
rownames(cont_table) <- c("A=0", "A=1")
colnames(cont_table) <- c("Y=0", "Y=1")
# Display as counts
print("Counts:")
## [1] "Counts:"
print(cont_table)
## Y
## A Y=0 Y=1
## A=0 750 70
## A=1 150 30
# Display as proportions
print("Proportions:")
## [1] "Proportions:"
prop_table <- prop.table(cont_table)
print(prop_table)
## Y
## A Y=0 Y=1
## A=0 0.75 0.07
## A=1 0.15 0.03
Among all subjects, 3% are both exposed and have the outcome: the joint probability of exposure \(A\) and outcome \(Y\) is denoted as \(P(A \cap Y)\).
Based on our simulation parameters: \[P(A=1 \cap Y=1) = 0.03\] Among all subjects, 10% have the outcome: the marginal probability of \[P(Y=1) = 0.10\] Among the exposed subjects,17% have the outcome : \[P(Y=1 \mid A=1) = \frac{P(A=1 \cap Y=1)}{P(A=1)} = \frac{0.03}{0.03 + 0.15} \approx 0.17\]
The conditional probability of having the outcome, for exposed subjects, is 0.17.
When the proportion of individuals who develop the outcome in the exposed equals the proportion of individuals who develop the outcome in the unexposed, we say that exposure and outcome are independent, that Y is not associated with A, or that A does not impact Y :
\[P(Y=1 \mid A=1) =P(Y=1\mid A=0)=P(Y=1)\]
A and Y are dependent or associated when instead:
\[P(Y=1 \mid A=1) \neq P(Y=1\mid A=0)\neq P(Y=1)\]
Independence is represented by:
\(A \perp \!\!\! \perp Y\) or \(Y \perp \!\!\! \perp A\)
In this example, are A and Y associated?
prop.table(table(df$A, df$Y))
##
## 0 1
## 0 0.75 0.07
## 1 0.15 0.03
\[P(Y=1 \mid A=1) = \frac{P(A=1 \cap Y=1)}{P(A=1)} = \frac{0.03}{0.03 + 0.15} \approx 0.17\] \[P(Y=1 \mid A=0) = \frac{0.07}{0.07 + 0.75} \approx 0.08\] \[P(Y=1) = 0.10\]
\[P(Y=1 \mid A=1) \neq P(Y=1\mid A=0) \neq P(Y=1)\]
So we conclude that A and Y are associated.
Possible explanations for the observed association:
It is not possible to disentangle only from the data in which situation we are…we need a priori knowledge !!! More on this in Block 2 and 3.
Conceptually, the missing counterfactual outcome is one that would have occurred under a different set of circumstances. In causal inference, we wish we could observe the counterfactual outcome that would have occurred in an alternate universe where the exposure status for a given observation was flipped. To do this, we attempt to control for all factors that are related to an exposure and outcome such that we can construct (or estimate) such a counterfactual outcome.
Let’s suppose some happiness index, from 1-10 exists.
We are interested in assessing whether eating chocolate ice cream versus vanilla will increase happiness.
We have 10 individuals with two potential outcomes for each, one is what their happiness would be if they ate chocolate ice cream, (defined as y_chocolate in the code below), and one is what their happiness would be if they ate vanilla ice cream (defined as y_vanilla in the code below).
We can define the true causal effect of eating chocolate ice cream (versus vanilla) on happiness for each individual as the difference between the two:
data <- data.frame(
id = 1:10,
y_chocolate = c(4, 4, 6, 5, 6, 5, 6, 7, 5, 6),
y_vanilla = c(1, 3, 4, 5, 5, 6, 8, 6, 3, 5)
)
data <- data |>
mutate(causal_effect = y_chocolate - y_vanilla)
data
## id y_chocolate y_vanilla causal_effect
## 1 1 4 1 3
## 2 2 4 3 1
## 3 3 6 4 2
## 4 4 5 5 0
## 5 5 6 5 1
## 6 6 5 6 -1
## 7 7 6 8 -2
## 8 8 7 6 1
## 9 9 5 3 2
## 10 10 6 5 1
data |>
summarize(
avg_chocolate = mean(y_chocolate),
avg_vanilla = mean(y_vanilla),
avg_causal_effect = mean(causal_effect)
)
## avg_chocolate avg_vanilla avg_causal_effect
## 1 5.4 4.6 0.8
The causal effect of eating chocolate ice cream (versus vanilla) for individual 4 is 0, whereas the causal effect for individual 9 is 2.
The average potential happiness after eating chocolate is 5.4 and the average potential happiness after eating vanilla is 4.6.
The average treatment effect of eating chocolate (versus vanilla) ice cream among the ten individuals in this study is 0.8.
In reality, we cannot observe both potential outcomes, in any moment in time, each individual in our study can only eat one flavor of ice cream.
Suppose that we let our participants choose which ice cream they wanted to eat and each choose their favorite (i.e. they knew which would make them “happier” and picked that one).
Now what we observe is:
data_observed <- data |>
mutate(
exposure = case_when(
# people who like chocolate more chose that
y_chocolate > y_vanilla ~ "chocolate",
# people who like vanilla more chose that
y_vanilla >= y_chocolate ~ "vanilla"
),
observed_outcome = case_when(
exposure == "chocolate" ~ y_chocolate,
exposure == "vanilla" ~ y_vanilla
)
) |>
# we can only observe the exposure and one potential outcome!
select(id, exposure, observed_outcome)
data_observed
## id exposure observed_outcome
## 1 1 chocolate 4
## 2 2 chocolate 4
## 3 3 chocolate 6
## 4 4 vanilla 5
## 5 5 chocolate 6
## 6 6 vanilla 6
## 7 7 vanilla 8
## 8 8 chocolate 7
## 9 9 chocolate 5
## 10 10 chocolate 6
data_observed |>
group_by(exposure) |>
summarise(avg_outcome = mean(observed_outcome))
## # A tibble: 2 × 2
## exposure avg_outcome
## <chr> <dbl>
## 1 chocolate 5.43
## 2 vanilla 6.33
Now, the observed average outcome among those who ate chocolate ice cream is 5.4 (the same as the true average potential outcome), while the observed average outcome among those who ate vanilla is 6.3 – quite different from the actual average (4.6). The estimated effect here could be calculated as 5.4 - 6.3 = -0.9.
It turns out here, these 10 participants chose which ice cream they wanted to eat and they always chose to eat their favorite!
This artificially made it look like eating vanilla ice cream would increase the happiness in this population when in fact we know the opposite is true.
The next section will discuss which assumptions need to be true in order to allow us to accurately estimate causal effects using observed data.
As a sneak peak, our issue here was that how the exposure was decided, if instead we randomized who ate chocolate versus vanilla ice cream we would (on average, with a large enough sample) recover the true causal effect.
Since now we are doing something random so let’s set a seed so we always observe the same result each time we run the code:
set.seed(11)
data_observed <- data |>
mutate(
# change the exposure to randomized, generate from a binomial distribution
# with a probability 0.5 for being in either group
exposure = case_when(
rbinom(10, 1, 0.5) == 1 ~ "chocolate",
TRUE ~ "vanilla"
),
observed_outcome = case_when(
exposure == "chocolate" ~ y_chocolate,
exposure == "vanilla" ~ y_vanilla
)
) |>
# as before, we can only observe the exposure and one potential outcome
select(id, exposure, observed_outcome)
data_observed |>
group_by(exposure) |>
summarise(avg_outcome = mean(observed_outcome) )
## # A tibble: 2 × 2
## exposure avg_outcome
## <chr> <dbl>
## 1 chocolate 5.33
## 2 vanilla 4.71
As you can see, now we are very close to the true average causal effect in this population. How this happened???? More on this in block 2 (Study Design) and block 3 !