In the slides we have introduced the basic idea of the non-parametric g-formula as a way to estimate the ATE (Average Treatment Effect) in observational studies when there is a binary exposure, a binary outcome and a binary confounder variable. In real applications we can have much more complicated situations, in which there are multiple confounders (possibly on numerical scales), as well as continuous exposure… etc.
In these situations the basic idea in causal inference is to switch to a parametric approach that rely on regression modelling (we will see examples in Block 3). Obviously the statistical solutions work if we assume that the regression model is correct (i.e. correspond to the data generating mechanism) and that we are considering all possible confounders.
The data we use now come from a study in which the treatment (T) was a drug that should reduce alcohol drinking in women living with HIV.
But here we are not interested in the treatment T, the scientific question of interest is whether reducing drinking has an effect on suppressed viral load.
Let’s load the dataset:
load("C:\\Users\\13474\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\AA 24_25\\Block 2\\R codes\\whatifdat.RData")
head(whatifdat)
## T A H Y
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0
In the dataset T = 1 indicates the drug and T = 0 placebo; A = 1 denotes reduced drinking (less than or equal to 7 drinks per week) (A=0 indicates not reduced), H = 1 denotes unsuppressed HIV viral load (≥ 200 copies/ml) at baseline, and Y = 1 denotes unsuppressed HIV viral load at month four (our outcome of interest).
We want to investigate whether reducing drinking (A=1 vs A=0) is associated with unsuppressed viral load at month 4 (Y=1): we observe that 27.6% who reduced drinking have unsuppressed viral load at month 4, versus 40% who did not (p value = 0.1, borderline significant if we consider the small sample size).
library(dplyr)
##
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
##
## filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
##
## intersect, setdiff, setequal, union
library(knitr)
library(ggplot2)
library(boot)
library(gmodels)
CrossTable(whatifdat$A, whatifdat$Y)
##
##
## Cell Contents
## |-------------------------|
## | N |
## | Chi-square contribution |
## | N / Row Total |
## | N / Col Total |
## | N / Table Total |
## |-------------------------|
##
##
## Total Observations in Table: 165
##
##
## | whatifdat$Y
## whatifdat$A | 0 | 1 | Row Total |
## -------------|-----------|-----------|-----------|
## 0 | 36 | 24 | 60 |
## | 0.549 | 1.160 | |
## | 0.600 | 0.400 | 0.364 |
## | 0.321 | 0.453 | |
## | 0.218 | 0.145 | |
## -------------|-----------|-----------|-----------|
## 1 | 76 | 29 | 105 |
## | 0.314 | 0.663 | |
## | 0.724 | 0.276 | 0.636 |
## | 0.679 | 0.547 | |
## | 0.461 | 0.176 | |
## -------------|-----------|-----------|-----------|
## Column Total | 112 | 53 | 165 |
## | 0.679 | 0.321 | |
## -------------|-----------|-----------|-----------|
##
##
A problem with this comparison is that the apparent effect of reducing drinking might be confounded. Specifically, it may depend on H, i.e. the unsuppressed HIV viral load measured at baseline (pre-treatment/exposure condition).
xtabs(~Y+A+H, data=whatifdat)
## , , H = 0
##
## A
## Y 0 1
## 0 30 63
## 1 6 7
##
## , , H = 1
##
## A
## Y 0 1
## 0 6 13
## 1 18 22
If we stratify with respect to H, in order to obtain the conditional average treatment effect, we obtain : (note that here I use the symbol “E” that is “Expectation” like “Probability (that Y=1)”):
E(Y|H = 0, A = 0) = 6/36 = 0.167, and E(Y|H = 0, A = 1) = 7/70 = 0.100.
Thus for the group with suppressed viral load at baseline (H=0), reducing drinking leads to a reduction in the chance of unsuppressed viral load at 4 months from 16.7% to 10%.
Similarly we estimate E(Y|H = 1, A = 0) = 18/24 = 0.750, and E(Y|H = 1, A = 1) = 22/35 = 0.629.
Thus for the group with unsuppressed viral load at baseline (H=1), reducing drinking leads to a reduction in the chance of unsuppressed viral load at 4 months from 75% to 62.9%.
Now, if we want to estimate the global average treatment effect (ATE) taking into account the confounder H, we can apply the non-parametric g-formula as follows: 106/165 is the relative weight of the H=0 group, 59/165 is the relative weight of the H=1 group, and the standardized effect of A=1 and A=0 is:
effect_A.1 <- (7/70)*(106/165)+(22/35)*(59/165)
effect_A.0 <- (6/36)*(106/165)+(18/24)*(59/165)
effect_A.1
## [1] 0.2890043
effect_A.0
## [1] 0.3752525
We observe now that reducing drinking leads to an expected average reduction of unsuppressed viral load in the study population (given the observed distribution of the confounder) from 37.5% to 28.9%.
The non-parametric g-formula, while a powerful tool in causal inference, does come with certain limitations, particularly when applied in real-world scenarios. Here’s a breakdown of some key challenges:
High Dimensionality: A core issue arises when dealing with numerous confounders. As the number of variables increases, the data becomes increasingly sparse. This “curse of dimensionality” can make it very difficult to obtain reliable estimates, as there may be insufficient data to accurately estimate the required conditional probabilities. Essentially, when you have many variables, you need a very large sample size to have enough data in each combination of those variables.
When dealing with the non-parametric g-formula, calculating standard errors presents a significant challenge. Due to the non-parametric nature of the estimations, traditional analytical methods for standard error calculation are often not applicable (we are not dealing with simple proportions, since also weights related to the counfounding variable/s are involved). The most widely used method for estimating standard errors in the context of the non-parametric g-formula is bootstrapping. Bootstrapping involves repeatedly resampling from the observed data with replacement, and then recalculating the g-formula estimate for each of these resampled datasets. The standard deviation of these replicated estimates then serves as an estimate of the standard error. This method is particularly valuable because it makes minimal assumptions about the underlying data distribution. However, bootstrapping can be computationally expensive, especially in large datasets.