G-computation

In the slides we have introduced the basic idea of the non-parametric g-formula as a way to estimate the ATE (Average Treatment Effect) in observational studies when there is a binary exposure, a binary outcome and a binary confounder variable. In real applications we can have much more complicated situations, in which there are multiple confounders (possibly on numerical scales), as well as continuous exposure… etc.

In these situations the basic idea in causal inference is to switch to a parametric approach that rely on regression modelling (we will see examples in Block 3). Obviously the statistical solutions work if we assume that the regression model is correct (i.e. correspond to the data generating mechanism) and that we are considering all possible confounders.

The data we use now come from a study in which the treatment (T) was a drug that should reduce alcohol drinking in women living with HIV.

But here we are not interested in the treatment T, the scientific question of interest is whether reducing drinking has an effect on suppressed viral load.

Let’s load the dataset:

load("C:\\Users\\13474\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\AA 24_25\\Block 2\\R codes\\whatifdat.RData")
head(whatifdat)
##   T A H Y
## 1 0 0 0 0
## 2 0 0 0 0
## 3 0 0 0 0
## 4 0 0 0 0
## 5 0 0 0 0
## 6 0 0 0 0

In the dataset T = 1 indicates the drug and T = 0 placebo; A = 1 denotes reduced drinking (less than or equal to 7 drinks per week) (A=0 indicates not reduced), H = 1 denotes unsuppressed HIV viral load (≥ 200 copies/ml) at baseline, and Y = 1 denotes unsuppressed HIV viral load at month four (our outcome of interest).

We want to investigate whether reducing drinking (A=1 vs A=0) is associated with unsuppressed viral load at month 4 (Y=1): we observe that 27.6% who reduced drinking have unsuppressed viral load at month 4, versus 40% who did not (p value = 0.1, borderline significant if we consider the small sample size).

library(dplyr)
## 
## Caricamento pacchetto: 'dplyr'
## I seguenti oggetti sono mascherati da 'package:stats':
## 
##     filter, lag
## I seguenti oggetti sono mascherati da 'package:base':
## 
##     intersect, setdiff, setequal, union
library(knitr)
library(ggplot2)
library(boot)
library(gmodels)
CrossTable(whatifdat$A, whatifdat$Y)
## 
##  
##    Cell Contents
## |-------------------------|
## |                       N |
## | Chi-square contribution |
## |           N / Row Total |
## |           N / Col Total |
## |         N / Table Total |
## |-------------------------|
## 
##  
## Total Observations in Table:  165 
## 
##  
##              | whatifdat$Y 
##  whatifdat$A |         0 |         1 | Row Total | 
## -------------|-----------|-----------|-----------|
##            0 |        36 |        24 |        60 | 
##              |     0.549 |     1.160 |           | 
##              |     0.600 |     0.400 |     0.364 | 
##              |     0.321 |     0.453 |           | 
##              |     0.218 |     0.145 |           | 
## -------------|-----------|-----------|-----------|
##            1 |        76 |        29 |       105 | 
##              |     0.314 |     0.663 |           | 
##              |     0.724 |     0.276 |     0.636 | 
##              |     0.679 |     0.547 |           | 
##              |     0.461 |     0.176 |           | 
## -------------|-----------|-----------|-----------|
## Column Total |       112 |        53 |       165 | 
##              |     0.679 |     0.321 |           | 
## -------------|-----------|-----------|-----------|
## 
## 

A problem with this comparison is that the apparent effect of reducing drinking might be confounded. Specifically, it may depend on H, i.e. the unsuppressed HIV viral load measured at baseline (pre-treatment/exposure condition).

xtabs(~Y+A+H, data=whatifdat)
## , , H = 0
## 
##    A
## Y    0  1
##   0 30 63
##   1  6  7
## 
## , , H = 1
## 
##    A
## Y    0  1
##   0  6 13
##   1 18 22

If we stratify with respect to H, in order to obtain the conditional average treatment effect, we obtain : (note that here I use the symbol “E” that is “Expectation” like “Probability (that Y=1)”):

E(Y|H = 0, A = 0) = 6/36 = 0.167, and E(Y|H = 0, A = 1) = 7/70 = 0.100.

Thus for the group with suppressed viral load at baseline (H=0), reducing drinking leads to a reduction in the chance of unsuppressed viral load at 4 months from 16.7% to 10%.

Similarly we estimate E(Y|H = 1, A = 0) = 18/24 = 0.750, and E(Y|H = 1, A = 1) = 22/35 = 0.629.

Thus for the group with unsuppressed viral load at baseline (H=1), reducing drinking leads to a reduction in the chance of unsuppressed viral load at 4 months from 75% to 62.9%.

Now, if we want to estimate the global average treatment effect (ATE) taking into account the confounder H, we can apply the non-parametric g-formula as follows: 106/165 is the relative weight of the H=0 group, 59/165 is the relative weight of the H=1 group, and the standardized effect of A=1 and A=0 is:

effect_A.1 <- (7/70)*(106/165)+(22/35)*(59/165)
effect_A.0 <- (6/36)*(106/165)+(18/24)*(59/165)
effect_A.1
## [1] 0.2890043
effect_A.0
## [1] 0.3752525

We observe now that reducing drinking leads to an expected average reduction of unsuppressed viral load in the study population (given the observed distribution of the confounder) from 37.5% to 28.9%.

The non-parametric g-formula, while a powerful tool in causal inference, does come with certain limitations, particularly when applied in real-world scenarios. Here’s a breakdown of some key challenges: