Causal inference versus prediction

Conceptually, in prediction, in a certain sense we make comparisons between outcomes across different combinations of values of input variables to predict the probability of an outcome. In causal inference, we ask what would happen to an outcome y as a result of a treatment or intervention. Predictive inference relates to comparisons between units (different groups/subjects). Causal inference addresses comparisons of different treatments when applied to the same unit.

The FEV dataset

The FEV, which is an acronym for Forced Expiratory Volume, is a measure of how much air a person can exhale (in liters) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age, their height, their gender and, most importantly, whether the child is a smoker or a non-smoker (the exposure of interest). This is an observational cross-sectional study.

The goal of this study was to find out whether or not smoking has an effect on the FEV of children.

Load the required libraries

library(tidyverse)
library(DataExplorer)
library(SmartEDA)
library(ggplot2)
library(ggstatsplot)

Set the working directory and import data

setwd('C:\\Users\\Giulia Barbati\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\Block 3\\R codes')
fev <- read_tsv("fev.txt")
head(fev)
## # A tibble: 6 × 5
##     age   fev height gender smoking
##   <dbl> <dbl>  <dbl> <chr>    <dbl>
## 1     9  1.71   57   f            0
## 2     8  1.72   67.5 f            0
## 3     7  1.72   54.5 f            0
## 4     9  1.56   53   m            0
## 5     9  1.90   57   m            0
## 6     8  2.34   61   f            0

There are a few things in the formatting of the data that can be improved upon:

  1. Both the gender and smoking can be transformed to factors.
  2. The height variable is written in inches. Inches are hard to interpret. Let’s add a new column, height_cm, with the values converted to centimeter using the mutate function. (For this example we will not use this variable however).
fev <- fev %>%
  mutate(gender = as.factor(gender)) %>%
  mutate(smoking = as.factor(smoking)) %>%
  mutate(height_cm = height*2.54)

head(fev)
## # A tibble: 6 × 6
##     age   fev height gender smoking height_cm
##   <dbl> <dbl>  <dbl> <fct>  <fct>       <dbl>
## 1     9  1.71   57   f      0            145.
## 2     8  1.72   67.5 f      0            171.
## 3     7  1.72   54.5 f      0            138.
## 4     9  1.56   53   m      0            135.
## 5     9  1.90   57   m      0            145.
## 6     8  2.34   61   f      0            155.

Data Exploration

Now, let’s make a first explorative boxplot, showing only the FEV for both smoking categories.

fev %>%
  ggplot(aes(x=smoking,y=fev,fill=smoking)) +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  theme_bw() +
  geom_boxplot(outlier.shape=NA) + 
  geom_jitter(width = 0.2, size=0.1) +
  ggtitle("Boxplot of FEV versus smoking") +
  ylab("fev (l)") +
  xlab("smoking status")

Did you expect these results??

It appears that children that smoke have a higher median FEV than children that do not smoke. Should we change legislations worldwide and make smoking obligatory for children??

Maybe there is something else going on in the data. Now, we will generate a similar plot, but we will stratify the data based on age (age as factor).

fev %>%
  ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
  geom_boxplot(outlier.shape=NA) +
  geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
  theme_bw() +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  ggtitle("Boxplot of FEV versus smoking, stratified on age") +
  ylab("fev (l)") +
  xlab("Age (years)")

This plot seems to already give us a more plausible picture. First, it seems that we do not have any smoking children of ages 6, 7 or 8. Second, when looking at the results per age “category”, it seems no longer the case that smokers have a much higher FEV than non-smokers; for the higher ages, the contrary seems true.

This shows that taking into account confounders (in this case) is crucial! If we simply analyse the dataset based on the smoking status and FEV values only, our inference might be incorrect:

fit1   <- lm(fev~smoking, data=fev)     # wrong beta0star, beta1star
fit2   <- lm(fev~age+smoking, data=fev) # "true" beta0, beta2, beta1
fitage <- lm(age~smoking, data=fev)     #  gamma0, gamma1
fit1
## 
## Call:
## lm(formula = fev ~ smoking, data = fev)
## 
## Coefficients:
## (Intercept)     smoking1  
##      2.6346       0.6054
fit2
## 
## Call:
## lm(formula = fev ~ age + smoking, data = fev)
## 
## Coefficients:
## (Intercept)          age     smoking1  
##      0.2040       0.2479      -0.2358
fitage
## 
## Call:
## lm(formula = age ~ smoking, data = fev)
## 
## Coefficients:
## (Intercept)     smoking1  
##       9.804        3.393
beta0     <- coef(fit2)[1]
beta2     <- coef(fit2)[2]
beta1     <- coef(fit2)[3]
gamma0    <- coef(fitage)[1]
gamma1    <- coef(fitage)[2]
beta0star <- coef(fit1)[1]
beta1star <- coef(fit1)[2]

check that: beta0star = beta0 + beta2*gamma0 :

beta0 + beta2*gamma0
## (Intercept) 
##    2.634628
beta0star 
## (Intercept) 
##    2.634628

and also check that (wrong) beta1star = beta1 + beta2*gamma1:

beta1 + beta2*gamma1
##  smoking1 
## 0.6054053
beta1star
##  smoking1 
## 0.6054053

Therefore, a beneficial estimated smoking effect is obtained when age is ignored. If the causal inference assumptions hold (see slides) we can consider beta1 as the average smoking effect in the population under study.

Factors that affect causal inference estimates

Imbalance and lack of complete overlap can make causal inference difficult. Remind : imbalance is when treatment groups differ with respect to an important covariate. Lack of complete overlap: when some combination of treatment level and covariate level is lacking (no observations, violation of the positivity assumption).

To explain fev, sex seems to matter, especially among older individuals:

plot(fev~age, col=gender, data=fev)
legend("topleft", pch=1, col=1:2, levels(fev$gender))

Another way to plot the same thing is:

fev %>%
  ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
  geom_boxplot(outlier.shape=NA) +
  geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
  theme_bw() +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  ggtitle("Boxplot of FEV versus smoking, stratified on age and gender") +
  ylab("fev (l)") +
  xlab("smoking status") + 
  facet_grid(rows = vars(gender))

Especially for higher ages, the median FEV is higher for males as compared to females. [This could suggest a kind of interaction between gender and age, that could be explored in the regression model even if the interpretation of such interaction could be quite tricky (many levels for age!)].

Moreover, there is a slight gender imbalance among age categories:

counts      <- table(fev$gender, as.factor(fev$age))
percentages <- round(prop.table(table(fev$gender, as.factor(fev$age)),2),digits=3)

barplot(percentages, main="Gender distribution",
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(counts))

For imbalanced samples, simple comparisons of sample means between groups are not good estimates of treatment/risk factors effects. A model adjustment is of course one way to better estimate a treatment effect, where we add the covariate to the model. In this case for example we can add also gender in the regression model:

fit3   <- lm(fev~age+smoking+gender, data=fev)
summary(fit3)
## 
## Call:
## lm(formula = fev ~ age + smoking + gender, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39229 -0.35677 -0.03623  0.33644  1.91571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.088007   0.097635   0.901   0.3677    
## age          0.243707   0.009523  25.591  < 2e-16 ***
## smoking1    -0.180489   0.080950  -2.230   0.0261 *  
## genderm      0.295794   0.044683   6.620 7.97e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5465 on 602 degrees of freedom
## Multiple R-squared:  0.5681, Adjusted R-squared:  0.566 
## F-statistic:   264 on 3 and 602 DF,  p-value: < 2.2e-16

So now the effect of smoking is estimated conditional on both age and gender, and as expected is a negative effect on FEV; one possible problem however with these estimates could be that for some combinations of age-gender-smoking we do not have any observed data, so that the extrapolation of the regression model could be not so reliable. Also in this case (no interaction) if causal assumptions hold we can interpret the effect of smoking also as a marginal effect.

index           <- fev$smoking==0
counts.noS      <- table(fev[index,]$gender, as.factor(fev[index,]$age))
percentages.noS <- round(prop.table(table(fev[index,]$gender, as.factor(fev[index,]$age)),2),digits=3)

index1           <- fev$smoking==1
counts.S      <- table(fev[index1,]$gender, as.factor(fev[index1,]$age))
percentages.S <- round(prop.table(table(fev[index1,]$gender, as.factor(fev[index1,]$age)),2),digits=3)

par(mfrow=c(1,2))
barplot(percentages.noS, main="Gender distribution among non-smokers",cex.main=0.8, 
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(percentages.noS))
barplot(percentages.S, main="Gender distribution among smokers",cex.main=0.8, 
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(percentages.S))

Observe that in fact there is also a lack of “smoking” children below age 9: lack of complete overlap is when there are no observations at all for some combination(s) of treatment levels / covariate levels.

For lack of complete overlap, there is no data available for some comparisons. This requires extrapolation using a model to make comparisons. This is in fact the job that the regression model does, but again extrapolation is always a risky business for the model in regions where no data are available. This is an even more serious problem than imbalance.

Matching is a possible strategy in these situations to overcome (avoid) imbalance, even if some data will be discarded (see in the propensity score methods examples).

Supplemental material

Of note: we do not cover here how to obtain a marginal effect estimate from a regression model when for example there is an interaction. If you are interested in more details about these topics, you can explore the R library lmw at https://cran.r-project.org/web/packages/lmw/index.html and also the R library arm at https://cran.r-project.org/web/packages/arm/index.html.