Conceptually, in prediction, in a certain sense we make comparisons between outcomes across different combinations of values of input variables to predict the probability of an outcome. In causal inference, we ask what would happen to an outcome y as a result of a treatment or intervention. Predictive inference relates to comparisons between units (different groups/subjects). Causal inference addresses comparisons of different treatments when applied to the same unit.
The FEV, which is an acronym for Forced Expiratory Volume, is a measure of how much air a person can exhale (in liters) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age, their height, their gender and, most importantly, whether the child is a smoker or a non-smoker (the exposure of interest). This is an observational cross-sectional study.
The goal of this study was to find out whether or not smoking has an effect on the FEV of children.
Load the required libraries
library(tidyverse)
library(DataExplorer)
library(SmartEDA)
library(ggplot2)
library(ggstatsplot)
Set the working directory and import data
setwd('C:\\Users\\Giulia Barbati\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\Block 3\\R codes')
fev <- read_tsv("fev.txt")
head(fev)
## # A tibble: 6 × 5
## age fev height gender smoking
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 9 1.71 57 f 0
## 2 8 1.72 67.5 f 0
## 3 7 1.72 54.5 f 0
## 4 9 1.56 53 m 0
## 5 9 1.90 57 m 0
## 6 8 2.34 61 f 0
There are a few things in the formatting of the data that can be improved upon:
gender
and smoking
can be
transformed to factors.height
variable is written in inches. Inches are
hard to interpret. Let’s add a new column, height_cm
, with
the values converted to centimeter using the mutate
function. (For this example we will not use this variable however).fev <- fev %>%
mutate(gender = as.factor(gender)) %>%
mutate(smoking = as.factor(smoking)) %>%
mutate(height_cm = height*2.54)
head(fev)
## # A tibble: 6 × 6
## age fev height gender smoking height_cm
## <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 9 1.71 57 f 0 145.
## 2 8 1.72 67.5 f 0 171.
## 3 7 1.72 54.5 f 0 138.
## 4 9 1.56 53 m 0 135.
## 5 9 1.90 57 m 0 145.
## 6 8 2.34 61 f 0 155.
Now, let’s make a first explorative boxplot, showing only the FEV for both smoking categories.
fev %>%
ggplot(aes(x=smoking,y=fev,fill=smoking)) +
scale_fill_manual(values=c("dimgrey","firebrick")) +
theme_bw() +
geom_boxplot(outlier.shape=NA) +
geom_jitter(width = 0.2, size=0.1) +
ggtitle("Boxplot of FEV versus smoking") +
ylab("fev (l)") +
xlab("smoking status")
Did you expect these results??
It appears that children that smoke have a higher median FEV than children that do not smoke. Should we change legislations worldwide and make smoking obligatory for children??
Maybe there is something else going on in the data. Now, we will generate a similar plot, but we will stratify the data based on age (age as factor).
fev %>%
ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
geom_boxplot(outlier.shape=NA) +
geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
theme_bw() +
scale_fill_manual(values=c("dimgrey","firebrick")) +
ggtitle("Boxplot of FEV versus smoking, stratified on age") +
ylab("fev (l)") +
xlab("Age (years)")
This plot seems to already give us a more plausible picture. First, it seems that we do not have any smoking children of ages 6, 7 or 8. Second, when looking at the results per age “category”, it seems no longer the case that smokers have a much higher FEV than non-smokers; for the higher ages, the contrary seems true.
This shows that taking into account confounders (in this case) is crucial! If we simply analyse the dataset based on the smoking status and FEV values only, our inference might be incorrect:
fit1 <- lm(fev~smoking, data=fev) # wrong beta0star, beta1star
fit2 <- lm(fev~age+smoking, data=fev) # "true" beta0, beta2, beta1
fitage <- lm(age~smoking, data=fev) # gamma0, gamma1
fit1
##
## Call:
## lm(formula = fev ~ smoking, data = fev)
##
## Coefficients:
## (Intercept) smoking1
## 2.6346 0.6054
fit2
##
## Call:
## lm(formula = fev ~ age + smoking, data = fev)
##
## Coefficients:
## (Intercept) age smoking1
## 0.2040 0.2479 -0.2358
fitage
##
## Call:
## lm(formula = age ~ smoking, data = fev)
##
## Coefficients:
## (Intercept) smoking1
## 9.804 3.393
beta0 <- coef(fit2)[1]
beta2 <- coef(fit2)[2]
beta1 <- coef(fit2)[3]
gamma0 <- coef(fitage)[1]
gamma1 <- coef(fitage)[2]
beta0star <- coef(fit1)[1]
beta1star <- coef(fit1)[2]
check that: beta0star = beta0 + beta2*gamma0 :
beta0 + beta2*gamma0
## (Intercept)
## 2.634628
beta0star
## (Intercept)
## 2.634628
and also check that (wrong) beta1star = beta1 + beta2*gamma1:
beta1 + beta2*gamma1
## smoking1
## 0.6054053
beta1star
## smoking1
## 0.6054053
Therefore, a beneficial estimated smoking effect is obtained when age is ignored. If the causal inference assumptions hold (see slides) we can consider beta1 as the average smoking effect in the population under study.
Imbalance and lack of complete overlap can make causal inference difficult. Remind : imbalance is when treatment groups differ with respect to an important covariate. Lack of complete overlap: when some combination of treatment level and covariate level is lacking (no observations, violation of the positivity assumption).
To explain fev, sex seems to matter, especially among older individuals:
plot(fev~age, col=gender, data=fev)
legend("topleft", pch=1, col=1:2, levels(fev$gender))
Another way to plot the same thing is:
fev %>%
ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
geom_boxplot(outlier.shape=NA) +
geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
theme_bw() +
scale_fill_manual(values=c("dimgrey","firebrick")) +
ggtitle("Boxplot of FEV versus smoking, stratified on age and gender") +
ylab("fev (l)") +
xlab("smoking status") +
facet_grid(rows = vars(gender))
Especially for higher ages, the median FEV is higher for males as compared to females. [This could suggest a kind of interaction between gender and age, that could be explored in the regression model even if the interpretation of such interaction could be quite tricky (many levels for age!)].
Moreover, there is a slight gender imbalance among age categories:
counts <- table(fev$gender, as.factor(fev$age))
percentages <- round(prop.table(table(fev$gender, as.factor(fev$age)),2),digits=3)
barplot(percentages, main="Gender distribution",
xlab="ages", col=c("pink", "darkblue"),
legend = rownames(counts))
For imbalanced samples, simple comparisons of sample means between groups are not good estimates of treatment/risk factors effects. A model adjustment is of course one way to better estimate a treatment effect, where we add the covariate to the model. In this case for example we can add also gender in the regression model:
fit3 <- lm(fev~age+smoking+gender, data=fev)
summary(fit3)
##
## Call:
## lm(formula = fev ~ age + smoking + gender, data = fev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39229 -0.35677 -0.03623 0.33644 1.91571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.088007 0.097635 0.901 0.3677
## age 0.243707 0.009523 25.591 < 2e-16 ***
## smoking1 -0.180489 0.080950 -2.230 0.0261 *
## genderm 0.295794 0.044683 6.620 7.97e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5465 on 602 degrees of freedom
## Multiple R-squared: 0.5681, Adjusted R-squared: 0.566
## F-statistic: 264 on 3 and 602 DF, p-value: < 2.2e-16
So now the effect of smoking is estimated conditional on both age and gender, and as expected is a negative effect on FEV; one possible problem however with these estimates could be that for some combinations of age-gender-smoking we do not have any observed data, so that the extrapolation of the regression model could be not so reliable. Also in this case (no interaction) if causal assumptions hold we can interpret the effect of smoking also as a marginal effect.
index <- fev$smoking==0
counts.noS <- table(fev[index,]$gender, as.factor(fev[index,]$age))
percentages.noS <- round(prop.table(table(fev[index,]$gender, as.factor(fev[index,]$age)),2),digits=3)
index1 <- fev$smoking==1
counts.S <- table(fev[index1,]$gender, as.factor(fev[index1,]$age))
percentages.S <- round(prop.table(table(fev[index1,]$gender, as.factor(fev[index1,]$age)),2),digits=3)
par(mfrow=c(1,2))
barplot(percentages.noS, main="Gender distribution among non-smokers",cex.main=0.8,
xlab="ages", col=c("pink", "darkblue"),
legend = rownames(percentages.noS))
barplot(percentages.S, main="Gender distribution among smokers",cex.main=0.8,
xlab="ages", col=c("pink", "darkblue"),
legend = rownames(percentages.S))
Observe that in fact there is also a lack of “smoking” children below age 9: lack of complete overlap is when there are no observations at all for some combination(s) of treatment levels / covariate levels.
For lack of complete overlap, there is no data available for some comparisons. This requires extrapolation using a model to make comparisons. This is in fact the job that the regression model does, but again extrapolation is always a risky business for the model in regions where no data are available. This is an even more serious problem than imbalance.
Matching is a possible strategy in these situations to overcome (avoid) imbalance, even if some data will be discarded (see in the propensity score methods examples).
Of note: we do not cover here how to obtain a marginal effect estimate from a regression model when for example there is an interaction. If you are interested in more details about these topics, you can explore the R library lmw at https://cran.r-project.org/web/packages/lmw/index.html and also the R library arm at https://cran.r-project.org/web/packages/arm/index.html.