Load the required libraries:
library(haven)
library(tidyverse)
library(magrittr)
library(ggdag)
library(ggplot2)
library(marginaleffects)
Data comes from SHARE, the Survey of Health, Ageing and Retirement in Europe (https://share-eric.eu/data/). It is a multidisciplinary and cross-national panel database of data on health, socio-economic status and social and family networks of about 140000 individuals aged 50 or older (around 380000 interviews). SHARE covers 27 European countries and Israel.
The dataset we use here contains only part of the data the survey produced.
The research question is whether a stressful period could cause the
occurrence of muscular weakness in the over 50 population: therefore we
can consider it from the point of view of an explanatory
(causal) model, having a specific exposure of interest in mind : the
target is to estimate an effect.
This question could be of interest since muscular weakness can be
considered a proxy of poor physical health. Muscular strength was
defined as the hand grip measured with a dynamo-meter and the weakness
was defined as having a grip strength below a threshold calculated
according to BMI and sex.
The exposure group was defined as subjects who underwent a stressful period between the first and the last time they had been interviewed; all other subject were defined as non-exposed.
Of note, an inclusion criteria was to have a normal hand strength at the start of the study. Once the subjects were divided in the exposed and non-exposed group, the hand grip measurement after two years from the exposure definition was used to define the outcome.
In this example we will consider the following variables:
load("Logistic Regression examples.RData")
Before starting to fit any regression model we have to properly code categorical variables. This is usually done in the initial data analysis (IDA) and the descriptive statistics phase.
The first step consists in identifying which variables are categorical. This may seem as an easy step but it isn’t always. If fact the choice for some variables (i.e. indexes and scales) is not obvious and it requires either knowledge of the data we are analysing or a bit more effort on the data modelling.
The important message is that we can’t assume that it is correct to model a variable as numerical only because it was of type numerical/integer in the dataset imported in R.
Once we have identified the variables that we think are categorical we can process them. We can have three cases that leads to a slightly different procedure according to the R type of the variable:
Binary variables can either be treated as numerical, usually dummy 0/1 variables, or as factors. It does not make any difference in terms of the results obtained from the model but we have to be aware of how they are coded for a correct interpretation of the model.
In general, the model will always have 1 parameter for a binary variable. However, we have to choose which level the parameter should represent by deciding which group to code with 1. The choice should be made according to how we want the model to be interpreted. For example, for the treatment variable, 0 is usually the placebo or the control treatment and 1 is usually the experimental treatment. Alternatively, in the case of the covariates/exposure variables, we could simply use a statistical criteria: for a better stability of the model it is always better to use the most frequent category as a reference.
For example, it makes more sense to consider as the reference level of paid_job subjects who have had a paid job. This will help with the stability of the model but also it will ease the interpretation of the results. Therefore, we create a new variable no_paid_job:
datiSHAREStress$no_paid_job <- ifelse(datiSHAREStress$paid_job==0,1,0)
In this case we must transform the variable in a factor, assign labels to the factor levels and choose a reference level. The reference level will be the group all the others levels will be compared to. In fact when a factor variable with c levels is added to a model, R internally transforms it in c-1 dummy variables and a model parameter will be estimated for each of them. Again, is important we choose the reference level. In this case we have ses which we have to transform into a factor
datiSHAREStress$ses <- as.factor(datiSHAREStress$ses)
We can now check how R has coded the variable:
str(datiSHAREStress$ses)
## Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 2 4 4 ...
contrasts(datiSHAREStress$ses)
## 2 3 4
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 0 1
By default, R uses the group 1 as the reference level. With the contrasts() function we can see how the variable will be parametrized in the regression model. In the rows we have the variable levels and on the columns the parameters generated for this variable.
It is also useful to assign more meaningful labels:
datiSHAREStress$ses=factor(datiSHAREStress$ses,
labels = c("Great Difficulties",
"Some Difficulties",
"Fairly Easily",
"Easily"))
Last but not least we can set the reference category of our choice. In this case we choose “Easily” which is the most frequent category.
table(datiSHAREStress$ses)
##
## Great Difficulties Some Difficulties Fairly Easily Easily
## 191 813 1997 3148
datiSHAREStress$ses=relevel(datiSHAREStress$ses,ref="Easily")
If the variable is of type character the steps are similar to the previous ones with the exception that we won’t need to explicitly set the labels. As a reminder, by default R uses as a reference level the first level in alphabetical order.
We skip this part here, since I assume that you are already able to produce descriptive statistics of a dataset… Remind that you should pay particular attention to outliers, missing values, continuous variable’s distribution shapes, rare categories in categorical ones!
A common method used also in causal modelling as a first step to explore associations is fitting many models with one covariate at the time. This is also a way to begin to explore the dataset, even if as discussed it is not the suggested method to select the variables to be included in the final model, especially in the causal framework, where instead a DAG (Directed Acyclic Graph) should be discussed with experts (see later). For this example, we can start with the exposure to stress, our variable of interest. So we start by fitting the model:
fit_uni_stress <- glm(low_grip~stress,family = binomial,data=datiSHAREStress)
As a reminder, the glm function automatically uses the logit as link function unless we state otherwise.
The results of the model can be obtained with
summary(fit_uni_stress)
##
## Call:
## glm(formula = low_grip ~ stress, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5660 0.0501 -51.217 <2e-16 ***
## stress -0.5174 0.4204 -1.231 0.218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3141.1 on 6147 degrees of freedom
## AIC: 3145.1
##
## Number of Fisher Scoring iterations: 5
It seems that we can’t reject the null hypothesis for \(\beta_{stress}\).
We can also obtain the 95% Confidence Interval
confint.default(fit_uni_stress)
## 2.5 % 97.5 %
## (Intercept) -2.664221 -2.4678285
## stress -1.341440 0.3066133
We can fit more univariable models to look for the role of possible confounders in the exposure-outcome relationship.
fit_uni_age <- glm(low_grip~age,family = binomial,data=datiSHAREStress)
summary(fit_uni_age)
##
## Call:
## glm(formula = low_grip ~ age, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.505544 0.419383 -22.67 <2e-16 ***
## age 0.101366 0.005822 17.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 2817.0 on 6147 degrees of freedom
## AIC: 2821
##
## Number of Fisher Scoring iterations: 6
fit_uni_job <- glm(low_grip~no_paid_job,family = binomial,data=datiSHAREStress)
summary(fit_uni_job)
##
## Call:
## glm(formula = low_grip ~ no_paid_job, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.61992 0.05134 -51.030 < 2e-16 ***
## no_paid_job 1.13184 0.21544 5.254 1.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3120.8 on 6147 degrees of freedom
## AIC: 3124.8
##
## Number of Fisher Scoring iterations: 5
fit_uni_sex <- glm(low_grip~female,family = binomial,data=datiSHAREStress)
summary(fit_uni_sex)
##
## Call:
## glm(formula = low_grip ~ female, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.56532 0.07178 -35.738 <2e-16 ***
## female -0.01918 0.09955 -0.193 0.847
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3142.8 on 6147 degrees of freedom
## AIC: 3146.8
##
## Number of Fisher Scoring iterations: 5
fit_uni_move <- glm(low_grip~house_move,family = binomial,data=datiSHAREStress)
summary(fit_uni_move)
##
## Call:
## glm(formula = low_grip ~ house_move, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3735 0.1307 -18.161 <2e-16 ***
## house_move -0.2329 0.1413 -1.648 0.0993 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3140.2 on 6147 degrees of freedom
## AIC: 3144.2
##
## Number of Fisher Scoring iterations: 5
fit_uni_ses <- glm(low_grip~ses,family = binomial,data=datiSHAREStress)
summary(fit_uni_ses)
##
## Call:
## glm(formula = low_grip ~ ses, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.69592 0.07324 -36.809 < 2e-16 ***
## sesGreat Difficulties 0.70745 0.23408 3.022 0.00251 **
## sesSome Difficulties 0.37973 0.14288 2.658 0.00787 **
## sesFairly Easily 0.11084 0.11422 0.970 0.33182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3129.9 on 6145 degrees of freedom
## AIC: 3137.9
##
## Number of Fisher Scoring iterations: 5
In this univariable models the variable Sex is not significantly associated with the outcome. Also talking with the experts, they say that sex could not be considered as a confounder in this framework, so we can ignore it in the following analyses.
We can now fit a multivariable logistic model with the variables that were found to be associated with the outcome in the previous analysis and we suspect could act as confounders of the main exposure of interest. We obviously include in the model stress , since it is our exposure of interest (even if at univariable analysis the coefficient was not significant). We will treat the others variables are possible confounders in the model. So here our proposed DAG is :
# Define the DAG
dag_model <- dagify(
# Outcome affected by exposure and all confounders
low_grip ~ stress + no_paid_job + age + house_move + ses,
# Exposure affected by all confounders
stress ~ no_paid_job + age + house_move + ses,
exposure = "stress",
outcome = "low_grip",
labels = c(
"low_grip" = "Low Grip Strength",
"stress" = "Stress",
"no_paid_job" = "No Paid Job",
"age" = "Age",
"house_move" = "House Move",
"ses" = "SES"
)
)
# Plotting the DAG
ggdag_status(dag_model, use_labels = "label", text = FALSE) +
theme_dag() +
# Fixed legend syntax using labs() for simplicity
labs(fill = "Variable Role", color = "Variable Role") +
theme(legend.position = "bottom")
We fit now the model:
fit_multi <- glm(low_grip~stress+no_paid_job+age+house_move+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi)
##
## Call:
## glm(formula = low_grip ~ stress + no_paid_job + age + house_move +
## ses, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.486409 0.448437 -21.154 < 2e-16 ***
## stress -0.149441 0.438194 -0.341 0.733075
## no_paid_job 0.408889 0.233225 1.753 0.079568 .
## age 0.101474 0.005924 17.129 < 2e-16 ***
## house_move -0.196669 0.147181 -1.336 0.181472
## sesGreat Difficulties 0.939131 0.249104 3.770 0.000163 ***
## sesSome Difficulties 0.489696 0.149577 3.274 0.001061 **
## sesFairly Easily 0.066057 0.118419 0.558 0.576962
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 2790.2 on 6141 degrees of freedom
## AIC: 2806.2
##
## Number of Fisher Scoring iterations: 6
It seems again that our exposure of interest is not even statistically associated with the outcome. Knowledge of the context from which the data comes from can be used to make hypothesis about possible interaction between confounders and the exposure. In this case we want to test for an interaction between no_paid_job and stress. As we know, DAGs are excellent at showing the direction of causality, but they are not conceived to show the possible interactions. But, we can stay with the original DAG in mind, and then fit a model to see if the association between the outcome and the exposure could vary for different levels of no_paid_job. We then fit the model:
fit_multi2 <- glm(low_grip~stress*no_paid_job+house_move+age+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi2)
##
## Call:
## glm(formula = low_grip ~ stress * no_paid_job + house_move +
## age + ses, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.544923 0.450526 -21.186 < 2e-16 ***
## stress -0.812176 0.603047 -1.347 0.178049
## no_paid_job 0.267759 0.243706 1.099 0.271901
## house_move -0.181885 0.147844 -1.230 0.218604
## age 0.102246 0.005946 17.197 < 2e-16 ***
## sesGreat Difficulties 0.927537 0.249828 3.713 0.000205 ***
## sesSome Difficulties 0.490455 0.149577 3.279 0.001042 **
## sesFairly Easily 0.060946 0.118563 0.514 0.607225
## stress:no_paid_job 3.189101 1.054703 3.024 0.002497 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 2781.6 on 6140 degrees of freedom
## AIC: 2799.6
##
## Number of Fisher Scoring iterations: 6
As a reminder, using * introduces both the main effect and the interaction effect between variables in the model.
Alternatively, we can also fit the same model as follows:
fit_multi2 <- glm(low_grip~stress+no_paid_job+age+ses+stress:no_paid_job+house_move,family = binomial,data=datiSHAREStress)
This second way of adding an interaction term is useful in case we have multiple interactions involving the same variable.
It seems that a significant interaction is present between stress and the no_paid_job confounder. Note that in any case, the main effect of stress is still not significant in the model.
Now that we have fitted these two models, which one should we choose?
We can use a formal statistical test to compare the two nested models:
anova(fit_multi,fit_multi2,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: low_grip ~ stress + no_paid_job + age + house_move + ses
## Model 2: low_grip ~ stress + no_paid_job + age + ses + stress:no_paid_job +
## house_move
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 6141 2790.2
## 2 6140 2781.6 1 8.5938 0.003373 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis of this test is that the simpler model (i.e. the one with the smaller number of regression parameters) is no different from the more complex model. In this case we reject that hypothesis at a 95% confidence level and we would keep the model with the interaction.This is good also because our expert of the field were very sure about the fact that no_paid_job was an important effect modifier!
The next step is interpreting the results obtained from the model which is one of the most important part of the analysis to be discussed with experts. Of course we are keeping things simple here: in reality we would have had to consider many other possible candidate confounders for the outcome in order to properly control for confounding in this observational study, at least for the measured ones. A general premise is that in causal modelling the regression coefficient of the exposure of interest is interpreted as a CATE (conditional average treatment effect). In this specific example however, the regression coefficient of stress is not statistical significant, instead the interaction between stress and no_paid_job is, so that we should interpret the causal effect differently in the subgroups defined by no_paid_job. When an interaction is significant, you can no longer talk about the “average” effect of stress. The question “Does stress affect grip strength?” can now only be answered with: “It depends on whether they have a paid job.”
We should therefore examine the \(\hat{OR}s\) for stress and no_paid_job and since there is an interaction involved, we have to be a bit more careful and we have to consider the two variables together.
So first of all we have to know which is our reference group.
In our case this is the group who have not experienced a stressful period and have done some paid job in their life (i.e. the reference levels for both variables involved in the interaction). Then, we can consider all combinations of the levels of the variables.
If we want the \(\hat{OR}\) for undergoing a stressful period and having had a paid job with respect to not having had a period of stress and having had a paid job we simply exponentiate the main estimated coefficient for stress (keeping at the same values the others covariates in the model):
exp(fit_multi2$coefficients)[2]
## stress
## 0.4438912
and we don’t reject the null hypothesis it is equal to 1 (no significant effect):
exp(confint.default(fit_multi2))[2,]
## 2.5 % 97.5 %
## 0.1361326 1.4474079
On the other hand, if we want the \(\hat{OR}\) for never having had a paid job and not being exposed to stress with respect to having had a paid job and not being exposed to stress we simply exponentiate the coefficient for the main effect of the variable no_paid_job:
exp(fit_multi2$coefficients)[3]
## no_paid_job
## 1.307032
Also this effect is not statistically significant which means that there does not seem to be a difference in the risk of low hand grip with regards to job experience in the non-exposed group (no stress) keeping fixed all others covariates.
exp(confint.default(fit_multi2))[3,]
## 2.5 % 97.5 %
## 0.8106679 2.1073152
Last, we can obtain the OR for subjects who were stressed and had never have a paid job with respect to the reference group (not experienced a stressful period and have done some paid job):
exp(fit_multi2$coefficients[9]+fit_multi2$coefficients[2]+fit_multi2$coefficients[3])
## stress:no_paid_job
## 14.07899
So in this subgroup it seems that the probability of developing low grip increases.
However, we should evaluate also the 95% CI for this \(\hat{OR}\).
One possibility to do it is to re-fit the multivariable logistic model using the rms R package:
library(rms)
dd <- datadist(datiSHAREStress)
options(datadist='dd') #define ranges of the covariates
fit_multi2b <- lrm(low_grip~stress*no_paid_job+age+ses+house_move,data=datiSHAREStress)
The code is quite similar, the main difference is that first we have to run the datadist function which stores the distribution summaries of the variables.
The print function returns a very similar output of the summary for the glm() object.
print(fit_multi2b)
## Logistic Regression Model
##
## lrm(formula = low_grip ~ stress * no_paid_job + age + ses + house_move,
## data = datiSHAREStress)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6149 LR chi2 361.20 R2 0.143 C 0.752
## 0 5714 d.f. 8 R2(8,6149)0.056 Dxy 0.504
## 1 435 Pr(> chi2) <0.0001 R2(8,1212.7)0.253 gamma 0.505
## max |deriv| 2e-06 Brier 0.061 tau-a 0.066
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -9.5449 0.4505 -21.19 <0.0001
## stress -0.8122 0.6031 -1.35 0.1781
## no_paid_job 0.2678 0.2437 1.10 0.2719
## age 0.1022 0.0059 17.20 <0.0001
## ses=Great Difficulties 0.9275 0.2498 3.71 0.0002
## ses=Some Difficulties 0.4905 0.1496 3.28 0.0010
## ses=Fairly Easily 0.0609 0.1186 0.51 0.6072
## house_move -0.1819 0.1478 -1.23 0.2186
## stress * no_paid_job 3.1891 1.0547 3.02 0.0025
The summary on the other hand gives you the estimates for the coefficients as well as the ones for the OR. Of note by default for the continuous variables here, such as age, the function calculates the OR for the difference between the \(1^{st}\) and the \(3^{rd}\) quartile.
summary(fit_multi2b)
## Effects Response : low_grip
##
## Factor Low High Diff. Effect S.E. Lower 0.95
## stress 0 1.0 1.0 -0.812180 0.603110 -1.99430
## Odds Ratio 0 1.0 1.0 0.443890 NA 0.13612
## no_paid_job 0 1.0 1.0 0.267760 0.243710 -0.20990
## Odds Ratio 0 1.0 1.0 1.307000 NA 0.81067
## age 58 70.8 12.8 1.308700 0.076105 1.15960
## Odds Ratio 58 70.8 12.8 3.701500 NA 3.18860
## house_move 0 1.0 1.0 -0.181880 0.147840 -0.47165
## Odds Ratio 0 1.0 1.0 0.833700 NA 0.62397
## ses - Great Difficulties:Easily 1 2.0 NA 0.927540 0.249830 0.43788
## Odds Ratio 1 2.0 NA 2.528300 NA 1.54940
## ses - Some Difficulties:Easily 1 3.0 NA 0.490460 0.149580 0.19729
## Odds Ratio 1 3.0 NA 1.633100 NA 1.21810
## ses - Fairly Easily:Easily 1 4.0 NA 0.060946 0.118560 -0.17143
## Odds Ratio 1 4.0 NA 1.062800 NA 0.84246
## Upper 0.95
## 0.36990
## 1.44760
## 0.74541
## 2.10730
## 1.45790
## 4.29700
## 0.10788
## 1.11390
## 1.41720
## 4.12550
## 0.78362
## 2.18940
## 0.29333
## 1.34090
##
## Adjusted to: stress=0 no_paid_job=0
So far, the OR for stress and the job experience are the ones for the main effects. We can easily obtain here the OR for the interaction with the built-in contrast function:
c <- contrast(fit_multi2b,
list(stress=1,no_paid_job=1), list(stress=0,no_paid_job=0),type="average")
c
## Contrast S.E. Lower Upper Z Pr(>|z|)
## 11 2.644684 0.8328779 1.012273 4.277095 3.18 0.0015
##
## Confidence intervals are 0.95 individual intervals
#OR and 95% CI
exp(c$Contrast)
## 1
## 14.07899
exp(c$Lower)
## 1
## 2.75185
exp(c$Upper)
## 1
## 72.03085
So we confirm that there is a significant interaction between stress and no paid job on the outcome. Note that the 95% CI is quite large, therefore the precision of our estimate is questionable here…
Remind our initial research question: whether a stressful period could cause the occurrence of muscular weakness in the over 50 population.
Based on our analysis it seems that is not the single event of a stressful period that cause the occurrence of muscular weakness, but it is the joint impact of having a stressful period coupled with no paid job that could be a cause of the occurrence of muscular weakness, adjusting for (or independently from) the specific age, house condition or socio-economic status.
Another possibility to interpret the results is :
# 1. Calculate the slope of stress for each group
slopes(fit_multi2, variables = "stress", by = "no_paid_job")
##
## no_paid_job Estimate Std. Error z Pr(>|z|) S 2.5 % 97.5 %
## 0 -0.0355 0.0186 -1.90 0.05686 4.1 -0.072 0.00104
## 1 0.4241 0.1637 2.59 0.00959 6.7 0.103 0.74490
##
## Term: stress
## Type: response
## Comparison: 1 - 0
# 2. Visualize it (The most intuitive way)
plot_predictions(fit_multi2, condition = c("stress", "no_paid_job")) +
labs(y = "Probability of Low Grip Strength",
title = "Interaction between Stress and Job Status")
A Note on Causal Direction: In SHARE data, be careful with the “Causal” label. Because it is cross-sectional, the interaction could also suggest that people with low grip strength and high stress are more likely to leave the workforce (reverse causality). Your DAG assumes the arrow goes from Job \(\rightarrow\) Stress/Grip, but keep that limitation in mind!