Survey of Health, Ageing and Retirement in Europe : example of an analysis of a real dataset

Load the required libraries

library(haven)
library(tidyverse)
library(magrittr)

Data comes from SHARE, the Survey of Health, Ageing and Retirement in Europe (https://share-eric.eu/data/). It is a multidisciplinary and cross-national panel database of data on health, socio-economic status and social and family networks of about 140000 individuals aged 50 or older (around 380000 interviews). SHARE covers 27 European countries and Israel.

The dataset we use here contains only part of the data the survey produced.

The research question is whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population: therefore we can consider it from the pint of view of an explanatory (causal) model, not a predictive one.
This question could be of interest since muscular weakness can be considered a proxy of poor physical health. Muscular strength was defined as the hand grip measured with a dynamo-meter and the weakness was defined as having a grip strength below a threshold calculated according to BMI and sex.

The exposure group was defined as subjects who underwent a stressful period between the first and the last time they had been interviewed; all other subject were defined as non-exposed.

Of note, an inclusion criteria was to have a normal hand strength at the start of the study. Once the subjects were divided in the exposed and non-exposed group, the hand grip measurement after two years from the exposure definition was used to define the outcome.

In this example we will consider the following variables:

low_grip: our binary outcome variable (0: No; 1: Yes)
stress: our binary exposure variable (0: No; 1: Yes)
age: age of the subject in years
female: sex (1: females; 0: males)
ses: is household able to make ends meet? (1: Great Difficulties; 2: Some Difficulties; 3: Fairly Good ; 4 Good)
paid_job: has the subject ever done some paid job in their life? (0: No; 1: Yes)
move_house: has the subject ever moved country in their life? (0: No; 1: Yes)

load("Logistic Regression examples.RData")

IDA phase

Categorical Variables

Before starting to fit any regression model we have to properly code categorical variables. This is usually done in the initial data analysis (IDA) and the descriptive statistics phase.

The first step consists in identifying which variables are categorical. This may seem as an easy step but it isn’t always. If fact the choice for some variables (i.e. indexes and scales) is not obvious and it requires either knowledge of the data we are analysing or a bit more effort on the data modelling.

The important message is that we can’t assume that it is correct to model a variable as numerical only because it was of type numerical/integer in the dataset imported in R.

Once we have identified the variables that we think are categorical we can process them. We can have three cases that leads to a slightly different procedure according to the R type of the variable:

Binary variables
Variables of type numerical or integer
Variables of type character

Binary Variables

Binary variables can either be treated as numerical, usually dummy 0/1 variables, or as factors. It does not make any difference in terms of the results obtained from the model but we have to be aware of how they are coded for a correct interpretation of the model.

In general, the model will always have 1 parameter for a binary variable. However, we have to choose which level the parameter should represent by deciding which group to code with 1. The choice should be made according to how we want the model to be interpreted. For example, for the treatment variable, 0 is usually the placebo or the control treatment and 1 is usually the experimental treatment. Alternatively, in the case of the covariates/exposure variables, we could simply use a statistical criteria: for a better stability of the model it is always better to use the most frequent category as a reference.

For example, it makes more sense to consider as the reference level of paid_job subjects who have had a paid job. This will help with the stability of the model but also it will ease the interpretation of the results. Therefore, we create a new variable no_paid_job:

datiSHAREStress$no_paid_job <- ifelse(datiSHAREStress$paid_job==0,1,0)

Categorical variables with numerical labels

In this case we must transform the variable in a factor, assign labels to the factor levels and choose a reference level. The reference level will be the group all the others levels will be compared to. In fact when a factor variable with c levels is added to a model, R internally transforms it in c-1 dummy variables and a model parameter will be estimated for each of them. Again, is important we choose the reference level. In this case we have ses which we have to transform into a factor

datiSHAREStress$ses <- as.factor(datiSHAREStress$ses)

We can now check how R has coded the variable:

str(datiSHAREStress$ses)

##  Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 2 4 4 ...

contrasts(datiSHAREStress$ses)

By default, R uses the group 1 as the reference level. With the contrasts() function we can see how the variable will be parametrized in the regression model. In the rows we have the variable levels and on the columns the parameters generated for this variable.

It is also useful to assign more meaningful labels:

datiSHAREStress$ses=factor(datiSHAREStress$ses,
                           labels = c("Great Difficulties",
                           "Some Difficulties",
                            "Fairly Easily",
                             "Easily"))

Last but not least we can set the reference category of our choice. In this case we choose “Easily” which is the most frequent category.

table(datiSHAREStress$ses)

## 
## Great Difficulties  Some Difficulties      Fairly Easily             Easily 
##                191                813               1997               3148

datiSHAREStress$ses=relevel(datiSHAREStress$ses,ref="Easily")

Characters variables

If the variable is of type character the steps are similar to the previous ones with the exception that we won’t need to explicitly set the labels. As a reminder, by default R uses as a reference level the first level in alphabetical order.

General descriptives statistics and plots !

We skip this part here, since we already discussed how to describe a dataset in the various IDA examples. Remind that you should pay particular attention to outliers, missing values, continuous variable’s distribution shapes, rare categories in categorical ones!

Univariable Logistic Regression (univariable filtering)

A common method used in the analysis of health data for variables selection, is fitting many models with one covariate at the time. This is also a way to begin to explore the dataset, even if as discussed it is not the suggested method to select variables in the model. In this case, we can start with the exposure to stress, our variable of interest. This kind of analysis in the medical literature is often referred as “univariable analysis”.

So we start by fitting the model

fit_uni_stress <- glm(low_grip~stress,family = binomial,data=datiSHAREStress)

As a reminder, the glm function automatically uses the logit as link function unless we state otherwise.

The results of the model can be obtained with

summary(fit_uni_stress)

## 
## Call:
## glm(formula = low_grip ~ stress, family = binomial, data = datiSHAREStress)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3848  -0.3848  -0.3848  -0.3848   2.5013  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.5660     0.0501 -51.217   <2e-16 ***
## stress       -0.5174     0.4204  -1.231    0.218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3141.1  on 6147  degrees of freedom
## AIC: 3145.1
## 
## Number of Fisher Scoring iterations: 5

It seems that we can’t reject the null hypothesis for \(\beta_{stress}\).

We can also obtain the 95% Confidence Interval

confint.default(fit_uni_stress)

##                 2.5 %     97.5 %
## (Intercept) -2.664221 -2.4678285
## stress      -1.341440  0.3066133

We can do more univariable models to look for possible confounders (previoulsy discussed with an expert if possible..) in the exposure-outcome relationship.

fit_uni_age <- glm(low_grip~age,family = binomial,data=datiSHAREStress)
summary(fit_uni_age)

## 
## Call:
## glm(formula = low_grip ~ age, family = binomial, data = datiSHAREStress)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.4362  -0.4048  -0.2897  -0.2115   2.9594  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.505544   0.419383  -22.67   <2e-16 ***
## age          0.101366   0.005822   17.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2817.0  on 6147  degrees of freedom
## AIC: 2821
## 
## Number of Fisher Scoring iterations: 6

fit_uni_job <- glm(low_grip~no_paid_job,family = binomial,data=datiSHAREStress)
summary(fit_uni_job)

## 
## Call:
## glm(formula = low_grip ~ no_paid_job, family = binomial, data = datiSHAREStress)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.6381  -0.3749  -0.3749  -0.3749   2.3196  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.61992    0.05134 -51.030  < 2e-16 ***
## no_paid_job  1.13184    0.21544   5.254 1.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3120.8  on 6147  degrees of freedom
## AIC: 3124.8
## 
## Number of Fisher Scoring iterations: 5

fit_uni_sex <- glm(low_grip~female,family = binomial,data=datiSHAREStress)
summary(fit_uni_sex)

## 
## Call:
## glm(formula = low_grip ~ female, family = binomial, data = datiSHAREStress)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.3849  -0.3849  -0.3814  -0.3814   2.3053  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.56532    0.07178 -35.738   <2e-16 ***
## female      -0.01918    0.09955  -0.193    0.847    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3142.8  on 6147  degrees of freedom
## AIC: 3146.8
## 
## Number of Fisher Scoring iterations: 5

fit_uni_move <- glm(low_grip~house_move,family = binomial,data=datiSHAREStress)
summary(fit_uni_move)

## 
## Call:
## glm(formula = low_grip ~ house_move, family = binomial, data = datiSHAREStress)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.4221  -0.3774  -0.3774  -0.3774   2.3141  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3735     0.1307 -18.161   <2e-16 ***
## house_move   -0.2329     0.1413  -1.648   0.0993 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3140.2  on 6147  degrees of freedom
## AIC: 3144.2
## 
## Number of Fisher Scoring iterations: 5

fit_uni_ses <- glm(low_grip~ses,family = binomial,data=datiSHAREStress)
summary(fit_uni_ses)

## 
## Call:
## glm(formula = low_grip ~ ses, family = binomial, data = datiSHAREStress)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -0.5066  -0.3813  -0.3614  -0.3614   2.3500  
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.69592    0.07324 -36.809  < 2e-16 ***
## sesGreat Difficulties  0.70745    0.23408   3.022  0.00251 ** 
## sesSome Difficulties   0.37973    0.14288   2.658  0.00787 ** 
## sesFairly Easily       0.11084    0.11422   0.970  0.33182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3129.9  on 6145  degrees of freedom
## AIC: 3137.9
## 
## Number of Fisher Scoring iterations: 5

When we have covariates with many levels, such as ses, we can use the global Wald test which takes into account for multiple testing.

drop1(fit_uni_ses,test="Chisq")

## Single term deletions
## 
## Model:
## low_grip ~ ses
##        Df Deviance    AIC    LRT Pr(>Chi)   
## <none>      3129.8 3137.8                   
## ses     3   3142.8 3144.8 12.988 0.004663 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Multivariable Logistic Model

We can now fit a multivariable logistic model with the variables that were found to be associated with the outcome in the previous analysis and we suspect could act as confounders of the main exposure of interest.
Since we have not the possibility do discuss with an expert, we decide to include in the multivariable regression model all covariates with a p-value < 0.1. This criterion could be not the best one, but is widely applied in clinical and epidemiological research.

We obviously include in the model stress , since it is our exposure of interest.

We will treat the others variables are possible confounders in the model, and we will explore if significant associations are present with the exposure of interest. In principle, in the explanatory setting, we should start from a DAG and make the causal assumptions (in order to estimate a causal effect!!) but we do not have here the experts of the matter and we will limit ourselves in the end to a cautious interpretation of the estimated association.

fit_multi <-  glm(low_grip~stress+no_paid_job+age+house_move+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi)

## 
## Call:
## glm(formula = low_grip ~ stress + no_paid_job + age + house_move + 
##     ses, family = binomial, data = datiSHAREStress)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3619  -0.3994  -0.2864  -0.2075   2.9996  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -9.486409   0.448437 -21.154  < 2e-16 ***
## stress                -0.149441   0.438194  -0.341 0.733075    
## no_paid_job            0.408889   0.233225   1.753 0.079568 .  
## age                    0.101474   0.005924  17.129  < 2e-16 ***
## house_move            -0.196669   0.147181  -1.336 0.181472    
## sesGreat Difficulties  0.939131   0.249104   3.770 0.000163 ***
## sesSome Difficulties   0.489696   0.149577   3.274 0.001061 ** 
## sesFairly Easily       0.066057   0.118419   0.558 0.576962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2790.2  on 6141  degrees of freedom
## AIC: 2806.2
## 
## Number of Fisher Scoring iterations: 6

This model estimates the coefficients of each of the variables independently from all the others. This is key concept of multivariable regression and it is what allows to remove confounding.

At a first sight it seems that our exposure of interest is not associated with the outcome.

Knowledge of the context from which the data comes from can be used to make hypothesis about possible interaction between confounders and the exposure.

In this case we want to test for an interaction between no_paid_job and stress. In other words we want to see if the association between the outcome and the exposure could vary for different levels of no_paid_job.

We then fit the model:

fit_multi2 <-  glm(low_grip~stress*no_paid_job+house_move+age+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi2)

## 
## Call:
## glm(formula = low_grip ~ stress * no_paid_job + house_move + 
##     age + ses, family = binomial, data = datiSHAREStress)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3766  -0.3996  -0.2850  -0.2054   3.0626  
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -9.544923   0.450526 -21.186  < 2e-16 ***
## stress                -0.812176   0.603047  -1.347 0.178049    
## no_paid_job            0.267759   0.243706   1.099 0.271901    
## house_move            -0.181885   0.147844  -1.230 0.218604    
## age                    0.102246   0.005946  17.197  < 2e-16 ***
## sesGreat Difficulties  0.927537   0.249828   3.713 0.000205 ***
## sesSome Difficulties   0.490455   0.149577   3.279 0.001042 ** 
## sesFairly Easily       0.060946   0.118563   0.514 0.607225    
## stress:no_paid_job     3.189101   1.054703   3.024 0.002497 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2781.6  on 6140  degrees of freedom
## AIC: 2799.6
## 
## Number of Fisher Scoring iterations: 6

As a reminder, using * introduces both the main effect and the interaction effect between variables in the model.

Alternatively, we can also fit the same model as follows:

fit_multi2 <-  glm(low_grip~stress+no_paid_job+age+ses+stress:no_paid_job+house_move,family = binomial,data=datiSHAREStress)

This second way of adding an interaction term is useful in case we have multiple interactions involving the same variable.

It seems that a significant interaction is present between stress and the no_paid_job confounder. Note that in any case, the main effect of stress is still not significant in the model.

Now that we have fitted these two models, which one should we choose?

We can use a formal statistical test to compare the two nested models:

anova(fit_multi,fit_multi2,test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: low_grip ~ stress + no_paid_job + age + house_move + ses
## Model 2: low_grip ~ stress + no_paid_job + age + ses + stress:no_paid_job + 
##     house_move
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1      6141     2790.2                        
## 2      6140     2781.6  1   8.5938 0.003373 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis of this test is that the simpler model (i.e. the one with the smaller number of regression parameters) is no different from the more complex model. In this case we reject that hypothesis at a 95% confidence level and we would keep the model with the interaction.

Calibration for the logistic regression model (simplest basic method !!)

Even if we are here estimating an explanatory model, this does not means that we are not at all interested in evaluating if the predicted probabilities are in line with the observed event rates. A basic procedure to evaluate this aspect in logistic regression is the Hosmer-Lemeshow test (see for details: https://onlinelibrary.wiley.com/doi/book/10.1002/0471722146). In brief, the null hypothesis of the test is that the predicted probabilities (splitted in ordinal groups) are in line with the observed rates of the events in each group.

library(generalhoslem)

logitgof(datiSHAREStress$low_grip,fitted(fit_multi2))

## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  datiSHAREStress$low_grip, fitted(fit_multi2)
## X-squared = 5.6329, df = 8, p-value = 0.6883

The model doesn’t show evidence for poor goodness of fit/calibration. Remind that when we are instead working with models used specifically for prediction, we should use more refined analyses such that the bootstrap overfitting-corrected calibration curves.

Interpretation of the logistic regression model

The next step is interpreting the results obtained from the model which is one of the most important part of the analysis to be discussed with experts. Of course we are keeping things simple here: in reality we would have had to consider many other possible candidate confounders for the outcome in order to properly control for confounding in this observational study, at least for the measured ones.

Continuous covariates

We obtain the OR, which is the measure of association selected when reporting and interpreting the results of a logistic model.

We start with the OR estimated for age:

exp(fit_multi2$coefficients)[4]

##      age 
## 1.107656

In general, OR>1 indicate an increase of the probability of the outcome whereas OR<1 a decrease. But the question is : with respect to what? We know that the OR is a relative measure of association so we must always ask ourselves what comparison we are making when we report an OR we have estimated. The above output does not mean anything by itself.

The first important thing to remember is that age is a continuous variable (measured in years) and in the model we have assumed it has a linear effect on the log-odds of the outcome. When we obtain the \(\hat{OR}\) by exponentiating the coefficient for age we obtain the OR estimates for an increase of one year of age. Being the \(\hat{OR}>1\), the probability of developing low grip increases as the age increases. Specifically, a subject has odds of having low grip 1.11 times greater with respect to a subject who is 1 year younger.

For continuous variables is important to report an OR for a difference that is relevant in the application at hand; a proper choice will depend of the scale the variable is measured and on the magnitude of the estimated coefficient.

For example here 1 year may not be so relevant from an epidemiological point of view, we may want to report an OR for a 5 years difference in age:

exp(fit_multi2$coefficients*5)[4]

##     age 
## 1.66734

Binary Covariates

What if we want to interpret the OR of a binary variable? Let’s consider the variable house_move. If it is coded numerical, then the coefficients refers always to the level 1. In this case 1 stands for having moved country in the past, so if the estimated coefficient was significant we could say that subjects who moved seem to have an odds lower by 17% (\(1-\hat{OR}\)) compared to people that have never moved countries in their life. However, when we look at the 95% CI we observe that it contains 1 so the association does not seem to be statistically significant.

#OR
exp(fit_multi2$coefficients)[8]

## house_move 
##  0.8336973

exp(confint.default(fit_multi2))[8,]

##     2.5 %    97.5 % 
## 0.6239695 1.1139185

Categorical Covariates

The interpretation of the OR for categorical variables is similar: we will have to keep in mind that we are always comparing each of the variable levels to the reference level. Here, it seems (interestingly!) that the odds of health worsening is the same for subject with good or fairly good self-perceived socio-economic status while it increases as the economic situation gets worse.

# OR
exp(fit_multi2$coefficients)[5:7]

## sesGreat Difficulties  sesSome Difficulties      sesFairly Easily 
##              2.528273              1.633059              1.062842

# 95% CI
exp(confint.default(fit_multi2))[5:7,]

##                          2.5 %   97.5 %
## sesGreat Difficulties 1.549423 4.125513
## sesSome Difficulties  1.218097 2.189385
## sesFairly Easily      0.842456 1.340880

Interactions

Finally, we will obtain the \(\hat{OR}s\) for stress and no_paid_job. Since there is an interaction involved, we have to be a bit more careful and we have to consider the two variables together. So first of all we have to know which is our reference group. In our case this is the group who have not experienced a stressful period and have done some paid job in their life (the reference levels for both variables involved in the interaction). Then, we can consider all combinations of the levels of the variables.

If we want the \(\hat{OR}\) for undergoing a stressful period and having had a paid job with respect to not having had a period of stress and having had a paid job we simply exponentiate the main estimated coefficient for stress (keeping at the same values the others covariates in the model):

exp(fit_multi2$coefficients)[2]

##    stress 
## 0.4438912

and we don’t reject the null hypothesis it is equal to 1 (no significant effect):

exp(confint.default(fit_multi2))[2,]

##     2.5 %    97.5 % 
## 0.1361326 1.4474079

On the other hand, if we want the \(\hat{OR}\) for never having had a paid job and not being exposed to stress with respect to having had a paid job and not being exposed to stress we simply exponentiate the coefficient for the main effect of the variable no_paid_job:

exp(fit_multi2$coefficients)[3]

## no_paid_job 
##    1.307032

Also this effect is not statistically significant which means that there does not seem to be a difference in the risk of low hand grip with regards to job experience in the non-exposed group (no stress) keeping fixed all others covariates.

exp(confint.default(fit_multi2))[3,]

##     2.5 %    97.5 % 
## 0.8106679 2.1073152

Last, we can obtain the OR for subjects who were stressed and had never have a paid job with respect to the reference group (not experienced a stressful period and have done some paid job):

exp(fit_multi2$coefficients[9]+fit_multi2$coefficients[2]+fit_multi2$coefficients[3])

## stress:no_paid_job 
##           14.07899

So in this subgroup it seems that the probability of developing low grip increases.

However, we should evaluate also the 95% CI for this \(\hat{OR}\).

One possibility to do it is to re-fit the multivariable logistic model using the rms R package:

library(rms)
dd <- datadist(datiSHAREStress)
options(datadist='dd') #define ranges of the covariates 

fit_multi2b <- lrm(low_grip~stress*no_paid_job+age+ses+house_move,data=datiSHAREStress)

The code is quite similar, the main difference is that first we have to run the datadist function which stores the distribution summaries of the variables.

The print function returns a very similar output of the summary for the glm() object.

print(fit_multi2b)

## Logistic Regression Model
## 
## lrm(formula = low_grip ~ stress * no_paid_job + age + ses + house_move, 
##     data = datiSHAREStress)
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs          6149    LR chi2     361.20       R2       0.143    C       0.752    
##  0           5714    d.f.             8      R2(8,6149)0.056    Dxy     0.504    
##  1            435    Pr(> chi2) <0.0001    R2(8,1212.7)0.253    gamma   0.505    
## max |deriv| 2e-06                             Brier    0.061    tau-a   0.066    
## 
##                        Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept              -9.5449 0.4505 -21.19 <0.0001 
## stress                 -0.8122 0.6031  -1.35 0.1781  
## no_paid_job             0.2678 0.2437   1.10 0.2719  
## age                     0.1022 0.0059  17.20 <0.0001 
## ses=Great Difficulties  0.9275 0.2498   3.71 0.0002  
## ses=Some Difficulties   0.4905 0.1496   3.28 0.0010  
## ses=Fairly Easily       0.0609 0.1186   0.51 0.6072  
## house_move             -0.1819 0.1478  -1.23 0.2186  
## stress * no_paid_job    3.1891 1.0547   3.02 0.0025

The summary on the other hand gives you the estimates for the coefficients as well as the ones for the OR. Of note by default for the continuous variables here, such as age, the function calculates the OR for the difference between the \(1^{st}\) and the \(3^{rd}\) quartile.

summary(fit_multi2b)

##              Effects              Response : low_grip 
## 
##  Factor                          Low High Diff. Effect    S.E.     Lower 0.95
##  stress                           0   1.0  1.0  -0.812180 0.603110 -1.99430  
##   Odds Ratio                      0   1.0  1.0   0.443890       NA  0.13612  
##  no_paid_job                      0   1.0  1.0   0.267760 0.243710 -0.20990  
##   Odds Ratio                      0   1.0  1.0   1.307000       NA  0.81067  
##  age                             58  70.8 12.8   1.308700 0.076105  1.15960  
##   Odds Ratio                     58  70.8 12.8   3.701500       NA  3.18860  
##  house_move                       0   1.0  1.0  -0.181880 0.147840 -0.47165  
##   Odds Ratio                      0   1.0  1.0   0.833700       NA  0.62397  
##  ses - Great Difficulties:Easily  1   2.0   NA   0.927540 0.249830  0.43788  
##   Odds Ratio                      1   2.0   NA   2.528300       NA  1.54940  
##  ses - Some Difficulties:Easily   1   3.0   NA   0.490460 0.149580  0.19729  
##   Odds Ratio                      1   3.0   NA   1.633100       NA  1.21810  
##  ses - Fairly Easily:Easily       1   4.0   NA   0.060946 0.118560 -0.17143  
##   Odds Ratio                      1   4.0   NA   1.062800       NA  0.84246  
##  Upper 0.95
##  0.36990   
##  1.44760   
##  0.74541   
##  2.10730   
##  1.45790   
##  4.29700   
##  0.10788   
##  1.11390   
##  1.41720   
##  4.12550   
##  0.78362   
##  2.18940   
##  0.29333   
##  1.34090   
## 
## Adjusted to: stress=0 no_paid_job=0

So far, the OR for stress and the job experience are the ones for the main effects. We can easily obtain here the OR for the interaction with the built-in contrast function:

c <- contrast(fit_multi2b,
         list(stress=1,no_paid_job=1), list(stress=0,no_paid_job=0),type="average")

c

##    Contrast      S.E.    Lower    Upper    Z Pr(>|z|)
## 11 2.644684 0.8328779 1.012273 4.277095 3.18   0.0015
## 
## Confidence intervals are 0.95 individual intervals

#OR and 95% CI
exp(c$Contrast)

##        1 
## 14.07899

exp(c$Lower)

##       1 
## 2.75185

exp(c$Upper)

##        1 
## 72.03085

So it seems that there is a significant interaction between stress and no paid job on the outcome.

Conclusions

Remind our initial research question: whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population. Based on our analysis it seems that is not the single event of a stessful period that increase the occurrence of muscular weakness, but it is the joint impact of having a stressful period coupled with no paid job that is associated with the occurrence of muscular weakness, adjusting for (or independently from) the specific age, house condition or socio-economic status.

Logistic Regression Models (II)