1 Example of prediction model: dataset “Prostate” Case Study

First of all, install and load the required libraries:

library(rms)
library(Hmisc)
library(gtsummary)
library(magrittr)
library(tidyverse)
library(haven)
library(ggplot2)
library(ggpubr)
library(pROC)
library(here)
library(psych)

This practical contains a case study on developing, describing, and validating a regression prediction model.

In the original data set of the study there was also a follow up time to the events, but we did not consider here this aspect (just for didactic purposes we will consider it as another candidate predictor in the model).

The original dataset was composed by 506 patients with prostate cancer from the Byar and Green paper (see if interested: D. P. Byar and S. B. Green. The choice of treatment for cancer patients based on covariate information: Application to prostate cancer. Bulletin Cancer, Paris, 67:477-488, 1980).

These data come originally from a randomized trial (RCT) comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen.

In this trial, larger doses of estrogen reduced the effect of prostate cancer but at the cost of increased risk of cardiovascular death. (Four patients had missing values on many variables. These patients have been excluded from consideration).

Our goal here is to predict the probability of incurring in cardiovascular-cerebrovascular death in the subset of patients that experienced this event or the prostate cancer death.

We will subset the original dataset of those patients dying from prostate cancer, heart or vascular disease, or cerebrovascular disease.

Here are the description of the variables contained in the dataset:

Some a priori information from clinicians: stage is defined by ap as well as X-ray results. Of the patients in stage 3, 92% have ap ≤ 0.8. Of those in stage 4, 93% have ap > 0.8.

Since stage can be predicted almost certainly from ap, we do not consider stage in the analyses. (You can check this relationship on the data).

1.1 Initial Data Analysis (IDA) on the original dataset

This part is an exercise that you will do on your own: first of all, let’s load the original dataset:

prostate <- read_sav(here("datasets", "prostate.sav"))
prostate <-labelled::to_factor(prostate, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
summary.data.frame(prostate)
##      patno           stage                     rx          dtime      
##  Min.   :  1.0   Min.   :3.000   placebo        :127   Min.   : 0.00  
##  1st Qu.:126.2   1st Qu.:3.000   0.2 mg estrogen:124   1st Qu.:14.25  
##  Median :251.5   Median :3.000   1.0 mg estrogen:126   Median :34.00  
##  Mean   :251.7   Mean   :3.424   5.0 mg estrogen:125   Mean   :36.13  
##  3rd Qu.:376.8   3rd Qu.:4.000                         3rd Qu.:57.75  
##  Max.   :506.0   Max.   :4.000                         Max.   :76.00  
##                                                                       
##                           status         age              wt        
##  alive                       :148   Min.   :48.00   Min.   : 69.00  
##  dead - prostatic ca         :130   1st Qu.:70.00   1st Qu.: 90.00  
##  dead - heart or vascular    : 96   Median :73.00   Median : 98.00  
##  dead - cerebrovascular      : 31   Mean   :71.46   Mean   : 99.03  
##  dead - other specific non-ca: 28   3rd Qu.:76.00   3rd Qu.:107.00  
##  dead - other ca             : 25   Max.   :89.00   Max.   :152.00  
##  (Other)                     : 44   NA's   :1       NA's   :2       
##                     pf            hx              sbp             dbp        
##  normal activity     :450   Min.   :0.0000   Min.   : 8.00   Min.   : 4.000  
##  in bed < 50% daytime: 37   1st Qu.:0.0000   1st Qu.:13.00   1st Qu.: 7.000  
##  in bed > 50% daytime: 13   Median :0.0000   Median :14.00   Median : 8.000  
##  confined to bed     :  2   Mean   :0.4243   Mean   :14.35   Mean   : 8.149  
##                             3rd Qu.:1.0000   3rd Qu.:16.00   3rd Qu.: 9.000  
##                             Max.   :1.0000   Max.   :30.00   Max.   :18.000  
##                                                                              
##                                 ekg            hg               sz       
##  normal                           :168   Min.   : 5.899   Min.   : 0.00  
##  heart strain                     :150   1st Qu.:12.299   1st Qu.: 5.00  
##  old MI                           : 75   Median :13.699   Median :11.00  
##  rhythmic disturb & electrolyte ch: 51   Mean   :13.446   Mean   :14.63  
##  heart block or conduction def    : 26   3rd Qu.:14.699   3rd Qu.:21.00  
##  (Other)                          : 24   Max.   :21.199   Max.   :69.00  
##  NA's                             :  8                    NA's   :5      
##        sg              ap                bm        
##  Min.   : 5.00   Min.   :  0.100   Min.   :0.0000  
##  1st Qu.: 9.00   1st Qu.:  0.500   1st Qu.:0.0000  
##  Median :10.00   Median :  0.700   Median :0.0000  
##  Mean   :10.31   Mean   : 12.176   Mean   :0.1633  
##  3rd Qu.:11.00   3rd Qu.:  2.975   3rd Qu.:0.0000  
##  Max.   :15.00   Max.   :999.875   Max.   :1.0000  
##  NA's   :11

Here are the suggestions for the initial data analysis:

  1. Create a descriptive summary of all the variables contained in the dataset (except for patient number!).

Do you think that some variables could be re-coded combining an infrequent category with the next category?

For example, pay attention to the frequencies of ekg and pf.

Hint from a discussion with clinicians: in the ekg variable we could collapse the normal and benign conditions.

For pf, you can combine in bed > 50% daytime with confined to bed in one category.

  1. Then, here there is another question for you: based on the study design (RCT), do you expect to find significant statistical differences between baseline characteristics of groups defined by the levels of the treatment rx ? Build a descriptive table comparing the four groups and evaluating the global p value across groups. (Hint: for numerical variables you could use the Kruskal-Wallis test and for categorical ones the Chi-square test).

  2. Give a look to the distribution’s shape of the continuous variables : do you think that some scale transformation (for example taking the logarithm) could help in improving the normality of some of them ?

  3. Now: subset the original dataset selecting only the patients died for cardiovascular–cerebrovascular death or prostate cancer death. Create the outcome variable cvd identifying as “1” patients died for cardiovascular–cerebrovascular death and as “0” patients died for prostate cancer. (We will use then logistic regression for the prediction model).

The reduced dataset “prostate_rid” is created. Here we begin with the prediction model development part.

1.2 Let’s begin with the prediction model construction

First of all, we upload the reduced dataset with the recoded variables that have been suggested in the IDA phase:

prostate_rid <- read_sav(here("datasets","prostate_rid.sav"))
prostate_rid <-labelled::to_factor(prostate_rid, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
summary.data.frame(prostate_rid)
##      patno           stage                     rx         dtime      
##  Min.   :  3.0   Min.   :3.000   placebo        :71   Min.   : 0.00  
##  1st Qu.:131.0   1st Qu.:3.000   0.2 mg estrogen:67   1st Qu.:11.00  
##  Median :265.0   Median :4.000   1.0 mg estrogen:49   Median :24.00  
##  Mean   :260.5   Mean   :3.525   5.0 mg estrogen:70   Mean   :25.26  
##  3rd Qu.:395.0   3rd Qu.:4.000                        3rd Qu.:37.00  
##  Max.   :506.0   Max.   :4.000                        Max.   :71.00  
##                                                                      
##                       status         age              wt        
##  dead - prostatic ca     :130   Min.   :48.00   Min.   : 71.00  
##  dead - heart or vascular: 96   1st Qu.:70.00   1st Qu.: 89.00  
##  dead - cerebrovascular  : 31   Median :73.00   Median : 97.00  
##  alive                   :  0   Mean   :71.55   Mean   : 98.03  
##  dead - pulmonary embolus:  0   3rd Qu.:76.00   3rd Qu.:106.00  
##  dead - other ca         :  0   Max.   :88.00   Max.   :152.00  
##  (Other)                 :  0                                   
##                     pf            hx              sbp             dbp        
##  normal activity     :221   Min.   :0.0000   Min.   : 8.00   Min.   : 4.000  
##  in bed < 50% daytime: 24   1st Qu.:0.0000   1st Qu.:13.00   1st Qu.: 7.000  
##  in bed > 50% daytime: 10   Median :1.0000   Median :14.00   Median : 8.000  
##  confined to bed     :  2   Mean   :0.5097   Mean   :14.53   Mean   : 8.144  
##                             3rd Qu.:1.0000   3rd Qu.:16.00   3rd Qu.: 9.000  
##                             Max.   :1.0000   Max.   :24.00   Max.   :13.000  
##                                                                              
##                                 ekg           hg               sz       
##  heart strain                     :83   Min.   : 5.899   Min.   : 0.00  
##  normal                           :74   1st Qu.:12.000   1st Qu.: 6.00  
##  old MI                           :44   Median :13.600   Median :14.00  
##  rhythmic disturb & electrolyte ch:28   Mean   :13.301   Mean   :16.99  
##  benign                           :13   3rd Qu.:14.600   3rd Qu.:24.50  
##  (Other)                          :10   Max.   :21.199   Max.   :69.00  
##  NA's                             : 5                    NA's   :2      
##        sg              ap               bm              ap80         ekg_rec  
##  Min.   : 5.00   Min.   :  0.10   Min.   :0.0000   Min.   :0.0000   bngn :87  
##  1st Qu.: 9.00   1st Qu.:  0.60   1st Qu.:0.0000   1st Qu.:0.0000   rd&ec:28  
##  Median :11.00   Median :  1.10   Median :0.0000   Median :0.0000   hbocd:10  
##  Mean   :10.75   Mean   : 16.17   Mean   :0.2451   Mean   :0.4553   hrts :83  
##  3rd Qu.:12.00   3rd Qu.:  7.00   3rd Qu.:0.0000   3rd Qu.:1.0000   MI   :44  
##  Max.   :15.00   Max.   :999.88   Max.   :1.0000   Max.   :1.0000   NA's : 5  
##  NA's   :5                                                                    
##               pf_rec                       status_num      subset 
##  normal          :221   dead - prostatic ca     :130   Min.   :1  
##  bed_under_50    : 24   dead - heart or vascular: 96   1st Qu.:1  
##  bed_over_50_conf: 12   dead - cerebrovascular  : 31   Median :1  
##                         alive                   :  0   Mean   :1  
##                         dead - pulmonary embolus:  0   3rd Qu.:1  
##                         dead - other ca         :  0   Max.   :1  
##                         (Other)                 :  0              
##       cvd             log_ap        
##  Min.   :0.0000   Min.   :-2.30268  
##  1st Qu.:0.0000   1st Qu.:-0.51087  
##  Median :0.0000   Median : 0.09518  
##  Mean   :0.4942   Mean   : 0.70723  
##  3rd Qu.:1.0000   3rd Qu.: 1.94591  
##  Max.   :1.0000   Max.   : 6.90763  
## 

Here we define ranges of the variables contained in the dataset (this is a useful function both for data description and for the subsequent interpretation of the regression coefficients):

options(datadist='dd')
dd <-datadist(prostate_rid)
dd
##                     patno stage              rx dtime               status
## Low:effect      131.00000     3            <NA>    11                 <NA>
## Adjust to       265.00000     3         placebo    24  dead - prostatic ca
## High:effect     395.00000     4            <NA>    37                 <NA>
## Low:prediction   24.96109     3         placebo     1                alive
## High:prediction 482.03891     4 5.0 mg estrogen    58 dead - unknown cause
## Low               3.00000     3         placebo     0                alive
## High            506.00000     4 5.0 mg estrogen    71 dead - unknown cause
##                      age       wt              pf hx sbp dbp          ekg
## Low:effect      70.00000  89.0000            <NA>  0  13   7         <NA>
## Adjust to       73.00000  97.0000 normal activity  0  14   8 heart strain
## High:effect     76.00000 106.0000            <NA>  1  16   9         <NA>
## Low:prediction  55.96109  76.0000 normal activity  0  11   6       normal
## High:prediction 80.00000 125.0389 confined to bed  1  19  10    recent MI
## Low             48.00000  71.0000 normal activity  0   8   4       normal
## High            88.00000 152.0000 confined to bed  1  24  13    recent MI
##                        hg       sz sg           ap bm ap80 ekg_rec
## Low:effect      12.000000  6.00000  9   0.59997559  0    0    <NA>
## Adjust to       13.599609 14.00000 11   1.09985352  0    0    bngn
## High:effect     14.599609 24.50000 12   7.00000000  1    1    <NA>
## Low:prediction   9.298828  2.00000  8   0.19998169  0    0    bngn
## High:prediction 16.503800 46.11673 14  54.97592412  1    1      MI
## Low              5.899414  0.00000  5   0.09999084  0    0    bngn
## High            21.199219 69.00000 15 999.87500000  1    1      MI
##                           pf_rec           status_num subset cvd      log_ap
## Low:effect                  <NA>                 <NA>      1   0 -0.51086631
## Adjust to                 normal  dead - prostatic ca      1   0  0.09517701
## High:effect                 <NA>                 <NA>      1   1  1.94591015
## Low:prediction            normal                alive      1   0 -1.60952947
## High:prediction bed_over_50_conf dead - unknown cause      1   1  4.00677071
## Low                       normal                alive      1   0 -2.30267665
## High            bed_over_50_conf dead - unknown cause      1   1  6.90763027
## 
## Values:
## 
## stage : 3 4 
## rx : placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen 
## status :
##  [1] alive                        dead - prostatic ca         
##  [3] dead - heart or vascular     dead - cerebrovascular      
##  [5] dead - pulmonary embolus     dead - other ca             
##  [7] dead - respiratory disease   dead - other specific non-ca
##  [9] dead - unspecified non-ca    dead - unknown cause        
## pf : normal activity in bed < 50% daytime in bed > 50% daytime confined to bed 
## hx : 0 1 
## dbp :  4  5  6  7  8  9 10 11 12 13 
## ekg :
## [1] normal                            benign                           
## [3] rhythmic disturb & electrolyte ch heart block or conduction def    
## [5] heart strain                      old MI                           
## [7] recent MI                        
## bm : 0 1 
## ap80 : 0 1 
## ekg_rec : bngn rd&ec hbocd hrts MI 
## pf_rec : normal bed_under_50 bed_over_50_conf 
## status_num :
##  [1] alive                        dead - prostatic ca         
##  [3] dead - heart or vascular     dead - cerebrovascular      
##  [5] dead - pulmonary embolus     dead - other ca             
##  [7] dead - respiratory disease   dead - other specific non-ca
##  [9] dead - unspecified non-ca    dead - unknown cause        
## subset : 1 
## cvd : 0 1

1.3 Functional forms of continous variables

Now we visually explore the assumption about the linearity of the effect for continuous candidate predictors by means of nonparametric smoothed regression:

par(mfrow=c(2,2))
plsmo(prostate_rid$age,prostate_rid$cvd)
plsmo(prostate_rid$wt, prostate_rid$cvd)
plsmo(prostate_rid$sbp,prostate_rid$cvd)
plsmo(prostate_rid$dbp,prostate_rid$cvd)

A strange behaviour is observed for wt i.e. the Weight Index [calculated as wt= wt(kg)-ht(cm)+200]; the others variables could be roughly approximated by a linear effect.

par(mfrow=c(2,3))
plsmo(prostate_rid$hg, prostate_rid$cvd)
plsmo(prostate_rid$sz, prostate_rid$cvd)
plsmo(prostate_rid$sg, prostate_rid$cvd)
plsmo(prostate_rid$ap, prostate_rid$cvd) 
plsmo(prostate_rid$dtime, prostate_rid$cvd) 

Again, a strange behaviour is observed for dtime (Months of Follow-up, as it is obvioulsy expected…), the others could be roughly approximated by a linear effect. Remind that this dataset is quite small, so we have to be parsimonious in the number of coefficients to be estimated !

Note that, being this dataset quite small, we will use all the dataset for the training phase of the regression model, but after for evaluating the performance of the model we will apply boostrap techniques. This is obviously not the optimal way to proceed, but when there is a constraint of a low-moderate sample size (as often happen in medical data collected prospectively) there is no possibility of further splitting the sample. With such a small sample size we could not afford an initial split of the study sample in separate training and test subgroups. This is also a great obstacle in applying machine learning techniques to small-moderate medical datasets.

1.4 Univariable analysis

Even if univariable analysis is generally discouraged to select candidate predictors (it works only with perfectly uncorrelated variables, that is nearly impossible in biological problems), it is another exploratory step that could be useful also to discuss then with clinicians about the indications of associations that are present in the data. So first of all we can create a table with univariable results, assuming for the moment a linear effect for all the continuous variables :

prostate_rid %>% select(-patno,-stage,-pf,-ekg,-status,-status_num,-subset) %>% 
  tbl_uvregression(method = glm, # glm function
                 method.args = list(family = binomial),# logistic model
                 exponentiate = T, # report OR
                 y=cvd,# outcome variable
                 conf.level = 0.95,
                 pvalue_fun = function(x) style_pvalue(x, digits = 3))
Characteristic N OR1 95% CI1 p-value
rx 257


    placebo

    0.2 mg estrogen
0.65 0.33, 1.27 0.210
    1.0 mg estrogen
1.13 0.55, 2.36 0.736
    5.0 mg estrogen
1.73 0.89, 3.41 0.107
Months of Follow-up 257 1.00 0.98, 1.01 0.707
Age in Years 257 1.07 1.03, 1.11 <0.001
Weight Index = wt(kg)-ht(cm)+200 257 1.00 0.99, 1.02 0.726
History of Cardiovascular Disease 257 3.96 2.37, 6.71 <0.001
Systolic Blood Pressure/10 257 1.09 0.99, 1.21 0.082
Diastolic Blood Pressure/10 257 1.30 1.09, 1.57 0.005
Serum Hemoglobin (g/100ml) 257 1.17 1.04, 1.33 0.012
Size of Primary Tumor (cm^2) 255 0.93 0.91, 0.95 <0.001
Combined Index of Stage and Hist. Grade 252 0.56 0.47, 0.66 <0.001
Serum Prostatic Acid Phosphatase 257 0.93 0.89, 0.96 <0.001
Bone Metastases 257 0.28 0.15, 0.52 <0.001
ap80 257 5.74 3.38, 9.96 <0.001
ekg_rec 252


    bngn

    rd&ec
1.35 0.57, 3.20 0.492
    hbocd
1.04 0.25, 3.91 0.955
    hrts
2.04 1.11, 3.78 0.023
    MI
2.73 1.30, 5.88 0.009
pf_rec 257


    normal

    bed_under_50
1.36 0.58, 3.28 0.477
    bed_over_50_conf
0.09 0.00, 0.47 0.021
log_ap 257 0.55 0.45, 0.65 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

Note that here all the odds ratios (OR) are expressing variations by 1-unit on the scale of the independent variable.

This table should be commented with the clinicians in order to discuss if the observed statistical significant associations are in line with what it is expected from the biological perspective.

Now, we can come back to the functional forms for the continuous variables: try now to use a nonlinear effect for wt and for dtime

fit.splines.wt <-  lrm(cvd ~ rcs(wt,4), data=prostate_rid)
print(fit.splines.wt)
## Logistic Regression Model
## 
## lrm(formula = cvd ~ rcs(wt, 4), data = prostate_rid)
## 
##                       Model Likelihood      Discrimination    Rank Discrim.    
##                             Ratio Test             Indexes          Indexes    
## Obs           257    LR chi2      0.66      R2       0.003    C       0.518    
##  0            130    d.f.            3      R2(3,257)0.000    Dxy     0.036    
##  1            127    Pr(> chi2) 0.8833    R2(3,192.7)0.000    gamma   0.037    
## max |deriv| 2e-12                           Brier    0.249    tau-a   0.018    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -2.5286 3.5066 -0.72  0.4708  
## wt         0.0300 0.0418  0.72  0.4731  
## wt'       -0.1046 0.1446 -0.72  0.4693  
## wt''       0.3319 0.4557  0.73  0.4664
summary(fit.splines.wt)
##              Effects              Response : cvd 
## 
##  Factor      Low High Diff. Effect   S.E.    Lower 0.95 Upper 0.95
##  wt          89  106  17    -0.14415 0.32689 -0.78484   0.49654   
##   Odds Ratio 89  106  17     0.86576      NA  0.45619   1.64300

We observe that there is no significant effect for the nonlinear components of the spline.

fit.splines.dtime <-  lrm(cvd ~ rcs(dtime,4), data=prostate_rid)
print(fit.splines.dtime)
## Logistic Regression Model
## 
## lrm(formula = cvd ~ rcs(dtime, 4), data = prostate_rid)
## 
##                       Model Likelihood      Discrimination    Rank Discrim.    
##                             Ratio Test             Indexes          Indexes    
## Obs           257    LR chi2      8.04      R2       0.041    C       0.591    
##  0            130    d.f.            3      R2(3,257)0.019    Dxy     0.181    
##  1            127    Pr(> chi2) 0.0452    R2(3,192.7)0.026    gamma   0.185    
## max |deriv| 1e-06                           Brier    0.242    tau-a   0.091    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept  0.6180 0.3867  1.60  0.1100  
## dtime     -0.0991 0.0422 -2.35  0.0188  
## dtime'     0.4881 0.1833  2.66  0.0078  
## dtime''   -1.2217 0.4470 -2.73  0.0063
summary(fit.splines.dtime)
##              Effects              Response : cvd 
## 
##  Factor      Low High Diff. Effect  S.E.    Lower 0.95 Upper 0.95
##  dtime       11  37   26    0.73072 0.35467 0.035589   1.4259    
##   Odds Ratio 11  37   26    2.07660      NA 1.036200   4.1614

Here instead there is a significant nonlinear effect for the dtime variable (as expected being the follow up…) In this case, the interpretation of the odds ratio is reported as a contrast between two specific values (37 vs 11), i.e. the third and first quartile of the variable. This is done in order to help in the interpretation of the non-linear effect modelled by the spline function.

Let’s visualize the effect of this candidate predictor modeled in a nonlinear way:

ggplot(Predict(fit.splines.dtime))

This is quite similar to the non parametric regression previously used:

plsmo(prostate_rid$dtime, prostate_rid$cvd) 

1.5 Multivariable full model

We decide to start with a full (no selection of predictors) binary logistic model assuming linearity for continuous predictors except for dtime:

fit.multi <- lrm(cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt +hg + ekg_rec + pf_rec + bm + hx + rx +rcs(dtime,4), data =prostate_rid)

Note here that there is usually an upper-bound on the number of candidate predictors that can be used in a [logistic/poisson/survival] regression model dictated by the number of events occurring in the population (we will take into account this aspect when the sample size issue in multivariable modelling will be discussed).

Let’s see the results:

print(fit.multi)
## Frequencies of Missing Values Due to Each Variable
##     cvd      sz      sg      ap     sbp     dbp     age      wt      hg ekg_rec 
##       0       2       5       0       0       0       0       0       0       5 
##  pf_rec      bm      hx      rx   dtime 
##       0       0       0       0       0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt + 
##     hg + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           245    LR chi2     150.16       R2       0.611    C       0.906    
##  0            121    d.f.            22      R2(22,245)0.407    Dxy     0.811    
##  1            124    Pr(> chi2) <0.0001    R2(22,183.7)0.502    gamma   0.811    
## max |deriv| 2e-09                             Brier    0.124    tau-a   0.407    
## 
##                         Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept               -4.7151 3.6106 -1.31  0.1916  
## sz                      -0.0635 0.0174 -3.64  0.0003  
## sg                      -0.3281 0.1247 -2.63  0.0085  
## ap                      -0.3808 0.1468 -2.59  0.0095  
## sbp                     -0.0278 0.0931 -0.30  0.7654  
## dbp                      0.4263 0.1727  2.47  0.0136  
## age                      0.1004 0.0317  3.16  0.0016  
## wt                      -0.0172 0.0148 -1.16  0.2442  
## hg                       0.0708 0.1060  0.67  0.5042  
## ekg_rec=rd&ec           -0.5096 0.6760 -0.75  0.4509  
## ekg_rec=hbocd           -1.3951 0.8889 -1.57  0.1166  
## ekg_rec=hrts             0.1518 0.4586  0.33  0.7407  
## ekg_rec=MI               0.0106 0.6210  0.02  0.9864  
## pf_rec=bed_under_50      1.3081 0.8132  1.61  0.1077  
## pf_rec=bed_over_50_conf -2.2541 1.2487 -1.81  0.0710  
## bm                       0.0981 0.5397  0.18  0.8557  
## hx                       1.3719 0.4195  3.27  0.0011  
## rx=0.2 mg estrogen      -0.3661 0.5245 -0.70  0.4852  
## rx=1.0 mg estrogen       0.9277 0.5579  1.66  0.0964  
## rx=5.0 mg estrogen       0.7030 0.5238  1.34  0.1796  
## dtime                   -0.1256 0.0691 -1.82  0.0691  
## dtime'                   0.6755 0.2979  2.27  0.0234  
## dtime''                 -1.7766 0.7258 -2.45  0.0144
summary(fit.multi)
##              Effects              Response : cvd 
## 
##  Factor                           Low      High  Diff.   Effect    S.E.   
##  sz                                6.00000  24.5 18.5000 -1.174100 0.32268
##   Odds Ratio                       6.00000  24.5 18.5000  0.309110      NA
##  sg                                9.00000  12.0  3.0000 -0.984350 0.37423
##   Odds Ratio                       9.00000  12.0  3.0000  0.373680      NA
##  ap                                0.59998   7.0  6.4000 -0.935590 0.36060
##   Odds Ratio                       0.59998   7.0  6.4000  0.392360      NA
##  sbp                              13.00000  16.0  3.0000 -0.083384 0.27943
##   Odds Ratio                      13.00000  16.0  3.0000  0.920000      NA
##  dbp                               7.00000   9.0  2.0000  0.852590 0.34535
##   Odds Ratio                       7.00000   9.0  2.0000  2.345700      NA
##  age                              70.00000  76.0  6.0000  0.602180 0.19035
##   Odds Ratio                      70.00000  76.0  6.0000  1.826100      NA
##  wt                               89.00000 106.0 17.0000 -0.292780 0.25140
##   Odds Ratio                      89.00000 106.0 17.0000  0.746180      NA
##  hg                               12.00000  14.6  2.5996  0.184020 0.27553
##   Odds Ratio                      12.00000  14.6  2.5996  1.202000      NA
##  bm                                0.00000   1.0  1.0000  0.098134 0.53966
##   Odds Ratio                       0.00000   1.0  1.0000  1.103100      NA
##  hx                                0.00000   1.0  1.0000  1.371900 0.41948
##   Odds Ratio                       0.00000   1.0  1.0000  3.943000      NA
##  dtime                            11.00000  37.0 26.0000  1.053600 0.55450
##   Odds Ratio                      11.00000  37.0 26.0000  2.867900      NA
##  ekg_rec - rd&ec:bngn              1.00000   2.0      NA -0.509560 0.67596
##   Odds Ratio                       1.00000   2.0      NA  0.600760      NA
##  ekg_rec - hbocd:bngn              1.00000   3.0      NA -1.395100 0.88895
##   Odds Ratio                       1.00000   3.0      NA  0.247820      NA
##  ekg_rec - hrts:bngn               1.00000   4.0      NA  0.151770 0.45859
##   Odds Ratio                       1.00000   4.0      NA  1.163900      NA
##  ekg_rec - MI:bngn                 1.00000   5.0      NA  0.010576 0.62096
##   Odds Ratio                       1.00000   5.0      NA  1.010600      NA
##  pf_rec - bed_under_50:normal      1.00000   2.0      NA  1.308100 0.81317
##   Odds Ratio                       1.00000   2.0      NA  3.699200      NA
##  pf_rec - bed_over_50_conf:normal  1.00000   3.0      NA -2.254100 1.24870
##   Odds Ratio                       1.00000   3.0      NA  0.104970      NA
##  rx - 0.2 mg estrogen:placebo      1.00000   2.0      NA -0.366070 0.52451
##   Odds Ratio                       1.00000   2.0      NA  0.693460      NA
##  rx - 1.0 mg estrogen:placebo      1.00000   3.0      NA  0.927670 0.55790
##   Odds Ratio                       1.00000   3.0      NA  2.528600      NA
##  rx - 5.0 mg estrogen:placebo      1.00000   4.0      NA  0.702950 0.52383
##   Odds Ratio                       1.00000   4.0      NA  2.019700      NA
##  Lower 0.95 Upper 0.95
##  -1.8065000 -0.54161  
##   0.1642300  0.58181  
##  -1.7178000 -0.25087  
##   0.1794500  0.77813  
##  -1.6424000 -0.22882  
##   0.1935200  0.79547  
##  -0.6310500  0.46429  
##   0.5320300  1.59090  
##   0.1757100  1.52950  
##   1.1921000  4.61570  
##   0.2291100  0.97525  
##   1.2575000  2.65180  
##  -0.7855300  0.19996  
##   0.4558800  1.22140  
##  -0.3560000  0.72405  
##   0.7004700  2.06280  
##  -0.9595800  1.15590  
##   0.3830500  3.17670  
##   0.5497700  2.19410  
##   1.7329000  8.97190  
##  -0.0332040  2.14040  
##   0.9673400  8.50280  
##  -1.8344000  0.81529  
##   0.1597100  2.25980  
##  -3.1374000  0.34723  
##   0.0433970  1.41510  
##  -0.7470500  1.05060  
##   0.4737600  2.85930  
##  -1.2065000  1.22760  
##   0.2992500  3.41320  
##  -0.2856600  2.90190  
##   0.7515200 18.20900  
##  -4.7014000  0.19330  
##   0.0090825  1.21320  
##  -1.3941000  0.66195  
##   0.2480600  1.93860  
##  -0.1657900  2.02110  
##   0.8472200  7.54690  
##  -0.3237300  1.72960  
##   0.7234500  5.63860
an <- anova (fit.multi)
an
##                 Wald Statistics          Response: cvd 
## 
##  Factor     Chi-Square d.f. P     
##  sz         13.24       1   0.0003
##  sg          6.92       1   0.0085
##  ap          6.73       1   0.0095
##  sbp         0.09       1   0.7654
##  dbp         6.09       1   0.0136
##  age        10.01       1   0.0016
##  wt          1.36       1   0.2442
##  hg          0.45       1   0.5042
##  ekg_rec     3.60       4   0.4628
##  pf_rec      6.11       2   0.0471
##  bm          0.03       1   0.8557
##  hx         10.70       1   0.0011
##  rx          6.67       3   0.0831
##  dtime       8.80       3   0.0321
##   Nonlinear  8.02       2   0.0182
##  TOTAL      62.82      22   <.0001
plot(an) 

More than interpreting the single regression coefficients and their corresponding statistical significance (even if in the last phase, i.e. when presenting results to the clinicians also this aspect is important!!), in prediction modelling we are mostly interested in evaluating the accuracy in predictions.

Now we want to evaluate the performance of the full model:

s         <- fit.multi$stats
round(s, digits=3)
##          Obs    Max Deriv   Model L.R.         d.f.            P            C 
##      245.000        0.000      150.158       22.000        0.000        0.906 
##          Dxy        Gamma        Tau-a           R2      R2(245)   R2(22,245) 
##        0.811        0.811        0.407        0.611        0.458        0.407 
##    R2(183.7) R2(22,183.7)        Brier            g           gr           gp 
##        0.558        0.502        0.124        3.008       20.245        0.408

C is the AUC (discrimination measure) and Dxy is related to C (Dxy=2*(C-0.5)), so they are both discrimination measures. They measure how the model perform in the relative ranking of the estimated risk.

gamma.hat <-  (s['Model L.R.'] - s['d.f.'])/s['Model L.R.']
gamma.hat
## Model L.R. 
##  0.8534877

Estimation of the shrinkage coefficient gamma allows the quantification of the amount of overfitting present, and it also allows one to estimate the likelihood that the model will reliably predict new observations. The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.85, indicating that this model will validate on new data about 15% worse than on this dataset. This is also related to the calibration concept (the absolute ordering of the predicted risks).

1.6 Model visualization: a useful interim step to be discussed with clinicians !

Let’s now visualize the predictors effects on the logit scale:

ggplot(Predict(fit.multi))

Now let’s see them on the odds ratio scale:

plot(summary(fit.multi),log=TRUE)

The general function is plot(summary(fit.multi)) but it less readable, for this reason there is the option log=TRUE.

Estimates of the associations are related to the interquartile-range odds ratios for continuous predictors and odds ratios with respect to a reference level for categorical predictors. Numbers at left are upper quartile vs lower quartile or current group vs reference group. The bars represent confidence limits. The intervals are drawn on the log odds ratio scale and labeled on the odds ratio scale. Ranges are on the original scale. Discussing this data with clinicians is also useful for thinking about possible interaction effects between some of the candidate predictors; in this example we will not examine this issue.

1.7 Using more splines!

Now, forget for a moment our sample size limitation, we want to check if it is worth to expand all the others continuous variables in a full model approach using splines:

fit.all.splines <-  lrm(cvd ~ rcs(sz,4) + rcs(sg,4) + rcs(log(ap),4) +rcs(sbp,4) + rcs(dbp,4) + rcs(age,4) 
                        + rcs(wt,4) + rcs(hg,4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime,4),data=prostate_rid)
print(fit.all.splines)
## Frequencies of Missing Values Due to Each Variable
##     cvd      sz      sg      ap     sbp     dbp     age      wt      hg ekg_rec 
##       0       2       5       0       0       0       0       0       0       5 
##  pf_rec      bm      hx      rx   dtime 
##       0       0       0       0       0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ rcs(sz, 4) + rcs(sg, 4) + rcs(log(ap), 4) + 
##     rcs(sbp, 4) + rcs(dbp, 4) + rcs(age, 4) + rcs(wt, 4) + rcs(hg, 
##     4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           245    LR chi2     177.11       R2       0.686    C       0.933    
##  0            121    d.f.            38      R2(38,245)0.433    Dxy     0.865    
##  1            124    Pr(> chi2) <0.0001    R2(38,183.7)0.531    gamma   0.865    
## max |deriv| 4e-11                             Brier    0.104    tau-a   0.434    
## 
##                         Coef     S.E.    Wald Z Pr(>|Z|)
## Intercept               -24.1991 13.5603 -1.78  0.0743  
## sz                       -0.2597  0.1295 -2.00  0.0450  
## sz'                       1.6649  0.9134  1.82  0.0683  
## sz''                     -3.1513  1.6810 -1.87  0.0608  
## sg                        0.5706  0.5774  0.99  0.3230  
## sg'                      -4.7064  2.0646 -2.28  0.0226  
## sg''                     21.8060  8.9783  2.43  0.0152  
## ap                       -0.4486  0.8526 -0.53  0.5988  
## ap'                       3.6227  5.5246  0.66  0.5120  
## ap''                     -7.7643 10.2918 -0.75  0.4506  
## sbp                      -0.0453  0.3717 -0.12  0.9030  
## sbp'                     -0.1906  1.2777 -0.15  0.8814  
## sbp''                     0.5496  7.7526  0.07  0.9435  
## dbp                      -0.1280  0.4527 -0.28  0.7773  
## dbp'                      0.4811  1.0301  0.47  0.6405  
## dbp''                     3.6419  6.7889  0.54  0.5916  
## age                       0.3420  0.1106  3.09  0.0020  
## age'                     -0.3789  0.1770 -2.14  0.0323  
## age''                     3.6426  2.0773  1.75  0.0795  
## wt                        0.0024  0.0893  0.03  0.9788  
## wt'                      -0.0771  0.2949 -0.26  0.7936  
## wt''                      0.2732  0.9056  0.30  0.7629  
## hg                        0.2602  0.3600  0.72  0.4697  
## hg'                      -0.9822  0.8826 -1.11  0.2657  
## hg''                      7.9914  5.4689  1.46  0.1439  
## ekg_rec=rd&ec            -0.6508  0.7960 -0.82  0.4136  
## ekg_rec=hbocd            -2.0159  1.0458 -1.93  0.0539  
## ekg_rec=hrts             -0.0923  0.5329 -0.17  0.8625  
## ekg_rec=MI               -0.1072  0.7061 -0.15  0.8793  
## pf_rec=bed_under_50       1.7917  0.9468  1.89  0.0584  
## pf_rec=bed_over_50_conf  -1.2181  1.3352 -0.91  0.3616  
## bm                       -0.1454  0.6230 -0.23  0.8155  
## hx                        1.8027  0.5203  3.46  0.0005  
## rx=0.2 mg estrogen       -0.4657  0.5854 -0.80  0.4263  
## rx=1.0 mg estrogen        0.6931  0.6265  1.11  0.2686  
## rx=5.0 mg estrogen        0.5258  0.6375  0.82  0.4095  
## dtime                    -0.1925  0.0810 -2.38  0.0175  
## dtime'                    0.9895  0.3536  2.80  0.0051  
## dtime''                  -2.5924  0.8682 -2.99  0.0028
summary(fit.all.splines)
##              Effects              Response : cvd 
## 
##  Factor                           Low      High  Diff.   Effect     S.E.   
##  sz                                6.00000  24.5 18.5000 -0.4753200 0.57195
##   Odds Ratio                       6.00000  24.5 18.5000  0.6216800      NA
##  sg                                9.00000  12.0  3.0000 -3.0788000 0.86958
##   Odds Ratio                       9.00000  12.0  3.0000  0.0460160      NA
##  ap                                0.59998   7.0  6.4000 -0.0015339 0.71565
##   Odds Ratio                       0.59998   7.0  6.4000  0.9984700      NA
##  sbp                              13.00000  16.0  3.0000 -0.4203600 0.64517
##   Odds Ratio                      13.00000  16.0  3.0000  0.6568100      NA
##  dbp                               7.00000   9.0  2.0000  0.7533600 0.70002
##   Odds Ratio                       7.00000   9.0  2.0000  2.1241000      NA
##  age                              70.00000  76.0  6.0000 -0.4358300 0.56104
##   Odds Ratio                      70.00000  76.0  6.0000  0.6467300      NA
##  wt                               89.00000 106.0 17.0000 -0.4036000 0.59476
##   Odds Ratio                      89.00000 106.0 17.0000  0.6679100      NA
##  hg                               12.00000  14.6  2.5996 -0.5632000 0.62210
##   Odds Ratio                      12.00000  14.6  2.5996  0.5693800      NA
##  bm                                0.00000   1.0  1.0000 -0.1453900 0.62300
##   Odds Ratio                       0.00000   1.0  1.0000  0.8646800      NA
##  hx                                0.00000   1.0  1.0000  1.8027000 0.52030
##   Odds Ratio                       0.00000   1.0  1.0000  6.0659000      NA
##  dtime                            11.00000  37.0 26.0000  1.3504000 0.64498
##   Odds Ratio                      11.00000  37.0 26.0000  3.8589000      NA
##  ekg_rec - rd&ec:bngn              1.00000   2.0      NA -0.6508500 0.79604
##   Odds Ratio                       1.00000   2.0      NA  0.5216000      NA
##  ekg_rec - hbocd:bngn              1.00000   3.0      NA -2.0159000 1.04580
##   Odds Ratio                       1.00000   3.0      NA  0.1332000      NA
##  ekg_rec - hrts:bngn               1.00000   4.0      NA -0.0922990 0.53294
##   Odds Ratio                       1.00000   4.0      NA  0.9118300      NA
##  ekg_rec - MI:bngn                 1.00000   5.0      NA -0.1072100 0.70615
##   Odds Ratio                       1.00000   5.0      NA  0.8983400      NA
##  pf_rec - bed_under_50:normal      1.00000   2.0      NA  1.7917000 0.94679
##   Odds Ratio                       1.00000   2.0      NA  5.9995000      NA
##  pf_rec - bed_over_50_conf:normal  1.00000   3.0      NA -1.2181000 1.33520
##   Odds Ratio                       1.00000   3.0      NA  0.2957800      NA
##  rx - 0.2 mg estrogen:placebo      1.00000   2.0      NA -0.4657400 0.58539
##   Odds Ratio                       1.00000   2.0      NA  0.6276700      NA
##  rx - 1.0 mg estrogen:placebo      1.00000   3.0      NA  0.6931000 0.62648
##   Odds Ratio                       1.00000   3.0      NA  1.9999000      NA
##  rx - 5.0 mg estrogen:placebo      1.00000   4.0      NA  0.5257800 0.63747
##   Odds Ratio                       1.00000   4.0      NA  1.6918000      NA
##  Lower 0.95 Upper 0.95
##  -1.5963000  0.64568  
##   0.2026400  1.90730  
##  -4.7831000 -1.37440  
##   0.0083699  0.25299  
##  -1.4042000  1.40110  
##   0.2455700  4.05970  
##  -1.6849000  0.84415  
##   0.1854700  2.32600  
##  -0.6186600  2.12540  
##   0.5386600  8.37610  
##  -1.5354000  0.66378  
##   0.2153600  1.94210  
##  -1.5693000  0.76211  
##   0.2081900  2.14280  
##  -1.7825000  0.65608  
##   0.1682200  1.92720  
##  -1.3664000  1.07570  
##   0.2550100  2.93190  
##   0.7829300  2.82240  
##   2.1879000 16.81800  
##   0.0862480  2.61450  
##   1.0901000 13.66100  
##  -2.2111000  0.90936  
##   0.1095900  2.48270  
##  -4.0657000  0.03389  
##   0.0171510  1.03450  
##  -1.1368000  0.95224  
##   0.3208300  2.59150  
##  -1.4912000  1.27680  
##   0.2251000  3.58520  
##  -0.0640040  3.64740  
##   0.9380000 38.37300  
##  -3.8351000  1.39890  
##   0.0215980  4.05060  
##  -1.6131000  0.68161  
##   0.1992700  1.97710  
##  -0.5347800  1.92100  
##   0.5858000  6.82760  
##  -0.7236500  1.77520  
##   0.4849800  5.90150
an.s <- anova (fit.all.splines)
an.s
##                 Wald Statistics          Response: cvd 
## 
##  Factor          Chi-Square d.f. P     
##  sz              11.55       3   0.0091
##   Nonlinear       3.82       2   0.1481
##  sg              13.31       3   0.0040
##   Nonlinear       6.55       2   0.0379
##  ap               3.32       3   0.3443
##   Nonlinear       1.46       2   0.4827
##  sbp              1.40       3   0.7054
##   Nonlinear       0.14       2   0.9312
##  dbp             12.05       3   0.0072
##   Nonlinear       7.18       2   0.0276
##  age             13.01       3   0.0046
##   Nonlinear       5.02       2   0.0813
##  wt               0.69       3   0.8762
##   Nonlinear       0.16       2   0.9212
##  hg               3.13       3   0.3714
##   Nonlinear       3.03       2   0.2193
##  ekg_rec          4.20       4   0.3798
##  pf_rec           4.71       2   0.0951
##  bm               0.05       1   0.8155
##  hx              12.00       1   0.0005
##  rx               4.23       3   0.2373
##  dtime           12.40       3   0.0061
##   Nonlinear      10.89       2   0.0043
##  TOTAL NONLINEAR 24.81      18   0.1302
##  TOTAL           57.36      38   0.0227
plot(an.s) 

Let’s check what happens to the model’s performance:

s.splines <- fit.all.splines$stats # performance of the estimated model 
round(s.splines, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures
##          Obs    Max Deriv   Model L.R.         d.f.            P            C 
##      245.000        0.000      177.108       38.000        0.000        0.933 
##          Dxy        Gamma        Tau-a           R2      R2(245)   R2(38,245) 
##        0.865        0.865        0.434        0.686        0.515        0.433 
##    R2(183.7) R2(38,183.7)        Brier            g           gr           gp 
##        0.619        0.531        0.104        4.107       60.753        0.432
gamma.hat.splines <-  (s.splines['Model L.R.'] - s.splines['d.f.'])/s.splines['Model L.R.'] 
gamma.hat.splines
## Model L.R. 
##  0.7854418

The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.79, indicating that this model will validate on new data about 21% worse than on this dataset (more complex than the previous model, more overfitting, as expected).

Let’s compare the performance of the two models in terms of AIC: the Akaike Information Criterion (the smaller is the better is):

AIC.models <- c(AIC(fit.multi), AIC(fit.all.splines))
AIC.models
## [1] 235.4473 240.4973

Based on AIC, the simpler model fitted to the raw data and assuming linearity for all the continuous predictors except dtime has a lower AIC and a lower optimism, so we would prefer that one as a predictive tool.

1.8 Selecting candidate predictors using backwards step-down selection

Now we use fast backward step-down (with total residual AIC as the stopping rule, that is a reasonable way to use stepwise selection) to identify the variables that explain the bulk of the cause of death. This algorithm performs a slightly inefficient but numerically stable version of fast backward elimination on factors;it uses the fitted full model and computes approximate Wald statistics by computing conditional (restricted) maximum likelihood estimates (assuming multivariate normality of estimates..). The function prints the deletion statistics for each variable in turn, and prints approximate parameter estimates for the model after deleting variables. The approximation is better when the number of factors deleted is not large.

fastbw(fit.multi)
## 
##  Deleted Chi-Sq d.f. P      Residual d.f. P      AIC   
##  ekg_rec 3.60   4    0.4628  3.60     4   0.4628  -4.40
##  bm      0.01   1    0.9242  3.61     5   0.6069  -6.39
##  sbp     0.08   1    0.7830  3.69     6   0.7192  -8.31
##  hg      0.30   1    0.5827  3.99     7   0.7813 -10.01
##  wt      0.92   1    0.3380  4.91     8   0.7677 -11.09
##  rx      7.31   3    0.0628 12.21    11   0.3480  -9.79
##  pf_rec  5.72   2    0.0573 17.93    13   0.1602  -8.07
##  dtime   7.29   3    0.0633 25.22    16   0.0661  -6.78
##  sg      3.36   1    0.0667 28.58    17   0.0386  -5.42
##  dbp     4.98   1    0.0257 33.55    18   0.0143  -2.45
## 
## Approximate Estimates after Deleting Factors
## 
##               Coef    S.E. Wald Z         P
## Intercept -3.68667 2.00132 -1.842 0.0654580
## sz        -0.04316 0.01605 -2.689 0.0071632
## ap        -0.41027 0.11823 -3.470 0.0005202
## age        0.05934 0.02809  2.112 0.0346497
## hx         0.82995 0.36398  2.280 0.0225953
## 
## Factors in Final Model
## 
## [1] sz  ap  age hx

In the final step the following covariates are retained: sz, ap, age and hx.
Let’s estimate the reduced model:

fred <- lrm(cvd ~ sz + log(ap) + age + hx , data =prostate_rid)
print(fred)
## Frequencies of Missing Values Due to Each Variable
## cvd  sz  ap age  hx 
##   0   2   0   0   0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + log(ap) + age + hx, data = prostate_rid)
## 
## 
##                        Model Likelihood      Discrimination    Rank Discrim.    
##                              Ratio Test             Indexes          Indexes    
## Obs           255    LR chi2     100.81      R2       0.435    C       0.840    
##  0            129    d.f.             4      R2(4,255)0.316    Dxy     0.679    
##  1            126    Pr(> chi2) <0.0001    R2(4,191.2)0.397    gamma   0.679    
## max |deriv| 3e-08                            Brier    0.163    tau-a   0.341    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -4.9339 1.6846 -2.93  0.0034  
## sz        -0.0543 0.0136 -3.99  <0.0001 
## ap        -0.5048 0.1044 -4.83  <0.0001 
## age        0.0776 0.0235  3.30  0.0010  
## hx         1.0713 0.3101  3.45  0.0006
summary(fred)
##              Effects              Response : cvd 
## 
##  Factor      Low      High Diff. Effect   S.E.    Lower 0.95 Upper 0.95
##  sz           6.00000 24.5 18.5  -1.00520 0.25199 -1.49910   -0.51135  
##   Odds Ratio  6.00000 24.5 18.5   0.36595      NA  0.22332    0.59968  
##  ap           0.59998  7.0  6.4  -1.24020 0.25656 -1.74310   -0.73737  
##   Odds Ratio  0.59998  7.0  6.4   0.28932      NA  0.17498    0.47837  
##  age         70.00000 76.0  6.0   0.46531 0.14111  0.18874    0.74188  
##   Odds Ratio 70.00000 76.0  6.0   1.59250      NA  1.20770    2.09990  
##  hx           0.00000  1.0  1.0   1.07130 0.31007  0.46355    1.67900  
##   Odds Ratio  0.00000  1.0  1.0   2.91910      NA  1.58970    5.36010

Again, we want to evaluate the performance of the reduced model:

s.red <- fred$stats  
round(s.red, digits=3) 
##         Obs   Max Deriv  Model L.R.        d.f.           P           C 
##     255.000       0.000     100.810       4.000       0.000       0.840 
##         Dxy       Gamma       Tau-a          R2     R2(255)   R2(4,255) 
##       0.679       0.679       0.341       0.435       0.327       0.316 
##   R2(191.2) R2(4,191.2)       Brier           g          gr          gp 
##       0.410       0.397       0.163       1.917       6.800       0.340
gamma.hat.red <-  (s.red['Model L.R.'] - s.red['d.f.'])/s.red['Model L.R.']
gamma.hat.red
## Model L.R. 
##  0.9603214

The van Houwelingen-Le Cessie heuristic shrinkage estimate is now 0.96, indicating that this model will validate on new data about 4% worse than on this dataset (simpler than the full model, less overfitting).

Let us now use a bootstrap approach to further evaluate calibration and discrimination of this reduced model:

fred <- update(fred, x=TRUE , y=TRUE)
vred <- validate (fred, B=200)
vred
##           index.orig training    test optimism index.corrected   n
## Dxy           0.6792   0.6864  0.6694   0.0170          0.6623 200
## R2            0.4354   0.4478  0.4252   0.0226          0.4128 200
## Intercept     0.0000   0.0000 -0.0080   0.0080         -0.0080 200
## Slope         1.0000   1.0000  0.9580   0.0420          0.9580 200
## Emax          0.0000   0.0000  0.0108   0.0108          0.0108 200
## D             0.3914   0.4069  0.3801   0.0268          0.3646 200
## U            -0.0078  -0.0078  0.0020  -0.0098          0.0020 200
## Q             0.3993   0.4148  0.3782   0.0366          0.3626 200
## B             0.1630   0.1593  0.1668  -0.0075          0.1705 200
## g             1.9169   1.9934  1.8723   0.1212          1.7958 200
## gp            0.3396   0.3425  0.3352   0.0074          0.3322 200

The slope shrinkage as already seen is around 4%. There is a sligth drop-off in all indexes. The estimated likely future predictive discrimination of the model as measured by Somers’ Dxy change from 0.68 to 0.66. The latter estimate is the one that should be claimed when describing model performance.

A nearly unbiased estimate of future calibration of the stepwise-derived model is given below:

fred <- update(fred, x=TRUE , y=TRUE)
cal  <- calibrate(fred, B=200)
plot(cal)

## 
## n=255   Mean absolute error=0.014   Mean squared error=0.00025
## 0.9 Quantile of absolute error=0.023

Bootstrap overfitting-corrected calibration curve estimate for the backwards step-down cause of death logistic model, along with a rug plot showing the distribution of predicted risks. The smooth nonparametric calibration estimator (loess) is used.The estimated mean absolute error is 0.013 between estimated probabilities and observed events: quite good!

For comparison, let consider now a bootstrap validation of the full model without using any variable selection:

fit.multi <- update(fit.multi, x=TRUE , y=TRUE)
vfull     <- validate (fit.multi, B=200)
vfull
##           index.orig training   test optimism index.corrected   n
## Dxy           0.8113   0.8576 0.7641   0.0935          0.7178 200
## R2            0.6110   0.6780 0.5362   0.1419          0.4691 200
## Intercept     0.0000   0.0000 0.0051  -0.0051          0.0051 200
## Slope         1.0000   1.0000 0.6679   0.3321          0.6679 200
## Emax          0.0000   0.0000 0.0902   0.0902          0.0902 200
## D             0.6088   0.7087 0.5109   0.1978          0.4110 200
## U            -0.0082  -0.0082 0.0657  -0.0739          0.0657 200
## Q             0.6170   0.7169 0.4452   0.2717          0.3453 200
## B             0.1241   0.1045 0.1415  -0.0370          0.1611 200
## g             3.0079   3.9649 2.5829   1.3820          1.6259 200
## gp            0.4077   0.4281 0.3793   0.0489          0.3588 200

Compared to the validation of the full model, the step-down model has much less optimism, as expected, even if it has a smaller Dxy (discrimination) due to loss of information from removing moderately important variables. Let’s now explore calibration of the full model:

cal.full <- calibrate(fit.multi, B=200)
plot(cal.full)

## 
## n=245   Mean absolute error=0.04   Mean squared error=0.00195
## 0.9 Quantile of absolute error=0.062

The estimated mean absolute error is 0.04 between estimated probabilities and observed events, the bias-corrected performance is worse than the reduced model.

We could also make a test to compare the two models in terms of AUC (i.e the discrimination power): note that in the reduced model we have less missing values (in any case the % of missing values in this dataset is quite irrelevant). Let’s extract the probabilities predicted from the models:

full.pred  <- predict(fit.multi, type="fitted") 
rid.pred   <- predict(fred, type="fitted") 
dati.pp    <- data.frame(prostate_rid, full.pred, rid.pred)

And perform the test:

roc.test(response = dati.pp$cvd, predictor1= dati.pp$full.pred, predictor2 = dati.pp$rid.pred)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  dati.pp$full.pred and dati.pp$rid.pred by dati.pp$cvd (0, 1)
## Z = 3.4561, p-value = 0.0005481
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.02628403 0.09515026
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.9056252   0.8449080

We observe as expected significant better discrimination of the full model, but remember also that
this model is more prone to overfitting (the bootstrap-corrected calibration is worse…) and that these measures are derived from the “training-all” dataset so they need also to be rescaled in new data.

roc.full  <- roc(dati.pp$cvd, dati.pp$full.pred)
roc.red   <- roc(dati.pp$cvd, dati.pp$rid.pred)
plot(roc.full)
plot(roc.red, add=TRUE, col="red")
legend("bottomright",legend=c("Full model", "Reduced model"), lty=c(1,1), col=c("black", "red"), bty="n")
title("Discrimination on the development sample")

1.9 Another threshold to selecting candidate predictors

We can try to be less conservative in estimating a reduced model, in order to increase the discrimination, using a different significance level for a variable to stay in the model and using individual approximate Wald tests rather than tests combining all deleted variables:

vrid2 <- validate(fit.multi, bw=TRUE , sls =0.5, type = 'individual' , B=200)
## 
##      Backwards Step-down - Original Model
## 
##  Deleted Chi-Sq d.f. P      Residual d.f. P      AIC   
##  ekg_rec 3.60   4    0.4628 3.60     4    0.4628  -4.40
##  bm      0.01   1    0.9242 3.61     5    0.6069  -6.39
##  sbp     0.08   1    0.7830 3.69     6    0.7192  -8.31
##  hg      0.30   1    0.5827 3.99     7    0.7813 -10.01
##  wt      0.92   1    0.3380 4.91     8    0.7677 -11.09
## 
## Approximate Estimates after Deleting Factors
## 
##                             Coef    S.E.  Wald Z         P
## Intercept               -5.12176 3.12777 -1.6375 0.1015240
## sz                      -0.05646 0.01689 -3.3422 0.0008313
## sg                      -0.33179 0.12027 -2.7587 0.0058026
## ap                      -0.34171 0.13370 -2.5558 0.0105945
## dbp                      0.35870 0.14429  2.4859 0.0129210
## age                      0.09340 0.03045  3.0677 0.0021573
## pf_rec=bed_under_50      1.28840 0.78205  1.6475 0.0994630
## pf_rec=bed_over_50_conf -2.31951 1.21744 -1.9052 0.0567492
## hx                       1.27821 0.39842  3.2082 0.0013358
## rx=0.2 mg estrogen      -0.28720 0.50331 -0.5706 0.5682597
## rx=1.0 mg estrogen       0.94249 0.54264  1.7369 0.0824130
## rx=5.0 mg estrogen       0.85665 0.50700  1.6896 0.0910964
## dtime                   -0.11929 0.06723 -1.7742 0.0760268
## dtime'                   0.67239 0.29006  2.3181 0.0204412
## dtime''                 -1.78597 0.70664 -2.5274 0.0114911
## 
## Factors in Final Model
## 
## [1] sz     sg     ap     dbp    age    pf_rec hx     rx     dtime 
## singular information matrix in lrm.fit (rank= 22 ).  Offending variable(s):
## ekg_rec=hbocd 
## 
## Divergence or singularity in 1 samples

Now let’s estimate a second reduced model:

fred2 <- lrm(cvd ~ sz + sg+ log(ap) + dbp + age + hx +pf_rec+rx +rcs(dtime,4) , data =prostate_rid)
fred2
## Frequencies of Missing Values Due to Each Variable
##    cvd     sz     sg     ap    dbp    age     hx pf_rec     rx  dtime 
##      0      2      5      0      0      0      0      0      0      0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hx + pf_rec + 
##     rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           250    LR chi2     145.18       R2       0.587    C       0.897    
##  0            125    d.f.            14      R2(14,250)0.408    Dxy     0.794    
##  1            125    Pr(> chi2) <0.0001    R2(14,187.5)0.503    gamma   0.794    
## max |deriv| 7e-10                             Brier    0.131    tau-a   0.398    
## 
##                         Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept               -4.7269 3.0077 -1.57  0.1160  
## sz                      -0.0614 0.0165 -3.71  0.0002  
## sg                      -0.3203 0.1167 -2.74  0.0061  
## ap                      -0.3365 0.1311 -2.57  0.0103  
## dbp                      0.3416 0.1381  2.47  0.0134  
## age                      0.0878 0.0290  3.02  0.0025  
## hx                       1.3681 0.3838  3.56  0.0004  
## pf_rec=bed_under_50      1.2721 0.7466  1.70  0.0884  
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87  0.0611  
## rx=0.2 mg estrogen      -0.2781 0.4865 -0.57  0.5676  
## rx=1.0 mg estrogen       0.9088 0.5227  1.74  0.0821  
## rx=5.0 mg estrogen       0.9254 0.4880  1.90  0.0579  
## dtime                   -0.1070 0.0644 -1.66  0.0966  
## dtime'                   0.5910 0.2755  2.15  0.0319  
## dtime''                 -1.5783 0.6699 -2.36  0.0185
print(fred2)
## Frequencies of Missing Values Due to Each Variable
##    cvd     sz     sg     ap    dbp    age     hx pf_rec     rx  dtime 
##      0      2      5      0      0      0      0      0      0      0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hx + pf_rec + 
##     rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           250    LR chi2     145.18       R2       0.587    C       0.897    
##  0            125    d.f.            14      R2(14,250)0.408    Dxy     0.794    
##  1            125    Pr(> chi2) <0.0001    R2(14,187.5)0.503    gamma   0.794    
## max |deriv| 7e-10                             Brier    0.131    tau-a   0.398    
## 
##                         Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept               -4.7269 3.0077 -1.57  0.1160  
## sz                      -0.0614 0.0165 -3.71  0.0002  
## sg                      -0.3203 0.1167 -2.74  0.0061  
## ap                      -0.3365 0.1311 -2.57  0.0103  
## dbp                      0.3416 0.1381  2.47  0.0134  
## age                      0.0878 0.0290  3.02  0.0025  
## hx                       1.3681 0.3838  3.56  0.0004  
## pf_rec=bed_under_50      1.2721 0.7466  1.70  0.0884  
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87  0.0611  
## rx=0.2 mg estrogen      -0.2781 0.4865 -0.57  0.5676  
## rx=1.0 mg estrogen       0.9088 0.5227  1.74  0.0821  
## rx=5.0 mg estrogen       0.9254 0.4880  1.90  0.0579  
## dtime                   -0.1070 0.0644 -1.66  0.0966  
## dtime'                   0.5910 0.2755  2.15  0.0319  
## dtime''                 -1.5783 0.6699 -2.36  0.0185
summary(fred2)
##              Effects              Response : cvd 
## 
##  Factor                           Low      High Diff. Effect    S.E.   
##  sz                                6.00000 24.5 18.5  -1.135100 0.30581
##   Odds Ratio                       6.00000 24.5 18.5   0.321390      NA
##  sg                                9.00000 12.0  3.0  -0.960990 0.35016
##   Odds Ratio                       9.00000 12.0  3.0   0.382510      NA
##  ap                                0.59998  7.0  6.4  -0.826700 0.32204
##   Odds Ratio                       0.59998  7.0  6.4   0.437490      NA
##  dbp                               7.00000  9.0  2.0   0.683210 0.27626
##   Odds Ratio                       7.00000  9.0  2.0   1.980200      NA
##  age                              70.00000 76.0  6.0   0.526990 0.17427
##   Odds Ratio                      70.00000 76.0  6.0   1.693800      NA
##  hx                                0.00000  1.0  1.0   1.368100 0.38383
##   Odds Ratio                       0.00000  1.0  1.0   3.928100      NA
##  dtime                            11.00000 37.0 26.0   0.924720 0.51670
##   Odds Ratio                      11.00000 37.0 26.0   2.521200      NA
##  pf_rec - bed_under_50:normal      1.00000  2.0   NA   1.272100 0.74664
##   Odds Ratio                       1.00000  2.0   NA   3.568500      NA
##  pf_rec - bed_over_50_conf:normal  1.00000  3.0   NA  -2.360800 1.26070
##   Odds Ratio                       1.00000  3.0   NA   0.094345      NA
##  rx - 0.2 mg estrogen:placebo      1.00000  2.0   NA  -0.278060 0.48652
##   Odds Ratio                       1.00000  2.0   NA   0.757250      NA
##  rx - 1.0 mg estrogen:placebo      1.00000  3.0   NA   0.908770 0.52267
##   Odds Ratio                       1.00000  3.0   NA   2.481300      NA
##  rx - 5.0 mg estrogen:placebo      1.00000  4.0   NA   0.925360 0.48803
##   Odds Ratio                       1.00000  4.0   NA   2.522800      NA
##  Lower 0.95 Upper 0.95
##  -1.7345000 -0.53573  
##   0.1764900  0.58524  
##  -1.6473000 -0.27468  
##   0.1925700  0.75981  
##  -1.4579000 -0.19552  
##   0.2327300  0.82241  
##   0.1417600  1.22470  
##   1.1523000  3.40300  
##   0.1854300  0.86855  
##   1.2037000  2.38350  
##   0.6158600  2.12040  
##   1.8513000  8.33470  
##  -0.0879970  1.93740  
##   0.9157600  6.94090  
##  -0.1912400  2.73550  
##   0.8259400 15.41800  
##  -4.8317000  0.11013  
##   0.0079727  1.11640  
##  -1.2316000  0.67550  
##   0.2918200  1.96500  
##  -0.1156500  1.93320  
##   0.8907900  6.91150  
##  -0.0311690  1.88190  
##   0.9693100  6.56590

Give a look to the performance of this alternative reduced model:

fred2 <- update(fred2, x=TRUE , y=TRUE)
vred2 <- validate (fred2, B=200)
vred2
##           index.orig training   test optimism index.corrected   n
## Dxy           0.7937   0.8220 0.7626   0.0594          0.7342 200
## R2            0.5873   0.6307 0.5393   0.0914          0.4959 200
## Intercept     0.0000   0.0000 0.0129  -0.0129          0.0129 200
## Slope         1.0000   1.0000 0.7782   0.2218          0.7782 200
## Emax          0.0000   0.0000 0.0580   0.0580          0.0580 200
## D             0.5767   0.6386 0.5146   0.1240          0.4527 200
## U            -0.0080  -0.0080 0.0234  -0.0314          0.0234 200
## Q             0.5847   0.6466 0.4912   0.1554          0.4293 200
## B             0.1315   0.1193 0.1425  -0.0232          0.1547 200
## g             2.8350   3.4161 2.6168   0.7993          2.0357 200
## gp            0.3990   0.4121 0.3805   0.0316          0.3674 200

The performance statistics are now midway between the full model and the smaller stepwise model.

cal.red2 <- calibrate(fred2, B=200)
plot(cal.red2)

## 
## n=250   Mean absolute error=0.03   Mean squared error=0.00103
## 0.9 Quantile of absolute error=0.041

The mean absolute error is 0.03.

Let’s compare the AUCs :

rid2.pred    <- predict(fred2, type="fitted") 
dati.pp2     <- data.frame(dati.pp,rid2.pred)
roc.test(response = dati.pp2$cvd, predictor1= dati.pp2$full.pred, predictor2 = dati.pp2$rid2.pred)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  dati.pp2$full.pred and dati.pp2$rid2.pred by dati.pp2$cvd (0, 1)
## Z = 1.1526, p-value = 0.2491
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  -0.004248807  0.016378906
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.9056252   0.8995601

We have a comparable discrimination power on the development sample with respect to the full model, but with a better calibration.

roc.full  <- roc(dati.pp2$cvd, dati.pp2$full.pred)
roc.red2  <- roc(dati.pp2$cvd, dati.pp2$rid2.pred)
plot(roc.full)
plot(roc.red2, add=TRUE, col="red")
legend("bottomright",legend=c("Full model", "Reduced 2 model"), lty=c(1,1), col=c("black", "red"), bty="n")
title("Discrimination on the development sample")

1.10 Final decision !

Finally we decide to present our estimated prediction model to the clinicians. Let’s prepare a visualization that could be helpful to discuss with them; we rename some levels of the categorical variables:

prostate_rid$hxf              <- as.factor(prostate_rid$hx)
levels(prostate_rid$hxf)      <- c("No", "Yes")
label(prostate_rid$hxf)       <- "History of CV disease"


options(datadist='dd')
dd <-datadist(prostate_rid)
dd
##                     patno stage              rx dtime               status
## Low:effect      131.00000     3            <NA>    11                 <NA>
## Adjust to       265.00000     3         placebo    24  dead - prostatic ca
## High:effect     395.00000     4            <NA>    37                 <NA>
## Low:prediction   24.96109     3         placebo     1                alive
## High:prediction 482.03891     4 5.0 mg estrogen    58 dead - unknown cause
## Low               3.00000     3         placebo     0                alive
## High            506.00000     4 5.0 mg estrogen    71 dead - unknown cause
##                      age       wt              pf hx sbp dbp          ekg
## Low:effect      70.00000  89.0000            <NA>  0  13   7         <NA>
## Adjust to       73.00000  97.0000 normal activity  0  14   8 heart strain
## High:effect     76.00000 106.0000            <NA>  1  16   9         <NA>
## Low:prediction  55.96109  76.0000 normal activity  0  11   6       normal
## High:prediction 80.00000 125.0389 confined to bed  1  19  10    recent MI
## Low             48.00000  71.0000 normal activity  0   8   4       normal
## High            88.00000 152.0000 confined to bed  1  24  13    recent MI
##                        hg       sz sg           ap bm ap80 ekg_rec
## Low:effect      12.000000  6.00000  9   0.59997559  0    0    <NA>
## Adjust to       13.599609 14.00000 11   1.09985352  0    0    bngn
## High:effect     14.599609 24.50000 12   7.00000000  1    1    <NA>
## Low:prediction   9.298828  2.00000  8   0.19998169  0    0    bngn
## High:prediction 16.503800 46.11673 14  54.97592412  1    1      MI
## Low              5.899414  0.00000  5   0.09999084  0    0    bngn
## High            21.199219 69.00000 15 999.87500000  1    1      MI
##                           pf_rec           status_num subset cvd      log_ap
## Low:effect                  <NA>                 <NA>      1   0 -0.51086631
## Adjust to                 normal  dead - prostatic ca      1   0  0.09517701
## High:effect                 <NA>                 <NA>      1   1  1.94591015
## Low:prediction            normal                alive      1   0 -1.60952947
## High:prediction bed_over_50_conf dead - unknown cause      1   1  4.00677071
## Low                       normal                alive      1   0 -2.30267665
## High            bed_over_50_conf dead - unknown cause      1   1  6.90763027
##                  hxf
## Low:effect      <NA>
## Adjust to        Yes
## High:effect     <NA>
## Low:prediction    No
## High:prediction  Yes
## Low               No
## High             Yes
## 
## Values:
## 
## stage : 3 4 
## rx : placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen 
## status :
##  [1] alive                        dead - prostatic ca         
##  [3] dead - heart or vascular     dead - cerebrovascular      
##  [5] dead - pulmonary embolus     dead - other ca             
##  [7] dead - respiratory disease   dead - other specific non-ca
##  [9] dead - unspecified non-ca    dead - unknown cause        
## pf : normal activity in bed < 50% daytime in bed > 50% daytime confined to bed 
## hx : 0 1 
## dbp :  4  5  6  7  8  9 10 11 12 13 
## ekg :
## [1] normal                            benign                           
## [3] rhythmic disturb & electrolyte ch heart block or conduction def    
## [5] heart strain                      old MI                           
## [7] recent MI                        
## bm : 0 1 
## ap80 : 0 1 
## ekg_rec : bngn rd&ec hbocd hrts MI 
## pf_rec : normal bed_under_50 bed_over_50_conf 
## status_num :
##  [1] alive                        dead - prostatic ca         
##  [3] dead - heart or vascular     dead - cerebrovascular      
##  [5] dead - pulmonary embolus     dead - other ca             
##  [7] dead - respiratory disease   dead - other specific non-ca
##  [9] dead - unspecified non-ca    dead - unknown cause        
## subset : 1 
## cvd : 0 1 
## hxf : No Yes

We estimate the final model selected:

fredfin <- lrm(cvd ~ sz + sg+ log(ap) + dbp+ age + hxf +pf_rec+rx+rcs(dtime,4), data =prostate_rid)
fredfin
## Frequencies of Missing Values Due to Each Variable
##    cvd     sz     sg     ap    dbp    age    hxf pf_rec     rx  dtime 
##      0      2      5      0      0      0      0      0      0      0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hxf + pf_rec + 
##     rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           250    LR chi2     145.18       R2       0.587    C       0.897    
##  0            125    d.f.            14      R2(14,250)0.408    Dxy     0.794    
##  1            125    Pr(> chi2) <0.0001    R2(14,187.5)0.503    gamma   0.794    
## max |deriv| 7e-10                             Brier    0.131    tau-a   0.398    
## 
##                         Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept               -4.7269 3.0077 -1.57  0.1160  
## sz                      -0.0614 0.0165 -3.71  0.0002  
## sg                      -0.3203 0.1167 -2.74  0.0061  
## ap                      -0.3365 0.1311 -2.57  0.0103  
## dbp                      0.3416 0.1381  2.47  0.0134  
## age                      0.0878 0.0290  3.02  0.0025  
## hxf=Yes                  1.3681 0.3838  3.56  0.0004  
## pf_rec=bed_under_50      1.2721 0.7466  1.70  0.0884  
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87  0.0611  
## rx=0.2 mg estrogen      -0.2781 0.4865 -0.57  0.5676  
## rx=1.0 mg estrogen       0.9088 0.5227  1.74  0.0821  
## rx=5.0 mg estrogen       0.9254 0.4880  1.90  0.0579  
## dtime                   -0.1070 0.0644 -1.66  0.0966  
## dtime'                   0.5910 0.2755  2.15  0.0319  
## dtime''                 -1.5783 0.6699 -2.36  0.0185
nom <-  nomogram (fredfin , ap=c(.1 , .5 , 1, 5, 10, 20, 30, 40),
                fun=plogis , funlabel ="Probability ",
                lp=TRUE, 
                fun.at =c(.01,.1,.25,.5,.75,.95))
nom
## Points per unit of linear predictor: 21.51285 
## Linear predictor units per point   : 0.04648384 
## 
## 
##  sz Points
##   0 92    
##   5 86    
##  10 79    
##  15 73    
##  20 66    
##  25 59    
##  30 53    
##  35 46    
##  40 40    
##  45 33    
##  50 26    
##  55 20    
##  60 13    
##  65  7    
##  70  0    
## 
## 
##  sg Points
##   5 69    
##   6 62    
##   7 55    
##   8 48    
##   9 41    
##  10 34    
##  11 28    
##  12 21    
##  13 14    
##  14  7    
##  15  0    
## 
## 
##  ap   Points
##   0.1 43    
##   0.5 32    
##   1.0 27    
##   5.0 15    
##  10.0 10    
##  20.0  5    
##  30.0  2    
##  40.0  0    
## 
## 
##  dbp Points
##   4   0    
##   5   7    
##   6  15    
##   7  22    
##   8  29    
##   9  37    
##  10  44    
##  11  51    
##  12  59    
##  13  66    
## 
## 
##  age Points
##  45   0    
##  50   9    
##  55  19    
##  60  28    
##  65  38    
##  70  47    
##  75  57    
##  80  66    
##  85  76    
##  90  85    
## 
## 
##  History of CV disease Points
##  No                     0    
##  Yes                   29    
## 
## 
##  pf_rec           Points
##  normal           51    
##  bed_under_50     78    
##  bed_over_50_conf  0    
## 
## 
##  rx              Points
##  placebo          6    
##  0.2 mg estrogen  0    
##  1.0 mg estrogen 26    
##  5.0 mg estrogen 26    
## 
## 
##  Months of Follow-up Points
##   0                  100   
##  10                   79   
##  20                   79   
##  30                   96   
##  40                   95   
##  50                   77   
##  60                   52   
##  70                   26   
##  80                    0   
## 
## 
##  Total Points Probability 
##           268         0.01
##           319         0.10
##           343         0.25
##           367         0.50
##           390         0.75
##           430         0.95

This is a useful tool to visualize a prediction model and discuss it with clinicians:

plot(nom,cex.axis=.75)

2 Simple Linear Regression (a useful recap)

First of all, install and load the required libraries : rms, Hmisc and epiDisplay.

library(rms)
library(Hmisc)
library(epiDisplay)

Let’s simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve coming out of the page, with its base along one of the vertical lines of points. The equal variance assumption (homoscedasticity) dictates that the series of Gaussian curves for all the different x values have equal variances.

n <- 100
set.seed(13)
x <- round(rnorm(n, .5, .25), 1)
y <- x + rnorm(n, 0, .1)
r <- c(-.2, 1.2)

Plot:

plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y))
axis(1, at=r, labels=FALSE)     
axis(2, at=r, labels=FALSE)
abline(a=0,b=1)
histSpike(y, side=2, add=TRUE)
abline(v=.6, lty=2)

Simple linear regression is used when:

2.1 Interval Estimation: evaluating the uncertainty about predictions

Estimation of the confidence intervals (CI) for predictions depend on what you want to predict, if at individual level or at mean population level.

x1 <- c( 1,  3,  5,  6,  7,  9,  11)
y  <- c( 5, 10, 70, 58, 85, 89, 135)
dd <- datadist(x1, n.unique=5); options(datadist='dd')
f  <- ols(y ~ x1)
p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean')
p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual')
p  <- rbind(Mean=p1, Individual=p2)
ggplot(p, legend.position='none') +    
  geom_point(aes(x1, y), data=data.frame(x1, y, .set.=''))

Example usages:

  • Is a child of age x smaller than predicted for her age? Use the individual level, p2 (wider bands)

  • What is the best estimate of the population mean blood pressure for patients on treatment A? Use the mean population level, p1 (narrower bands)

2.2 Assessing the Goodness of Fit

It is crucial to verify the the assumptions underlying a linear regression model:

  • In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear

  • Easier to see this with a plot of residuals vs estimated values

  • In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x)

  • Trend in central tendency indicates failure of linearity

  • qqnorm plot of residuals is a useful tool

Here an example: we fit a linear regression model where x and y should instead have been log transformed:

n <- 50
set.seed(2)
res <- rnorm(n, sd=.25)
x <- runif(n)
y <- exp(log(x) + res)
f <- ols(y ~ x)
plot(fitted(f), resid(f)) 

This plot depicts non-constant variance of the residuals, which might call for transforming y.

Now, we fit a linear model that should have been quadratic (functional form of X):

x <- runif(n, -1, 1)
y <- x ^ 2 + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))

Finally, we fit a correct model:

y <- x + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))

qqnorm(resid(f)); qqline(resid(f))

These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50.

2.3 Application example of simple linear regression: Hookworm & blood loss

The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970.

data(Suwit)
summary(Suwit)
##        id            worm            bloss      
##  Min.   : 1.0   Min.   :  32.0   Min.   : 5.03  
##  1st Qu.: 4.5   1st Qu.: 167.0   1st Qu.:14.02  
##  Median : 8.0   Median : 525.0   Median :33.80  
##  Mean   : 8.0   Mean   : 552.4   Mean   :33.45  
##  3rd Qu.:11.5   3rd Qu.: 796.5   3rd Qu.:41.70  
##  Max.   :15.0   Max.   :1929.0   Max.   :86.65
des(Suwit)
## HW and Blood loss SEAJTMH 1970; 
##  No. of observations =  15 
##   Variable      Class           Description   
## 1 id            numeric                       
## 2 worm          numeric         No. of worms  
## 3 bloss         numeric         Blood loss/day
summ(Suwit)
## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
## 
##   Var. name obs. mean   median  s.d.   min.   max.  
## 1 id        15   8      8       4.47   1      15    
## 2 worm      15   552.4  525     513.9  32     1929  
## 3 bloss     15   33.45  33.8    24.85  5.03   86.65
attach(Suwit)

The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to clean datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of ‘id’ and no missing values. The records have been sorted in ascending order of ‘worm’ (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss (‘bloss’) is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to predict blood loss using worm.

First of all, give a look to the data:

plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel")

A linear model using the above two variables seems reasonable.

lm1 <- lm(bloss ~ worm, data=Suwit)
lm1
## 
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
## 
## Coefficients:
## (Intercept)         worm  
##    10.84733      0.04092

Displaying the model by typing ‘lm1’ gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary.

attr(lm1, "names")
##  [1] "coefficients"  "residuals"     "effects"       "rank"         
##  [5] "fitted.values" "assign"        "qr"            "df.residual"  
##  [9] "xlevels"       "call"          "terms"         "model"
summary(lm1)
## 
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -15.846 -10.812   0.750   4.356  34.390 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 10.847327   5.308569   2.043   0.0618 .  
## worm         0.040922   0.007147   5.725 6.99e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 13.74 on 13 degrees of freedom
## Multiple R-squared:  0.716,  Adjusted R-squared:  0.6942 
## F-statistic: 32.78 on 1 and 13 DF,  p-value: 6.99e-05

The first section of summary shows the formula that was ‘called’. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of ‘worm’ on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of ‘worm’ is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of ‘worm’ using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table.

2.3.1 Analysis of variance table, R-squared and adjusted R-squared

summary(aov(lm1))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## worm         1   6192    6192   32.78 6.99e-05 ***
## Residuals   13   2455     189                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals).

The so-called ‘square’ is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore:

SST <- sum((bloss-mean(bloss))^2)
SST
## [1] 8647.044

The sum of squares from residuals is:

SSR <- sum(residuals(lm1)^2)
SSR
## [1] 2455.468

See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is:

SSW <- sum((fitted(lm1)-mean(bloss))^2)
SSW
## [1] 6191.577

The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares.

SSW/SST
## [1] 0.7160339

Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance.

resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual
Radj      <- (var(bloss)- resid.msq)/var(bloss)
Radj
## [1] 0.6941903

This is the adjusted R-squared shown in summary(lm1) in the above section.

2.3.2 F-test

When the mean square of ‘worm’ is divided by the mean square of residuals, the result is:

F <- SSW/resid.msq; F
## [1] 32.78011

Using this F value with the two corresponding degrees of freedom (from ‘worm’ and residuals) the P value for testing the effect of ‘worm’ can be computed.

pf(F, df1=1, df2=13, lower.tail=FALSE)
## [1] 6.990401e-05

The function pf is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument ‘lower.tail’ is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn.

2.3.3 Regression line, fitted values and residuals

A regression line can be added to the scatter plot with the following command:

plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel", type="n")
abline(lm1)
points(worm, fitted(lm1), pch=18, col="blue")
segments(worm, bloss, worm, fitted(lm1), col="red")

The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of ‘worm’.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments.

The actual values of the residuals can be checked from the specific attribute of the defined linear model.

residuals(lm1) -> lm1.res
hist(lm1.res)

2.3.4 Checking the normality of residuals

Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated.

a <- qqnorm(lm1.res) 

shapiro.qqnorm(lm1.res, type="n")
text(a$x, a$y, labels=as.character(id))

If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected.

Finally, the residuals are plotted against the fitted values to see if there is a pattern.

plot(fitted(lm1), lm1.res, xlab="Fitted values")

plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n")
text(fitted(lm1), lm1.res, labels=as.character(id))
abline(h=0, col="blue")

There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the qqnorm command we may conclude that the residuals are randomly and normally distributed.

The above two diagnostic plots for the model ‘lm1’ can also be obtained from:

par(mfrow=c(1,2))
plot(lm1, which=1:2)

detach(Suwit)

2.3.5 Final conclusion

From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured.

2.4 Visualizing relationships/functional forms

Remember that linear regression always means linearity in the parameters, irrespective of linearity in explanatory variables (i.e. the functional forms selected for the covariates). Y is linearly related to X (in the parameters) if the rate of change of Y with respect to X (dY/dX) is independent of the value of X. A function Y=BX=\(b_{1}\)\(x_{1}\) + \(b_{2}\)\(x_{2}\) is said to be linear (in the parameters), say, \(b_{1}\), if \(b_{1}\) appears with a power of 1 only and is not multiplied or divided by any other parameter (for eg \(b_{1}\) x \(b_{2}\) , or \(b_{2}\) / \(b_{1}\)).

This is a different concept with respect to the linearity in the functional form of the covariates, that is not instead required.

Moreover, some regression models may look non linear in the parameters but are inherently or intrinsically linear.

This is because with suitable transformations they can be made linear in parameters.

Just as an example, imagine that we want to investigate the effect of total CSF polymorph count on blood glucose ratio in patients with either bacterial or viral meningitis.

require(Hmisc)
getHdata(abm)

As a first step in every statistical analysis, we should visualize data of interest.

In this example, we use a nonparametric regression approach to explore the relationship between the two variables: plsmo is a loess nonparametric smoother.

with(ABM, {
  glratio <- gl / bloodgl
  tpolys <- polys * whites / 100
  plsmo(tpolys, glratio, xlab='Total Polymorphs in CSF',
       ylab='CSF/Blood Glucose Ratio',    
       xlim=quantile(tpolys,  c(.05,.95), na.rm=TRUE),
       ylim=quantile(glratio, c(.05,.95), na.rm=TRUE))
 scat1d(tpolys); scat1d(glratio, side=4) })

Moreover, we can use also a Super smoother relating age to the probability of bacterial meningitis given a patient has bacterial or viral meningitis (a binary outcome, see later in this block for logistic regression examples), with a rug plot showing the age distribution:

with(ABM, {
  plsmo(age, abm, 'supsmu', bass=7,     
        xlab='Age at Admission, Years',
        ylab='Proportion Bacterial Meningitis')
  scat1d(age) })

We can use these nonparametric approaches to help in finding a suitable functional form for the candidate predictor (switching to parametric regression models), that could be very different from the linear effect. For example, we can decide to use splines to model the effect of a continuous predictor on an outcome. For whom interested in splines see: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3

3 Multiple linear regression (MLR)

Datasets usually contain many variables collected during a study. It is often useful to see the relationship between two variables within the different levels of another third, categorical variable, i.e. to verify for the presence of an interaction. (This is just a didactic example, usually more than 3 variables are involved in MLR models).

3.1 Example: Systolic blood pressure

A small survey on blood pressure was carried out. The objective is to see the hypertensive effect of subjects putting additional table salt on their meal.Gender of subjects is also measured.

data(BP)
attach(BP)
des(BP)
## cross-sectional survey on BP & risk factors 
##  No. of observations =  100 
##   Variable      Class           Description        
## 1 id            integer         id                 
## 2 sex           factor          sex                
## 3 sbp           integer         Systolic BP        
## 4 dbp           integer         Diastolic BP       
## 5 saltadd       factor          Salt added on table
## 6 birthdate     Date

Note that the maximum systolic and diastolic blood pressures are quite high. There are 20 missing values in saltadd. The frequencies of the categorical variables sex and saltadd are now inspected.

describe(data.frame(sex, saltadd))
##          vars   n mean  sd median trimmed mad min max range  skew kurtosis   se
## sex*        1 100 1.55 0.5      2    1.56   0   1   2     1 -0.20    -1.98 0.05
## saltadd*    2  80 1.54 0.5      2    1.55   0   1   2     1 -0.15    -2.00 0.06

The next step is to create a new age variable from birthdate. The calculation is based on 12th March 2001, the date of the survey (date of entry in the study).

age.in.days <- as.Date("2001-03-12") - birthdate

There is a leap year in every four years. Therefore, an average year will have 365.25 days.

class(age.in.days)
## [1] "difftime"
age <- as.numeric(age.in.days)/365.25

The function as.numeric is needed to transform the units of age (difftime); otherwise modelling would not be possible.

describeBy(sbp,saltadd)
## 
##  Descriptive statistics by group 
## group: no
##    vars  n   mean    sd median trimmed   mad min max range skew kurtosis   se
## X1    1 37 137.46 29.62    132  136.39 23.72  80 201   121 0.44    -0.42 4.87
## ------------------------------------------------------------ 
## group: yes
##    vars  n   mean    sd median trimmed   mad min max range  skew kurtosis   se
## X1    1 43 163.02 39.39    171  163.94 45.96  80 224   144 -0.25    -1.13 6.01

3.1.1 Recoding missing values into another category

The missing value group has the highest median and average systolic blood pressure. In order to create a new variable with three levels type:

saltadd1 <- saltadd
levels(saltadd1) <- c("no", "yes", "missing")
saltadd1[is.na(saltadd)] <- "missing"
summary(saltadd1)
##      no     yes missing 
##      37      43      20
summary(aov(age ~ saltadd1))
##             Df Sum Sq Mean Sq F value Pr(>F)
## saltadd1     2    115   57.42   0.448   0.64
## Residuals   97  12422  128.06

Since there is not enough evidence that the missing group is important and for additional reasons of simplicity, we will assume MCAR (missing completely at random) and we will ignore this group and continue the analysis with the original saltadd variable consisting of only two levels. Before doing this however, a simple regression model and regression line are first fitted.

lm1 <- lm(sbp ~ age)
summary(lm1)
## 
## Call:
## lm(formula = sbp ~ age)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -68.231 -21.801  -3.679  15.127 105.395 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  65.1465    14.8942   4.374 3.05e-05 ***
## age           1.8422     0.2997   6.147 1.71e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33.56 on 98 degrees of freedom
## Multiple R-squared:  0.2782, Adjusted R-squared:  0.2709 
## F-statistic: 37.78 on 1 and 98 DF,  p-value: 1.712e-08

Although the R-squared is not very high, the p value is small indicating important influence of age on systolic blood pressure. A scatterplot of age against systolic blood pressure is now shown with the regression line added using the abline function. This function can accept many different argument forms, including a regression object. If this object has a coef method, and it returns a vector of length 1, then the value is taken to be the slope of a line through the origin, otherwise the first two values are taken to be the intercept and slope, as is the case for lm1.

plot(age, sbp, main = "Systolic BP by age", xlab = "Years", ylab = "mm.Hg")
abline(lm1)

Subsequent exploration of residuals suggests a non-significant deviation from normality and no pattern. Details of this can be adopted from the techniques already discussed and are omitted here. The next step is to provide different plot patterns for different groups of salt habits. Note that here the sample size is lower (we decided to omit missing values for saltadd, reducing the analysis to the complete cases):

lm2 <- lm(sbp ~ age + saltadd)
summary(lm2)
## 
## Call:
## lm(formula = sbp ~ age + saltadd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -66.369 -24.972  -0.125  22.456  64.805 
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  63.1291    15.7645   4.005 0.000142 ***
## age           1.5526     0.3118   4.979 3.81e-06 ***
## saltaddyes   22.9094     6.9340   3.304 0.001448 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.83 on 77 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.3331, Adjusted R-squared:  0.3158 
## F-statistic: 19.23 on 2 and 77 DF,  p-value: 1.679e-07

On the average, a one year increment of age is associated with an increase in systolic blood pressure by 1.5 mmHg. Adding table salt increases systolic blood pressure significantly by approximately 23 mmHg.

plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)

Note that the red dots corresponding to those who added table salt are higher than the blue circles.

The final task is to draw two separate regression lines for each group. We now have two regression lines to draw, one for each group. The intercept for non-salt users will be the first coefficient and for salt users will be the first plus the third. The slope for both groups is the same.

a0 <- coef(lm2)[1]
a1 <- coef(lm2)[1] + coef(lm2)[3]
b <- coef(lm2)[2]
plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)
abline(a = a0, b, col = "blue")
abline(a = a1, b, col = "red")

Note that X-axis does not start at zero. Thus the intercepts are out of the plot frame. The red line is for the red points of salt adders and the blue line is for the blue points of non-adders. In this model, age is assumed to have a constant effect on systolic blood pressure independently from added salt.

But look at the distributions of the points of the two colours: the red points are higher than the blue ones but mainly on the right half of the graph. To fit lines with different slopes, a new model with interaction term is created.

Therefore the next step is to estimate a model with different slopes (or different ‘b’ for the abline arguments) for the different lines. The model needs an interaction term between addsalt and age.

lm3 <- lm(sbp ~ age * saltadd)
summary(lm3)
## 
## Call:
## lm(formula = sbp ~ age * saltadd)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -69.885 -24.891  -1.142  21.228  66.867 
## 
## Coefficients:
##                Estimate Std. Error t value Pr(>|t|)    
## (Intercept)     78.0066    20.3981   3.824 0.000267 ***
## age              1.2419     0.4128   3.009 0.003558 ** 
## saltaddyes     -12.2540    31.4574  -0.390 0.697965    
## age:saltaddyes   0.7199     0.6282   1.146 0.255441    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 30.77 on 76 degrees of freedom
##   (20 observations deleted due to missingness)
## Multiple R-squared:  0.3445, Adjusted R-squared:  0.3186 
## F-statistic: 13.31 on 3 and 76 DF,  p-value: 4.528e-07

For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. This term is negative. The intercept of salt users is therefore lower than that of the non-users.

a0 <- coef(lm3)[1]
a1 <- coef(lm3)[1] + coef(lm3)[3]

For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has saltadd being 0. The slope for the salt users group includes the second and the fourth coefficients since saltaddyes is 1.

b0 <- coef(lm3)[2]
b1 <- coef(lm3)[2] + coef(lm3)[4]
plot(age, sbp, main="Systolic BP by age", xlab="Years",ylab="mm.Hg", pch=18, col=as.numeric(saltadd))
abline(a = a0, b = b0, col = 1)
abline(a = a1, b = b1, col = 2)
legend("topleft", legend = c("Salt added", "No salt added"),lty=1, col=c("red","black"))

Note that as.numeric(saltadd) converts the factor levels into the integers 1 (black) and 2 (red), representing the non-salt adders and the salt adders, respectively. These colour codes come from the R colour palette.

This model suggests that at the young age, the systolic blood pressure of two groups are not much different as the two lines are close together on the left of the plot. For example, at the age of 25, the difference is 5.7mmHg. Increasing age increases the difference between the two groups. At 70 years of age, the difference is as great as 38mmHg. In this aspect, age modifies the effect of adding table salt on blood pressure.

On the other hand the slope of age is 1.24mmHg per year among those who did not add salt but becomes 1.24+0.72 = 1.96mmHg among the salt adders. Thus, salt adding modifies the effect of age. Note that interaction is a statistical term whereas effect modification is the equivalent epidemiological term.

Note also that the coefficient of the interaction term age:saltaddyes is not statistically significant !!!

This means that the two slopes just differ by chance (or that we are low-powered to detect a significant interaction..)

This was in fact just a didactic example to show how to introduce an interaction term in a regression model.

4 Causal inference versus prediction

Conceptually, in prediction, in a certain sense we make comparisons between outcomes across different combinations of values of input variables to predict the probability of an outcome. In causal inference, we ask what would happen to an outcome y as a result of a treatment or intervention. Predictive inference relates to comparisons between units (different groups/subjects). Causal inference addresses comparisons of different treatments when applied to the same unit.

4.1 The FEV dataset

The FEV, which is an acronym for Forced Expiratory Volume, is a measure of how much air a person can exhale (in liters) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age, their height, their gender and, most importantly, whether the child is a smoker or a non-smoker (the exposure of interest). This is an observational cross-sectional study.

The goal of this study was to find out whether or not smoking has an effect on the FEV of children.

Load the required libraries

library(tidyverse)
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.3.2
library(SmartEDA)
## Warning: package 'SmartEDA' was built under R version 4.3.2
library(ggplot2)
library(ggstatsplot)
## Warning: package 'ggstatsplot' was built under R version 4.3.2

Set the working directory and import data

fev <- read_tsv(here("datasets","fev.txt"))
head(fev)
## # A tibble: 6 × 5
##     age   fev height gender smoking
##   <dbl> <dbl>  <dbl> <chr>    <dbl>
## 1     9  1.71   57   f            0
## 2     8  1.72   67.5 f            0
## 3     7  1.72   54.5 f            0
## 4     9  1.56   53   m            0
## 5     9  1.90   57   m            0
## 6     8  2.34   61   f            0

There are a few things in the formatting of the data that can be improved upon:

  1. Both the gender and smoking can be transformed to factors.
  2. The height variable is written in inches. Inches are hard to interpret. Let’s add a new column, height_cm, with the values converted to centimeter using the mutate function. (For this example we will not use this variable however).
fev <- fev %>%
  mutate(gender = as.factor(gender)) %>%
  mutate(smoking = as.factor(smoking)) %>%
  mutate(height_cm = height*2.54)

head(fev)
## # A tibble: 6 × 6
##     age   fev height gender smoking height_cm
##   <dbl> <dbl>  <dbl> <fct>  <fct>       <dbl>
## 1     9  1.71   57   f      0            145.
## 2     8  1.72   67.5 f      0            171.
## 3     7  1.72   54.5 f      0            138.
## 4     9  1.56   53   m      0            135.
## 5     9  1.90   57   m      0            145.
## 6     8  2.34   61   f      0            155.

4.2 Data Exploration

Now, let’s make a first explorative boxplot, showing only the FEV for both smoking categories.

fev %>%
  ggplot(aes(x=smoking,y=fev,fill=smoking)) +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  theme_bw() +
  geom_boxplot(outlier.shape=NA) + 
  geom_jitter(width = 0.2, size=0.1) +
  ggtitle("Boxplot of FEV versus smoking") +
  ylab("fev (l)") +
  xlab("smoking status")

Did you expect these results??

It appears that children that smoke have a higher median FEV than children that do not smoke. Should we change legislations worldwide and make smoking obligatory for children??

Maybe there is something else going on in the data. Now, we will generate a similar plot, but we will stratify the data based on age (age as factor).

fev %>%
  ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
  geom_boxplot(outlier.shape=NA) +
  geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
  theme_bw() +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  ggtitle("Boxplot of FEV versus smoking, stratified on age") +
  ylab("fev (l)") +
  xlab("Age (years)")

This plot seems to already give us a more plausible picture. First, it seems that we do not have any smoking children of ages 6, 7 or 8. Second, when looking at the results per age “category”, it seems no longer the case that smokers have a much higher FEV than non-smokers; for the higher ages, the contrary seems true.

This shows that taking into account confounders (in this case) is crucial! If we simply analyse the dataset based on the smoking status and FEV values only, our inference might be incorrect:

fit1   <- lm(fev~smoking, data=fev)     # wrong beta0star, beta1star
fit2   <- lm(fev~age+smoking, data=fev) # "true" beta0, beta2, beta1
fitage <- lm(age~smoking, data=fev)     #  gamma0, gamma1
fit1
## 
## Call:
## lm(formula = fev ~ smoking, data = fev)
## 
## Coefficients:
## (Intercept)     smoking1  
##      2.6346       0.6054
fit2
## 
## Call:
## lm(formula = fev ~ age + smoking, data = fev)
## 
## Coefficients:
## (Intercept)          age     smoking1  
##      0.2040       0.2479      -0.2358
fitage
## 
## Call:
## lm(formula = age ~ smoking, data = fev)
## 
## Coefficients:
## (Intercept)     smoking1  
##       9.804        3.393
beta0     <- coef(fit2)[1]
beta2     <- coef(fit2)[2]
beta1     <- coef(fit2)[3]
gamma0    <- coef(fitage)[1]
gamma1    <- coef(fitage)[2]
beta0star <- coef(fit1)[1]
beta1star <- coef(fit1)[2]

check that: beta0star = beta0 + beta2*gamma0 :

beta0 + beta2*gamma0
## (Intercept) 
##    2.634628
beta0star 
## (Intercept) 
##    2.634628

and also check that (wrong) beta1star = beta1 + beta2*gamma1:

beta1 + beta2*gamma1
##  smoking1 
## 0.6054053
beta1star
##  smoking1 
## 0.6054053

Therefore, a beneficial estimated smoking effect is obtained when age is ignored. If the causal inference assumptions hold (see slides) we can consider beta1 as the average smoking effect in the population under study.

4.3 Factors that affect causal inference estimates

Imbalance and lack of complete overlap can make causal inference difficult. Remind : imbalance is when treatment groups differ with respect to an important covariate. Lack of complete overlap: when some combination of treatment level and covariate level is lacking (no observations, violation of the positivity assumption).

To explain fev, sex seems to matter, especially among older individuals:

plot(fev~age, col=gender, data=fev)
legend("topleft", pch=1, col=1:2, levels(fev$gender))

Another way to plot the same thing is:

fev %>%
  ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
  geom_boxplot(outlier.shape=NA) +
  geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
  theme_bw() +
  scale_fill_manual(values=c("dimgrey","firebrick")) +
  ggtitle("Boxplot of FEV versus smoking, stratified on age and gender") +
  ylab("fev (l)") +
  xlab("smoking status") + 
  facet_grid(rows = vars(gender))

Especially for higher ages, the median FEV is higher for males as compared to females. [This could suggest a kind of interaction between gender and age, that could be explored in the regression model even if the interpretation of such interaction could be quite tricky (many levels for age!)].

Moreover, there is a slight gender imbalance among age categories:

counts      <- table(fev$gender, as.factor(fev$age))
percentages <- round(prop.table(table(fev$gender, as.factor(fev$age)),2),digits=3)

barplot(percentages, main="Gender distribution",
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(counts))

For imbalanced samples, simple comparisons of sample means between groups are not good estimates of treatment/risk factors effects. A model adjustment is of course one way to better estimate a treatment effect, where we add the covariate to the model. In this case for example we can add also gender in the regression model:

fit3   <- lm(fev~age+smoking+gender, data=fev)
summary(fit3)
## 
## Call:
## lm(formula = fev ~ age + smoking + gender, data = fev)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1.39229 -0.35677 -0.03623  0.33644  1.91571 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  0.088007   0.097635   0.901   0.3677    
## age          0.243707   0.009523  25.591  < 2e-16 ***
## smoking1    -0.180489   0.080950  -2.230   0.0261 *  
## genderm      0.295794   0.044683   6.620 7.97e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.5465 on 602 degrees of freedom
## Multiple R-squared:  0.5681, Adjusted R-squared:  0.566 
## F-statistic:   264 on 3 and 602 DF,  p-value: < 2.2e-16

So now the effect of smoking is estimated conditional on both age and gender, and as expected is a negative effect on FEV; one possible problem however with these estimates could be that for some combinations of age-gender-smoking we do not have any observed data, so that the extrapolation of the regression model could be not so reliable. Also in this case (no interaction) if causal assumptions hold we can interpret the effect of smoking also as a marginal effect.

index           <- fev$smoking==0
counts.noS      <- table(fev[index,]$gender, as.factor(fev[index,]$age))
percentages.noS <- round(prop.table(table(fev[index,]$gender, as.factor(fev[index,]$age)),2),digits=3)

index1           <- fev$smoking==1
counts.S      <- table(fev[index1,]$gender, as.factor(fev[index1,]$age))
percentages.S <- round(prop.table(table(fev[index1,]$gender, as.factor(fev[index1,]$age)),2),digits=3)

par(mfrow=c(1,2))
barplot(percentages.noS, main="Gender distribution among non-smokers",cex.main=0.8, 
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(percentages.noS))
barplot(percentages.S, main="Gender distribution among smokers",cex.main=0.8, 
  xlab="ages", col=c("pink", "darkblue"),
  legend = rownames(percentages.S))

Observe that in fact there is also a lack of “smoking” children below age 9: lack of complete overlap is when there are no observations at all for some combination(s) of treatment levels / covariate levels.

For lack of complete overlap, there is no data available for some comparisons. This requires extrapolation using a model to make comparisons. This is in fact the job that the regression model does, but again extrapolation is always a risky business for the model in regions where no data are available. This is an even more serious problem than imbalance.

Matching is a possible strategy in these situations to overcome (avoid) imbalance, even if some data will be discarded (see in the propensity score methods examples).

4.4 Supplementary material: estimating a marginal effect from a regression model with an interaction

Load the required libraries

library(tidyverse) 
library(broom) 
library(gtsummary) 
library(MatchIt) 
library(geepack) 
## Warning: package 'geepack' was built under R version 4.3.2
library(boot)

Let’s simulate the data : first of all set the parameters of the data generating mechanism:

set.seed(0) 
n.obs = 10000 #set sample size 
#---- True parameters in outcome model  

b0 = 60 
b1 = 5 
b2 = -0.3 
b3 = -0.1 
b4 = 8 
b5 = 3 
b6 = 2

#---- True parameters in exposure odds model

g0 = log(0.20/(1-0.20)) 
g1 = log(1.01) 
g2 = log(1.005) 
g3 = log(0.6) 
g4 = log(0.5) 
g5 = log(0.8)

#Function to compute outcome values
## Use the parameters specified above

mean_out <- function(C1, C2, C3, exposure){
b0 + b1*exposure + b2*C1 + b3*I(C1^2) + b4*C2 + b5*C3 + b6*exposure*C2 + rnorm(n = n.obs, mean = 0, sd = 5)
}

# Function to compute exposure probabilities 
## Use the parameters specified above

prob_exp <- function(C1, C2, C3){
exp(g0 + g1*C1 + g2*I(C1^2) + g3*C2 + g4*C3 + g5*C2*C3)/(1 + exp(g0 + g1*C1+ g2*I(C1^2) + g3*C2 + g4*C3 + g5*C2*C3))
}

Now simulate the dataset :

df.sim <- tibble("ID" = seq(from = 1, to = n.obs, by = 1),
"C1" = rnorm(n = n.obs, mean = 0, sd = 5),
"C2" = rbinom(n = n.obs, size = 1, p = 0.4),
"C3" = rbinom(n = n.obs, size = 1, p = 0.3),
"Pexposure" = prob_exp(C1, C2, C3),
"Exposure" = rbinom(n = n.obs, size = 1,prob = Pexposure),
"Outcome" = as.numeric(mean_out(C1,C2,C3, Exposure)))

head(df.sim)
## # A tibble: 6 × 7
##      ID    C1    C2    C3 Pexposure Exposure Outcome
##   <dbl> <dbl> <int> <int>  <I<dbl>>    <int>   <dbl>
## 1     1  6.31     0     0     0.245        0    55.7
## 2     2 -1.63     0     0     0.200        0    62.9
## 3     3  6.65     0     0     0.250        0    47.7
## 4     4  6.36     0     0     0.246        0    45.9
## 5     5  2.07     0     1     0.115        0    61.5
## 6     6 -7.70     0     0     0.237        1    65.9

In this simulated data, A is exposure, C1 is a continuous covariate, and C2 and C3 are binary covariates.

The true outcome model is specified as follows:

Y=b0+b1xA+b2xC1+b3x(C1)^2+b4xC2+b5xC3+b6xAxC2+eps

where eps is the error term normally distributed with variance set at 5^2.

The true causal effect of exposure A among those with C2 = 0 is b1=5. The true causal effect of exposure A among those with C2 = 1 is b1+b6=7.

The marginal effect of A is: 5x0.6+7x0.4=5.8 because 40% of the total population has C2 = 1.

# Correctly specified model
df.sim %>%
lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .) %>%
tidy(conf.int = TRUE)
## # A tibble: 7 × 7
##   term        estimate std.error statistic   p.value conf.low conf.high
##   <chr>          <dbl>     <dbl>     <dbl>     <dbl>    <dbl>     <dbl>
## 1 (Intercept)   60.0     0.0875     686.   0           59.8     60.2   
## 2 Exposure       5.02    0.166       30.2  4.61e-192    4.69     5.34  
## 3 C2             8.10    0.112       72.4  0            7.88     8.32  
## 4 C1            -0.305   0.0101     -30.1  2.49e-190   -0.324   -0.285 
## 5 I(C1^2)       -0.100   0.00148    -67.8  0           -0.103   -0.0972
## 6 C3             2.94    0.111       26.5  4.31e-150    2.72     3.16  
## 7 Exposure:C2    1.93    0.294        6.56 5.77e- 11    1.35     2.51

The estimate for Exposure is 5.02. Note that this is an estimate of the conditional effect for C2=0 and it is identical to the true value because the model is correctly specified. The conditional effect for C2 = 1 is estimated to be 5.02 + 1.93 = 6.95, which again is nearly identical to the true parameter b1+b6=7.

One can now standardize the conditional effect estimates from the correctly specified multivariable regression model to get an estimate of a marginal effect:

# Make copies of original data 

df.sim.a1 <- df.sim %>% mutate(Outcome = NA, Exposure = 1) #Assign Exposure = 1 to everyone 
df.sim.a0 <- df.sim %>% mutate(Outcome = NA, Exposure = 0) #Assign Exposure = 0 to everyone 
df.sim.combined <- bind_rows(df.sim.a1,df.sim.a0)
#mean(df.sim.combined$"Exposure")
# Fit an outcome model to the original data 
## Correctly specified model
gcomp.fit <- df.sim %>% lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .) 
# Predict outcome values using now the copied datasets 
df.sim.combined$pred <- predict(gcomp.fit, newdata = df.sim.combined) 
# ATE Estimate: difference between mean predicted values for rows with A=1 and mean predicted values for rows with A = 0 
df.sim.combined %>% group_by(Exposure) %>% summarise( mean.Y = mean(pred) ) %>% 
  pivot_wider( names_from = Exposure, names_glue = "mean.Y.{Exposure}", values_from = mean.Y ) %>% 
  mutate(ATE = mean.Y.1-mean.Y.0)
## # A tibble: 1 × 3
##   mean.Y.0 mean.Y.1   ATE
##      <dbl>    <dbl> <dbl>
## 1     61.6     67.4  5.78

The resulting estimate of a marginal effect is 5.78 — this is a consistent estimate of the true marginal effect of 5.8.

Confidence intervals for the standardized estimate can be obtained via bootstrapping:

standardization.boot <- function(data, indices){ 
  df <- data[indices,] 
  df.a1 <- df %>%
           mutate(Outcome = NA, Exposure = 1)
  
  df.a0 <- df %>% 
           mutate(Outcome = NA, Exposure = 0) 
  
  df.combined <- bind_rows(df.a1,df.a0)
  
  
  gcomp.fit <- df %>% 
    lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .) 
  df.combined$pred <- predict(gcomp.fit, newdata = df.combined) 
  output <- df.combined %>% 
    group_by(Exposure) %>% 
    summarise( 
      mean.Y = mean(pred)) %>% 
    pivot_wider( 
      names_from = Exposure, 
      names_glue = "mean.Y.{Exposure}", 
      values_from = mean.Y 
      ) %>% 
    mutate(
      ATE = mean.Y.1-mean.Y.0 
      ) 
  return(output$ATE) 
  } 


# bootstrap

standardization.results <- boot(data=df.sim, statistic=standardization.boot, R=100) 
# 100 bootstrapped samples
  

# generating confidence intervals
empirical.se <- sd(standardization.results$t) # get empirical standard error estimate 
estimate    <- standardization.results$t0 
ll <- estimate - qnorm(0.975)*empirical.se # normal approximation 
ul <- estimate + qnorm(0.975)*empirical.se 

data.frame(cbind(estimate, empirical.se, ll, ul))
##   estimate empirical.se      ll       ul
## 1 5.782279    0.1425789 5.50283 6.061729

Of note: if you are interested in more details about these topics, you can explore the R library lmw at https://cran.r-project.org/web/packages/lmw/index.html and also the R library arm at https://cran.r-project.org/web/packages/arm/index.html.

5 Estimating causal effects from observational studies using the propensity score approach

Fist of all, we will upload the required libraries:

library(twang)
library(magrittr)
library(tidyverse)
library(gtsummary)
library(stddiff)
library(ggplot2)
library(data.table)
library(boot)
library(splines)
library(PSAgraphics)
library(Matching)
library(sandwich)
library(survey)
library(rms)

We will use a dataset coming from an observational study of 996 patients receiving an initial Percutaneous Coronary Intervention (PCI) at Ohio Heart Health, Christ Hospital, Cincinnati in 1997 and followed for at least 6 months by the staff of the Lindner Center.

The patients thought to be more severely diseased were assigned to treatment with abciximab (an expensive, high-molecular-weight IIb/IIIa cascade blocker); in fact, only 298 (29.9 percent) of patients received usual-care-alone with their initial PCI.

Our research question aims at estimating the treatment effect of abciximab+PCI (abcix) vs the standard care on the probability of of being deceased at 6 months.

In this practical, we will apply the methods based on the propensity score.

Measured pre-treatment characteristics that could confound the treatment-outcome relationship are:

5.1 Exploring the data

data("lindner",package="twang")
set.seed(123)
summary(lindner)
##     lifepres       cardbill          abcix            stent       
##  Min.   : 0.0   Min.   :  2216   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:11.6   1st Qu.: 10219   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :11.6   Median : 12458   Median :1.0000   Median :1.0000  
##  Mean   :11.3   Mean   : 15674   Mean   :0.7008   Mean   :0.6687  
##  3rd Qu.:11.6   3rd Qu.: 16660   3rd Qu.:1.0000   3rd Qu.:1.0000  
##  Max.   :11.6   Max.   :178534   Max.   :1.0000   Max.   :1.0000  
##      height          female          diabetic         acutemi      
##  Min.   :108.0   Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  1st Qu.:165.0   1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Median :173.0   Median :0.0000   Median :0.0000   Median :0.0000  
##  Mean   :171.4   Mean   :0.3474   Mean   :0.2239   Mean   :0.1436  
##  3rd Qu.:178.0   3rd Qu.:1.0000   3rd Qu.:0.0000   3rd Qu.:0.0000  
##  Max.   :196.0   Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     ejecfrac        ves1proc     sixMonthSurvive
##  Min.   : 0.00   Min.   :0.000   Mode :logical  
##  1st Qu.:45.00   1st Qu.:1.000   FALSE:26       
##  Median :55.00   Median :1.000   TRUE :970      
##  Mean   :50.97   Mean   :1.386                  
##  3rd Qu.:56.00   3rd Qu.:2.000                  
##  Max.   :90.00   Max.   :5.000

How is the treatment variable distributed in the population?

# Exposure
lindner %>%
  dplyr::select(abcix) %>%
  tbl_summary()
Characteristic N = 9961
abcix 698 (70%)
1 n (%)
ggplot(lindner)+
  geom_bar(aes(x=abcix,fill=as.factor(abcix)),stat="count")+
  scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()

Is the outcome rare?

# Outcome 
lindner$sixMonthDeath <- 1-lindner$sixMonthSurvive

lindner %>%
  dplyr::select(sixMonthDeath) %>% 
  tbl_summary()
Characteristic N = 9961
sixMonthDeath 26 (2.6%)
1 n (%)

What is the crude odds ratio for mortality?

# Crude OR
fit.crude <- glm(sixMonthDeath~abcix,family = binomial,data=lindner)
tbl_regression(fit.crude,exponentiate=T)
Characteristic OR1 95% CI1 p-value
abcix 0.30 0.13, 0.66 0.003
1 OR = Odds Ratio, CI = Confidence Interval

Let’s now summarize the confounding variables by treatment group and by outcome, to have a general idea about the observed associations:

# Possible confounders
lindner %>% 
  dplyr::select(acutemi,
                  ejecfrac,
                  ves1proc,
                  stent,
                  diabetic,
                  female,
                height) %>%
  tbl_summary()
Characteristic N = 9961
acutemi 143 (14%)
ejecfrac 55 (45, 56)
ves1proc
    0 4 (0.4%)
    1 680 (68%)
    2 252 (25%)
    3 45 (4.5%)
    4 14 (1.4%)
    5 1 (0.1%)
stent 666 (67%)
diabetic 223 (22%)
female 346 (35%)
height 173 (165, 178)
1 n (%); Median (IQR)
# Descriptive statistics of patients'characteristics by treatment group
lindner %>% 
  dplyr::select(acutemi,
                ejecfrac,
                ves1proc,
                stent,
                diabetic,
                female,
                height,
                abcix) %>%
  tbl_summary(by=abcix) %>% 
  add_overall() %>% 
  add_p()
Characteristic Overall, N = 9961 0, N = 2981 1, N = 6981 p-value2
acutemi 143 (14%) 18 (6.0%) 125 (18%) <0.001
ejecfrac 55 (45, 56) 55 (50, 60) 55 (45, 56) 0.004
ves1proc


<0.001
    0 4 (0.4%) 1 (0.3%) 3 (0.4%)
    1 680 (68%) 243 (82%) 437 (63%)
    2 252 (25%) 47 (16%) 205 (29%)
    3 45 (4.5%) 6 (2.0%) 39 (5.6%)
    4 14 (1.4%) 1 (0.3%) 13 (1.9%)
    5 1 (0.1%) 0 (0%) 1 (0.1%)
stent 666 (67%) 174 (58%) 492 (70%) <0.001
diabetic 223 (22%) 80 (27%) 143 (20%) 0.027
female 346 (35%) 115 (39%) 231 (33%) 0.10
height 173 (165, 178) 173 (165, 178) 173 (165, 180) 0.8
1 n (%); Median (IQR)
2 Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test
# Descriptive statistics of patients'characteristics by outcome
lindner %>% 
  dplyr::select(acutemi,
                ejecfrac,
                ves1proc,
                stent,
                diabetic,
                female,
                height,
                sixMonthDeath) %>%
  tbl_summary(by=sixMonthDeath) %>% 
  add_overall() %>% 
  add_p()
Characteristic Overall, N = 9961 0, N = 9701 1, N = 261 p-value2
acutemi 143 (14%) 137 (14%) 6 (23%) 0.2
ejecfrac 55 (45, 56) 55 (45, 56) 50 (33, 54) 0.002
ves1proc


0.10
    0 4 (0.4%) 4 (0.4%) 0 (0%)
    1 680 (68%) 661 (68%) 19 (73%)
    2 252 (25%) 249 (26%) 3 (12%)
    3 45 (4.5%) 41 (4.2%) 4 (15%)
    4 14 (1.4%) 14 (1.4%) 0 (0%)
    5 1 (0.1%) 1 (0.1%) 0 (0%)
stent 666 (67%) 649 (67%) 17 (65%) 0.9
diabetic 223 (22%) 212 (22%) 11 (42%) 0.014
female 346 (35%) 335 (35%) 11 (42%) 0.4
height 173 (165, 178) 173 (165, 178) 168 (163, 177) 0.10
1 n (%); Median (IQR)
2 Fisher’s exact test; Wilcoxon rank sum test; Pearson’s Chi-squared test

We can now calculate the Standardized Difference,which can be use as a measure of balance in the treatment groups. It is a measure of difference between groups that is independent from statistical testing (remember that p values always depend on sample size !!). It is very similar to the definition of effect size that we discussed in Block 2.

It can be defined for a continuous covariate as:

\(SD_{c}=\frac{\overline{x_{1}}-\overline{x_{0}}} {\sqrt{\frac{s_1^2+s_0^2}{2}}}\)

and for a dichotomous covariate as:

\(SD_{d}=\frac{\overline{p_{1}}-\overline{p_{0}}} {\sqrt{\frac{\overline{p_{1}}(1-\overline{p_{1})}+\overline{p_{0}}(1-\overline{p_{0}})}{2}}}\)

The rough interpretation is that imbalance is present if the standardized difference is greater than 0.1 or 0.2.

s1 <- stddiff.numeric(vcol="height",gcol="abcix",data=lindner)
s2 <- stddiff.numeric(vcol="ejecfrac",gcol="abcix",data=lindner)
s3 <- stddiff.numeric(vcol="ves1proc",gcol="abcix",data=lindner)


s4 <- stddiff.binary(vcol="stent",gcol="abcix",data=lindner)
s5 <- stddiff.binary(vcol="female",gcol="abcix",data=lindner)
s6 <- stddiff.binary(vcol="diabetic",gcol="abcix",data=lindner)
s7 <- stddiff.binary(vcol="acutemi",gcol="abcix",data=lindner)


cont.var <- as.data.frame(rbind(s1,s2,s3))
rownames(cont.var) <- c("height", "ejecfrac", "ves1proc")
cont.var
##           mean.c   sd.c  mean.t   sd.t missing.c missing.t stddiff stddiff.l
## height   171.446 10.589 171.443 10.695         0         0   0.000    -0.135
## ejecfrac  52.289 10.297  50.403 10.419         0         0   0.182     0.046
## ves1proc   1.205  0.480   1.463  0.706         0         0   0.427     0.290
##          stddiff.u
## height       0.136
## ejecfrac     0.318
## ves1proc     0.564
bin.var <- as.data.frame(rbind(s4,s5,s6, s7))
rownames(bin.var) <- c("stent", "female", "diabetic", "acutemi")
bin.var
##            p.c   p.t missing.c missing.t stddiff stddiff.l stddiff.u
## stent    0.584 0.705         0         0   0.255     0.119     0.391
## female   0.386 0.331         0         0   0.115    -0.021     0.251
## diabetic 0.268 0.205         0         0   0.150     0.014     0.286
## acutemi  0.060 0.179         0         0   0.372     0.235     0.508

5.2 Estimating the propensity score

Fit now a propensity score model — a logistic regression model with abciximab+PCI (vs. PCI) as the outcome, and the confounders listed in the table above included as covariates. We exclude from the list the variable height, since there was not a relevant difference between the groups.

# Fit a propensity score model
fit.ps<- glm(abcix~
                      acutemi+
                      ejecfrac+
                      ves1proc+
                      stent+
                      diabetic+
                      female,
                    data=lindner,family = binomial)

summary(fit.ps)
## 
## Call:
## glm(formula = abcix ~ acutemi + ejecfrac + ves1proc + stent + 
##     diabetic + female, family = binomial, data = lindner)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  0.272481   0.440711   0.618 0.536394    
## acutemi      1.178158   0.269479   4.372 1.23e-05 ***
## ejecfrac    -0.015076   0.007391  -2.040 0.041374 *  
## ves1proc     0.757365   0.138279   5.477 4.32e-08 ***
## stent        0.571165   0.150208   3.803 0.000143 ***
## diabetic    -0.409191   0.170371  -2.402 0.016316 *  
## female      -0.133346   0.151377  -0.881 0.378379    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1215.5  on 995  degrees of freedom
## Residual deviance: 1127.0  on 989  degrees of freedom
## AIC: 1141
## 
## Number of Fisher Scoring iterations: 4
# Save the estimated propensity score
lindner$ps <- fitted(fit.ps)
# Plot estimated ps 
ggplot(lindner) +
  geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) +
  scale_y_continuous("Estimated PS") +
  scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()

ggplot(lindner) +
  geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) +
  scale_y_continuous("Estimated PS") +
  facet_grid(cols=vars(abcix))+
  scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Assess whether there are non-overlapping scores (positivity violation) in the two exposure groups:

lindner %>% 
  dplyr::select(ps,abcix) %>% 
  tbl_summary(type = list(ps~"continuous2"),by=abcix,
                                             statistic = all_continuous2() ~c(
                                             "{median} ({p25}, {p75})", 
                                             "{min}, {max}"))
Characteristic 0, N = 298 1, N = 698
ps

    Median (IQR) 0.65 (0.55, 0.70) 0.71 (0.65, 0.82)
    Range 0.25, 0.96 0.28, 0.98

Investigate overlap:

lindner %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
                                     ps<=max(ps[abcix==0]),1,0)) 

# non-overlap: treatment group have higher ps than any non-abciximab user and 
# control group have smaller ps than any abciximab user 

with(lindner,table(overlap,abcix))
##        abcix
## overlap   0   1
##       0   1   8
##       1 297 690
with(lindner,prop.table(table(overlap,abcix)),2)
##        abcix
## overlap           0           1
##       0 0.001004016 0.008032129
##       1 0.298192771 0.692771084

In the successive steps, we remove subjects that does not overlap. This step reduce the original sample size, but we should respect the assumption of positivity in order to estimate a reasonable causal effect. It makes no sense including subjects that have “near-zero” probability to receive the treatment or to have a “match” in the successive analyses.

5.3 First option: “adjusting” for the propensity score

We can use the estimated PS as a covariate in a logistic regression model for the outcome:

# Model 1: Linear relationship between ps and outcome 
fit.out<- glm(sixMonthDeath~abcix+ps,
              data=lindner,
              family = binomial,
              subset=overlap==1)

# Model summary
summary(fit.out)
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix + ps, family = binomial, 
##     data = lindner, subset = overlap == 1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   -4.211      1.165  -3.616 0.000300 ***
## abcix         -1.445      0.439  -3.293 0.000992 ***
## ps             1.947      1.698   1.147 0.251522    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 233.15  on 986  degrees of freedom
## Residual deviance: 222.01  on 984  degrees of freedom
## AIC: 228.01
## 
## Number of Fisher Scoring iterations: 7

The second step is to save the predicted probabilities for the treated and the untreated and estimate the causal effects of interest in the population:

# fitted values (probabilities)
lindner$predY0<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[3]*lindner$ps) # PCI subjects 
lindner$predY1<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[2]+coef(fit.out)[3]*lindner$ps) # PCI+abciximab subjects 

# ATE effect 
Y1<-mean(lindner$predY1)
Y0<-mean(lindner$predY0)

# ATT effect 
Y1_1<-mean(lindner$predY1[lindner$abcix==1])
Y0_1<-mean(lindner$predY0[lindner$abcix==1])

# ATE effect 
Y1-Y0
## [1] -0.04246731
# ATT effect
Y1_1-Y0_1
## [1] -0.04436082
# Estimate odds ratios related to the "ATE" and the "ATT"
(Y1/(1-Y1))/(Y0/(1-Y0))
## [1] 0.2363187
(Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1))
## [1] 0.236305

The ATE effect is quite similar to the ATT effect, indicating that there is a protective effect of the PCI+abciximab vs PCI alone.

To obtain the corresponding confidence intervals we can use the bootstrap approach.

We do not here outline this procedure, see at the end of this practical the supplementary code.

This model relies on two additional assumptions: no interaction between propensity score and treatment, and a linear relationship between the propensity score and treatment.

Do these assumptions appear reasonable here?

We can try to fit different models, and then compare the AIC:

# Model 2: Non-linear relationship between ps and outcome 
fit.out2<- glm(sixMonthDeath~abcix+ps+I(ps^2),
              data=lindner,
              family = binomial,
              subset=overlap==1)

summary(fit.out2)
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix + ps + I(ps^2), family = binomial, 
##     data = lindner, subset = overlap == 1)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)   2.3787     3.8937   0.611 0.541255    
## abcix        -1.5088     0.4493  -3.358 0.000784 ***
## ps          -17.7559    11.5272  -1.540 0.123475    
## I(ps^2)      14.1968     8.3131   1.708 0.087681 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 233.15  on 986  degrees of freedom
## Residual deviance: 219.46  on 983  degrees of freedom
## AIC: 227.46
## 
## Number of Fisher Scoring iterations: 7
# Model 3: Non-linear relationship between ps and outcome and interaction between ps and treatment
fit.out3<- glm(sixMonthDeath~abcix*ps+I(ps^2),
               data=lindner,
               family = binomial,
               subset=overlap==1)

summary(fit.out3)
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix * ps + I(ps^2), family = binomial, 
##     data = lindner, subset = overlap == 1)
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)  
## (Intercept)   2.361196   3.786408   0.624   0.5329  
## abcix         0.009567   2.075020   0.005   0.9963  
## ps          -18.641792  11.183815  -1.667   0.0955 .
## I(ps^2)      15.513326   8.167099   1.899   0.0575 .
## abcix:ps     -2.130967   2.862505  -0.744   0.4566  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 233.15  on 986  degrees of freedom
## Residual deviance: 218.93  on 982  degrees of freedom
## AIC: 228.93
## 
## Number of Fisher Scoring iterations: 7

It seems that the relationship could be partially non-linear, but there is no a statistical significance very strong, as well as for the interaction. So probably the best parsimonious model to keep is Model 1.

5.4 Second option: Stratification

Create propensity score strata: this could be an iterative process, since we should verify if we have enough subjects/event in each stratum.

lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.25,0.5,0.75,1)),include.lowest=T,labels=c(1:4))) 
#Check they have been created correctly
summary(lindner$strata)
##   1   2   3   4 
## 252 246 257 241
tapply(lindner$ps,lindner$strata,summary)
## $`1`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.2499  0.5168  0.5463  0.5321  0.5713  0.6051 
## 
## $`2`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6078  0.6505  0.6674  0.6605  0.6807  0.6846 
## 
## $`3`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.6877  0.7036  0.7307  0.7406  0.7728  0.8106 
## 
## $`4`
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.8106  0.8305  0.8706  0.8759  0.9102  0.9765
#Look at numbers of events and patients in each strata/exposure group
table(lindner$sixMonthDeath,lindner$strata,lindner$abcix)
## , ,  = 0
## 
##    
##       1   2   3   4
##   0 105  89  64  25
##   1   7   1   4   3
## 
## , ,  = 1
## 
##    
##       1   2   3   4
##   0 138 156 185 208
##   1   2   0   4   5

These strata seem quite “sparse” as number of events. Another possibility is :

#Create propensity score strata
lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.33,0.66,1)),include.lowest=T,labels=c(1:3))) 

#Check they have been created correctly
summary(lindner$strata)
##   1   2   3 
## 356 302 338
#Look at numbers of events and patients in each strata/exposure group
table(lindner$sixMonthDeath,lindner$strata,lindner$abcix)
## , ,  = 0
## 
##    
##       1   2   3
##   0 146  96  41
##   1   7   4   4
## 
## , ,  = 1
## 
##    
##       1   2   3
##   0 201 201 285
##   1   2   1   8

Remind : we should also check for the balance of the confounders in each strata ! See the supplementary material for that. For now, let’s just estimate the OR in each stratum:

beta.treat<-numeric(3)
nstrata<-table(lindner$strata)
treated.strata<-table(lindner$strata,lindner$abcix)[,2]

for (i in 1:3){
   ms<-glm(sixMonthDeath~abcix,data=lindner,subset = strata==i,family="binomial")
   beta.treat[i]<-coef(ms)[2]
   print(summary(ms))
}
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner, 
##     subset = strata == i)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.0377     0.3869  -7.851 4.13e-15 ***
## abcix        -1.5725     0.8091  -1.943    0.052 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 83.969  on 355  degrees of freedom
## Residual deviance: 79.319  on 354  degrees of freedom
## AIC: 83.319
## 
## Number of Fisher Scoring iterations: 7
## 
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner, 
##     subset = strata == i)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.1781     0.5103  -6.228 4.73e-10 ***
## abcix        -2.1253     1.1249  -1.889   0.0589 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 50.927  on 301  degrees of freedom
## Residual deviance: 46.200  on 300  degrees of freedom
## AIC: 50.2
## 
## Number of Fisher Scoring iterations: 8
## 
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner, 
##     subset = strata == i)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3273     0.5238  -4.443 8.88e-06 ***
## abcix        -1.2458     0.6347  -1.963   0.0497 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 103.68  on 337  degrees of freedom
## Residual deviance: 100.39  on 336  degrees of freedom
## AIC: 104.39
## 
## Number of Fisher Scoring iterations: 6

And, finally, let’s estimate the weighted OR related to the ATE and the ATT as a weighted average of the ORs in the various strata:

exp(sum(beta.treat*nstrata)/nrow(lindner))
## [1] 0.1960846
exp(sum(beta.treat*treated.strata)/sum(treated.strata))
## [1] 0.2028472

Also here, we should use a bootstrap approach to estimate the corresponding confidence intervals.

5.5 Third option: Matching

Here we should create a reduced dataset retaining only patients with the overlap:

lindner.overlap <- lindner %>% filter(overlap==1)

Now we proceed with the matching algorithm: there is plenty of different algorithms in R that produce matching, here we use one from the library(Matching).

library(Matching)
match <-  Match(Y=lindner.overlap$sixMonthDeath, 
                Tr=lindner.overlap$abcix,
                X=lindner.overlap$ps, 
                caliper=0.2,#  all matches not equal to or within 0.2 standard deviations of ps are dropped
                M=1,
                ties=FALSE,
                replace=TRUE # 1:1
)
# Number of pairs
nn   <- length(match$index.treated)

# Create matched dataset
lindnerMatched <- cbind(rbind(lindner.overlap[match$index.treated,],
                              lindner.overlap[match$index.control,]),
                        pair=c(1:nn,1:nn))

table(lindner.overlap$abcix)
## 
##   0   1 
## 297 690
#Check number of treated patients
table(lindnerMatched$abcix)
## 
##   0   1 
## 689 689
#Look at people being used multiple times in the matched sample
summary(as.factor(table(match$index.treated)))
##   1 
## 689
summary(as.factor(table(match$index.control)))
##  1  2  3  4  5  6  7  8  9 10 12 13 15 17 19 
## 69 45 30 20 11  9  9  5  2  2  1  1  2  1  2
#Look at the propensity score distribution in the matched dataset
ggplot(lindnerMatched) +
  geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) +
  scale_y_continuous("Estimated PS") +
  scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()

ggplot(lindnerMatched) +
  geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) +
  scale_y_continuous("Estimated PS") +
  facet_grid(cols=vars(abcix))+
  scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
  theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Let’s check now the balance:

# Balance Diagnostics before and after matching
bal <- MatchBalance(abcix~
                      stent+
                      female+
                      diabetic+
                      acutemi+
                      ejecfrac+
                      ves1proc,
                    data=lindner.overlap,
                    match.out = match)
## 
## ***** (V1) stent *****
##                        Before Matching        After Matching
## mean treatment........    0.70435            0.70537 
## mean control..........    0.58586            0.65602 
## std mean diff.........     25.947             10.817 
## 
## mean raw eQQ diff.....    0.11785           0.049347 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.059245           0.024673 
## med  eCDF diff........   0.059245           0.024673 
## max  eCDF diff........    0.11849           0.049347 
## 
## var ratio (Tr/Co).....    0.85663            0.92097 
## T-test p-value........  0.0004398          0.0057027 
## 
## 
## ***** (V2) female *****
##                        Before Matching        After Matching
## mean treatment........    0.33333            0.33382 
## mean control..........    0.38384            0.30479 
## std mean diff.........    -10.706              6.151 
## 
## mean raw eQQ diff.....   0.050505           0.029028 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.025253           0.014514 
## med  eCDF diff........   0.025253           0.014514 
## max  eCDF diff........   0.050505           0.029028 
## 
## var ratio (Tr/Co).....     0.9378             1.0495 
## T-test p-value........    0.13211            0.09767 
## 
## 
## ***** (V3) diabetic *****
##                        Before Matching        After Matching
## mean treatment........    0.20725             0.2061 
## mean control..........    0.26599            0.21771 
## std mean diff.........    -14.483            -2.8684 
## 
## mean raw eQQ diff.....   0.057239           0.011611 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.029373          0.0058055 
## med  eCDF diff........   0.029373          0.0058055 
## max  eCDF diff........   0.058747           0.011611 
## 
## var ratio (Tr/Co).....    0.83988            0.96072 
## T-test p-value........   0.050489            0.47958 
## 
## 
## ***** (V4) acutemi *****
##                        Before Matching        After Matching
## mean treatment........    0.17101            0.17126 
## mean control..........   0.060606            0.16255 
## std mean diff.........     29.302             2.3098 
## 
## mean raw eQQ diff.....    0.11111          0.0087083 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.055204          0.0043541 
## med  eCDF diff........   0.055204          0.0043541 
## max  eCDF diff........    0.11041          0.0087083 
## 
## var ratio (Tr/Co).....     2.4853             1.0426 
## T-test p-value........ 4.1797e-08             0.5361 
## 
## 
## ***** (V5) ejecfrac *****
##                        Before Matching        After Matching
## mean treatment........     50.559             50.553 
## mean control..........     52.279             51.209 
## std mean diff.........    -16.899            -6.4414 
## 
## mean raw eQQ diff.....     1.8687            0.80406 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....         20                 15 
## 
## mean eCDF diff........   0.033253           0.013763 
## med  eCDF diff........  0.0087396          0.0043541 
## max  eCDF diff........    0.10771           0.046444 
## 
## var ratio (Tr/Co).....    0.97403             1.0808 
## T-test p-value........   0.016162            0.13002 
## KS Bootstrap p-value..      0.002              0.216 
## KS Naive p-value......   0.016165            0.44722 
## KS Statistic..........    0.10771           0.046444 
## 
## 
## ***** (V6) ves1proc *****
##                        Before Matching        After Matching
## mean treatment........     1.4435             1.4456 
## mean control..........     1.2088             1.5109 
## std mean diff.........     34.534            -9.6339 
## 
## mean raw eQQ diff.....    0.24242           0.065312 
## med  raw eQQ diff.....          0                  0 
## max  raw eQQ diff.....          1                  1 
## 
## mean eCDF diff........   0.048684           0.013062 
## med  eCDF diff........   0.014024          0.0043541 
## max  eCDF diff........     0.1805           0.049347 
## 
## var ratio (Tr/Co).....     2.0392            0.93506 
## T-test p-value........ 9.0068e-10          0.0050611 
## KS Bootstrap p-value.. < 2.22e-16              0.062 
## KS Naive p-value...... 2.6627e-06            0.37114 
## KS Statistic..........     0.1805           0.049347 
## 
## 
## Before Matching Minimum p.value: < 2.22e-16 
## Variable Name(s): ves1proc  Number(s): 6 
## 
## After Matching Minimum p.value: 0.0050611 
## Variable Name(s): ves1proc  Number(s): 6
lindnerMatched %>% 
  dplyr::select(stent,female,diabetic,acutemi,ejecfrac,ves1proc,abcix) %>%
  tbl_summary(by=abcix) %>% 
  add_overall() %>% 
  add_p()
Characteristic Overall, N = 1,3781 0, N = 6891 1, N = 6891 p-value2
stent 938 (68%) 452 (66%) 486 (71%) 0.049
female 440 (32%) 210 (30%) 230 (33%) 0.2
diabetic 292 (21%) 150 (22%) 142 (21%) 0.6
acutemi 230 (17%) 112 (16%) 118 (17%) 0.7
ejecfrac 55 (45, 56) 55 (45, 60) 55 (45, 56) 0.3
ves1proc


0.3
    0 2 (0.1%) 0 (0%) 2 (0.3%)
    1 842 (61%) 405 (59%) 437 (63%)
    2 434 (31%) 231 (34%) 203 (29%)
    3 73 (5.3%) 38 (5.5%) 35 (5.1%)
    4 27 (2.0%) 15 (2.2%) 12 (1.7%)
1 n (%); Median (IQR)
2 Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test

Note that the number of stent has not been well balanced after the matching procedure.

For this reason, we use this covariate in the regression model for the outcome.

Now we estimate the causal effect:

fit.out4<-glm(sixMonthDeath~abcix+stent, data=lindnerMatched,family=binomial)
summary(fit.out4)
## 
## Call:
## glm(formula = sixMonthDeath ~ abcix + stent, family = binomial, 
##     data = lindnerMatched)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -3.4089     0.3429  -9.941  < 2e-16 ***
## abcix        -1.5529     0.3562  -4.359  1.3e-05 ***
## stent         0.9438     0.3727   2.532   0.0113 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 449.29  on 1377  degrees of freedom
## Residual deviance: 418.68  on 1375  degrees of freedom
## AIC: 424.68
## 
## Number of Fisher Scoring iterations: 7

We can see that the number of stent is statistically significant in the model,so it has been a good idea to control for it, since it was not well balanced in the matching procedure. As we already have discussed, sometimes also covariates that are not confounders for the effect of the treatment on the outcome could be included in the final model in order to obtain more accurate estimates of the effect.

5.6 Fourth option: IPTW

Definition of the weights:

# Definition of weights for ATE
lindner.overlap %<>%mutate(w_ATE=case_when(abcix==1~1/ps,
                                   abcix==0~1/(1-ps))) 

#Definition of weights for ATT
lindner.overlap %<>%mutate(w_ATT=case_when(abcix==1~1,
                                   abcix==0~ps/(1-ps))) 

Check the extreme weights: sometimes it is useful to use truncated or stabilized weights, in order to reduce the variance of the final estimates, but we do not cover here this aspect.

#Check extremes
quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1))
##       0%       1%       5%      95%      99%     100% 
## 1.041578 1.052785 1.076734 1.882227 2.339352 3.627077
quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1))
##        0%        1%        5%       95%       99%      100% 
##  1.566665  1.658869  1.791224  6.137759 14.832595 25.687628
quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1))
##   0%   1%   5%  95%  99% 100% 
##    1    1    1    1    1    1
quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1))
##         0%         1%         5%        95%        99%       100% 
##  0.5666647  0.6588687  0.7912238  5.1377591 13.8325951 24.6876276
# Balance diagnostics
#ATE
bal_IPTW_ATE <- dx.wts(x=lindner.overlap$w_ATE,
                       data=lindner.overlap,
                       vars=colnames(lindner.overlap)[4:10],
                       treat.var = colnames(lindner.overlap)[3],
                       estimand = "ATE")

#ATT
bal_IPTW_ATT <- dx.wts(x=lindner.overlap$w_ATT,
                       data=lindner.overlap,
                       x.as.weights = T,
                       vars=colnames(lindner.overlap)[4:10],
                       treat.var = colnames(lindner.overlap)[3],
                       estimand = "ATT")


bal.table(bal_IPTW_ATE) 
## $unw
##            tx.mn  tx.sd   ct.mn  ct.sd std.eff.sz   stat     p    ks ks.pval
## stent      0.704  0.457   0.586  0.493      0.252  3.541 0.000 0.118   0.005
## height   171.419 10.728 171.458 10.605     -0.004 -0.053 0.958 0.024   1.000
## female     0.333  0.472   0.384  0.487     -0.106 -1.509 0.132 0.051   0.641
## diabetic   0.207  0.406   0.266  0.443     -0.141 -1.962 0.050 0.059   0.450
## acutemi    0.171  0.377   0.061  0.239      0.320  5.537 0.000 0.110   0.012
## ejecfrac  50.559 10.179  52.279 10.313     -0.168 -2.415 0.016 0.108   0.015
## ves1proc   1.443  0.680   1.209  0.476      0.370  6.207 0.000 0.181   0.000
## 
## [[2]]
##            tx.mn  tx.sd   ct.mn  ct.sd std.eff.sz   stat     p    ks ks.pval
## stent      0.669  0.471   0.668  0.472      0.000  0.005 0.996 0.000   1.000
## height   171.123 10.734 172.398 10.965     -0.119 -1.351 0.177 0.064   0.532
## female     0.347  0.476   0.330  0.471      0.035  0.468 0.640 0.017   1.000
## diabetic   0.225  0.418   0.242  0.429     -0.040 -0.461 0.645 0.017   1.000
## acutemi    0.137  0.344   0.149  0.357     -0.035 -0.307 0.759 0.012   1.000
## ejecfrac  51.077  9.910  51.093 10.182     -0.002 -0.020 0.984 0.044   0.909
## ves1proc   1.368  0.641   1.427  0.690     -0.093 -0.783 0.434 0.032   0.996
bal.table(bal_IPTW_ATT) 
## $unw
##            tx.mn  tx.sd   ct.mn  ct.sd std.eff.sz   stat     p    ks ks.pval
## stent      0.704  0.457   0.586  0.493      0.259  3.541 0.000 0.118   0.005
## height   171.419 10.728 171.458 10.605     -0.004 -0.053 0.958 0.024   1.000
## female     0.333  0.472   0.384  0.487     -0.107 -1.509 0.132 0.051   0.641
## diabetic   0.207  0.406   0.266  0.443     -0.145 -1.962 0.050 0.059   0.450
## acutemi    0.171  0.377   0.061  0.239      0.293  5.537 0.000 0.110   0.012
## ejecfrac  50.559 10.179  52.279 10.313     -0.169 -2.415 0.016 0.108   0.015
## ves1proc   1.443  0.680   1.209  0.476      0.345  6.207 0.000 0.181   0.000
## 
## [[2]]
##            tx.mn  tx.sd   ct.mn  ct.sd std.eff.sz   stat     p    ks ks.pval
## stent      0.704  0.457   0.703  0.458      0.003  0.034 0.973 0.001   1.000
## height   171.419 10.728 172.793 11.090     -0.128 -1.264 0.206 0.068   0.597
## female     0.333  0.472   0.307  0.462      0.055  0.683 0.495 0.026   1.000
## diabetic   0.207  0.406   0.232  0.423     -0.061 -0.598 0.550 0.025   1.000
## acutemi    0.171  0.377   0.186  0.390     -0.041 -0.313 0.754 0.015   1.000
## ejecfrac  50.559 10.179  50.594 10.085     -0.003 -0.041 0.968 0.035   0.997
## ves1proc   1.443  0.680   1.519  0.743     -0.111 -0.827 0.409 0.043   0.966

Finally, let’s estimate the ATE causal effect on the weighted dataset:

# Estimate ATE 
design.lindnerATE <- svydesign(ids=~1,
                               weights = ~w_ATE,
                               data=lindner.overlap)

fit_itpw_ATE <- svyglm(sixMonthDeath~abcix,
                       family=binomial,
                       design=design.lindnerATE)

tbl_regression(fit_itpw_ATE,exponentiate = T)
Characteristic OR1 95% CI1 p-value
abcix 0.17 0.06, 0.49 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

And the ATT:

design.lindnerATT <- svydesign(ids=~1,
                               weights = ~w_ATT,
                               data=lindner.overlap)

fit_iptw_ATT <- svyglm(sixMonthDeath~abcix,
                       family=binomial,
                       design=design.lindnerATT)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
tbl_regression(fit_iptw_ATT,exponentiate = T)
Characteristic OR1 95% CI1 p-value
abcix 0.15 0.05, 0.46 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

Also here it is possible to estimate the 95% CI using boostrap methods (that are in general more robust).

5.7 Boostrap confidence intervals for METHOD 1: covariate adjustement

results <- data.frame(ATE=rep(NA,4),ATT=rep(NA,4))
f_PSadj <- function(data, indices,outcome.formula) {
  
  d <- data[indices,] # allows boot to select sample
  # estimation of ps
  m1<-glm(abcix~
            acutemi+
            ejecfrac+
            ves1proc+
            stent+
            diabetic+
            female,
          data=d,
          family = binomial)
  
  d$ps<-fitted.values(m1)
  # overlap
  d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
                                 ps<=max(ps[abcix==0]),1,0)) 


  
  # outcome model
  m2<-glm(outcome.formula,data=d,family="binomial",subset = overlap==1)
  if(!m2$converged) print("Model did not converged")
  d$predY0<-m2$family$linkinv(coef(m2)[1]+coef(m2)[3]*d$ps)
  d$predY1<-m2$family$linkinv(coef(m2)[1]+coef(m2)[2]+coef(m2)[3]*d$ps)
  
  Y1<-mean(d$predY1)
  Y0<-mean(d$predY0)
  Y1_1<-mean(d$predY1[lindner$abcix==1])
  Y0_1<-mean(d$predY0[lindner$abcix==1])
  
  ATE_PSadj<-(Y1/(1-Y1))/(Y0/(1-Y0))
  ATT_PSadj<-(Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1))
  return(c(ATE_PSadj,ATT_PSadj))
}

res_boot <- function(obj.boot,type="percent",digits=3){
  suppressWarnings({
    orig <- round(obj.boot$t0,digits)
    ciATE <- paste(round(boot.ci(obj.boot,index=1)[[type]][4:5],digits),collapse = "-")
    ciATT <- paste(round(boot.ci(obj.boot,index=2)[[type]][4:5],digits),collapse = "-")
  })
  res <- paste(orig,rbind(ciATE,ciATT),sep="(")
  res.u <- paste(res,rep(")",2),sep = "")
  return(res.u)
  
}
boot.out <- boot(data=lindner, statistic=f_PSadj,R=1000,outcome.formula=fit.out$formula)

print(boot.out)
## 
## ORDINARY NONPARAMETRIC BOOTSTRAP
## 
## 
## Call:
## boot(data = lindner, statistic = f_PSadj, R = 1000, outcome.formula = fit.out$formula)
## 
## 
## Bootstrap Statistics :
##      original     bias    std. error
## t1* 0.2363187 0.02393070   0.1318934
## t2* 0.2363050 0.02394347   0.1318932
# Get 95% confidence interval
results$ATE[1]<-res_boot(boot.out,digits=4)[1]
results$ATT[1]<-res_boot(boot.out,digits=4)[2]

results
##                     ATE                   ATT
## 1 0.2363(0.0929-0.5899) 0.2363(0.0929-0.5899)
## 2                  <NA>                  <NA>
## 3                  <NA>                  <NA>
## 4                  <NA>                  <NA>

5.8 Boostrap confidence intervals for METHOD 2: stratification

Very often with the stratification method there are many problems of convergence of the regression algorithm, since in the strata we have very few events !

f_PSstrat <- function(data, indices) {
  d <- data[indices,] # allows boot to select sample
  
  m1<-glm(abcix~
            acutemi+
            ejecfrac+
            ves1proc+
            stent+
            diabetic+
            female,
          data=d,family = binomial)
  
  d$ps<-fitted.values(m1)
  
  quart_PS<-quantile(d$ps,c(0,0.33,0.66,1))
  
  d$strata<-cut(d$ps, quart_PS, labels=c(1:3))
  
  for (i in 1:3){
    ms<-glm(sixMonthDeath~abcix,data=d,subset = strata==i,family="binomial")
    beta.treat[i]<-coef(ms)[2]
  }
  
  
  ATE <- exp(sum(beta.treat*nstrata)/nrow(lindner))
  ATT <- exp(sum(beta.treat*treated.strata)/sum(treated.strata))
  
  return(c(ATE,ATT))
  
}
boot.out4 <- boot(data=lindner, statistic=f_PSstrat,R=1000)

# Get 95% confidence interval
results$ATE[2]<-res_boot(boot.out4,digits=3)[1]
results$ATT[2]<-res_boot(boot.out4,digits=3)[2]

5.9 Robust confidence intervals for METHOD 3: matching

We now estimate the robust standard errors related to the estimate on the matched dataset:

cov <- vcovHC(fit.out4, type = "HC0")
std.err <- sqrt(diag(cov))

q.val <- qnorm(0.975)
r <- cbind(
  Estimate = coef(fit.out4)
  , "Robust SE" = std.err
  , z = (coef(fit.out4)/std.err)
  , "Pr(>|z|) "= 2 * pnorm(abs(coef(fit.out4)/std.err), lower.tail = FALSE)
  , LL = coef(fit.out4) - q.val  * std.err
  , UL = coef(fit.out4) + q.val  * std.err
)


#Exponential to get the OR
results$ATT[3]<- paste0(round(exp(r[2,1]),4),"(",round(exp(r[2,5]),4),"-",round(exp(r[2,6]),4),")")

5.10 Boostrap confidence intervals for method 4: IPTW

f_IPTW <- function(data, indices) {
  
  d <- data[indices,] # allows boot to select sample
  
  # estimation of ps
  m1<-glm(abcix~
            acutemi+
            ejecfrac+
            ves1proc+
            stent+
            diabetic+
            female,
          data=d,
          family = binomial)
  
  d$ps<-fitted.values(m1)
  
  # overlap
  d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
                                 ps<=max(ps[abcix==0]),1,0)) %>% 
    filter(overlap==1)
  
  # Definition of weights for ATE
  d %<>%mutate(w_ATE=case_when(abcix==1~1/ps,
                                     abcix==0~1/(1-ps))) 
  
  #Definition of weights for ATT
  d %<>%mutate(w_ATT=case_when(abcix==1~1,
                                     abcix==0~ps/(1-ps)))
  
  
  
  # Estimate ATE 
  design.lindnerATE <- svydesign(ids=~1,
                                 weights = ~w_ATE,
                                 data=d)
  
  fit_itpw_ATE <- svyglm(sixMonthDeath~abcix,
                         family=binomial,
                         design=design.lindnerATE)
  
  
  
  # Estimate ATT
  design.lindnerATT <- svydesign(ids=~1,
                                 weights = ~w_ATT,
                                 data=d)
  
  fit_iptw_ATT <- svyglm(sixMonthDeath~abcix,
                         family=binomial,
                         design=design.lindnerATT)
  
  ATE <- exp(fit_itpw_ATE$coefficients[2])
  ATT <- exp(fit_iptw_ATT$coefficients[2])
  
  return(c(ATE,ATT))
}
boot.out5 <- boot(data=lindner, statistic=f_IPTW,R=1000)

results$ATE[4]<-res_boot(boot.out5,digits=4)[1]
results$ATT[4]<-res_boot(boot.out5,digits=4)[2]

5.11 Final Comparison across methods !

# recall the crude OR from the original dataset
tbl_regression(fit.crude,exponentiate=T)
Characteristic OR1 95% CI1 p-value
abcix 0.30 0.13, 0.66 0.003
1 OR = Odds Ratio, CI = Confidence Interval
# PS results
row.names(results) <- c("PS Adj Linear","Stratification","Matching","IPTW")
results
##                                  ATE                   ATT
## PS Adj Linear  0.2363(0.0929-0.5899) 0.2363(0.0929-0.5899)
## Stratification       0.196(0-43.913)   0.202(0.001-35.486)
## Matching                        <NA> 0.2116(0.1046-0.4283)
## IPTW           0.1749(0.0592-0.6322) 0.1525(0.0474-0.6406)

We can observe that using the propensity score based methods we obtain a stronger estimate of the treatment effect with respect to the crude odds ratio. We can also observe that stratification produce very unstable results due to the low sample size in the different strata. In conclusion, adjusting for confounders was important in this context !

5.12 Appendix 1 : METHOD 2 “Visually” checking balance across strata

# Function for graphical diagnostic to check balance 
balance.Strat <- function(x){
  
  if(length(unique(lindner[,x]))<10){
    cat.psa(categorical = lindner[,x],
            treatment   = lindner$abcix,
            strata      = lindner$strata,
            ylab="Relative Frequency")
    title(x)
  }
  else{
    box.psa(continuous =  lindner[,x],
            treatment   = lindner$abcix,
            strata      = lindner$strata)
    title(x)
  }
}
var <- lindner %>%dplyr::select(stent,
                         female,
                         diabetic,
                         acutemi,
                         ejecfrac,
                         ves1proc) %>% colnames() 
balance.Strat(var[1])

balance.Strat(var[2])

balance.Strat(var[3])

balance.Strat(var[4])

balance.Strat(var[5])
## Warning in xy.coords(x, y): NAs introduced by coercion

## Warning in xy.coords(x, y): NAs introduced by coercion

balance.Strat(var[6])

5.13 Appendix 2: Methods to estimate weights

There are many R libraries that can be used to estimate weights see for example: https://cran.r-project.org/web/packages/PSweight/index.html This library enables the estimation and inference of average causal effects with binary and multiple treatments using overlap weights (ATO), inverse probability of treatment weights (ATE), average treatment effect among the treated weights (ATT), matching weights (ATM) and entropy weights (ATEN), with and without propensity score trimming. Reference: https://arxiv.org/pdf/2010.08893

6 Logistic regression

6.1 Introduction

In epidemiological data, most of the outcomes are often binary or dichotomous. For example, in the investigation of the cause of a disease, the status of the outcome, the disease, is diseased vs non-diseased. For a mortality study, the outcome is usually died or survived. For a continuous variable such as weight or height, we can think that the single representative number for the population or sample is the mean or median. For dichotomous data, the representative number is the proportion or percentage of one type of the outcome. For example, as you should remember from Block 1, prevalence is the proportion of the population with the disease of interest. Case-fatality is the proportion of deaths among the people with the disease. The other related term is probability. Proportion is a simple straightforward term. Probability denotes the likeliness, which is more theoretical (assuming a random mechanism that generate the observed data). In the case of a dichotomous variable, the proportion is used as the estimated probability, for example assuming a binomial random variable behaviour. For computation, the outcome is often represented with 1 and 0 otherwise. The prevalence is then the mean of diseased values among the study sample. Assuming the binomial mechanism, the standard regression model is the logistic regression, using the log(odds) as a dependent variable in the model. If P is the probability of having a disease, 1-P is probability of not having the disease. The odds is thus P/(1-P). As you remind, the relationship between probability and odds, mainly log(odds) can be plotted as follows:

p <- seq(from=0, to=1, by=.01)
odds <- p/(1-p)
plot(log(odds), p, type="l", col="blue", ylab="Probability",
main="Relationship between log odds and probability", las=1)
abline(h=.5)
abline(v=0)

Being on a linear and well balanced scale, the logit is a more appropriate scale for regression model for a binary outcome than the probability itself.

Similarly to the conversion between probabilities and odds, we can devise functions that convert both ways between probability and log-odds:

logit <- function(p) log(p/(1-p))
tigol <- function(t) 1/(1+exp(-t))

(tigol is just the letters of logit reversed).

Modelling logit(Y|X) ~ \(\beta\)X is the general form of logistic regression. It means that we are modelling the logit of Y given X (or conditioning on X), where X denotes one or more independent variables.

Suppose there are independent or exposure variables: X1 and X2. \(\beta\)X would be \(\beta_{0}\)+ \(\beta_{1}\)X1 + \(\beta_{2}\)X2, where \(\beta_{0}\) is the intercept.

The X can be age, sex, and other prognostic variables. In the explanatory/causal setting, among these X variables, one specific X is usually the focus of the study. Others are potential confounders and covariates.

Using logistic regression it turns out that the probability of observing Y given X Pr(Y|X) is equal to exp\(\beta\)X/(1 + exp\(\beta\)X). Hence, logistic regression is often used to compute the probability of an outcome under a given set of exposures. For example, prediction of probability of getting a disease under a given set of age, sex, and behaviour groups, etc.

6.2 Example: Tooth decay

The dataset Decay is a simple dataset containing two variables: ‘decay’, which is binary and ‘strep’, which is a continuous variable.

data(Decay)
summary(Decay)
##      decay            strep       
##  Min.   :0.0000   Min.   :  0.50  
##  1st Qu.:0.0000   1st Qu.: 43.00  
##  Median :1.0000   Median :105.00  
##  Mean   :0.6307   Mean   : 95.25  
##  3rd Qu.:1.0000   3rd Qu.:150.00  
##  Max.   :1.0000   Max.   :152.50

The outcome variable is ‘decay’, which indicates whether a person has at least one decayed tooth (1) or not (0). The exposure variable is ‘strep’, the number of colony forming units (CFU) of streptococci, a group of bacteria suspected to cause tooth decay. The prevalence of having decayed teeth is equal to the mean of the ‘decay’ variable, i.e. 0.63. To look at the ‘strep’ variable type:

summary(Decay$strep)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    0.50   43.00  105.00   95.25  150.00  152.50
hist(Decay$strep)

The plot shows that the vast majority have the value at about 150. Since the natural distribution of bacteria is logarithmic, a transformed variable is created and used as the independent variable. Let’s now estimate an univariable logistic regression model:

Decay$log10.strep <- log10(Decay$strep)
glm0 <- glm(decay~log10.strep, family=binomial, data=Decay)

Ask for the summary of the estimated model:

summary(glm0)
## 
## Call:
## glm(formula = decay ~ log10.strep, family = binomial, data = Decay)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.5539     0.5184  -4.927 8.36e-07 ***
## log10.strep   1.6808     0.2764   6.082 1.19e-09 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 574.27  on 435  degrees of freedom
## Residual deviance: 531.83  on 434  degrees of freedom
## AIC: 535.83
## 
## Number of Fisher Scoring iterations: 4

Both the coefficients of the intercept and ‘log10.strep’ are statistically significant. Pr(>|z|) for ‘log10.strep’ is the P value from Wald’s test. This tests whether the coefficient, 1.681, is significantly different from 0. In this case it is. The estimated intercept is -2.554. This means that when log10.strep is 0 (or strep equals 1 CFU), the logit of having at least a decayed tooth is -2.55. We can then calculate the related baseline odds and probability.

exp(-2.554) -> baseline.odds
baseline.odds
## [1] 0.07776996
baseline.odds/(1+baseline.odds) -> baseline.prob
baseline.prob
## [1] 0.07215822

There is an odds of 0.077 or a probability of 7.2% of having at least one decayed tooth if the number of CFU of the mutan strep is at 1 CFU.

The coefficient of log10.strep is 1.681. For every unit increment of log10(strep), or an increment of 10 CFU, the logit will increase by 1.681.

This increment of logit is constant but not the increment of probability because the latter is not on a linear scale.

The probability at each point of CFU is computed by replacing both coefficients obtained from the model. For example, at 100 CFU, the probability is:

prob.100 <- coef(glm0)[1] + log10(100)*coef(glm0)[2]
prob.100
## (Intercept) 
##   0.8078015

To see the relationship for the whole dataset:

plot(Decay$log10.strep, fitted(glm0), ylim=c(0,0.80))

A logistic nature of the curve is partly demonstrated. To make it clearer, the ranges of X and Y axes are both expanded to allow a more extensive curve fitting.

plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1))

Another vector of the same name ‘log10.strep’ is created in the form of a data frame for plotting a fitted line on the same graph.

newdata        <- data.frame(log10.strep=seq(from=-2, to=4, by=.01))
predicted.line <- predict.glm(glm0,newdata,type="response")

The values for predicted line on the above command must be on the same scale as the ‘response’ variable. Since the response is either 0 or 1, the predicted line would be in between, ie. the predicted probability for each value of log10(strep).

plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1), xlab=" ", ylab=" ", xaxt="n", las=1)
lines(newdata$log10.strep, predicted.line, col="blue")
axis(side=1, at=-2:4, labels=as.character(10^(-2:4)))
title(main="Relationship between mutan streptococci \n and probability of tooth decay", xlab="CFU",
ylab="Probability of having decayed teeth")

6.3 Another example of logistic regression

The above example of caries data has a continuous variable ‘log10.strep’ as the key independent variable.

In most epidemiological datasets, the independent variables are often categorical.

We will examine an example coming from an outbreak investigation.

On 25 August 1990, the local health officer in Supan Buri Province of Thailand reported the occurrence of an outbreak of acute gastrointestinal illness on a national handicapped sports day. Epidemiologists went to investigate. The dataset is called Outbreak. Most variable names are self explanatory. Variables are coded as 0 = no, 1 = yes and 9 = missing/unknown for three food items consumed by participants: ‘beefcurry’ (beef curry), ‘saltegg’ (salted eggs) and ‘water’.

Also on the menu were eclairs, a finger-shaped iced cake of choux pastry filled with cream. This variable records the number of pieces eaten by each participant. Missing values were coded as follows: 88 = “ate but do not remember how much”, while code 90 represents totally missing information.

Some participants experienced gastrointestinal symptoms, such as: nausea, vomiting, abdominal pain and diarrhea. The ages of each participant are recorded in years with 99 representing a missing value.

The variables ‘exptime’ and ‘onset’ are the exposure and onset times, which are in character format.

Let’s look at the data.

data(Outbreak)
summary(Outbreak)
##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :18.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :23.69                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:24.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :99.00                     
##    beefcurry         saltegg          eclair          water      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.: 0.00   1st Qu.:1.000  
##  Median :1.0000   Median :1.000   Median : 2.00   Median :1.000  
##  Mean   :0.9534   Mean   :0.957   Mean   :11.48   Mean   :1.021  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.: 2.00   3rd Qu.:1.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :90.00   Max.   :9.000  
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     diarrhea     
##  Min.   :0.0000  
##  1st Qu.:0.0000  
##  Median :0.0000  
##  Mean   :0.2148  
##  3rd Qu.:0.0000  
##  Max.   :1.0000

We will first of all define the cases. It was agreed among the investigators that a case should be defined as a person who had any of the four symptoms: ‘nausea’, ‘vomiting’, ‘abdpain’ or ‘diarrhea’. A case can then by computed as follows:

Outbreak$case <- (Outbreak$nausea==1)|(Outbreak$vomiting==1)|(Outbreak$abdpain==1)|(Outbreak$diarrhea==1)

The variable ‘case’ is now incorporated into the data.

summary(Outbreak)
##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :18.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :23.69                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:24.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :99.00                     
##    beefcurry         saltegg          eclair          water      
##  Min.   :0.0000   Min.   :0.000   Min.   : 0.00   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.000   1st Qu.: 0.00   1st Qu.:1.000  
##  Median :1.0000   Median :1.000   Median : 2.00   Median :1.000  
##  Mean   :0.9534   Mean   :0.957   Mean   :11.48   Mean   :1.021  
##  3rd Qu.:1.0000   3rd Qu.:1.000   3rd Qu.: 2.00   3rd Qu.:1.000  
##  Max.   :9.0000   Max.   :9.000   Max.   :90.00   Max.   :9.000  
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##     diarrhea         case        
##  Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   TRUE :469      
##  Mean   :0.2148                  
##  3rd Qu.:0.0000                  
##  Max.   :1.0000

6.3.1 Recoding missing values

We now recode missing values.We do not impute them, so subjects with missing values will be removed from the analyses.

Outbreak = Outbreak %>%
  mutate(
    age=na_if(age, 99),
    beefcurry=na_if(beefcurry, 9),
    saltegg=na_if(saltegg, 9),
    water=na_if(water, 9),
    eclair=na_if(eclair, 90))
summary(Outbreak)
##        id              sex             age          exptime         
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094       
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs       
##  Median : 547.5   Median :1.000   Median :17.00   Mode  :character  
##  Mean   : 547.5   Mean   :0.659   Mean   :19.32                     
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:22.00                     
##  Max.   :1094.0   Max.   :1.000   Max.   :58.00                     
##                                   NA's   :60                        
##    beefcurry         saltegg           eclair           water      
##  Min.   :0.0000   Min.   :0.0000   Min.   : 0.000   Min.   :0.000  
##  1st Qu.:1.0000   1st Qu.:1.0000   1st Qu.: 0.000   1st Qu.:1.000  
##  Median :1.0000   Median :1.0000   Median : 2.000   Median :1.000  
##  Mean   :0.9164   Mean   :0.9201   Mean   : 2.072   Mean   :0.977  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.: 2.000   3rd Qu.:1.000  
##  Max.   :1.0000   Max.   :1.0000   Max.   :80.000   Max.   :1.000  
##  NA's   :5        NA's   :5        NA's   :117      NA's   :6      
##     onset               nausea          vomiting         abdpain      
##  Length:1094        Min.   :0.0000   Min.   :0.0000   Min.   :0.0000  
##  Class :AsIs        1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000  
##  Mode  :character   Median :0.0000   Median :0.0000   Median :0.0000  
##                     Mean   :0.3985   Mean   :0.3793   Mean   :0.3501  
##                     3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:1.0000  
##                     Max.   :1.0000   Max.   :1.0000   Max.   :1.0000  
##                                                                       
##     diarrhea         case        
##  Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   TRUE :469      
##  Mean   :0.2148                  
##  3rd Qu.:0.0000                  
##  Max.   :1.0000                  
## 

The three variables can also be changed to factors:

Outbreak= Outbreak %>%
  mutate(
    beefcurry=as.factor(beefcurry),
     saltegg=as.factor(saltegg),
  water=as.factor(water))
summary(Outbreak)
##        id              sex             age          exptime          beefcurry 
##  Min.   :   1.0   Min.   :0.000   Min.   : 1.00   Length:1094        0   : 91  
##  1st Qu.: 274.2   1st Qu.:0.000   1st Qu.:14.00   Class :AsIs        1   :998  
##  Median : 547.5   Median :1.000   Median :17.00   Mode  :character   NA's:  5  
##  Mean   : 547.5   Mean   :0.659   Mean   :19.32                                
##  3rd Qu.: 820.8   3rd Qu.:1.000   3rd Qu.:22.00                                
##  Max.   :1094.0   Max.   :1.000   Max.   :58.00                                
##                                   NA's   :60                                   
##  saltegg         eclair        water         onset               nausea      
##  0   :  87   Min.   : 0.000   0   :  25   Length:1094        Min.   :0.0000  
##  1   :1002   1st Qu.: 0.000   1   :1063   Class :AsIs        1st Qu.:0.0000  
##  NA's:   5   Median : 2.000   NA's:   6   Mode  :character   Median :0.0000  
##              Mean   : 2.072                                  Mean   :0.3985  
##              3rd Qu.: 2.000                                  3rd Qu.:1.0000  
##              Max.   :80.000                                  Max.   :1.0000  
##              NA's   :117                                                     
##     vomiting         abdpain          diarrhea         case        
##  Min.   :0.0000   Min.   :0.0000   Min.   :0.0000   Mode :logical  
##  1st Qu.:0.0000   1st Qu.:0.0000   1st Qu.:0.0000   FALSE:625      
##  Median :0.0000   Median :0.0000   Median :0.0000   TRUE :469      
##  Mean   :0.3793   Mean   :0.3501   Mean   :0.2148                  
##  3rd Qu.:1.0000   3rd Qu.:1.0000   3rd Qu.:0.0000                  
##  Max.   :1.0000   Max.   :1.0000   Max.   :1.0000                  
## 

All variables now look fine except ‘eclair’ which still contains the value 80 representing “ate but not remember how much”. We will analyse its relationship with ‘case’ by considering it as an ordered categorical variable.

tabpct(Outbreak$eclair, Outbreak$case)
## 
## Original table 
##                Outbreak$case
## Outbreak$eclair  FALSE  TRUE  Total
##           0        279    15    294
##           0.5        5    10     15
##           1         49    41     90
##           2        203   243    446
##           3          9    22     31
##           4         14    33     47
##           5          2     5      7
##           6          8    20     28
##           8          1     5      6
##           10         1     3      4
##           12         1     1      2
##           19         1     0      1
##           20         1     0      1
##           80         5     0      5
##           Total    579   398    977
## 
## Row percent 
##                Outbreak$case
## Outbreak$eclair   FALSE    TRUE  Total
##             0       279      15    294
##                  (94.9)   (5.1)  (100)
##             0.5       5      10     15
##                  (33.3)  (66.7)  (100)
##             1        49      41     90
##                  (54.4)  (45.6)  (100)
##             2       203     243    446
##                  (45.5)  (54.5)  (100)
##             3         9      22     31
##                    (29)    (71)  (100)
##             4        14      33     47
##                  (29.8)  (70.2)  (100)
##             5         2       5      7
##                  (28.6)  (71.4)  (100)
##             6         8      20     28
##                  (28.6)  (71.4)  (100)
##             8         1       5      6
##                  (16.7)  (83.3)  (100)
##             10        1       3      4
##                    (25)    (75)  (100)
##             12        1       1      2
##                    (50)    (50)  (100)
##             19        1       0      1
##                   (100)     (0)  (100)
##             20        1       0      1
##                   (100)     (0)  (100)
##             80        5       0      5
##                   (100)     (0)  (100)
## 
## Column percent 
##                Outbreak$case
## Outbreak$eclair  FALSE       %  TRUE       %
##           0        279  (48.2)    15   (3.8)
##           0.5        5   (0.9)    10   (2.5)
##           1         49   (8.5)    41  (10.3)
##           2        203  (35.1)   243  (61.1)
##           3          9   (1.6)    22   (5.5)
##           4         14   (2.4)    33   (8.3)
##           5          2   (0.3)     5   (1.3)
##           6          8   (1.4)    20   (5.0)
##           8          1   (0.2)     5   (1.3)
##           10         1   (0.2)     3   (0.8)
##           12         1   (0.2)     1   (0.3)
##           19         1   (0.2)     0   (0.0)
##           20         1   (0.2)     0   (0.0)
##           80         5   (0.9)     0   (0.0)
##           Total    579   (100)   398   (100)

The width of the columns of the mosaic graph denotes the relative frequency of that category. The highest frequency is 2 pieces followed by 0 and 1 piece. The other numbers have relatively low frequencies; particularly the 5 records where ‘eclair’ was coded as 80.

There is a tendency of increasing red area or attack rate from left to right indicating that the risk was increased when more pieces of eclair were consumed.

We will use the distribution of these proportions to guide our grouping of eclair consumption.

The first column of zero consumption has a very low attack rate, therefore it should be a separate category. Only a few took half a piece and this could be combined with those who took only one piece.

Persons consuming 2 pieces should be kept as one category as their frequency is very high. Others who ate more than two pieces should be grouped into another category. Finally, those coded as ‘80’ will be dropped due to the unknown amount of consumption as well as its low frequency.

Remind that collapsing categories with low numbers help in increasing the power of the statistical estimate of the regression coefficients.

Outbreak$eclairgr <- cut(Outbreak$eclair, breaks = c(0, 0.4, 1, 2, 79),
include.lowest = TRUE, labels=c("0","1","2",">2"))

The argument ‘include.lowest=TRUE’ indicates that 0 eclair must be included in the lowest category.

tabpct(Outbreak$eclairgr, Outbreak$case)
## 
## Original table 
##                  Outbreak$case
## Outbreak$eclairgr  FALSE  TRUE  Total
##             0        279    15    294
##             1         54    51    105
##             2        203   243    446
##             >2        38    89    127
##             Total    574   398    972
## 
## Row percent 
##                  Outbreak$case
## Outbreak$eclairgr   FALSE    TRUE  Total
##                0      279      15    294
##                    (94.9)   (5.1)  (100)
##                1       54      51    105
##                    (51.4)  (48.6)  (100)
##                2      203     243    446
##                    (45.5)  (54.5)  (100)
##                >2      38      89    127
##                    (29.9)  (70.1)  (100)
## 
## Column percent 
##                  Outbreak$case
## Outbreak$eclairgr  FALSE       %  TRUE       %
##             0        279  (48.6)    15   (3.8)
##             1         54   (9.4)    51  (12.8)
##             2        203  (35.4)   243  (61.1)
##             >2        38   (6.6)    89  (22.4)
##             Total    574   (100)   398   (100)

The attack rate or percentage of diseased in each category of exposure, as shown in the bracket of the column TRUE, increases from 5.1% among those who did not eat any eclairs to 70.1% among those heavy eaters of eclair.

The graph output is similar to the preceding one except that the groups now are more concise. We now have a continuous variable of ‘eclair’ and a categorical variable of ‘eclairgr’. The next step is to create a binary exposure for eclair.

Outbreak$eclair.eat <- Outbreak$eclair > 0

This binary exposure variable is now similar to the others, i.e. ‘beefcurry’, ‘saltegg’ and ‘water’.

We now model ‘case’ as the binary outcome variable and take ‘eclair.eat’ as the only explanatory variable:

glm0 <- glm(case ~ eclair.eat, family=binomial, data=Outbreak)
summary(glm0)
## 
## Call:
## glm(formula = case ~ eclair.eat, family = binomial, data = Outbreak)
## 
## Coefficients:
##                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -2.923      0.265  -11.03   <2e-16 ***
## eclair.eatTRUE    3.167      0.276   11.47   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1320.7  on 976  degrees of freedom
## Residual deviance: 1055.2  on 975  degrees of freedom
##   (117 observations deleted due to missingness)
## AIC: 1059.2
## 
## Number of Fisher Scoring iterations: 5
logistic.display(glm0)
## 
## Logistic regression predicting case 
##  
##            OR(95%CI)            P(Wald's test) P(LR-test)
## eclair.eat 23.75 (13.82,40.79)  < 0.001        < 0.001   
##                                                          
## Log-likelihood = -527.6075
## No. of observations = 977
## AIC value = 1059.2149

The odds ratio from the logistic regression is derived from exponentiation of the estimate, i.e. 23.75 is obtained from:

exp(coef(summary(glm0))[2,1])
## [1] 23.746

The 95% confidence interval of the odds ratio is obtained from:

exp(coef(summary(glm0))[2,1] + c(-1,1) * 1.96 *coef(summary(glm0))[2,2])
## [1] 13.82402 40.78933

These values are close to simple calculation of the 2-by-2 table.

The output from logistic.display also contains the ‘LR-test’ (likelihood ratio test) result, which checks whether the likelihood of the given model, ‘glm0’, would be significantly different from the model without ‘eclair.eat’, which in this case would be the “null” model. For an independent variable with two levels, the LR-test does not add further important information because Wald’s test has already tested the hypothesis. When the independent variable has more than two levels, the LR-test is more important than Wald’s test as the following example demonstrates.

glm1 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm1)
## 
## Logistic regression predicting case 
##  
##                  OR(95%CI)            P(Wald's test) P(LR-test)
## eclairgr: ref.=0                                     < 0.001   
##    1             17.57 (9.21,33.49)   < 0.001                  
##    2             22.27 (12.82,38.66)  < 0.001                  
##    >2            43.56 (22.89,82.91)  < 0.001                  
##                                                                
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471

Interpreting Wald’s test alone, one would conclude that all levels of eclair eaten would be significant. However, this depends on the reference level. By default, R assumes that the first level of an independent factor is the reference level. If we relevel the reference level to be 2 pieces of eclair, Wald’s test gives a different impression.

Outbreak$eclairgr <- relevel(Outbreak$eclairgr, ref="2")
glm2 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm2)
## 
## Logistic regression predicting case 
##  
##                  OR(95%CI)         P(Wald's test) P(LR-test)
## eclairgr: ref.=2                                  < 0.001   
##    0             0.04 (0.03,0.08)  < 0.001                  
##    1             0.79 (0.52,1.21)  0.275                    
##    >2            1.96 (1.28,2.99)  0.002                    
##                                                             
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471

The results show that eating only one piece of eclair does not reduce the risk significantly compared to eating two pieces. While results from Wald’s test depend on the reference level of the explanatory variable, the LR-test is concerned only with the contribution of the variable as a whole and ignores the reference level.

6.3.2 Try to evaluate the relationship between variables (when discussion with an expert is not possible)

Imagine that we want to have an idea of the presence of possible confounders with respect to the effect of eclair and we have not the possibility to talk with an expert, so that we can investigate relationship observing the data. In this case a possibility is to examine univariable effects and then building multivariable models paying attention to the possible change in the effects estimation for the variable of interest when introducing the other. This procedure is not considered as valid as defining a DAG a priori with the help of an expert, from a causal inference point of view,but it give some indications at least.

Therefore, try ‘saltegg’ now as the only explanatory variable:

glm3 <- glm(case ~ saltegg, family = binomial, data=Outbreak)
logistic.display(glm3)
## 
## Logistic regression predicting case 
##  
##                 OR(95%CI)         P(Wald's test) P(LR-test)
## saltegg: 1 vs 0 2.54 (1.53,4.22)  < 0.001        < 0.001   
##                                                            
## Log-likelihood = -736.998
## No. of observations = 1089
## AIC value = 1477.996

The odds ratio for ‘saltegg’ is statistically significant. The number of valid records is also higher than the model containing ‘eclairgr’.

Note: One should always be very careful when analysing data that contain missing values. Advanced methods to handle missing values are beyond the scope of this course and for reasons of simplicity are ignored here. Data scientists should be advised to deal with missing values properly prior to conducting their analysis.

To check whether the odds ratio of ‘saltegg’ is confounded by ‘eclairgr’, the two explanatory variables are put together in the next model:

glm4 <- glm(case ~ eclairgr + saltegg, family=binomial, data=Outbreak)
logistic.display(glm4, crude.p.value=TRUE)
## 
## Logistic regression predicting case 
##  
##                  crude OR(95%CI)   crude P value adj. OR(95%CI)   
## eclairgr: ref.=2                                                  
##    0             0.04 (0.03,0.08)  < 0.001       0.04 (0.03,0.08) 
##    1             0.79 (0.52,1.21)  0.275         0.79 (0.51,1.21) 
##    >2            1.96 (1.28,2.99)  0.002         1.96 (1.28,2.99) 
##                                                                   
## saltegg: 1 vs 0  2.37 (1.41,3.99)  0.001         1.01 (0.53,1.93) 
##                                                                   
##                  P(Wald's test) P(LR-test)
## eclairgr: ref.=2                < 0.001   
##    0             < 0.001                  
##    1             0.279                    
##    >2            0.002                    
##                                           
## saltegg: 1 vs 0  0.975          0.975     
##                                           
## Log-likelihood = -516.8231
## No. of observations = 972
## AIC value = 1043.6461

The odds ratios of the explanatory variables in glm4 are adjusted for each other. The crude odds ratios are exactly the same as from the previous models with only single variable.

The adjusted odds ratios of ‘eclairgr’ do not change suggesting that it is not confounded by ‘saltegg’, whereas the odds ratio of ‘saltegg’ is clearly changed towards unity, and now has a very large P value.

The difference between the adjusted odds ratio and the crude odds ratio is an indication that ‘saltegg’ is confounded by ‘eclairgr’, which could be considered as an independent risk factor.

Now that we have a model containing two explanatory variables, we can compare models ‘glm4’ and ‘glm2’ using the lrtest command.

lrtest(glm4, glm2)
## Likelihood ratio test for MLE method 
## Chi-squared 1 d.f. =  0.0009809013 , P value =  0.9750149

The P value of 0.975 is the same as that from ‘P(LR-test)’ of ‘saltegg’ obtained from the preceding command. The test determines whether removal of ‘saltegg’ in a model would make a significant difference than if it were kept.

When there is more than one explanatory variable, ‘P(LR-test)’ from logistic.display is actually obtained from the lrtest command, which compares the current model against one in which the particular variable is removed, while keeping all remaining variables.

Let’s further add covariates in the model:

glm5 <- glm(case~eclairgr+saltegg+sex, family=binomial, data=Outbreak)
logistic.display(glm5)
## 
## Logistic regression predicting case 
##  
##                  crude OR(95%CI)   adj. OR(95%CI)    P(Wald's test) P(LR-test)
## eclairgr: ref.=2                                                    < 0.001   
##    0             0.04 (0.03,0.08)  0.04 (0.02,0.07)  < 0.001                  
##    1             0.79 (0.52,1.21)  0.75 (0.49,1.16)  0.2                      
##    >2            1.96 (1.28,2.99)  1.82 (1.19,2.8)   0.006                    
##                                                                               
## saltegg: 1 vs 0  2.37 (1.41,3.99)  0.92 (0.48,1.76)  0.807          0.808     
##                                                                               
## sex: 1 vs 0      1.58 (1.19,2.08)  1.85 (1.35,2.53)  < 0.001        < 0.001   
##                                                                               
## Log-likelihood = -509.5181
## No. of observations = 972
## AIC value = 1031.0361

The third explanatory variable ‘sex’ is another independent risk factor. Since females are the reference level, males have an increased risk compared to females. This variable is not considered a confounder to either of the preceding variables because it has not substantially changed the odds ratios of any of them (from ‘glm4’). The reason for not being able to confound is its lack of association with either of the preceding explanatory variables. In other words, males and females were not different in terms of eating eclairs and salted eggs.

7 Survey of Health, Ageing and Retirement in Europe : example of an analysis of a real dataset

Load the required libraries

library(haven)
library(tidyverse)
library(magrittr)

Data comes from SHARE, the Survey of Health, Ageing and Retirement in Europe (https://share-eric.eu/data/). It is a multidisciplinary and cross-national panel database of data on health, socio-economic status and social and family networks of about 140000 individuals aged 50 or older (around 380000 interviews). SHARE covers 27 European countries and Israel.

The dataset we use here contains only part of the data the survey produced.

The research question is whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population: therefore we can consider it from the pint of view of an explanatory (causal) model, not a predictive one.
This question could be of interest since muscular weakness can be considered a proxy of poor physical health. Muscular strength was defined as the hand grip measured with a dynamo-meter and the weakness was defined as having a grip strength below a threshold calculated according to BMI and sex.

The exposure group was defined as subjects who underwent a stressful period between the first and the last time they had been interviewed; all other subject were defined as non-exposed.

Of note, an inclusion criteria was to have a normal hand strength at the start of the study. Once the subjects were divided in the exposed and non-exposed group, the hand grip measurement after two years from the exposure definition was used to define the outcome.

In this example we will consider the following variables:

load("Logistic Regression examples.RData")

7.1 IDA phase

7.1.1 Categorical Variables

Before starting to fit any regression model we have to properly code categorical variables. This is usually done in the initial data analysis (IDA) and the descriptive statistics phase.

The first step consists in identifying which variables are categorical. This may seem as an easy step but it isn’t always. If fact the choice for some variables (i.e. indexes and scales) is not obvious and it requires either knowledge of the data we are analysing or a bit more effort on the data modelling.

The important message is that we can’t assume that it is correct to model a variable as numerical only because it was of type numerical/integer in the dataset imported in R.

Once we have identified the variables that we think are categorical we can process them. We can have three cases that leads to a slightly different procedure according to the R type of the variable:

  • Binary variables
  • Variables of type numerical or integer
  • Variables of type character

7.1.2 Binary Variables

Binary variables can either be treated as numerical, usually dummy 0/1 variables, or as factors. It does not make any difference in terms of the results obtained from the model but we have to be aware of how they are coded for a correct interpretation of the model.

In general, the model will always have 1 parameter for a binary variable. However, we have to choose which level the parameter should represent by deciding which group to code with 1. The choice should be made according to how we want the model to be interpreted. For example, for the treatment variable, 0 is usually the placebo or the control treatment and 1 is usually the experimental treatment. Alternatively, in the case of the covariates/exposure variables, we could simply use a statistical criteria: for a better stability of the model it is always better to use the most frequent category as a reference.

For example, it makes more sense to consider as the reference level of paid_job subjects who have had a paid job. This will help with the stability of the model but also it will ease the interpretation of the results. Therefore, we create a new variable no_paid_job:

datiSHAREStress$no_paid_job <- ifelse(datiSHAREStress$paid_job==0,1,0)

7.1.3 Categorical variables with numerical labels

In this case we must transform the variable in a factor, assign labels to the factor levels and choose a reference level. The reference level will be the group all the others levels will be compared to. In fact when a factor variable with c levels is added to a model, R internally transforms it in c-1 dummy variables and a model parameter will be estimated for each of them. Again, is important we choose the reference level. In this case we have ses which we have to transform into a factor

datiSHAREStress$ses <- as.factor(datiSHAREStress$ses)

We can now check how R has coded the variable:

str(datiSHAREStress$ses)
##  Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 2 4 4 ...
contrasts(datiSHAREStress$ses)
##   2 3 4
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 0 1

By default, R uses the group 1 as the reference level. With the contrasts() function we can see how the variable will be parametrized in the regression model. In the rows we have the variable levels and on the columns the parameters generated for this variable.

It is also useful to assign more meaningful labels:

datiSHAREStress$ses=factor(datiSHAREStress$ses,
                           labels = c("Great Difficulties",
                           "Some Difficulties",
                            "Fairly Easily",
                             "Easily"))

Last but not least we can set the reference category of our choice. In this case we choose “Easily” which is the most frequent category.

table(datiSHAREStress$ses)
## 
## Great Difficulties  Some Difficulties      Fairly Easily             Easily 
##                191                813               1997               3148
datiSHAREStress$ses=relevel(datiSHAREStress$ses,ref="Easily")

7.1.4 Characters variables

If the variable is of type character the steps are similar to the previous ones with the exception that we won’t need to explicitly set the labels. As a reminder, by default R uses as a reference level the first level in alphabetical order.

7.2 General descriptives statistics and plots

We skip this part here, since we already discussed how to describe a dataset in the various IDA examples. Remind that you should pay particular attention to outliers, missing values, continuous variable’s distribution shapes, rare categories in categorical ones!

7.3 Univariable Logistic Regression (univariable filtering)

A common method used in the analysis of health data for variables selection, is fitting many models with one covariate at the time. This is also a way to begin to explore the dataset, even if as discussed it is not the suggested method to select variables in the model. In this case, we can start with the exposure to stress, our variable of interest. This kind of analysis in the medical literature is often referred as “univariable analysis”.

So we start by fitting the model

fit_uni_stress <- glm(low_grip~stress,family = binomial,data=datiSHAREStress)

As a reminder, the glm function automatically uses the logit as link function unless we state otherwise.

The results of the model can be obtained with

summary(fit_uni_stress)
## 
## Call:
## glm(formula = low_grip ~ stress, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.5660     0.0501 -51.217   <2e-16 ***
## stress       -0.5174     0.4204  -1.231    0.218    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3141.1  on 6147  degrees of freedom
## AIC: 3145.1
## 
## Number of Fisher Scoring iterations: 5

It seems that we can’t reject the null hypothesis for \(\beta_{stress}\).

We can also obtain the 95% Confidence Interval

confint.default(fit_uni_stress)
##                 2.5 %     97.5 %
## (Intercept) -2.664221 -2.4678285
## stress      -1.341440  0.3066133

We can do more univariable models to look for possible confounders (previoulsy discussed with an expert if possible..) in the exposure-outcome relationship.

fit_uni_age <- glm(low_grip~age,family = binomial,data=datiSHAREStress)
summary(fit_uni_age)
## 
## Call:
## glm(formula = low_grip ~ age, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -9.505544   0.419383  -22.67   <2e-16 ***
## age          0.101366   0.005822   17.41   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2817.0  on 6147  degrees of freedom
## AIC: 2821
## 
## Number of Fisher Scoring iterations: 6
fit_uni_job <- glm(low_grip~no_paid_job,family = binomial,data=datiSHAREStress)
summary(fit_uni_job)
## 
## Call:
## glm(formula = low_grip ~ no_paid_job, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.61992    0.05134 -51.030  < 2e-16 ***
## no_paid_job  1.13184    0.21544   5.254 1.49e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3120.8  on 6147  degrees of freedom
## AIC: 3124.8
## 
## Number of Fisher Scoring iterations: 5
fit_uni_sex <- glm(low_grip~female,family = binomial,data=datiSHAREStress)
summary(fit_uni_sex)
## 
## Call:
## glm(formula = low_grip ~ female, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -2.56532    0.07178 -35.738   <2e-16 ***
## female      -0.01918    0.09955  -0.193    0.847    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3142.8  on 6147  degrees of freedom
## AIC: 3146.8
## 
## Number of Fisher Scoring iterations: 5
fit_uni_move <- glm(low_grip~house_move,family = binomial,data=datiSHAREStress)
summary(fit_uni_move)
## 
## Call:
## glm(formula = low_grip ~ house_move, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)    
## (Intercept)  -2.3735     0.1307 -18.161   <2e-16 ***
## house_move   -0.2329     0.1413  -1.648   0.0993 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3140.2  on 6147  degrees of freedom
## AIC: 3144.2
## 
## Number of Fisher Scoring iterations: 5
fit_uni_ses <- glm(low_grip~ses,family = binomial,data=datiSHAREStress)
summary(fit_uni_ses)
## 
## Call:
## glm(formula = low_grip ~ ses, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##                       Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -2.69592    0.07324 -36.809  < 2e-16 ***
## sesGreat Difficulties  0.70745    0.23408   3.022  0.00251 ** 
## sesSome Difficulties   0.37973    0.14288   2.658  0.00787 ** 
## sesFairly Easily       0.11084    0.11422   0.970  0.33182    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 3129.9  on 6145  degrees of freedom
## AIC: 3137.9
## 
## Number of Fisher Scoring iterations: 5

When we have covariates with many levels, such as ses, we can use the global Wald test which takes into account for multiple testing.

drop1(fit_uni_ses,test="Chisq")
## Single term deletions
## 
## Model:
## low_grip ~ ses
##        Df Deviance    AIC    LRT Pr(>Chi)   
## <none>      3129.8 3137.8                   
## ses     3   3142.8 3144.8 12.988 0.004663 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

7.4 Multivariable Logistic Model

We can now fit a multivariable logistic model with the variables that were found to be associated with the outcome in the previous analysis and we suspect could act as confounders of the main exposure of interest.
Since we have not the possibility do discuss with an expert, we decide to include in the multivariable regression model all covariates with a p-value < 0.1. This criterion could be not the best one, but is widely applied in clinical and epidemiological research.

We obviously include in the model stress , since it is our exposure of interest.

We will treat the others variables are possible confounders in the model, and we will explore if significant associations are present with the exposure of interest. In principle, in the explanatory setting, we should start from a DAG and make the causal assumptions (in order to estimate a causal effect!!) but we do not have here the experts of the matter and we will limit ourselves in the end to a cautious interpretation of the estimated association.

fit_multi <-  glm(low_grip~stress+no_paid_job+age+house_move+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi)
## 
## Call:
## glm(formula = low_grip ~ stress + no_paid_job + age + house_move + 
##     ses, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -9.486409   0.448437 -21.154  < 2e-16 ***
## stress                -0.149441   0.438194  -0.341 0.733075    
## no_paid_job            0.408889   0.233225   1.753 0.079568 .  
## age                    0.101474   0.005924  17.129  < 2e-16 ***
## house_move            -0.196669   0.147181  -1.336 0.181472    
## sesGreat Difficulties  0.939131   0.249104   3.770 0.000163 ***
## sesSome Difficulties   0.489696   0.149577   3.274 0.001061 ** 
## sesFairly Easily       0.066057   0.118419   0.558 0.576962    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2790.2  on 6141  degrees of freedom
## AIC: 2806.2
## 
## Number of Fisher Scoring iterations: 6

This model estimates the coefficients of each of the variables independently from all the others. This is key concept of multivariable regression and it is what allows to remove confounding.

At a first sight it seems that our exposure of interest is not associated with the outcome.

Knowledge of the context from which the data comes from can be used to make hypothesis about possible interaction between confounders and the exposure.

In this case we want to test for an interaction between no_paid_job and stress. In other words we want to see if the association between the outcome and the exposure could vary for different levels of no_paid_job.

We then fit the model:

fit_multi2 <-  glm(low_grip~stress*no_paid_job+house_move+age+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi2)
## 
## Call:
## glm(formula = low_grip ~ stress * no_paid_job + house_move + 
##     age + ses, family = binomial, data = datiSHAREStress)
## 
## Coefficients:
##                        Estimate Std. Error z value Pr(>|z|)    
## (Intercept)           -9.544923   0.450526 -21.186  < 2e-16 ***
## stress                -0.812176   0.603047  -1.347 0.178049    
## no_paid_job            0.267759   0.243706   1.099 0.271901    
## house_move            -0.181885   0.147844  -1.230 0.218604    
## age                    0.102246   0.005946  17.197  < 2e-16 ***
## sesGreat Difficulties  0.927537   0.249828   3.713 0.000205 ***
## sesSome Difficulties   0.490455   0.149577   3.279 0.001042 ** 
## sesFairly Easily       0.060946   0.118563   0.514 0.607225    
## stress:no_paid_job     3.189101   1.054703   3.024 0.002497 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 3142.8  on 6148  degrees of freedom
## Residual deviance: 2781.6  on 6140  degrees of freedom
## AIC: 2799.6
## 
## Number of Fisher Scoring iterations: 6

As a reminder, using * introduces both the main effect and the interaction effect between variables in the model.

Alternatively, we can also fit the same model as follows:

fit_multi2 <-  glm(low_grip~stress+no_paid_job+age+ses+stress:no_paid_job+house_move,family = binomial,data=datiSHAREStress)

This second way of adding an interaction term is useful in case we have multiple interactions involving the same variable.

It seems that a significant interaction is present between stress and the no_paid_job confounder. Note that in any case, the main effect of stress is still not significant in the model.

Now that we have fitted these two models, which one should we choose?

We can use a formal statistical test to compare the two nested models:

anova(fit_multi,fit_multi2,test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: low_grip ~ stress + no_paid_job + age + house_move + ses
## Model 2: low_grip ~ stress + no_paid_job + age + ses + stress:no_paid_job + 
##     house_move
##   Resid. Df Resid. Dev Df Deviance Pr(>Chi)   
## 1      6141     2790.2                        
## 2      6140     2781.6  1   8.5938 0.003373 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The null hypothesis of this test is that the simpler model (i.e. the one with the smaller number of regression parameters) is no different from the more complex model. In this case we reject that hypothesis at a 95% confidence level and we would keep the model with the interaction.

7.5 Calibration for the logistic regression model (simplest basic method)

Even if we are here estimating an explanatory model, this does not means that we are not at all interested in evaluating if the predicted probabilities are in line with the observed event rates. A basic procedure to evaluate this aspect in logistic regression is the Hosmer-Lemeshow test (see for details: https://onlinelibrary.wiley.com/doi/book/10.1002/0471722146). In brief, the null hypothesis of the test is that the predicted probabilities (splitted in ordinal groups) are in line with the observed rates of the events in each group.

library(generalhoslem)
## Loading required package: reshape
## 
## Attaching package: 'reshape'
## The following object is masked from 'package:Matrix':
## 
##     expand
## The following object is masked from 'package:data.table':
## 
##     melt
## The following object is masked from 'package:lubridate':
## 
##     stamp
## The following object is masked from 'package:dplyr':
## 
##     rename
## The following objects are masked from 'package:tidyr':
## 
##     expand, smiths
logitgof(datiSHAREStress$low_grip,fitted(fit_multi2))
## 
##  Hosmer and Lemeshow test (binary model)
## 
## data:  datiSHAREStress$low_grip, fitted(fit_multi2)
## X-squared = 5.6329, df = 8, p-value = 0.6883

The model doesn’t show evidence for poor goodness of fit/calibration. Remind that when we are instead working with models used specifically for prediction, we should use more refined analyses such that the bootstrap overfitting-corrected calibration curves.

7.6 Interpretation of the logistic regression model

The next step is interpreting the results obtained from the model which is one of the most important part of the analysis to be discussed with experts. Of course we are keeping things simple here: in reality we would have had to consider many other possible candidate confounders for the outcome in order to properly control for confounding in this observational study, at least for the measured ones.

7.6.1 Continuous covariates

We obtain the OR, which is the measure of association selected when reporting and interpreting the results of a logistic model.

We start with the OR estimated for age:

exp(fit_multi2$coefficients)[4]
##      age 
## 1.107656

In general, OR>1 indicate an increase of the probability of the outcome whereas OR<1 a decrease. But the question is : with respect to what? We know that the OR is a relative measure of association so we must always ask ourselves what comparison we are making when we report an OR we have estimated. The above output does not mean anything by itself.

The first important thing to remember is that age is a continuous variable (measured in years) and in the model we have assumed it has a linear effect on the log-odds of the outcome. When we obtain the \(\hat{OR}\) by exponentiating the coefficient for age we obtain the OR estimates for an increase of one year of age. Being the \(\hat{OR}>1\), the probability of developing low grip increases as the age increases. Specifically, a subject has odds of having low grip 1.11 times greater with respect to a subject who is 1 year younger.

For continuous variables is important to report an OR for a difference that is relevant in the application at hand; a proper choice will depend of the scale the variable is measured and on the magnitude of the estimated coefficient.

For example here 1 year may not be so relevant from an epidemiological point of view, we may want to report an OR for a 5 years difference in age:

exp(fit_multi2$coefficients*5)[4]
##     age 
## 1.66734

7.6.2 Binary Covariates

What if we want to interpret the OR of a binary variable? Let’s consider the variable house_move. If it is coded numerical, then the coefficients refers always to the level 1. In this case 1 stands for having moved country in the past, so if the estimated coefficient was significant we could say that subjects who moved seem to have an odds lower by 17% (\(1-\hat{OR}\)) compared to people that have never moved countries in their life. However, when we look at the 95% CI we observe that it contains 1 so the association does not seem to be statistically significant.

#OR
exp(fit_multi2$coefficients)[8]
## house_move 
##  0.8336973
exp(confint.default(fit_multi2))[8,]
##     2.5 %    97.5 % 
## 0.6239695 1.1139185

7.6.3 Categorical Covariates

The interpretation of the OR for categorical variables is similar: we will have to keep in mind that we are always comparing each of the variable levels to the reference level. Here, it seems (interestingly!) that the odds of health worsening is the same for subject with good or fairly good self-perceived socio-economic status while it increases as the economic situation gets worse.

# OR
exp(fit_multi2$coefficients)[5:7]
## sesGreat Difficulties  sesSome Difficulties      sesFairly Easily 
##              2.528273              1.633059              1.062842
# 95% CI
exp(confint.default(fit_multi2))[5:7,]
##                          2.5 %   97.5 %
## sesGreat Difficulties 1.549423 4.125513
## sesSome Difficulties  1.218097 2.189385
## sesFairly Easily      0.842456 1.340880

7.6.4 Interactions

Finally, we will obtain the \(\hat{OR}s\) for stress and no_paid_job. Since there is an interaction involved, we have to be a bit more careful and we have to consider the two variables together. So first of all we have to know which is our reference group. In our case this is the group who have not experienced a stressful period and have done some paid job in their life (the reference levels for both variables involved in the interaction). Then, we can consider all combinations of the levels of the variables.

If we want the \(\hat{OR}\) for undergoing a stressful period and having had a paid job with respect to not having had a period of stress and having had a paid job we simply exponentiate the main estimated coefficient for stress (keeping at the same values the others covariates in the model):

exp(fit_multi2$coefficients)[2]
##    stress 
## 0.4438912

and we don’t reject the null hypothesis it is equal to 1 (no significant effect):

exp(confint.default(fit_multi2))[2,]
##     2.5 %    97.5 % 
## 0.1361326 1.4474079

On the other hand, if we want the \(\hat{OR}\) for never having had a paid job and not being exposed to stress with respect to having had a paid job and not being exposed to stress we simply exponentiate the coefficient for the main effect of the variable no_paid_job:

exp(fit_multi2$coefficients)[3]
## no_paid_job 
##    1.307032

Also this effect is not statistically significant which means that there does not seem to be a difference in the risk of low hand grip with regards to job experience in the non-exposed group (no stress) keeping fixed all others covariates.

exp(confint.default(fit_multi2))[3,]
##     2.5 %    97.5 % 
## 0.8106679 2.1073152

Last, we can obtain the OR for subjects who were stressed and had never have a paid job with respect to the reference group (not experienced a stressful period and have done some paid job):

exp(fit_multi2$coefficients[9]+fit_multi2$coefficients[2]+fit_multi2$coefficients[3])
## stress:no_paid_job 
##           14.07899

So in this subgroup it seems that the probability of developing low grip increases.

However, we should evaluate also the 95% CI for this \(\hat{OR}\).

One possibility to do it is to re-fit the multivariable logistic model using the rms R package:

library(rms)
dd <- datadist(datiSHAREStress)
## Warning in datadist(datiSHAREStress): intervallo is constant
options(datadist='dd') #define ranges of the covariates 

fit_multi2b <- lrm(low_grip~stress*no_paid_job+age+ses+house_move,data=datiSHAREStress)

The code is quite similar, the main difference is that first we have to run the datadist function which stores the distribution summaries of the variables.

The print function returns a very similar output of the summary for the glm() object.

print(fit_multi2b)
## Logistic Regression Model
## 
## lrm(formula = low_grip ~ stress * no_paid_job + age + ses + house_move, 
##     data = datiSHAREStress)
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs          6149    LR chi2     361.20       R2       0.143    C       0.752    
##  0           5714    d.f.             8      R2(8,6149)0.056    Dxy     0.504    
##  1            435    Pr(> chi2) <0.0001    R2(8,1212.7)0.253    gamma   0.505    
## max |deriv| 2e-06                             Brier    0.061    tau-a   0.066    
## 
##                        Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept              -9.5449 0.4505 -21.19 <0.0001 
## stress                 -0.8122 0.6031  -1.35 0.1781  
## no_paid_job             0.2678 0.2437   1.10 0.2719  
## age                     0.1022 0.0059  17.20 <0.0001 
## ses=Great Difficulties  0.9275 0.2498   3.71 0.0002  
## ses=Some Difficulties   0.4905 0.1496   3.28 0.0010  
## ses=Fairly Easily       0.0609 0.1186   0.51 0.6072  
## house_move             -0.1819 0.1478  -1.23 0.2186  
## stress * no_paid_job    3.1891 1.0547   3.02 0.0025

The summary on the other hand gives you the estimates for the coefficients as well as the ones for the OR. Of note by default for the continuous variables here, such as age, the function calculates the OR for the difference between the \(1^{st}\) and the \(3^{rd}\) quartile.

summary(fit_multi2b)
##              Effects              Response : low_grip 
## 
##  Factor                          Low High Diff. Effect    S.E.     Lower 0.95
##  stress                           0   1.0  1.0  -0.812180 0.603110 -1.99430  
##   Odds Ratio                      0   1.0  1.0   0.443890       NA  0.13612  
##  no_paid_job                      0   1.0  1.0   0.267760 0.243710 -0.20990  
##   Odds Ratio                      0   1.0  1.0   1.307000       NA  0.81067  
##  age                             58  70.8 12.8   1.308700 0.076105  1.15960  
##   Odds Ratio                     58  70.8 12.8   3.701500       NA  3.18860  
##  house_move                       0   1.0  1.0  -0.181880 0.147840 -0.47165  
##   Odds Ratio                      0   1.0  1.0   0.833700       NA  0.62397  
##  ses - Great Difficulties:Easily  1   2.0   NA   0.927540 0.249830  0.43788  
##   Odds Ratio                      1   2.0   NA   2.528300       NA  1.54940  
##  ses - Some Difficulties:Easily   1   3.0   NA   0.490460 0.149580  0.19729  
##   Odds Ratio                      1   3.0   NA   1.633100       NA  1.21810  
##  ses - Fairly Easily:Easily       1   4.0   NA   0.060946 0.118560 -0.17143  
##   Odds Ratio                      1   4.0   NA   1.062800       NA  0.84246  
##  Upper 0.95
##  0.36990   
##  1.44760   
##  0.74541   
##  2.10730   
##  1.45790   
##  4.29700   
##  0.10788   
##  1.11390   
##  1.41720   
##  4.12550   
##  0.78362   
##  2.18940   
##  0.29333   
##  1.34090   
## 
## Adjusted to: stress=0 no_paid_job=0

So far, the OR for stress and the job experience are the ones for the main effects. We can easily obtain here the OR for the interaction with the built-in contrast function:

c <- contrast(fit_multi2b,
         list(stress=1,no_paid_job=1), list(stress=0,no_paid_job=0),type="average")

c
##    Contrast      S.E.    Lower    Upper    Z Pr(>|z|)
## 11 2.644684 0.8328779 1.012273 4.277095 3.18   0.0015
## 
## Confidence intervals are 0.95 individual intervals
#OR and 95% CI
exp(c$Contrast)
##        1 
## 14.07899
exp(c$Lower)
##       1 
## 2.75185
exp(c$Upper)
##        1 
## 72.03085

So it seems that there is a significant interaction between stress and no paid job on the outcome.

7.7 Conclusions

Remind our initial research question: whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population. Based on our analysis it seems that is not the single event of a stessful period that increase the occurrence of muscular weakness, but it is the joint impact of having a stressful period coupled with no paid job that is associated with the occurrence of muscular weakness, adjusting for (or independently from) the specific age, house condition or socio-economic status.

8 Regression on count data: Poisson regression

8.1 Introduction

In nature, an event usually takes place in a very small amount of time. At any given point of time, the probability of encountering such an event is small. Instead of the probability of the single event, now we focus on the frequency of the events as a density, which means incidence or ‘count’ of events over a period of time. (While time is one dimension, the same concept applies to the density of counts of small objects in a two-dimensional area or three-dimensional space). Moreover we can assume that one event is independent from another and that the densities in different units of time vary with a variance equal to the average density. We can approximate this kind of random process using the Poisson random variable. When the probability of having an event is affected by some factors, a model is needed to explain and predict the density. Variation among different strata of a population could be explained by the various combination of factors. Within each stratum (defined by covariates combination), the distribution of the events is assumed random. Poisson regression deals with outcome variables that are counts in nature (whole numbers or integers). Independent covariates have the same role as those encountered in linear and logistic regression. In epidemiology, Poisson regression is very often used for analysing grouped population based or cohort data, looking at incidence density among person-time contributed by subjects that share similar characteristics of interest. Poisson regression is one of 3 common regression models used in epidemiological studies. The other two that are more commonly used are linear regression and logistic regression, which have been already covered. The last family is survival methods, that we will explain in the last block 4.

There are two main assumptions for Poisson regression:

  1. risk is homogeneous among person-times contributed by different subjects who share the same characteristics of interest (e.g. sex, age-group) and the same period.

  2. asymptotically, or as the sample size becomes larger and larger, the mean of the counts is equal to the variance.

Straightforward linear regression methods (assuming constant variance and normal errors) are not appropriate for count data for four main reasons:

  1. the linear model might lead to the prediction of negative counts

  2. the variance of the response may increase with the mean

  3. the errors will not be normally distributed

  4. zero counts are difficult to handle in transformations

Moreover, in studies that invole the time dimension different subjects may have different person-times of exposure. Analysing risk factors while ignoring differences in person-times is wrong.

Poisson regression overcomes these limitations.

Note that in survival analysis using for example Cox regression (see block 4), the hazard ratio will be estimated for each covariate in the model, not the incidence density in each subgroup; in the Cox model the interest will be focused on the “how long until an event occurs - time to event -”, instead in the Poisson regression model the focus is on “how many events occur in given interval”.

8.2 Example of Poisson model: the Montana smelter study

The dataset Montana was extracted from an occupational cohort study conducted to test the association between respiratory deaths (outcome) and exposure to arsenic in the industry, after adjusting for various other risk factors/confounders. The main outcome variable is respdeath. This is the count of the number of deaths among personyrs or personyears of subjects in each category. The other variables are independent covariates including age group agegr, period of employment period, starting time of employment start and the level of exposure to arsenic during the study period arsenic (the exposure of interest). Read in the data first and examine the variables.

data(Montana)
summary(Montana)
##    respdeath        personyrs           agegr           period     
##  Min.   : 0.000   Min.   :    4.2   Min.   :1.000   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:  127.5   1st Qu.:2.000   1st Qu.:1.000  
##  Median : 1.000   Median :  335.1   Median :3.000   Median :2.000  
##  Mean   : 2.421   Mean   : 1096.4   Mean   :2.605   Mean   :2.404  
##  3rd Qu.: 3.000   3rd Qu.:  925.7   3rd Qu.:4.000   3rd Qu.:3.000  
##  Max.   :19.000   Max.   :12451.3   Max.   :4.000   Max.   :4.000  
##     starting        arsenic     
##  Min.   :1.000   Min.   :1.000  
##  1st Qu.:1.000   1st Qu.:1.250  
##  Median :1.000   Median :2.000  
##  Mean   :1.456   Mean   :2.474  
##  3rd Qu.:2.000   3rd Qu.:3.000  
##  Max.   :2.000   Max.   :4.000

The last four variables are classed as integers. We need to tell R to interpret them as categorical variables, or factors, and attach labels to each of the levels. This can be done using the factor command with a ‘labels’ argument included.

Montana$agegr    <- factor(Montana$agegr, labels=c("40-49","50-59","60-69","70-79"))
Montana$period   <- factor(Montana$period, labels=c("1938-1949", "1950-1959","1960-1969", "1970-1977"))
Montana$start    <- factor(Montana$start, labels=c("pre-1925", "1925 & after"))
Montana$arsenic1 <- factor(Montana$arsenic, labels=c("<1 year", "1-4years","5-14 years", "15+ years"))
summary(Montana)
##    respdeath        personyrs         agegr          period      starting    
##  Min.   : 0.000   Min.   :    4.2   40-49:24   1938-1949:30   Min.   :1.000  
##  1st Qu.: 0.000   1st Qu.:  127.5   50-59:28   1950-1959:32   1st Qu.:1.000  
##  Median : 1.000   Median :  335.1   60-69:31   1960-1969:28   Median :1.000  
##  Mean   : 2.421   Mean   : 1096.4   70-79:31   1970-1977:24   Mean   :1.456  
##  3rd Qu.: 3.000   3rd Qu.:  925.7                             3rd Qu.:2.000  
##  Max.   :19.000   Max.   :12451.3                             Max.   :2.000  
##     arsenic               start          arsenic1 
##  Min.   :1.000   pre-1925    :62   <1 year   :29  
##  1st Qu.:1.250   1925 & after:52   1-4years  :29  
##  Median :2.000                     5-14 years:29  
##  Mean   :2.474                     15+ years :27  
##  3rd Qu.:3.000                                    
##  Max.   :4.000

We keep the original arsenic variable unchanged for use later on.

8.2.1 Descriptive analyses : breakdown of incidence by age and period

Let us explore the person-years breakdown by age and period. Firstly, create a table for total person-years:

tapply(Montana$personyrs, list(Montana$period, Montana$agegr), sum) -> table.pyears

Carry out the same procedure for number of deaths, and compute the table of incidence per 10,000 person years for each cell.

tapply(Montana$respdeath, list(Montana$period, Montana$agegr), sum) -> table.deaths
table.inc10000 <- table.deaths/table.pyears*10000
table.inc10000
##              40-49    50-59    60-69    70-79
## 1938-1949 5.424700 17.13102 34.95107 26.53928
## 1950-1959 3.344638 23.47556 49.01961 64.82632
## 1960-1969 4.341516 20.49375 58.23803 55.06608
## 1970-1977 4.408685 14.77747 44.09949 80.81413

Now, create a time-series plot of the incidence:

plot.ts(table.inc10000, plot.type="single", xlab=" ",ylab="#/10,000 person-years", xaxt="n", col=c("black",
"blue","red","green"), lty=c(2,1,1,2), las=1)
points(rep(1:4,4), table.inc10000, pch=22, cex=table.pyears/sum(table.pyears) * 20)
title(main = "Incidence by age and period")
axis(side = 1, at = 1:4, labels = levels(Montana$period))
legend(3.2,40, legend=levels(Montana$agegr)[4:1], col=c("green","red", "blue", "black"), bg = "white", lty=c(2,1,1,2))

The above graph shows that the older age group is generally associated with a higher risk. On the other hand, the sample size (reflected by the size of the squares at each point) decreases with age. The possibility of a confounding effect of age on the exposure of interest can better be examined by using Poisson regression.

8.2.2 Modelling with Poisson regression

Let’s estimate a Poisson regression model taking into account only period as a covariate:

mode11 <- glm(respdeath ~ period, offset = log(personyrs),family = poisson, data=Montana)
summary(mode11)
## 
## Call:
## glm(formula = respdeath ~ period, family = poisson, data = Montana, 
##     offset = log(personyrs))
## 
## Coefficients:
##                 Estimate Std. Error z value Pr(>|z|)    
## (Intercept)      -6.4331     0.1715 -37.511   <2e-16 ***
## period1950-1959   0.2365     0.2117   1.117   0.2638    
## period1960-1969   0.3781     0.2001   1.889   0.0588 .  
## period1970-1977   0.4830     0.2036   2.372   0.0177 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 369.27  on 110  degrees of freedom
## AIC: 596.05
## 
## Number of Fisher Scoring iterations: 6

The option offset = log(personyrs) allows the variable personyrs to be the denominator for the counts of respdeath. A logarithmic transformation is needed since, for a Poisson generalized linear model, the link function is the natural log, and the default link for the Poisson family is the log link.

Remind : an important criterion in the choice of a link function for various families of distributions is to ensure that the fitted values from the modelling stay within reasonable bounds. Specifying a log link (default for Poisson) ensures that the fitted counts are all greater than or equal to zero. For more details on default links for various families of distributions related to generalized linear modelling, see the help in R under help(family).

The first model above with period as the only independent variable suggests that the death rate increased with time. The model can be tested for goodness of fit and the checked whether the Poisson assumptions mentioned earlier have been violated.

8.2.3 Goodness of fit test

To test the goodness of fit of the Poisson model, type:

poisgof(mode11)
## $results
## [1] "Goodness-of-fit test for Poisson assumption"
## 
## $chisq
## [1] 369.2654
## 
## $df
## [1] 110
## 
## $p.value
## [1] 9.578443e-30

The component ‘$chisq’ is actually computed from the model deviance, a parameter reflecting the level of errors. A large chi-squared value with small degrees of freedom results in a significant violation of the Poisson assumption (p < 0.05). If only the P value is wanted, the command can be shortened.

poisgof(mode11)$p.value
## [1] 9.578443e-30

The P value is very small indicating a poor fit.

Note:It should be noted that this method works under the assumption of a large sample size. An alternative method is to a fit negative binomial regression model (but not covered in the slides!).

We now add the second independent variable ‘agegr’ to the model:

mode12 <- glm(respdeath~agegr+period, offset=log(personyrs), family = poisson, data=Montana)
AIC(mode12)
## [1] 396.6407

The AIC (Akaike Information Criterion) has decreased remarkably from ‘model1’ to ‘model2’ indicating a poor fit of the first model.

poisgof(mode12)$p.value
## [1] 0.0003381328

But ‘model2’ still violates the Poisson assumption.

mode13 <- glm(respdeath ~ agegr, offset = log(personyrs), family = poisson, data=Montana)
AIC(mode13) 
## [1] 394.4699
poisgof(mode13)$p.value 
## [1] 0.0003295142

Removal of ‘period’ further reduces the AIC but still violates the Poisson assumption to the same extent as the previous model. The next step is to add the exposure of interest: ‘arsenic1’.

mode14 <- glm(respdeath ~ agegr + arsenic1, offset=log(personyrs), family = poisson, data=Montana)
summary(mode14)
## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic1, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                    Estimate Std. Error z value Pr(>|z|)    
## (Intercept)         -7.9947     0.2237 -35.739  < 2e-16 ***
## agegr50-59           1.4624     0.2454   5.960 2.53e-09 ***
## agegr60-69           2.3502     0.2380   9.873  < 2e-16 ***
## agegr70-79           2.5987     0.2564  10.136  < 2e-16 ***
## arsenic11-4years     0.8041     0.1577   5.099 3.42e-07 ***
## arsenic15-14 years   0.5957     0.2059   2.893  0.00381 ** 
## arsenic115+ years    0.9976     0.1758   5.673 1.40e-08 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 122.25  on 107  degrees of freedom
## AIC: 355.04
## 
## Number of Fisher Scoring iterations: 5
poisgof(mode14)$p.value
## [1] 0.1486902

Fortunately, ‘model4’ has a much lower AIC than model3 and it now does not violate the assumption.

If we change the reference level for arsenic and we use 1-4 years vs others:

Montana$arsenic.b <- relevel(Montana$arsenic1,ref="1-4years")
mode15 <- glm(respdeath ~ agegr + arsenic.b, offset=log(personyrs), family = poisson, data=Montana)
summary(mode15)
## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic.b, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                     Estimate Std. Error z value Pr(>|z|)    
## (Intercept)          -7.1905     0.2466 -29.156  < 2e-16 ***
## agegr50-59            1.4624     0.2454   5.960 2.53e-09 ***
## agegr60-69            2.3502     0.2380   9.873  < 2e-16 ***
## agegr70-79            2.5987     0.2564  10.136  < 2e-16 ***
## arsenic.b<1 year     -0.8041     0.1577  -5.099 3.42e-07 ***
## arsenic.b5-14 years  -0.2085     0.2326  -0.896     0.37    
## arsenic.b15+ years    0.1935     0.2071   0.935     0.35    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 122.25  on 107  degrees of freedom
## AIC: 355.04
## 
## Number of Fisher Scoring iterations: 5

It does not appear to be any increase in the risk of death from more than 4 years of exposure to arsenic so it may be worth combining it into just two levels:

Montana$arsenic2 <- Montana$arsenic1
levels(Montana$arsenic2) <- c("<1 year", rep("1+ years", 3))

model6 <- glm(respdeath ~ agegr + arsenic2,offset=log(personyrs), family=poisson, data=Montana)
summary(model6)
## 
## Call:
## glm(formula = respdeath ~ agegr + arsenic2, family = poisson, 
##     data = Montana, offset = log(personyrs))
## 
## Coefficients:
##                  Estimate Std. Error z value Pr(>|z|)    
## (Intercept)       -8.0086     0.2233 -35.859  < 2e-16 ***
## agegr50-59         1.4702     0.2453   5.994 2.04e-09 ***
## agegr60-69         2.3661     0.2372   9.976  < 2e-16 ***
## agegr70-79         2.6238     0.2548  10.297  < 2e-16 ***
## arsenic21+ years   0.8109     0.1210   6.699 2.09e-11 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for poisson family taken to be 1)
## 
##     Null deviance: 376.02  on 113  degrees of freedom
## Residual deviance: 125.02  on 109  degrees of freedom
## AIC: 353.8
## 
## Number of Fisher Scoring iterations: 5

At this stage, we would accept ‘model6’ as the final model, since it has the smallest AIC among all the models that we have tried. We conclude that exposure to arsenic for at least one year is associated with an increased risk for the disease by exp(0.8109) or 2.25 times with statistical significance, independently from age.

8.3 Another usage of the model: prediction of the incidence rate

In the Poisson model, the outcome is a count. In the general linear model, the relationship between the values of the outcome (as measured in the data and predicted by the model in the fitted values) and the linear predictor is determined by the link function. This link function relates the mean value of the outcome to its linear predictor. By default, the link function for the Poisson distribution is the natural logarithm. With the offset being log(person-time), the value of the outcome becomes the log(incidence rate).

The matrix ‘table.inc10000’ (created previously) gives the crude incidence rate by age group and period. Each of the Poisson regression models above can be used to compute the predicted incidence rate when the variables in the model are given. For example, to compute the incidence rate from a population of 100,000 people aged between 40-49 years who were exposed to arsenic for less than one year using ‘model6’, type:

newdata <- as.data.frame(list(agegr="40-49",arsenic2="<1 year", personyrs=100000))
predict(model6, newdata, type="response")
##        1 
## 33.25736

This population would have an estimated incidence rate of 33.26 per 100,000 person-years.

8.4 Interpretation of the regression coefficients : the incidence rate ratio (IRR)

In this example, all subjects pool their follow-up times and this number is called ‘person time’, which is then used as the denominator for the event, resulting in ‘incidence rate’. Comparing the incidence rate among two groups of subjects by their exposure status is fairer than comparing the crude risks (considering population at baseline). The ratio between the incidence rates of two groups is called the incidence rate ratio (IRR), which is an improved form of the relative risk.

In ‘model6’, for example, we want to compute the incidence rate ratio between the subjects exposed to arsenic for one or more years against those exposed for less than one year.

The shorter way to obtain this IRR is to exponentiate the coefficient of the specific variable ‘arsenic’, which is the fifth coefficient in the model.

coef(model6)
##      (Intercept)       agegr50-59       agegr60-69       agegr70-79 
##       -8.0086495        1.4701505        2.3661111        2.6237532 
## arsenic21+ years 
##        0.8108698
exp(coef(model6)[5])
## arsenic21+ years 
##         2.249864

The following steps explain how the 95% confidence interval for all variables can be obtained.

coeff <- coef(model6)
coeff.95ci <- cbind(coeff, confint(model6))
coeff.95ci
##                       coeff      2.5 %    97.5 %
## (Intercept)      -8.0086495 -8.4780952 -7.598323
## agegr50-59        1.4701505  1.0094838  1.976026
## agegr60-69        2.3661111  1.9239329  2.858522
## agegr70-79        2.6237532  2.1411250  3.145495
## arsenic21+ years  0.8108698  0.5724818  1.047494

Note that confint(glm6) provides a 95% confidence interval for the model coefficients.

IRR.95ci <- round(exp(coeff.95ci), 1)[-1,]

The required values are obtained from exponentiating the last matrix with the first row or intercept removed. The display is rounded to 1 decimal place for better viewing. Then the matrix column is labelled and the 95% CI is displayed.

colnames(IRR.95ci) <- c("IRR", "lower95ci", "upper95ci")
IRR.95ci
##                   IRR lower95ci upper95ci
## agegr50-59        4.3       2.7       7.2
## agegr60-69       10.7       6.8      17.4
## agegr70-79       13.8       8.5      23.2
## arsenic21+ years  2.2       1.8       2.9

A simpler way is to use the command idr.display in epiDisplay (but they use the term IDR: Incidence Density Rate).

idr.display(model6, decimal=1)
## 
## Poisson regression predicting respdeath with offset = log(personyrs) 
##  
##                               crude IDR(95%CI) adj. IDR(95%CI)  P(Wald's test)
## agegr: ref.=40-49                                                             
##    50-59                      4.5 (2.8,7.3)    4.3 (2.7,7)      < 0.001       
##    60-69                      11.3 (7.1,17.9)  10.7 (6.7,17)    < 0.001       
##    70-79                      14.5 (8.8,23.8)  13.8 (8.4,22.7)  < 0.001       
##                                                                               
## arsenic2: 1+ years vs <1 year 2.5 (2,3.1)      2.2 (1.8,2.9)    < 0.001       
##                                                                               
##                               P(LR-test)
## agegr: ref.=40-49             < 0.001   
##    50-59                                
##    60-69                                
##    70-79                                
##                                         
## arsenic2: 1+ years vs <1 year < 0.001   
##                                         
## Log-likelihood = -171.9
## No. of observations = 114
## AIC value = 353.8

The command idr.display gives results to 3 decimal places by default. This can easily be changed by the user.

8.5 Optional material: negative binomial regression

Recall that for Poisson regression one of the assumptions for a valid model is that the mean and variance of the count variable are equal. The negative binomial distribution is a more generalized form of distribution used for ‘count’ response data, allowing for greater dispersion or variance of counts. In practice, it is quite common for the variance of the outcome to be larger than the mean. This is called overdispersion. If a count variable is overdispersed, Poisson regression underestimates the standard errors of the predictor variables. When overdispersion is evident, one solution is to specify that the errors have a negative binomial distribution. Negative binomial regression gives the same coefficients as those from Poisson regression but give larger standard errors. The interpretation of the results is the same as that from Poisson regression. Take an example of counts of water containers infested with mosquito larvae in a field survey. The data is contained in the dataset DHF99.

library(MASS)
data(DHF99)
summary(DHF99)
##     houseid          village             education     containers     
##  Min.   :  1.00   Min.   :  1.00   Primary    :168   Min.   : 0.0000  
##  1st Qu.: 77.75   1st Qu.: 23.00   Secondary  : 36   1st Qu.: 0.0000  
##  Median :154.50   Median : 51.00   High school: 34   Median : 0.0000  
##  Mean   :174.27   Mean   : 48.56   Bachelor   : 25   Mean   : 0.3512  
##  3rd Qu.:269.25   3rd Qu.: 75.00   Other      : 37   3rd Qu.: 0.0000  
##  Max.   :385.00   Max.   :105.00                     Max.   :11.0000  
##                                                      NA's   :1        
##   viltype   
##  rural:180  
##  urban: 72  
##  slum : 48  
##             
##             
##             
## 
describeBy(DHF99$containers, group=DHF99$viltype)
## 
##  Descriptive statistics by group 
## group: rural
##    vars   n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 179 0.49 1.25      0     0.2   0   0  11    11 4.58     29.1 0.09
## ------------------------------------------------------------ 
## group: urban
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 72 0.07 0.26      0       0   0   0   1     1 3.32     9.13 0.03
## ------------------------------------------------------------ 
## group: slum
##    vars  n mean   sd median trimmed mad min max range skew kurtosis   se
## X1    1 48 0.25 0.53      0    0.15   0   0   2     2 1.93     2.84 0.08

The function for performing a negative binomial glm is glm.nb. This function is located in the MASS library. In addition, a very helpful function for selecting the best model based on the AIC value is the step function, which is located in the stats library (a default library loaded on start-up).

model.poisson <- step(glm(containers ~ education + viltype, family=poisson, data=DHF99))
## Start:  AIC=513.3
## containers ~ education + viltype
## 
##             Df Deviance    AIC
## - education  4   363.67 505.92
## <none>           363.05 513.30
## - viltype    2   397.54 543.79
## 
## Step:  AIC=505.92
## containers ~ viltype
## 
##           Df Deviance    AIC
## <none>         363.67 505.92
## - viltype  2   398.52 536.76
model.nb      <- step(glm.nb(containers ~ education + viltype, data=DHF99))
## Start:  AIC=432.05
## containers ~ education + viltype
## 
##             Df Deviance    AIC
## - education  4   156.64 424.23
## <none>           156.46 432.05
## - viltype    2   174.78 446.37
## 
## Step:  AIC=424.23
## containers ~ viltype
## 
##           Df Deviance    AIC
## <none>         156.37 424.23
## - viltype  2   174.95 438.80
coef(model.poisson)
##  (Intercept) viltypeurban  viltypeslum 
##   -0.7100490   -1.9571792   -0.6762454
coef(model.nb)
##  (Intercept) viltypeurban  viltypeslum 
##   -0.7100490   -1.9571792   -0.6762454

Both models end up with only ‘viltype’ being selected. The coefficients are very similar. The Poisson model has significant overdispersion but not the negative binomial model.

poisgof(model.poisson)$p.value
## [1] 0.004387831
poisgof(model.nb)$p.value
## [1] 1

The AIC of the negative binomial model is also better (smaller) than that of the Poisson model.

model.poisson$aic
## [1] 505.9154
model.nb$aic
## [1] 426.2283

Finally, the main differences to be examined are their standard errors, the 95% confidence intervals and P values.

summary(model.poisson)$coefficients
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  -0.7100490  0.1066000 -6.660873 2.722059e-11
## viltypeurban -1.9571792  0.4597429 -4.257117 2.070800e-05
## viltypeslum  -0.6762454  0.3077286 -2.197538 2.798202e-02
summary(model.nb)$coefficients
##                Estimate Std. Error   z value     Pr(>|z|)
## (Intercept)  -0.7100490  0.1731160 -4.101578 4.103414e-05
## viltypeurban -1.9571792  0.5255707 -3.723912 1.961591e-04
## viltypeslum  -0.6762454  0.4274174 -1.582166 1.136116e-01
idr.display(model.poisson)
## 
## Poisson regression predicting # infested vessels 
##  
##                          IDR(95%CI)        P(Wald's test) P(LR-test)
## Village type: ref.=rural                                  < 0.001   
##    urban                 0.14 (0.06,0.35)  < 0.001                  
##    slum                  0.51 (0.28,0.93)  0.028                    
##                                                                     
## Log-likelihood = -249.9577
## No. of observations = 299
## AIC value = 505.9154
attach(DHF99)
idr.display(model.nb)
## 
## Negative binomial regression predicting # infested vessels 
##  
##                          IDR(95%CI)        P(Wald's test) P(LR-test)
## Village type: ref.=rural                                  < 0.001   
##    urban                 0.14 (0.05,0.4)   < 0.001                  
##    slum                  0.51 (0.22,1.18)  0.114                    
##                                                                     
## Log-likelihood = -209.1141
## No. of observations = 299
## AIC value = 426.2283

The standard errors from the negative binomial model are slightly larger than those from the Poisson model resulting in wider 95% confidence intervals and larger P values. From the Poisson regression, both urban community and slum area had a significantly lower risk (around 14% and a half reduction, respectively) for infestation. However, from the negative binomial regression, only the urban community had a significantly lower risk.