First of all, install and load the required libraries:
library(rms)
library(Hmisc)
library(gtsummary)
library(magrittr)
library(tidyverse)
library(haven)
library(ggplot2)
library(ggpubr)
library(pROC)
library(here)
library(psych)
This practical contains a case study on developing, describing, and validating a regression prediction model.
In the original data set of the study there was also a follow up time to the events, but we did not consider here this aspect (just for didactic purposes we will consider it as another candidate predictor in the model).
The original dataset was composed by 506 patients with prostate cancer from the Byar and Green paper (see if interested: D. P. Byar and S. B. Green. The choice of treatment for cancer patients based on covariate information: Application to prostate cancer. Bulletin Cancer, Paris, 67:477-488, 1980).
These data come originally from a randomized trial (RCT) comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen.
In this trial, larger doses of estrogen reduced the effect of prostate cancer but at the cost of increased risk of cardiovascular death. (Four patients had missing values on many variables. These patients have been excluded from consideration).
Our goal here is to predict the probability of incurring in cardiovascular-cerebrovascular death in the subset of patients that experienced this event or the prostate cancer death.
We will subset the original dataset of those patients dying from prostate cancer, heart or vascular disease, or cerebrovascular disease.
Here are the description of the variables contained in the dataset:
Some a priori information from clinicians: stage is defined by ap as well as X-ray results. Of the patients in stage 3, 92% have ap ≤ 0.8. Of those in stage 4, 93% have ap > 0.8.
Since stage can be predicted almost certainly from ap, we do not consider stage in the analyses. (You can check this relationship on the data).
This part is an exercise that you will do on your own: first of all, let’s load the original dataset:
prostate <- read_sav(here("datasets", "prostate.sav"))
prostate <-labelled::to_factor(prostate, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
summary.data.frame(prostate)
## patno stage rx dtime
## Min. : 1.0 Min. :3.000 placebo :127 Min. : 0.00
## 1st Qu.:126.2 1st Qu.:3.000 0.2 mg estrogen:124 1st Qu.:14.25
## Median :251.5 Median :3.000 1.0 mg estrogen:126 Median :34.00
## Mean :251.7 Mean :3.424 5.0 mg estrogen:125 Mean :36.13
## 3rd Qu.:376.8 3rd Qu.:4.000 3rd Qu.:57.75
## Max. :506.0 Max. :4.000 Max. :76.00
##
## status age wt
## alive :148 Min. :48.00 Min. : 69.00
## dead - prostatic ca :130 1st Qu.:70.00 1st Qu.: 90.00
## dead - heart or vascular : 96 Median :73.00 Median : 98.00
## dead - cerebrovascular : 31 Mean :71.46 Mean : 99.03
## dead - other specific non-ca: 28 3rd Qu.:76.00 3rd Qu.:107.00
## dead - other ca : 25 Max. :89.00 Max. :152.00
## (Other) : 44 NA's :1 NA's :2
## pf hx sbp dbp
## normal activity :450 Min. :0.0000 Min. : 8.00 Min. : 4.000
## in bed < 50% daytime: 37 1st Qu.:0.0000 1st Qu.:13.00 1st Qu.: 7.000
## in bed > 50% daytime: 13 Median :0.0000 Median :14.00 Median : 8.000
## confined to bed : 2 Mean :0.4243 Mean :14.35 Mean : 8.149
## 3rd Qu.:1.0000 3rd Qu.:16.00 3rd Qu.: 9.000
## Max. :1.0000 Max. :30.00 Max. :18.000
##
## ekg hg sz
## normal :168 Min. : 5.899 Min. : 0.00
## heart strain :150 1st Qu.:12.299 1st Qu.: 5.00
## old MI : 75 Median :13.699 Median :11.00
## rhythmic disturb & electrolyte ch: 51 Mean :13.446 Mean :14.63
## heart block or conduction def : 26 3rd Qu.:14.699 3rd Qu.:21.00
## (Other) : 24 Max. :21.199 Max. :69.00
## NA's : 8 NA's :5
## sg ap bm
## Min. : 5.00 Min. : 0.100 Min. :0.0000
## 1st Qu.: 9.00 1st Qu.: 0.500 1st Qu.:0.0000
## Median :10.00 Median : 0.700 Median :0.0000
## Mean :10.31 Mean : 12.176 Mean :0.1633
## 3rd Qu.:11.00 3rd Qu.: 2.975 3rd Qu.:0.0000
## Max. :15.00 Max. :999.875 Max. :1.0000
## NA's :11
Here are the suggestions for the initial data analysis:
Do you think that some variables could be re-coded combining an infrequent category with the next category?
For example, pay attention to the frequencies of ekg and pf.
Hint from a discussion with clinicians: in the ekg variable we could collapse the normal and benign conditions.
For pf, you can combine in bed > 50% daytime with confined to bed in one category.
Then, here there is another question for you: based on the study design (RCT), do you expect to find significant statistical differences between baseline characteristics of groups defined by the levels of the treatment rx ? Build a descriptive table comparing the four groups and evaluating the global p value across groups. (Hint: for numerical variables you could use the Kruskal-Wallis test and for categorical ones the Chi-square test).
Give a look to the distribution’s shape of the continuous variables : do you think that some scale transformation (for example taking the logarithm) could help in improving the normality of some of them ?
Now: subset the original dataset selecting only the patients died for cardiovascular–cerebrovascular death or prostate cancer death. Create the outcome variable cvd identifying as “1” patients died for cardiovascular–cerebrovascular death and as “0” patients died for prostate cancer. (We will use then logistic regression for the prediction model).
The reduced dataset “prostate_rid” is created. Here we begin with the prediction model development part.
First of all, we upload the reduced dataset with the recoded variables that have been suggested in the IDA phase:
prostate_rid <- read_sav(here("datasets","prostate_rid.sav"))
prostate_rid <-labelled::to_factor(prostate_rid, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
summary.data.frame(prostate_rid)
## patno stage rx dtime
## Min. : 3.0 Min. :3.000 placebo :71 Min. : 0.00
## 1st Qu.:131.0 1st Qu.:3.000 0.2 mg estrogen:67 1st Qu.:11.00
## Median :265.0 Median :4.000 1.0 mg estrogen:49 Median :24.00
## Mean :260.5 Mean :3.525 5.0 mg estrogen:70 Mean :25.26
## 3rd Qu.:395.0 3rd Qu.:4.000 3rd Qu.:37.00
## Max. :506.0 Max. :4.000 Max. :71.00
##
## status age wt
## dead - prostatic ca :130 Min. :48.00 Min. : 71.00
## dead - heart or vascular: 96 1st Qu.:70.00 1st Qu.: 89.00
## dead - cerebrovascular : 31 Median :73.00 Median : 97.00
## alive : 0 Mean :71.55 Mean : 98.03
## dead - pulmonary embolus: 0 3rd Qu.:76.00 3rd Qu.:106.00
## dead - other ca : 0 Max. :88.00 Max. :152.00
## (Other) : 0
## pf hx sbp dbp
## normal activity :221 Min. :0.0000 Min. : 8.00 Min. : 4.000
## in bed < 50% daytime: 24 1st Qu.:0.0000 1st Qu.:13.00 1st Qu.: 7.000
## in bed > 50% daytime: 10 Median :1.0000 Median :14.00 Median : 8.000
## confined to bed : 2 Mean :0.5097 Mean :14.53 Mean : 8.144
## 3rd Qu.:1.0000 3rd Qu.:16.00 3rd Qu.: 9.000
## Max. :1.0000 Max. :24.00 Max. :13.000
##
## ekg hg sz
## heart strain :83 Min. : 5.899 Min. : 0.00
## normal :74 1st Qu.:12.000 1st Qu.: 6.00
## old MI :44 Median :13.600 Median :14.00
## rhythmic disturb & electrolyte ch:28 Mean :13.301 Mean :16.99
## benign :13 3rd Qu.:14.600 3rd Qu.:24.50
## (Other) :10 Max. :21.199 Max. :69.00
## NA's : 5 NA's :2
## sg ap bm ap80 ekg_rec
## Min. : 5.00 Min. : 0.10 Min. :0.0000 Min. :0.0000 bngn :87
## 1st Qu.: 9.00 1st Qu.: 0.60 1st Qu.:0.0000 1st Qu.:0.0000 rd&ec:28
## Median :11.00 Median : 1.10 Median :0.0000 Median :0.0000 hbocd:10
## Mean :10.75 Mean : 16.17 Mean :0.2451 Mean :0.4553 hrts :83
## 3rd Qu.:12.00 3rd Qu.: 7.00 3rd Qu.:0.0000 3rd Qu.:1.0000 MI :44
## Max. :15.00 Max. :999.88 Max. :1.0000 Max. :1.0000 NA's : 5
## NA's :5
## pf_rec status_num subset
## normal :221 dead - prostatic ca :130 Min. :1
## bed_under_50 : 24 dead - heart or vascular: 96 1st Qu.:1
## bed_over_50_conf: 12 dead - cerebrovascular : 31 Median :1
## alive : 0 Mean :1
## dead - pulmonary embolus: 0 3rd Qu.:1
## dead - other ca : 0 Max. :1
## (Other) : 0
## cvd log_ap
## Min. :0.0000 Min. :-2.30268
## 1st Qu.:0.0000 1st Qu.:-0.51087
## Median :0.0000 Median : 0.09518
## Mean :0.4942 Mean : 0.70723
## 3rd Qu.:1.0000 3rd Qu.: 1.94591
## Max. :1.0000 Max. : 6.90763
##
Here we define ranges of the variables contained in the dataset (this is a useful function both for data description and for the subsequent interpretation of the regression coefficients):
options(datadist='dd')
dd <-datadist(prostate_rid)
dd
## patno stage rx dtime status
## Low:effect 131.00000 3 <NA> 11 <NA>
## Adjust to 265.00000 3 placebo 24 dead - prostatic ca
## High:effect 395.00000 4 <NA> 37 <NA>
## Low:prediction 24.96109 3 placebo 1 alive
## High:prediction 482.03891 4 5.0 mg estrogen 58 dead - unknown cause
## Low 3.00000 3 placebo 0 alive
## High 506.00000 4 5.0 mg estrogen 71 dead - unknown cause
## age wt pf hx sbp dbp ekg
## Low:effect 70.00000 89.0000 <NA> 0 13 7 <NA>
## Adjust to 73.00000 97.0000 normal activity 0 14 8 heart strain
## High:effect 76.00000 106.0000 <NA> 1 16 9 <NA>
## Low:prediction 55.96109 76.0000 normal activity 0 11 6 normal
## High:prediction 80.00000 125.0389 confined to bed 1 19 10 recent MI
## Low 48.00000 71.0000 normal activity 0 8 4 normal
## High 88.00000 152.0000 confined to bed 1 24 13 recent MI
## hg sz sg ap bm ap80 ekg_rec
## Low:effect 12.000000 6.00000 9 0.59997559 0 0 <NA>
## Adjust to 13.599609 14.00000 11 1.09985352 0 0 bngn
## High:effect 14.599609 24.50000 12 7.00000000 1 1 <NA>
## Low:prediction 9.298828 2.00000 8 0.19998169 0 0 bngn
## High:prediction 16.503800 46.11673 14 54.97592412 1 1 MI
## Low 5.899414 0.00000 5 0.09999084 0 0 bngn
## High 21.199219 69.00000 15 999.87500000 1 1 MI
## pf_rec status_num subset cvd log_ap
## Low:effect <NA> <NA> 1 0 -0.51086631
## Adjust to normal dead - prostatic ca 1 0 0.09517701
## High:effect <NA> <NA> 1 1 1.94591015
## Low:prediction normal alive 1 0 -1.60952947
## High:prediction bed_over_50_conf dead - unknown cause 1 1 4.00677071
## Low normal alive 1 0 -2.30267665
## High bed_over_50_conf dead - unknown cause 1 1 6.90763027
##
## Values:
##
## stage : 3 4
## rx : placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen
## status :
## [1] alive dead - prostatic ca
## [3] dead - heart or vascular dead - cerebrovascular
## [5] dead - pulmonary embolus dead - other ca
## [7] dead - respiratory disease dead - other specific non-ca
## [9] dead - unspecified non-ca dead - unknown cause
## pf : normal activity in bed < 50% daytime in bed > 50% daytime confined to bed
## hx : 0 1
## dbp : 4 5 6 7 8 9 10 11 12 13
## ekg :
## [1] normal benign
## [3] rhythmic disturb & electrolyte ch heart block or conduction def
## [5] heart strain old MI
## [7] recent MI
## bm : 0 1
## ap80 : 0 1
## ekg_rec : bngn rd&ec hbocd hrts MI
## pf_rec : normal bed_under_50 bed_over_50_conf
## status_num :
## [1] alive dead - prostatic ca
## [3] dead - heart or vascular dead - cerebrovascular
## [5] dead - pulmonary embolus dead - other ca
## [7] dead - respiratory disease dead - other specific non-ca
## [9] dead - unspecified non-ca dead - unknown cause
## subset : 1
## cvd : 0 1
Now we visually explore the assumption about the linearity of the effect for continuous candidate predictors by means of nonparametric smoothed regression:
par(mfrow=c(2,2))
plsmo(prostate_rid$age,prostate_rid$cvd)
plsmo(prostate_rid$wt, prostate_rid$cvd)
plsmo(prostate_rid$sbp,prostate_rid$cvd)
plsmo(prostate_rid$dbp,prostate_rid$cvd)
A strange behaviour is observed for wt i.e. the Weight Index [calculated as wt= wt(kg)-ht(cm)+200]; the others variables could be roughly approximated by a linear effect.
par(mfrow=c(2,3))
plsmo(prostate_rid$hg, prostate_rid$cvd)
plsmo(prostate_rid$sz, prostate_rid$cvd)
plsmo(prostate_rid$sg, prostate_rid$cvd)
plsmo(prostate_rid$ap, prostate_rid$cvd)
plsmo(prostate_rid$dtime, prostate_rid$cvd)
Again, a strange behaviour is observed for dtime (Months of Follow-up, as it is obvioulsy expected…), the others could be roughly approximated by a linear effect. Remind that this dataset is quite small, so we have to be parsimonious in the number of coefficients to be estimated !
Note that, being this dataset quite small, we will use all the dataset for the training phase of the regression model, but after for evaluating the performance of the model we will apply boostrap techniques. This is obviously not the optimal way to proceed, but when there is a constraint of a low-moderate sample size (as often happen in medical data collected prospectively) there is no possibility of further splitting the sample. With such a small sample size we could not afford an initial split of the study sample in separate training and test subgroups. This is also a great obstacle in applying machine learning techniques to small-moderate medical datasets.
Even if univariable analysis is generally discouraged to select candidate predictors (it works only with perfectly uncorrelated variables, that is nearly impossible in biological problems), it is another exploratory step that could be useful also to discuss then with clinicians about the indications of associations that are present in the data. So first of all we can create a table with univariable results, assuming for the moment a linear effect for all the continuous variables :
prostate_rid %>% select(-patno,-stage,-pf,-ekg,-status,-status_num,-subset) %>%
tbl_uvregression(method = glm, # glm function
method.args = list(family = binomial),# logistic model
exponentiate = T, # report OR
y=cvd,# outcome variable
conf.level = 0.95,
pvalue_fun = function(x) style_pvalue(x, digits = 3))
Characteristic | N | OR1 | 95% CI1 | p-value |
---|---|---|---|---|
rx | 257 | |||
placebo | — | — | ||
0.2 mg estrogen | 0.65 | 0.33, 1.27 | 0.210 | |
1.0 mg estrogen | 1.13 | 0.55, 2.36 | 0.736 | |
5.0 mg estrogen | 1.73 | 0.89, 3.41 | 0.107 | |
Months of Follow-up | 257 | 1.00 | 0.98, 1.01 | 0.707 |
Age in Years | 257 | 1.07 | 1.03, 1.11 | <0.001 |
Weight Index = wt(kg)-ht(cm)+200 | 257 | 1.00 | 0.99, 1.02 | 0.726 |
History of Cardiovascular Disease | 257 | 3.96 | 2.37, 6.71 | <0.001 |
Systolic Blood Pressure/10 | 257 | 1.09 | 0.99, 1.21 | 0.082 |
Diastolic Blood Pressure/10 | 257 | 1.30 | 1.09, 1.57 | 0.005 |
Serum Hemoglobin (g/100ml) | 257 | 1.17 | 1.04, 1.33 | 0.012 |
Size of Primary Tumor (cm^2) | 255 | 0.93 | 0.91, 0.95 | <0.001 |
Combined Index of Stage and Hist. Grade | 252 | 0.56 | 0.47, 0.66 | <0.001 |
Serum Prostatic Acid Phosphatase | 257 | 0.93 | 0.89, 0.96 | <0.001 |
Bone Metastases | 257 | 0.28 | 0.15, 0.52 | <0.001 |
ap80 | 257 | 5.74 | 3.38, 9.96 | <0.001 |
ekg_rec | 252 | |||
bngn | — | — | ||
rd&ec | 1.35 | 0.57, 3.20 | 0.492 | |
hbocd | 1.04 | 0.25, 3.91 | 0.955 | |
hrts | 2.04 | 1.11, 3.78 | 0.023 | |
MI | 2.73 | 1.30, 5.88 | 0.009 | |
pf_rec | 257 | |||
normal | — | — | ||
bed_under_50 | 1.36 | 0.58, 3.28 | 0.477 | |
bed_over_50_conf | 0.09 | 0.00, 0.47 | 0.021 | |
log_ap | 257 | 0.55 | 0.45, 0.65 | <0.001 |
1 OR = Odds Ratio, CI = Confidence Interval |
Note that here all the odds ratios (OR) are expressing variations by 1-unit on the scale of the independent variable.
This table should be commented with the clinicians in order to discuss if the observed statistical significant associations are in line with what it is expected from the biological perspective.
Now, we can come back to the functional forms for the continuous variables: try now to use a nonlinear effect for wt and for dtime
fit.splines.wt <- lrm(cvd ~ rcs(wt,4), data=prostate_rid)
print(fit.splines.wt)
## Logistic Regression Model
##
## lrm(formula = cvd ~ rcs(wt, 4), data = prostate_rid)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 257 LR chi2 0.66 R2 0.003 C 0.518
## 0 130 d.f. 3 R2(3,257)0.000 Dxy 0.036
## 1 127 Pr(> chi2) 0.8833 R2(3,192.7)0.000 gamma 0.037
## max |deriv| 2e-12 Brier 0.249 tau-a 0.018
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -2.5286 3.5066 -0.72 0.4708
## wt 0.0300 0.0418 0.72 0.4731
## wt' -0.1046 0.1446 -0.72 0.4693
## wt'' 0.3319 0.4557 0.73 0.4664
summary(fit.splines.wt)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## wt 89 106 17 -0.14415 0.32689 -0.78484 0.49654
## Odds Ratio 89 106 17 0.86576 NA 0.45619 1.64300
We observe that there is no significant effect for the nonlinear components of the spline.
fit.splines.dtime <- lrm(cvd ~ rcs(dtime,4), data=prostate_rid)
print(fit.splines.dtime)
## Logistic Regression Model
##
## lrm(formula = cvd ~ rcs(dtime, 4), data = prostate_rid)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 257 LR chi2 8.04 R2 0.041 C 0.591
## 0 130 d.f. 3 R2(3,257)0.019 Dxy 0.181
## 1 127 Pr(> chi2) 0.0452 R2(3,192.7)0.026 gamma 0.185
## max |deriv| 1e-06 Brier 0.242 tau-a 0.091
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 0.6180 0.3867 1.60 0.1100
## dtime -0.0991 0.0422 -2.35 0.0188
## dtime' 0.4881 0.1833 2.66 0.0078
## dtime'' -1.2217 0.4470 -2.73 0.0063
summary(fit.splines.dtime)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## dtime 11 37 26 0.73072 0.35467 0.035589 1.4259
## Odds Ratio 11 37 26 2.07660 NA 1.036200 4.1614
Here instead there is a significant nonlinear effect for the dtime variable (as expected being the follow up…) In this case, the interpretation of the odds ratio is reported as a contrast between two specific values (37 vs 11), i.e. the third and first quartile of the variable. This is done in order to help in the interpretation of the non-linear effect modelled by the spline function.
Let’s visualize the effect of this candidate predictor modeled in a nonlinear way:
ggplot(Predict(fit.splines.dtime))
This is quite similar to the non parametric regression previously used:
plsmo(prostate_rid$dtime, prostate_rid$cvd)
We decide to start with a full (no selection of predictors) binary logistic model assuming linearity for continuous predictors except for dtime:
fit.multi <- lrm(cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt +hg + ekg_rec + pf_rec + bm + hx + rx +rcs(dtime,4), data =prostate_rid)
Note here that there is usually an upper-bound on the number of candidate predictors that can be used in a [logistic/poisson/survival] regression model dictated by the number of events occurring in the population (we will take into account this aspect when the sample size issue in multivariable modelling will be discussed).
Let’s see the results:
print(fit.multi)
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap sbp dbp age wt hg ekg_rec
## 0 2 5 0 0 0 0 0 0 5
## pf_rec bm hx rx dtime
## 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt +
## hg + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 245 LR chi2 150.16 R2 0.611 C 0.906
## 0 121 d.f. 22 R2(22,245)0.407 Dxy 0.811
## 1 124 Pr(> chi2) <0.0001 R2(22,183.7)0.502 gamma 0.811
## max |deriv| 2e-09 Brier 0.124 tau-a 0.407
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.7151 3.6106 -1.31 0.1916
## sz -0.0635 0.0174 -3.64 0.0003
## sg -0.3281 0.1247 -2.63 0.0085
## ap -0.3808 0.1468 -2.59 0.0095
## sbp -0.0278 0.0931 -0.30 0.7654
## dbp 0.4263 0.1727 2.47 0.0136
## age 0.1004 0.0317 3.16 0.0016
## wt -0.0172 0.0148 -1.16 0.2442
## hg 0.0708 0.1060 0.67 0.5042
## ekg_rec=rd&ec -0.5096 0.6760 -0.75 0.4509
## ekg_rec=hbocd -1.3951 0.8889 -1.57 0.1166
## ekg_rec=hrts 0.1518 0.4586 0.33 0.7407
## ekg_rec=MI 0.0106 0.6210 0.02 0.9864
## pf_rec=bed_under_50 1.3081 0.8132 1.61 0.1077
## pf_rec=bed_over_50_conf -2.2541 1.2487 -1.81 0.0710
## bm 0.0981 0.5397 0.18 0.8557
## hx 1.3719 0.4195 3.27 0.0011
## rx=0.2 mg estrogen -0.3661 0.5245 -0.70 0.4852
## rx=1.0 mg estrogen 0.9277 0.5579 1.66 0.0964
## rx=5.0 mg estrogen 0.7030 0.5238 1.34 0.1796
## dtime -0.1256 0.0691 -1.82 0.0691
## dtime' 0.6755 0.2979 2.27 0.0234
## dtime'' -1.7766 0.7258 -2.45 0.0144
summary(fit.multi)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E.
## sz 6.00000 24.5 18.5000 -1.174100 0.32268
## Odds Ratio 6.00000 24.5 18.5000 0.309110 NA
## sg 9.00000 12.0 3.0000 -0.984350 0.37423
## Odds Ratio 9.00000 12.0 3.0000 0.373680 NA
## ap 0.59998 7.0 6.4000 -0.935590 0.36060
## Odds Ratio 0.59998 7.0 6.4000 0.392360 NA
## sbp 13.00000 16.0 3.0000 -0.083384 0.27943
## Odds Ratio 13.00000 16.0 3.0000 0.920000 NA
## dbp 7.00000 9.0 2.0000 0.852590 0.34535
## Odds Ratio 7.00000 9.0 2.0000 2.345700 NA
## age 70.00000 76.0 6.0000 0.602180 0.19035
## Odds Ratio 70.00000 76.0 6.0000 1.826100 NA
## wt 89.00000 106.0 17.0000 -0.292780 0.25140
## Odds Ratio 89.00000 106.0 17.0000 0.746180 NA
## hg 12.00000 14.6 2.5996 0.184020 0.27553
## Odds Ratio 12.00000 14.6 2.5996 1.202000 NA
## bm 0.00000 1.0 1.0000 0.098134 0.53966
## Odds Ratio 0.00000 1.0 1.0000 1.103100 NA
## hx 0.00000 1.0 1.0000 1.371900 0.41948
## Odds Ratio 0.00000 1.0 1.0000 3.943000 NA
## dtime 11.00000 37.0 26.0000 1.053600 0.55450
## Odds Ratio 11.00000 37.0 26.0000 2.867900 NA
## ekg_rec - rd&ec:bngn 1.00000 2.0 NA -0.509560 0.67596
## Odds Ratio 1.00000 2.0 NA 0.600760 NA
## ekg_rec - hbocd:bngn 1.00000 3.0 NA -1.395100 0.88895
## Odds Ratio 1.00000 3.0 NA 0.247820 NA
## ekg_rec - hrts:bngn 1.00000 4.0 NA 0.151770 0.45859
## Odds Ratio 1.00000 4.0 NA 1.163900 NA
## ekg_rec - MI:bngn 1.00000 5.0 NA 0.010576 0.62096
## Odds Ratio 1.00000 5.0 NA 1.010600 NA
## pf_rec - bed_under_50:normal 1.00000 2.0 NA 1.308100 0.81317
## Odds Ratio 1.00000 2.0 NA 3.699200 NA
## pf_rec - bed_over_50_conf:normal 1.00000 3.0 NA -2.254100 1.24870
## Odds Ratio 1.00000 3.0 NA 0.104970 NA
## rx - 0.2 mg estrogen:placebo 1.00000 2.0 NA -0.366070 0.52451
## Odds Ratio 1.00000 2.0 NA 0.693460 NA
## rx - 1.0 mg estrogen:placebo 1.00000 3.0 NA 0.927670 0.55790
## Odds Ratio 1.00000 3.0 NA 2.528600 NA
## rx - 5.0 mg estrogen:placebo 1.00000 4.0 NA 0.702950 0.52383
## Odds Ratio 1.00000 4.0 NA 2.019700 NA
## Lower 0.95 Upper 0.95
## -1.8065000 -0.54161
## 0.1642300 0.58181
## -1.7178000 -0.25087
## 0.1794500 0.77813
## -1.6424000 -0.22882
## 0.1935200 0.79547
## -0.6310500 0.46429
## 0.5320300 1.59090
## 0.1757100 1.52950
## 1.1921000 4.61570
## 0.2291100 0.97525
## 1.2575000 2.65180
## -0.7855300 0.19996
## 0.4558800 1.22140
## -0.3560000 0.72405
## 0.7004700 2.06280
## -0.9595800 1.15590
## 0.3830500 3.17670
## 0.5497700 2.19410
## 1.7329000 8.97190
## -0.0332040 2.14040
## 0.9673400 8.50280
## -1.8344000 0.81529
## 0.1597100 2.25980
## -3.1374000 0.34723
## 0.0433970 1.41510
## -0.7470500 1.05060
## 0.4737600 2.85930
## -1.2065000 1.22760
## 0.2992500 3.41320
## -0.2856600 2.90190
## 0.7515200 18.20900
## -4.7014000 0.19330
## 0.0090825 1.21320
## -1.3941000 0.66195
## 0.2480600 1.93860
## -0.1657900 2.02110
## 0.8472200 7.54690
## -0.3237300 1.72960
## 0.7234500 5.63860
an <- anova (fit.multi)
an
## Wald Statistics Response: cvd
##
## Factor Chi-Square d.f. P
## sz 13.24 1 0.0003
## sg 6.92 1 0.0085
## ap 6.73 1 0.0095
## sbp 0.09 1 0.7654
## dbp 6.09 1 0.0136
## age 10.01 1 0.0016
## wt 1.36 1 0.2442
## hg 0.45 1 0.5042
## ekg_rec 3.60 4 0.4628
## pf_rec 6.11 2 0.0471
## bm 0.03 1 0.8557
## hx 10.70 1 0.0011
## rx 6.67 3 0.0831
## dtime 8.80 3 0.0321
## Nonlinear 8.02 2 0.0182
## TOTAL 62.82 22 <.0001
plot(an)
More than interpreting the single regression coefficients and their corresponding statistical significance (even if in the last phase, i.e. when presenting results to the clinicians also this aspect is important!!), in prediction modelling we are mostly interested in evaluating the accuracy in predictions.
Now we want to evaluate the performance of the full model:
s <- fit.multi$stats
round(s, digits=3)
## Obs Max Deriv Model L.R. d.f. P C
## 245.000 0.000 150.158 22.000 0.000 0.906
## Dxy Gamma Tau-a R2 R2(245) R2(22,245)
## 0.811 0.811 0.407 0.611 0.458 0.407
## R2(183.7) R2(22,183.7) Brier g gr gp
## 0.558 0.502 0.124 3.008 20.245 0.408
C is the AUC (discrimination measure) and Dxy is related to C (Dxy=2*(C-0.5)), so they are both discrimination measures. They measure how the model perform in the relative ranking of the estimated risk.
gamma.hat <- (s['Model L.R.'] - s['d.f.'])/s['Model L.R.']
gamma.hat
## Model L.R.
## 0.8534877
Estimation of the shrinkage coefficient gamma allows the quantification of the amount of overfitting present, and it also allows one to estimate the likelihood that the model will reliably predict new observations. The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.85, indicating that this model will validate on new data about 15% worse than on this dataset. This is also related to the calibration concept (the absolute ordering of the predicted risks).
Let’s now visualize the predictors effects on the logit scale:
ggplot(Predict(fit.multi))
Now let’s see them on the odds ratio scale:
plot(summary(fit.multi),log=TRUE)
The general function is plot(summary(fit.multi)) but it less readable, for this reason there is the option log=TRUE.
Estimates of the associations are related to the interquartile-range odds ratios for continuous predictors and odds ratios with respect to a reference level for categorical predictors. Numbers at left are upper quartile vs lower quartile or current group vs reference group. The bars represent confidence limits. The intervals are drawn on the log odds ratio scale and labeled on the odds ratio scale. Ranges are on the original scale. Discussing this data with clinicians is also useful for thinking about possible interaction effects between some of the candidate predictors; in this example we will not examine this issue.
Now, forget for a moment our sample size limitation, we want to check if it is worth to expand all the others continuous variables in a full model approach using splines:
fit.all.splines <- lrm(cvd ~ rcs(sz,4) + rcs(sg,4) + rcs(log(ap),4) +rcs(sbp,4) + rcs(dbp,4) + rcs(age,4)
+ rcs(wt,4) + rcs(hg,4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime,4),data=prostate_rid)
print(fit.all.splines)
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap sbp dbp age wt hg ekg_rec
## 0 2 5 0 0 0 0 0 0 5
## pf_rec bm hx rx dtime
## 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ rcs(sz, 4) + rcs(sg, 4) + rcs(log(ap), 4) +
## rcs(sbp, 4) + rcs(dbp, 4) + rcs(age, 4) + rcs(wt, 4) + rcs(hg,
## 4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 245 LR chi2 177.11 R2 0.686 C 0.933
## 0 121 d.f. 38 R2(38,245)0.433 Dxy 0.865
## 1 124 Pr(> chi2) <0.0001 R2(38,183.7)0.531 gamma 0.865
## max |deriv| 4e-11 Brier 0.104 tau-a 0.434
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -24.1991 13.5603 -1.78 0.0743
## sz -0.2597 0.1295 -2.00 0.0450
## sz' 1.6649 0.9134 1.82 0.0683
## sz'' -3.1513 1.6810 -1.87 0.0608
## sg 0.5706 0.5774 0.99 0.3230
## sg' -4.7064 2.0646 -2.28 0.0226
## sg'' 21.8060 8.9783 2.43 0.0152
## ap -0.4486 0.8526 -0.53 0.5988
## ap' 3.6227 5.5246 0.66 0.5120
## ap'' -7.7643 10.2918 -0.75 0.4506
## sbp -0.0453 0.3717 -0.12 0.9030
## sbp' -0.1906 1.2777 -0.15 0.8814
## sbp'' 0.5496 7.7526 0.07 0.9435
## dbp -0.1280 0.4527 -0.28 0.7773
## dbp' 0.4811 1.0301 0.47 0.6405
## dbp'' 3.6419 6.7889 0.54 0.5916
## age 0.3420 0.1106 3.09 0.0020
## age' -0.3789 0.1770 -2.14 0.0323
## age'' 3.6426 2.0773 1.75 0.0795
## wt 0.0024 0.0893 0.03 0.9788
## wt' -0.0771 0.2949 -0.26 0.7936
## wt'' 0.2732 0.9056 0.30 0.7629
## hg 0.2602 0.3600 0.72 0.4697
## hg' -0.9822 0.8826 -1.11 0.2657
## hg'' 7.9914 5.4689 1.46 0.1439
## ekg_rec=rd&ec -0.6508 0.7960 -0.82 0.4136
## ekg_rec=hbocd -2.0159 1.0458 -1.93 0.0539
## ekg_rec=hrts -0.0923 0.5329 -0.17 0.8625
## ekg_rec=MI -0.1072 0.7061 -0.15 0.8793
## pf_rec=bed_under_50 1.7917 0.9468 1.89 0.0584
## pf_rec=bed_over_50_conf -1.2181 1.3352 -0.91 0.3616
## bm -0.1454 0.6230 -0.23 0.8155
## hx 1.8027 0.5203 3.46 0.0005
## rx=0.2 mg estrogen -0.4657 0.5854 -0.80 0.4263
## rx=1.0 mg estrogen 0.6931 0.6265 1.11 0.2686
## rx=5.0 mg estrogen 0.5258 0.6375 0.82 0.4095
## dtime -0.1925 0.0810 -2.38 0.0175
## dtime' 0.9895 0.3536 2.80 0.0051
## dtime'' -2.5924 0.8682 -2.99 0.0028
summary(fit.all.splines)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E.
## sz 6.00000 24.5 18.5000 -0.4753200 0.57195
## Odds Ratio 6.00000 24.5 18.5000 0.6216800 NA
## sg 9.00000 12.0 3.0000 -3.0788000 0.86958
## Odds Ratio 9.00000 12.0 3.0000 0.0460160 NA
## ap 0.59998 7.0 6.4000 -0.0015339 0.71565
## Odds Ratio 0.59998 7.0 6.4000 0.9984700 NA
## sbp 13.00000 16.0 3.0000 -0.4203600 0.64517
## Odds Ratio 13.00000 16.0 3.0000 0.6568100 NA
## dbp 7.00000 9.0 2.0000 0.7533600 0.70002
## Odds Ratio 7.00000 9.0 2.0000 2.1241000 NA
## age 70.00000 76.0 6.0000 -0.4358300 0.56104
## Odds Ratio 70.00000 76.0 6.0000 0.6467300 NA
## wt 89.00000 106.0 17.0000 -0.4036000 0.59476
## Odds Ratio 89.00000 106.0 17.0000 0.6679100 NA
## hg 12.00000 14.6 2.5996 -0.5632000 0.62210
## Odds Ratio 12.00000 14.6 2.5996 0.5693800 NA
## bm 0.00000 1.0 1.0000 -0.1453900 0.62300
## Odds Ratio 0.00000 1.0 1.0000 0.8646800 NA
## hx 0.00000 1.0 1.0000 1.8027000 0.52030
## Odds Ratio 0.00000 1.0 1.0000 6.0659000 NA
## dtime 11.00000 37.0 26.0000 1.3504000 0.64498
## Odds Ratio 11.00000 37.0 26.0000 3.8589000 NA
## ekg_rec - rd&ec:bngn 1.00000 2.0 NA -0.6508500 0.79604
## Odds Ratio 1.00000 2.0 NA 0.5216000 NA
## ekg_rec - hbocd:bngn 1.00000 3.0 NA -2.0159000 1.04580
## Odds Ratio 1.00000 3.0 NA 0.1332000 NA
## ekg_rec - hrts:bngn 1.00000 4.0 NA -0.0922990 0.53294
## Odds Ratio 1.00000 4.0 NA 0.9118300 NA
## ekg_rec - MI:bngn 1.00000 5.0 NA -0.1072100 0.70615
## Odds Ratio 1.00000 5.0 NA 0.8983400 NA
## pf_rec - bed_under_50:normal 1.00000 2.0 NA 1.7917000 0.94679
## Odds Ratio 1.00000 2.0 NA 5.9995000 NA
## pf_rec - bed_over_50_conf:normal 1.00000 3.0 NA -1.2181000 1.33520
## Odds Ratio 1.00000 3.0 NA 0.2957800 NA
## rx - 0.2 mg estrogen:placebo 1.00000 2.0 NA -0.4657400 0.58539
## Odds Ratio 1.00000 2.0 NA 0.6276700 NA
## rx - 1.0 mg estrogen:placebo 1.00000 3.0 NA 0.6931000 0.62648
## Odds Ratio 1.00000 3.0 NA 1.9999000 NA
## rx - 5.0 mg estrogen:placebo 1.00000 4.0 NA 0.5257800 0.63747
## Odds Ratio 1.00000 4.0 NA 1.6918000 NA
## Lower 0.95 Upper 0.95
## -1.5963000 0.64568
## 0.2026400 1.90730
## -4.7831000 -1.37440
## 0.0083699 0.25299
## -1.4042000 1.40110
## 0.2455700 4.05970
## -1.6849000 0.84415
## 0.1854700 2.32600
## -0.6186600 2.12540
## 0.5386600 8.37610
## -1.5354000 0.66378
## 0.2153600 1.94210
## -1.5693000 0.76211
## 0.2081900 2.14280
## -1.7825000 0.65608
## 0.1682200 1.92720
## -1.3664000 1.07570
## 0.2550100 2.93190
## 0.7829300 2.82240
## 2.1879000 16.81800
## 0.0862480 2.61450
## 1.0901000 13.66100
## -2.2111000 0.90936
## 0.1095900 2.48270
## -4.0657000 0.03389
## 0.0171510 1.03450
## -1.1368000 0.95224
## 0.3208300 2.59150
## -1.4912000 1.27680
## 0.2251000 3.58520
## -0.0640040 3.64740
## 0.9380000 38.37300
## -3.8351000 1.39890
## 0.0215980 4.05060
## -1.6131000 0.68161
## 0.1992700 1.97710
## -0.5347800 1.92100
## 0.5858000 6.82760
## -0.7236500 1.77520
## 0.4849800 5.90150
an.s <- anova (fit.all.splines)
an.s
## Wald Statistics Response: cvd
##
## Factor Chi-Square d.f. P
## sz 11.55 3 0.0091
## Nonlinear 3.82 2 0.1481
## sg 13.31 3 0.0040
## Nonlinear 6.55 2 0.0379
## ap 3.32 3 0.3443
## Nonlinear 1.46 2 0.4827
## sbp 1.40 3 0.7054
## Nonlinear 0.14 2 0.9312
## dbp 12.05 3 0.0072
## Nonlinear 7.18 2 0.0276
## age 13.01 3 0.0046
## Nonlinear 5.02 2 0.0813
## wt 0.69 3 0.8762
## Nonlinear 0.16 2 0.9212
## hg 3.13 3 0.3714
## Nonlinear 3.03 2 0.2193
## ekg_rec 4.20 4 0.3798
## pf_rec 4.71 2 0.0951
## bm 0.05 1 0.8155
## hx 12.00 1 0.0005
## rx 4.23 3 0.2373
## dtime 12.40 3 0.0061
## Nonlinear 10.89 2 0.0043
## TOTAL NONLINEAR 24.81 18 0.1302
## TOTAL 57.36 38 0.0227
plot(an.s)
Let’s check what happens to the model’s performance:
s.splines <- fit.all.splines$stats # performance of the estimated model
round(s.splines, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures
## Obs Max Deriv Model L.R. d.f. P C
## 245.000 0.000 177.108 38.000 0.000 0.933
## Dxy Gamma Tau-a R2 R2(245) R2(38,245)
## 0.865 0.865 0.434 0.686 0.515 0.433
## R2(183.7) R2(38,183.7) Brier g gr gp
## 0.619 0.531 0.104 4.107 60.753 0.432
gamma.hat.splines <- (s.splines['Model L.R.'] - s.splines['d.f.'])/s.splines['Model L.R.']
gamma.hat.splines
## Model L.R.
## 0.7854418
The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.79, indicating that this model will validate on new data about 21% worse than on this dataset (more complex than the previous model, more overfitting, as expected).
Let’s compare the performance of the two models in terms of AIC: the Akaike Information Criterion (the smaller is the better is):
AIC.models <- c(AIC(fit.multi), AIC(fit.all.splines))
AIC.models
## [1] 235.4473 240.4973
Based on AIC, the simpler model fitted to the raw data and assuming linearity for all the continuous predictors except dtime has a lower AIC and a lower optimism, so we would prefer that one as a predictive tool.
Now we use fast backward step-down (with total residual AIC as the stopping rule, that is a reasonable way to use stepwise selection) to identify the variables that explain the bulk of the cause of death. This algorithm performs a slightly inefficient but numerically stable version of fast backward elimination on factors;it uses the fitted full model and computes approximate Wald statistics by computing conditional (restricted) maximum likelihood estimates (assuming multivariate normality of estimates..). The function prints the deletion statistics for each variable in turn, and prints approximate parameter estimates for the model after deleting variables. The approximation is better when the number of factors deleted is not large.
fastbw(fit.multi)
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC
## ekg_rec 3.60 4 0.4628 3.60 4 0.4628 -4.40
## bm 0.01 1 0.9242 3.61 5 0.6069 -6.39
## sbp 0.08 1 0.7830 3.69 6 0.7192 -8.31
## hg 0.30 1 0.5827 3.99 7 0.7813 -10.01
## wt 0.92 1 0.3380 4.91 8 0.7677 -11.09
## rx 7.31 3 0.0628 12.21 11 0.3480 -9.79
## pf_rec 5.72 2 0.0573 17.93 13 0.1602 -8.07
## dtime 7.29 3 0.0633 25.22 16 0.0661 -6.78
## sg 3.36 1 0.0667 28.58 17 0.0386 -5.42
## dbp 4.98 1 0.0257 33.55 18 0.0143 -2.45
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept -3.68667 2.00132 -1.842 0.0654580
## sz -0.04316 0.01605 -2.689 0.0071632
## ap -0.41027 0.11823 -3.470 0.0005202
## age 0.05934 0.02809 2.112 0.0346497
## hx 0.82995 0.36398 2.280 0.0225953
##
## Factors in Final Model
##
## [1] sz ap age hx
In the final step the following covariates are retained: sz, ap, age
and hx.
Let’s estimate the reduced model:
fred <- lrm(cvd ~ sz + log(ap) + age + hx , data =prostate_rid)
print(fred)
## Frequencies of Missing Values Due to Each Variable
## cvd sz ap age hx
## 0 2 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + log(ap) + age + hx, data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 255 LR chi2 100.81 R2 0.435 C 0.840
## 0 129 d.f. 4 R2(4,255)0.316 Dxy 0.679
## 1 126 Pr(> chi2) <0.0001 R2(4,191.2)0.397 gamma 0.679
## max |deriv| 3e-08 Brier 0.163 tau-a 0.341
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.9339 1.6846 -2.93 0.0034
## sz -0.0543 0.0136 -3.99 <0.0001
## ap -0.5048 0.1044 -4.83 <0.0001
## age 0.0776 0.0235 3.30 0.0010
## hx 1.0713 0.3101 3.45 0.0006
summary(fred)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## sz 6.00000 24.5 18.5 -1.00520 0.25199 -1.49910 -0.51135
## Odds Ratio 6.00000 24.5 18.5 0.36595 NA 0.22332 0.59968
## ap 0.59998 7.0 6.4 -1.24020 0.25656 -1.74310 -0.73737
## Odds Ratio 0.59998 7.0 6.4 0.28932 NA 0.17498 0.47837
## age 70.00000 76.0 6.0 0.46531 0.14111 0.18874 0.74188
## Odds Ratio 70.00000 76.0 6.0 1.59250 NA 1.20770 2.09990
## hx 0.00000 1.0 1.0 1.07130 0.31007 0.46355 1.67900
## Odds Ratio 0.00000 1.0 1.0 2.91910 NA 1.58970 5.36010
Again, we want to evaluate the performance of the reduced model:
s.red <- fred$stats
round(s.red, digits=3)
## Obs Max Deriv Model L.R. d.f. P C
## 255.000 0.000 100.810 4.000 0.000 0.840
## Dxy Gamma Tau-a R2 R2(255) R2(4,255)
## 0.679 0.679 0.341 0.435 0.327 0.316
## R2(191.2) R2(4,191.2) Brier g gr gp
## 0.410 0.397 0.163 1.917 6.800 0.340
gamma.hat.red <- (s.red['Model L.R.'] - s.red['d.f.'])/s.red['Model L.R.']
gamma.hat.red
## Model L.R.
## 0.9603214
The van Houwelingen-Le Cessie heuristic shrinkage estimate is now 0.96, indicating that this model will validate on new data about 4% worse than on this dataset (simpler than the full model, less overfitting).
Let us now use a bootstrap approach to further evaluate calibration and discrimination of this reduced model:
fred <- update(fred, x=TRUE , y=TRUE)
vred <- validate (fred, B=200)
vred
## index.orig training test optimism index.corrected n
## Dxy 0.6792 0.6864 0.6694 0.0170 0.6623 200
## R2 0.4354 0.4478 0.4252 0.0226 0.4128 200
## Intercept 0.0000 0.0000 -0.0080 0.0080 -0.0080 200
## Slope 1.0000 1.0000 0.9580 0.0420 0.9580 200
## Emax 0.0000 0.0000 0.0108 0.0108 0.0108 200
## D 0.3914 0.4069 0.3801 0.0268 0.3646 200
## U -0.0078 -0.0078 0.0020 -0.0098 0.0020 200
## Q 0.3993 0.4148 0.3782 0.0366 0.3626 200
## B 0.1630 0.1593 0.1668 -0.0075 0.1705 200
## g 1.9169 1.9934 1.8723 0.1212 1.7958 200
## gp 0.3396 0.3425 0.3352 0.0074 0.3322 200
The slope shrinkage as already seen is around 4%. There is a sligth drop-off in all indexes. The estimated likely future predictive discrimination of the model as measured by Somers’ Dxy change from 0.68 to 0.66. The latter estimate is the one that should be claimed when describing model performance.
A nearly unbiased estimate of future calibration of the stepwise-derived model is given below:
fred <- update(fred, x=TRUE , y=TRUE)
cal <- calibrate(fred, B=200)
plot(cal)
##
## n=255 Mean absolute error=0.014 Mean squared error=0.00025
## 0.9 Quantile of absolute error=0.023
Bootstrap overfitting-corrected calibration curve estimate for the backwards step-down cause of death logistic model, along with a rug plot showing the distribution of predicted risks. The smooth nonparametric calibration estimator (loess) is used.The estimated mean absolute error is 0.013 between estimated probabilities and observed events: quite good!
For comparison, let consider now a bootstrap validation of the full model without using any variable selection:
fit.multi <- update(fit.multi, x=TRUE , y=TRUE)
vfull <- validate (fit.multi, B=200)
vfull
## index.orig training test optimism index.corrected n
## Dxy 0.8113 0.8576 0.7641 0.0935 0.7178 200
## R2 0.6110 0.6780 0.5362 0.1419 0.4691 200
## Intercept 0.0000 0.0000 0.0051 -0.0051 0.0051 200
## Slope 1.0000 1.0000 0.6679 0.3321 0.6679 200
## Emax 0.0000 0.0000 0.0902 0.0902 0.0902 200
## D 0.6088 0.7087 0.5109 0.1978 0.4110 200
## U -0.0082 -0.0082 0.0657 -0.0739 0.0657 200
## Q 0.6170 0.7169 0.4452 0.2717 0.3453 200
## B 0.1241 0.1045 0.1415 -0.0370 0.1611 200
## g 3.0079 3.9649 2.5829 1.3820 1.6259 200
## gp 0.4077 0.4281 0.3793 0.0489 0.3588 200
Compared to the validation of the full model, the step-down model has much less optimism, as expected, even if it has a smaller Dxy (discrimination) due to loss of information from removing moderately important variables. Let’s now explore calibration of the full model:
cal.full <- calibrate(fit.multi, B=200)
plot(cal.full)
##
## n=245 Mean absolute error=0.04 Mean squared error=0.00195
## 0.9 Quantile of absolute error=0.062
The estimated mean absolute error is 0.04 between estimated probabilities and observed events, the bias-corrected performance is worse than the reduced model.
We could also make a test to compare the two models in terms of AUC (i.e the discrimination power): note that in the reduced model we have less missing values (in any case the % of missing values in this dataset is quite irrelevant). Let’s extract the probabilities predicted from the models:
full.pred <- predict(fit.multi, type="fitted")
rid.pred <- predict(fred, type="fitted")
dati.pp <- data.frame(prostate_rid, full.pred, rid.pred)
And perform the test:
roc.test(response = dati.pp$cvd, predictor1= dati.pp$full.pred, predictor2 = dati.pp$rid.pred)
##
## DeLong's test for two correlated ROC curves
##
## data: dati.pp$full.pred and dati.pp$rid.pred by dati.pp$cvd (0, 1)
## Z = 3.4561, p-value = 0.0005481
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## 0.02628403 0.09515026
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.9056252 0.8449080
We observe as expected significant better discrimination of the full
model, but remember also that
this model is more prone to overfitting (the bootstrap-corrected
calibration is worse…) and that these measures are derived from the
“training-all” dataset so they need also to be rescaled in new
data.
roc.full <- roc(dati.pp$cvd, dati.pp$full.pred)
roc.red <- roc(dati.pp$cvd, dati.pp$rid.pred)
plot(roc.full)
plot(roc.red, add=TRUE, col="red")
legend("bottomright",legend=c("Full model", "Reduced model"), lty=c(1,1), col=c("black", "red"), bty="n")
title("Discrimination on the development sample")
We can try to be less conservative in estimating a reduced model, in order to increase the discrimination, using a different significance level for a variable to stay in the model and using individual approximate Wald tests rather than tests combining all deleted variables:
vrid2 <- validate(fit.multi, bw=TRUE , sls =0.5, type = 'individual' , B=200)
##
## Backwards Step-down - Original Model
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC
## ekg_rec 3.60 4 0.4628 3.60 4 0.4628 -4.40
## bm 0.01 1 0.9242 3.61 5 0.6069 -6.39
## sbp 0.08 1 0.7830 3.69 6 0.7192 -8.31
## hg 0.30 1 0.5827 3.99 7 0.7813 -10.01
## wt 0.92 1 0.3380 4.91 8 0.7677 -11.09
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept -5.12176 3.12777 -1.6375 0.1015240
## sz -0.05646 0.01689 -3.3422 0.0008313
## sg -0.33179 0.12027 -2.7587 0.0058026
## ap -0.34171 0.13370 -2.5558 0.0105945
## dbp 0.35870 0.14429 2.4859 0.0129210
## age 0.09340 0.03045 3.0677 0.0021573
## pf_rec=bed_under_50 1.28840 0.78205 1.6475 0.0994630
## pf_rec=bed_over_50_conf -2.31951 1.21744 -1.9052 0.0567492
## hx 1.27821 0.39842 3.2082 0.0013358
## rx=0.2 mg estrogen -0.28720 0.50331 -0.5706 0.5682597
## rx=1.0 mg estrogen 0.94249 0.54264 1.7369 0.0824130
## rx=5.0 mg estrogen 0.85665 0.50700 1.6896 0.0910964
## dtime -0.11929 0.06723 -1.7742 0.0760268
## dtime' 0.67239 0.29006 2.3181 0.0204412
## dtime'' -1.78597 0.70664 -2.5274 0.0114911
##
## Factors in Final Model
##
## [1] sz sg ap dbp age pf_rec hx rx dtime
## singular information matrix in lrm.fit (rank= 22 ). Offending variable(s):
## ekg_rec=hbocd
##
## Divergence or singularity in 1 samples
Now let’s estimate a second reduced model:
fred2 <- lrm(cvd ~ sz + sg+ log(ap) + dbp + age + hx +pf_rec+rx +rcs(dtime,4) , data =prostate_rid)
fred2
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap dbp age hx pf_rec rx dtime
## 0 2 5 0 0 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hx + pf_rec +
## rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 250 LR chi2 145.18 R2 0.587 C 0.897
## 0 125 d.f. 14 R2(14,250)0.408 Dxy 0.794
## 1 125 Pr(> chi2) <0.0001 R2(14,187.5)0.503 gamma 0.794
## max |deriv| 7e-10 Brier 0.131 tau-a 0.398
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.7269 3.0077 -1.57 0.1160
## sz -0.0614 0.0165 -3.71 0.0002
## sg -0.3203 0.1167 -2.74 0.0061
## ap -0.3365 0.1311 -2.57 0.0103
## dbp 0.3416 0.1381 2.47 0.0134
## age 0.0878 0.0290 3.02 0.0025
## hx 1.3681 0.3838 3.56 0.0004
## pf_rec=bed_under_50 1.2721 0.7466 1.70 0.0884
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87 0.0611
## rx=0.2 mg estrogen -0.2781 0.4865 -0.57 0.5676
## rx=1.0 mg estrogen 0.9088 0.5227 1.74 0.0821
## rx=5.0 mg estrogen 0.9254 0.4880 1.90 0.0579
## dtime -0.1070 0.0644 -1.66 0.0966
## dtime' 0.5910 0.2755 2.15 0.0319
## dtime'' -1.5783 0.6699 -2.36 0.0185
print(fred2)
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap dbp age hx pf_rec rx dtime
## 0 2 5 0 0 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hx + pf_rec +
## rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 250 LR chi2 145.18 R2 0.587 C 0.897
## 0 125 d.f. 14 R2(14,250)0.408 Dxy 0.794
## 1 125 Pr(> chi2) <0.0001 R2(14,187.5)0.503 gamma 0.794
## max |deriv| 7e-10 Brier 0.131 tau-a 0.398
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.7269 3.0077 -1.57 0.1160
## sz -0.0614 0.0165 -3.71 0.0002
## sg -0.3203 0.1167 -2.74 0.0061
## ap -0.3365 0.1311 -2.57 0.0103
## dbp 0.3416 0.1381 2.47 0.0134
## age 0.0878 0.0290 3.02 0.0025
## hx 1.3681 0.3838 3.56 0.0004
## pf_rec=bed_under_50 1.2721 0.7466 1.70 0.0884
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87 0.0611
## rx=0.2 mg estrogen -0.2781 0.4865 -0.57 0.5676
## rx=1.0 mg estrogen 0.9088 0.5227 1.74 0.0821
## rx=5.0 mg estrogen 0.9254 0.4880 1.90 0.0579
## dtime -0.1070 0.0644 -1.66 0.0966
## dtime' 0.5910 0.2755 2.15 0.0319
## dtime'' -1.5783 0.6699 -2.36 0.0185
summary(fred2)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E.
## sz 6.00000 24.5 18.5 -1.135100 0.30581
## Odds Ratio 6.00000 24.5 18.5 0.321390 NA
## sg 9.00000 12.0 3.0 -0.960990 0.35016
## Odds Ratio 9.00000 12.0 3.0 0.382510 NA
## ap 0.59998 7.0 6.4 -0.826700 0.32204
## Odds Ratio 0.59998 7.0 6.4 0.437490 NA
## dbp 7.00000 9.0 2.0 0.683210 0.27626
## Odds Ratio 7.00000 9.0 2.0 1.980200 NA
## age 70.00000 76.0 6.0 0.526990 0.17427
## Odds Ratio 70.00000 76.0 6.0 1.693800 NA
## hx 0.00000 1.0 1.0 1.368100 0.38383
## Odds Ratio 0.00000 1.0 1.0 3.928100 NA
## dtime 11.00000 37.0 26.0 0.924720 0.51670
## Odds Ratio 11.00000 37.0 26.0 2.521200 NA
## pf_rec - bed_under_50:normal 1.00000 2.0 NA 1.272100 0.74664
## Odds Ratio 1.00000 2.0 NA 3.568500 NA
## pf_rec - bed_over_50_conf:normal 1.00000 3.0 NA -2.360800 1.26070
## Odds Ratio 1.00000 3.0 NA 0.094345 NA
## rx - 0.2 mg estrogen:placebo 1.00000 2.0 NA -0.278060 0.48652
## Odds Ratio 1.00000 2.0 NA 0.757250 NA
## rx - 1.0 mg estrogen:placebo 1.00000 3.0 NA 0.908770 0.52267
## Odds Ratio 1.00000 3.0 NA 2.481300 NA
## rx - 5.0 mg estrogen:placebo 1.00000 4.0 NA 0.925360 0.48803
## Odds Ratio 1.00000 4.0 NA 2.522800 NA
## Lower 0.95 Upper 0.95
## -1.7345000 -0.53573
## 0.1764900 0.58524
## -1.6473000 -0.27468
## 0.1925700 0.75981
## -1.4579000 -0.19552
## 0.2327300 0.82241
## 0.1417600 1.22470
## 1.1523000 3.40300
## 0.1854300 0.86855
## 1.2037000 2.38350
## 0.6158600 2.12040
## 1.8513000 8.33470
## -0.0879970 1.93740
## 0.9157600 6.94090
## -0.1912400 2.73550
## 0.8259400 15.41800
## -4.8317000 0.11013
## 0.0079727 1.11640
## -1.2316000 0.67550
## 0.2918200 1.96500
## -0.1156500 1.93320
## 0.8907900 6.91150
## -0.0311690 1.88190
## 0.9693100 6.56590
Give a look to the performance of this alternative reduced model:
fred2 <- update(fred2, x=TRUE , y=TRUE)
vred2 <- validate (fred2, B=200)
vred2
## index.orig training test optimism index.corrected n
## Dxy 0.7937 0.8220 0.7626 0.0594 0.7342 200
## R2 0.5873 0.6307 0.5393 0.0914 0.4959 200
## Intercept 0.0000 0.0000 0.0129 -0.0129 0.0129 200
## Slope 1.0000 1.0000 0.7782 0.2218 0.7782 200
## Emax 0.0000 0.0000 0.0580 0.0580 0.0580 200
## D 0.5767 0.6386 0.5146 0.1240 0.4527 200
## U -0.0080 -0.0080 0.0234 -0.0314 0.0234 200
## Q 0.5847 0.6466 0.4912 0.1554 0.4293 200
## B 0.1315 0.1193 0.1425 -0.0232 0.1547 200
## g 2.8350 3.4161 2.6168 0.7993 2.0357 200
## gp 0.3990 0.4121 0.3805 0.0316 0.3674 200
The performance statistics are now midway between the full model and the smaller stepwise model.
cal.red2 <- calibrate(fred2, B=200)
plot(cal.red2)
##
## n=250 Mean absolute error=0.03 Mean squared error=0.00103
## 0.9 Quantile of absolute error=0.041
The mean absolute error is 0.03.
Let’s compare the AUCs :
rid2.pred <- predict(fred2, type="fitted")
dati.pp2 <- data.frame(dati.pp,rid2.pred)
roc.test(response = dati.pp2$cvd, predictor1= dati.pp2$full.pred, predictor2 = dati.pp2$rid2.pred)
##
## DeLong's test for two correlated ROC curves
##
## data: dati.pp2$full.pred and dati.pp2$rid2.pred by dati.pp2$cvd (0, 1)
## Z = 1.1526, p-value = 0.2491
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.004248807 0.016378906
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.9056252 0.8995601
We have a comparable discrimination power on the development sample with respect to the full model, but with a better calibration.
roc.full <- roc(dati.pp2$cvd, dati.pp2$full.pred)
roc.red2 <- roc(dati.pp2$cvd, dati.pp2$rid2.pred)
plot(roc.full)
plot(roc.red2, add=TRUE, col="red")
legend("bottomright",legend=c("Full model", "Reduced 2 model"), lty=c(1,1), col=c("black", "red"), bty="n")
title("Discrimination on the development sample")
Finally we decide to present our estimated prediction model to the clinicians. Let’s prepare a visualization that could be helpful to discuss with them; we rename some levels of the categorical variables:
prostate_rid$hxf <- as.factor(prostate_rid$hx)
levels(prostate_rid$hxf) <- c("No", "Yes")
label(prostate_rid$hxf) <- "History of CV disease"
options(datadist='dd')
dd <-datadist(prostate_rid)
dd
## patno stage rx dtime status
## Low:effect 131.00000 3 <NA> 11 <NA>
## Adjust to 265.00000 3 placebo 24 dead - prostatic ca
## High:effect 395.00000 4 <NA> 37 <NA>
## Low:prediction 24.96109 3 placebo 1 alive
## High:prediction 482.03891 4 5.0 mg estrogen 58 dead - unknown cause
## Low 3.00000 3 placebo 0 alive
## High 506.00000 4 5.0 mg estrogen 71 dead - unknown cause
## age wt pf hx sbp dbp ekg
## Low:effect 70.00000 89.0000 <NA> 0 13 7 <NA>
## Adjust to 73.00000 97.0000 normal activity 0 14 8 heart strain
## High:effect 76.00000 106.0000 <NA> 1 16 9 <NA>
## Low:prediction 55.96109 76.0000 normal activity 0 11 6 normal
## High:prediction 80.00000 125.0389 confined to bed 1 19 10 recent MI
## Low 48.00000 71.0000 normal activity 0 8 4 normal
## High 88.00000 152.0000 confined to bed 1 24 13 recent MI
## hg sz sg ap bm ap80 ekg_rec
## Low:effect 12.000000 6.00000 9 0.59997559 0 0 <NA>
## Adjust to 13.599609 14.00000 11 1.09985352 0 0 bngn
## High:effect 14.599609 24.50000 12 7.00000000 1 1 <NA>
## Low:prediction 9.298828 2.00000 8 0.19998169 0 0 bngn
## High:prediction 16.503800 46.11673 14 54.97592412 1 1 MI
## Low 5.899414 0.00000 5 0.09999084 0 0 bngn
## High 21.199219 69.00000 15 999.87500000 1 1 MI
## pf_rec status_num subset cvd log_ap
## Low:effect <NA> <NA> 1 0 -0.51086631
## Adjust to normal dead - prostatic ca 1 0 0.09517701
## High:effect <NA> <NA> 1 1 1.94591015
## Low:prediction normal alive 1 0 -1.60952947
## High:prediction bed_over_50_conf dead - unknown cause 1 1 4.00677071
## Low normal alive 1 0 -2.30267665
## High bed_over_50_conf dead - unknown cause 1 1 6.90763027
## hxf
## Low:effect <NA>
## Adjust to Yes
## High:effect <NA>
## Low:prediction No
## High:prediction Yes
## Low No
## High Yes
##
## Values:
##
## stage : 3 4
## rx : placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen
## status :
## [1] alive dead - prostatic ca
## [3] dead - heart or vascular dead - cerebrovascular
## [5] dead - pulmonary embolus dead - other ca
## [7] dead - respiratory disease dead - other specific non-ca
## [9] dead - unspecified non-ca dead - unknown cause
## pf : normal activity in bed < 50% daytime in bed > 50% daytime confined to bed
## hx : 0 1
## dbp : 4 5 6 7 8 9 10 11 12 13
## ekg :
## [1] normal benign
## [3] rhythmic disturb & electrolyte ch heart block or conduction def
## [5] heart strain old MI
## [7] recent MI
## bm : 0 1
## ap80 : 0 1
## ekg_rec : bngn rd&ec hbocd hrts MI
## pf_rec : normal bed_under_50 bed_over_50_conf
## status_num :
## [1] alive dead - prostatic ca
## [3] dead - heart or vascular dead - cerebrovascular
## [5] dead - pulmonary embolus dead - other ca
## [7] dead - respiratory disease dead - other specific non-ca
## [9] dead - unspecified non-ca dead - unknown cause
## subset : 1
## cvd : 0 1
## hxf : No Yes
We estimate the final model selected:
fredfin <- lrm(cvd ~ sz + sg+ log(ap) + dbp+ age + hxf +pf_rec+rx+rcs(dtime,4), data =prostate_rid)
fredfin
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap dbp age hxf pf_rec rx dtime
## 0 2 5 0 0 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hxf + pf_rec +
## rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 250 LR chi2 145.18 R2 0.587 C 0.897
## 0 125 d.f. 14 R2(14,250)0.408 Dxy 0.794
## 1 125 Pr(> chi2) <0.0001 R2(14,187.5)0.503 gamma 0.794
## max |deriv| 7e-10 Brier 0.131 tau-a 0.398
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.7269 3.0077 -1.57 0.1160
## sz -0.0614 0.0165 -3.71 0.0002
## sg -0.3203 0.1167 -2.74 0.0061
## ap -0.3365 0.1311 -2.57 0.0103
## dbp 0.3416 0.1381 2.47 0.0134
## age 0.0878 0.0290 3.02 0.0025
## hxf=Yes 1.3681 0.3838 3.56 0.0004
## pf_rec=bed_under_50 1.2721 0.7466 1.70 0.0884
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87 0.0611
## rx=0.2 mg estrogen -0.2781 0.4865 -0.57 0.5676
## rx=1.0 mg estrogen 0.9088 0.5227 1.74 0.0821
## rx=5.0 mg estrogen 0.9254 0.4880 1.90 0.0579
## dtime -0.1070 0.0644 -1.66 0.0966
## dtime' 0.5910 0.2755 2.15 0.0319
## dtime'' -1.5783 0.6699 -2.36 0.0185
nom <- nomogram (fredfin , ap=c(.1 , .5 , 1, 5, 10, 20, 30, 40),
fun=plogis , funlabel ="Probability ",
lp=TRUE,
fun.at =c(.01,.1,.25,.5,.75,.95))
nom
## Points per unit of linear predictor: 21.51285
## Linear predictor units per point : 0.04648384
##
##
## sz Points
## 0 92
## 5 86
## 10 79
## 15 73
## 20 66
## 25 59
## 30 53
## 35 46
## 40 40
## 45 33
## 50 26
## 55 20
## 60 13
## 65 7
## 70 0
##
##
## sg Points
## 5 69
## 6 62
## 7 55
## 8 48
## 9 41
## 10 34
## 11 28
## 12 21
## 13 14
## 14 7
## 15 0
##
##
## ap Points
## 0.1 43
## 0.5 32
## 1.0 27
## 5.0 15
## 10.0 10
## 20.0 5
## 30.0 2
## 40.0 0
##
##
## dbp Points
## 4 0
## 5 7
## 6 15
## 7 22
## 8 29
## 9 37
## 10 44
## 11 51
## 12 59
## 13 66
##
##
## age Points
## 45 0
## 50 9
## 55 19
## 60 28
## 65 38
## 70 47
## 75 57
## 80 66
## 85 76
## 90 85
##
##
## History of CV disease Points
## No 0
## Yes 29
##
##
## pf_rec Points
## normal 51
## bed_under_50 78
## bed_over_50_conf 0
##
##
## rx Points
## placebo 6
## 0.2 mg estrogen 0
## 1.0 mg estrogen 26
## 5.0 mg estrogen 26
##
##
## Months of Follow-up Points
## 0 100
## 10 79
## 20 79
## 30 96
## 40 95
## 50 77
## 60 52
## 70 26
## 80 0
##
##
## Total Points Probability
## 268 0.01
## 319 0.10
## 343 0.25
## 367 0.50
## 390 0.75
## 430 0.95
This is a useful tool to visualize a prediction model and discuss it with clinicians:
plot(nom,cex.axis=.75)
First of all, install and load the required libraries : rms, Hmisc and epiDisplay.
library(rms)
library(Hmisc)
library(epiDisplay)
Let’s simulate data from a sample of n=100 points along with population linear regression line. The conditional distribution of y|x can be thought of as a vertical slice at x. The unconditional distribution of y is shown on the y-axis. To envision the conditional normal distributions assumed for the underlying population, think of a bell-shaped curve coming out of the page, with its base along one of the vertical lines of points. The equal variance assumption (homoscedasticity) dictates that the series of Gaussian curves for all the different x values have equal variances.
n <- 100
set.seed(13)
x <- round(rnorm(n, .5, .25), 1)
y <- x + rnorm(n, 0, .1)
r <- c(-.2, 1.2)
Plot:
plot(x, y, axes=FALSE, xlim=r, ylim=r, xlab=expression(x), ylab=expression(y))
axis(1, at=r, labels=FALSE)
axis(2, at=r, labels=FALSE)
abline(a=0,b=1)
histSpike(y, side=2, add=TRUE)
abline(v=.6, lty=2)
Simple linear regression is used when:
Estimation of the confidence intervals (CI) for predictions depend on what you want to predict, if at individual level or at mean population level.
x1 <- c( 1, 3, 5, 6, 7, 9, 11)
y <- c( 5, 10, 70, 58, 85, 89, 135)
dd <- datadist(x1, n.unique=5); options(datadist='dd')
f <- ols(y ~ x1)
p1 <- Predict(f, x1=seq(1,11,length=100), conf.type='mean')
p2 <- Predict(f, x1=seq(1,11,length=100), conf.type='individual')
p <- rbind(Mean=p1, Individual=p2)
ggplot(p, legend.position='none') +
geom_point(aes(x1, y), data=data.frame(x1, y, .set.=''))
Example usages:
Is a child of age x smaller than predicted for her age? Use the individual level, p2 (wider bands)
What is the best estimate of the population mean blood pressure for patients on treatment A? Use the mean population level, p1 (narrower bands)
It is crucial to verify the the assumptions underlying a linear regression model:
In a scatterplot the spread of y about the fitted line should be constant as x increases, and y vs. x should appear linear
Easier to see this with a plot of residuals vs estimated values
In this plot there should be no systematic patterns (no trend in central tendency, no change in spread of points with x)
Trend in central tendency indicates failure of linearity
qqnorm plot of residuals is a useful tool
Here an example: we fit a linear regression model where x and y should instead have been log transformed:
n <- 50
set.seed(2)
res <- rnorm(n, sd=.25)
x <- runif(n)
y <- exp(log(x) + res)
f <- ols(y ~ x)
plot(fitted(f), resid(f))
This plot depicts non-constant variance of the residuals, which might call for transforming y.
Now, we fit a linear model that should have been quadratic (functional form of X):
x <- runif(n, -1, 1)
y <- x ^ 2 + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))
Finally, we fit a correct model:
y <- x + res
f <- ols(y ~ x)
plot(fitted(f), resid(f))
qqnorm(resid(f)); qqline(resid(f))
These plots shows the ideal situation of white noise (no trend, constant variance). The qq plot demonstrates approximate normality of residuals, for a sample of size n = 50.
The dataset concerns the relationship between hookworm and blood loss from a study conducted in 1970.
data(Suwit)
summary(Suwit)
## id worm bloss
## Min. : 1.0 Min. : 32.0 Min. : 5.03
## 1st Qu.: 4.5 1st Qu.: 167.0 1st Qu.:14.02
## Median : 8.0 Median : 525.0 Median :33.80
## Mean : 8.0 Mean : 552.4 Mean :33.45
## 3rd Qu.:11.5 3rd Qu.: 796.5 3rd Qu.:41.70
## Max. :15.0 Max. :1929.0 Max. :86.65
des(Suwit)
## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
## Variable Class Description
## 1 id numeric
## 2 worm numeric No. of worms
## 3 bloss numeric Blood loss/day
summ(Suwit)
## HW and Blood loss SEAJTMH 1970;
## No. of observations = 15
##
## Var. name obs. mean median s.d. min. max.
## 1 id 15 8 8 4.47 1 15
## 2 worm 15 552.4 525 513.9 32 1929
## 3 bloss 15 33.45 33.8 24.85 5.03 86.65
attach(Suwit)
The file is clean and ready for analysis (this happens here only for didactic purposes: in real life, you will usually spend a couple of hours - at minimum, if not days..- to clean datasets). For example, with this small sample size it is somewhat straightforward to verify that there is no repetition of ‘id’ and no missing values. The records have been sorted in ascending order of ‘worm’ (number of worms) ranging from 32 in the first subject to 1929 in the last one. Blood loss (‘bloss’) is however, not sorted. The 13th record has the highest blood loss of 86 ml per day, which is very high. The objective of this analysis is to predict blood loss using worm.
First of all, give a look to the data:
plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel")
A linear model using the above two variables seems reasonable.
lm1 <- lm(bloss ~ worm, data=Suwit)
lm1
##
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
##
## Coefficients:
## (Intercept) worm
## 10.84733 0.04092
Displaying the model by typing ‘lm1’ gives limited information (essentially, the estimated regression line coefficients). To get more information, one can look at the attributes of this model, its summary and attributes of its summary.
attr(lm1, "names")
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
summary(lm1)
##
## Call:
## lm(formula = bloss ~ worm, data = Suwit)
##
## Residuals:
## Min 1Q Median 3Q Max
## -15.846 -10.812 0.750 4.356 34.390
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 10.847327 5.308569 2.043 0.0618 .
## worm 0.040922 0.007147 5.725 6.99e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 13.74 on 13 degrees of freedom
## Multiple R-squared: 0.716, Adjusted R-squared: 0.6942
## F-statistic: 32.78 on 1 and 13 DF, p-value: 6.99e-05
The first section of summary shows the formula that was ‘called’. The second section gives the distribution of residuals. The pattern is clearly not symmetric. The maximum is too far on the right (34.38) compared to the minimum (-15.84) and the first quartile is further left (-10.81) of the median (0.75) than the third quartile (4.35) is. Otherwise, the median is close to zero. The third section gives coefficients of the intercept and the effect of ‘worm’ on blood loss. The intercept is 10.8 meaning that when there are no worms, the blood loss is estimated to be 10.8 ml per day. This is however, not significantly different from zero as the P value is 0.0618. The coefficient of ‘worm’ is 0.04 indicating that each worm is associated with an average increase of 0.04 ml of blood loss per day. Although the value is small, it is highly significantly different from zero. When there are many worms, the level of blood loss can be very substantial. The multiple R-squared value of 0.716 indicates that 71.6% of the variation in the data is explained by the model. The adjusted value is 0.6942. (The calculation of Rsquared is discussed in the analysis of variance section below). The last section describes more details of the residuals and hypothesis testing on the effect of ‘worm’ using the F-statistic. The P value from this section (0.0000699) is equal to that tested by the t-distribution in the coefficient section. This F-test more commonly appears in the analysis of variance table.
summary(aov(lm1))
## Df Sum Sq Mean Sq F value Pr(>F)
## worm 1 6192 6192 32.78 6.99e-05 ***
## Residuals 13 2455 189
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The above analysis of variance (aov) table breaks down the degrees of freedom, sum of squares and mean square of the outcome (blood loss) by sources (in this case there only two: worm + residuals).
The so-called ‘square’ is actually the square of difference between the value and the mean. The total sum of squares of blood loss is therefore:
SST <- sum((bloss-mean(bloss))^2)
SST
## [1] 8647.044
The sum of squares from residuals is:
SSR <- sum(residuals(lm1)^2)
SSR
## [1] 2455.468
See also the analysis of variance table. The sum of squares of worm or sum of squares of difference between the fitted values and the grand mean is:
SSW <- sum((fitted(lm1)-mean(bloss))^2)
SSW
## [1] 6191.577
The latter two sums add up to the first one. The R-squared is the proportion of sum of squares of the fitted values to the total sum of squares.
SSW/SST
## [1] 0.7160339
Instead of sum of squares, one may consider the mean square as the level of variation. In such a case, the number of worms can reduce the total mean square (or variance) by: (total mean square - residual mean square) / total mean square, or (variance - residual mean square) / variance.
resid.msq <- sum(residuals(lm1)^2)/lm1$df.residual
Radj <- (var(bloss)- resid.msq)/var(bloss)
Radj
## [1] 0.6941903
This is the adjusted R-squared shown in summary(lm1) in the above section.
When the mean square of ‘worm’ is divided by the mean square of residuals, the result is:
F <- SSW/resid.msq; F
## [1] 32.78011
Using this F value with the two corresponding degrees of freedom (from ‘worm’ and residuals) the P value for testing the effect of ‘worm’ can be computed.
pf(F, df1=1, df2=13, lower.tail=FALSE)
## [1] 6.990401e-05
The function pf is used to compute a P value from a given F value together with the two values of the degrees of freedom. The last argument ‘lower.tail’ is set to FALSE to obtain the right margin of the area under the curve of the F distribution. In summary, both the regression and analysis of variance give the same conclusion; that number of worms has a significant linear relationship with blood loss. Now the regression line can be drawn.
A regression line can be added to the scatter plot with the following command:
plot(worm, bloss, xlab="No. of worms", ylab="ml. per day",
main = "Blood loss by number of hookworms in the bowel", type="n")
abline(lm1)
points(worm, fitted(lm1), pch=18, col="blue")
segments(worm, bloss, worm, fitted(lm1), col="red")
The regression line has an intercept of 10.8 and a slope of 0.04. The expected value is the value of blood loss estimated from the regression line with a specific value of ‘worm’.A residual is the difference between the observed and expected value. The residuals can be drawn by adding the red segments.
The actual values of the residuals can be checked from the specific attribute of the defined linear model.
residuals(lm1) -> lm1.res
hist(lm1.res)
Histogram of the residuals is not somehow convincing about their normal distribution shape. However, with such a small sample size, it is difficult to draw any conclusion. A better way to check normality is to plot the residuals against the expected normal score or (residual-mean) / standard deviation. A reasonably straight line would indicate normality. Moreover, a test for normality could be calculated.
a <- qqnorm(lm1.res)
shapiro.qqnorm(lm1.res, type="n")
text(a$x, a$y, labels=as.character(id))
If the residuals were perfectly normally distributed, the text symbols would have formed along the straight dotted line. The graph suggests that the largest residual (13th) is too high (positive) whereas the smallest value (7th) is not large enough (negative). However, the P value from the Shapiro-Wilk test is 0.08 suggesting that the possibility of residuals being normally distributed cannot be rejected.
Finally, the residuals are plotted against the fitted values to see if there is a pattern.
plot(fitted(lm1), lm1.res, xlab="Fitted values")
plot(fitted(lm1), lm1.res, xlab="Fitted values", type="n")
text(fitted(lm1), lm1.res, labels=as.character(id))
abline(h=0, col="blue")
There is no obvious pattern. The residuals are quite independent of the expected values. With this and the above findings from the qqnorm command we may conclude that the residuals are randomly and normally distributed.
The above two diagnostic plots for the model ‘lm1’ can also be obtained from:
par(mfrow=c(1,2))
plot(lm1, which=1:2)
detach(Suwit)
From the analysis, it is clear that blood loss is associated with number of hookworms. On average, each additional worm is associated with an increase of 0.04 ml of blood loss. The remaining uncertainty of blood loss, apart from hookworm, is explained by random variation or other factors that were not measured.
Remember that linear regression always means linearity in the parameters, irrespective of linearity in explanatory variables (i.e. the functional forms selected for the covariates). Y is linearly related to X (in the parameters) if the rate of change of Y with respect to X (dY/dX) is independent of the value of X. A function Y=BX=\(b_{1}\)\(x_{1}\) + \(b_{2}\)\(x_{2}\) is said to be linear (in the parameters), say, \(b_{1}\), if \(b_{1}\) appears with a power of 1 only and is not multiplied or divided by any other parameter (for eg \(b_{1}\) x \(b_{2}\) , or \(b_{2}\) / \(b_{1}\)).
This is a different concept with respect to the linearity in the functional form of the covariates, that is not instead required.
Moreover, some regression models may look non linear in the parameters but are inherently or intrinsically linear.
This is because with suitable transformations they can be made linear in parameters.
Just as an example, imagine that we want to investigate the effect of total CSF polymorph count on blood glucose ratio in patients with either bacterial or viral meningitis.
require(Hmisc)
getHdata(abm)
As a first step in every statistical analysis, we should visualize data of interest.
In this example, we use a nonparametric regression approach to explore the relationship between the two variables: plsmo is a loess nonparametric smoother.
with(ABM, {
glratio <- gl / bloodgl
tpolys <- polys * whites / 100
plsmo(tpolys, glratio, xlab='Total Polymorphs in CSF',
ylab='CSF/Blood Glucose Ratio',
xlim=quantile(tpolys, c(.05,.95), na.rm=TRUE),
ylim=quantile(glratio, c(.05,.95), na.rm=TRUE))
scat1d(tpolys); scat1d(glratio, side=4) })
Moreover, we can use also a Super smoother relating age to the probability of bacterial meningitis given a patient has bacterial or viral meningitis (a binary outcome, see later in this block for logistic regression examples), with a rug plot showing the age distribution:
with(ABM, {
plsmo(age, abm, 'supsmu', bass=7,
xlab='Age at Admission, Years',
ylab='Proportion Bacterial Meningitis')
scat1d(age) })
We can use these nonparametric approaches to help in finding a suitable functional form for the candidate predictor (switching to parametric regression models), that could be very different from the linear effect. For example, we can decide to use splines to model the effect of a continuous predictor on an outcome. For whom interested in splines see: https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-019-0666-3
Datasets usually contain many variables collected during a study. It is often useful to see the relationship between two variables within the different levels of another third, categorical variable, i.e. to verify for the presence of an interaction. (This is just a didactic example, usually more than 3 variables are involved in MLR models).
A small survey on blood pressure was carried out. The objective is to see the hypertensive effect of subjects putting additional table salt on their meal.Gender of subjects is also measured.
data(BP)
attach(BP)
des(BP)
## cross-sectional survey on BP & risk factors
## No. of observations = 100
## Variable Class Description
## 1 id integer id
## 2 sex factor sex
## 3 sbp integer Systolic BP
## 4 dbp integer Diastolic BP
## 5 saltadd factor Salt added on table
## 6 birthdate Date
Note that the maximum systolic and diastolic blood pressures are quite high. There are 20 missing values in saltadd. The frequencies of the categorical variables sex and saltadd are now inspected.
describe(data.frame(sex, saltadd))
## vars n mean sd median trimmed mad min max range skew kurtosis se
## sex* 1 100 1.55 0.5 2 1.56 0 1 2 1 -0.20 -1.98 0.05
## saltadd* 2 80 1.54 0.5 2 1.55 0 1 2 1 -0.15 -2.00 0.06
The next step is to create a new age variable from birthdate. The calculation is based on 12th March 2001, the date of the survey (date of entry in the study).
age.in.days <- as.Date("2001-03-12") - birthdate
There is a leap year in every four years. Therefore, an average year will have 365.25 days.
class(age.in.days)
## [1] "difftime"
age <- as.numeric(age.in.days)/365.25
The function as.numeric is needed to transform the units of age (difftime); otherwise modelling would not be possible.
describeBy(sbp,saltadd)
##
## Descriptive statistics by group
## group: no
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 37 137.46 29.62 132 136.39 23.72 80 201 121 0.44 -0.42 4.87
## ------------------------------------------------------------
## group: yes
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 43 163.02 39.39 171 163.94 45.96 80 224 144 -0.25 -1.13 6.01
The missing value group has the highest median and average systolic blood pressure. In order to create a new variable with three levels type:
saltadd1 <- saltadd
levels(saltadd1) <- c("no", "yes", "missing")
saltadd1[is.na(saltadd)] <- "missing"
summary(saltadd1)
## no yes missing
## 37 43 20
summary(aov(age ~ saltadd1))
## Df Sum Sq Mean Sq F value Pr(>F)
## saltadd1 2 115 57.42 0.448 0.64
## Residuals 97 12422 128.06
Since there is not enough evidence that the missing group is important and for additional reasons of simplicity, we will assume MCAR (missing completely at random) and we will ignore this group and continue the analysis with the original saltadd variable consisting of only two levels. Before doing this however, a simple regression model and regression line are first fitted.
lm1 <- lm(sbp ~ age)
summary(lm1)
##
## Call:
## lm(formula = sbp ~ age)
##
## Residuals:
## Min 1Q Median 3Q Max
## -68.231 -21.801 -3.679 15.127 105.395
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 65.1465 14.8942 4.374 3.05e-05 ***
## age 1.8422 0.2997 6.147 1.71e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 33.56 on 98 degrees of freedom
## Multiple R-squared: 0.2782, Adjusted R-squared: 0.2709
## F-statistic: 37.78 on 1 and 98 DF, p-value: 1.712e-08
Although the R-squared is not very high, the p value is small indicating important influence of age on systolic blood pressure. A scatterplot of age against systolic blood pressure is now shown with the regression line added using the abline function. This function can accept many different argument forms, including a regression object. If this object has a coef method, and it returns a vector of length 1, then the value is taken to be the slope of a line through the origin, otherwise the first two values are taken to be the intercept and slope, as is the case for lm1.
plot(age, sbp, main = "Systolic BP by age", xlab = "Years", ylab = "mm.Hg")
abline(lm1)
Subsequent exploration of residuals suggests a non-significant deviation from normality and no pattern. Details of this can be adopted from the techniques already discussed and are omitted here. The next step is to provide different plot patterns for different groups of salt habits. Note that here the sample size is lower (we decided to omit missing values for saltadd, reducing the analysis to the complete cases):
lm2 <- lm(sbp ~ age + saltadd)
summary(lm2)
##
## Call:
## lm(formula = sbp ~ age + saltadd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -66.369 -24.972 -0.125 22.456 64.805
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 63.1291 15.7645 4.005 0.000142 ***
## age 1.5526 0.3118 4.979 3.81e-06 ***
## saltaddyes 22.9094 6.9340 3.304 0.001448 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.83 on 77 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.3331, Adjusted R-squared: 0.3158
## F-statistic: 19.23 on 2 and 77 DF, p-value: 1.679e-07
On the average, a one year increment of age is associated with an increase in systolic blood pressure by 1.5 mmHg. Adding table salt increases systolic blood pressure significantly by approximately 23 mmHg.
plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)
Note that the red dots corresponding to those who added table salt are higher than the blue circles.
The final task is to draw two separate regression lines for each group. We now have two regression lines to draw, one for each group. The intercept for non-salt users will be the first coefficient and for salt users will be the first plus the third. The slope for both groups is the same.
a0 <- coef(lm2)[1]
a1 <- coef(lm2)[1] + coef(lm2)[3]
b <- coef(lm2)[2]
plot(age, sbp, main="Systolic BP by age", xlab="Years", ylab="mm.Hg", type="n")
points(age[saltadd=="no"], sbp[saltadd=="no"], col="blue")
points(age[saltadd=="yes"], sbp[saltadd=="yes"], col="red",pch = 18)
abline(a = a0, b, col = "blue")
abline(a = a1, b, col = "red")
Note that X-axis does not start at zero. Thus the intercepts are out of the plot frame. The red line is for the red points of salt adders and the blue line is for the blue points of non-adders. In this model, age is assumed to have a constant effect on systolic blood pressure independently from added salt.
But look at the distributions of the points of the two colours: the red points are higher than the blue ones but mainly on the right half of the graph. To fit lines with different slopes, a new model with interaction term is created.
Therefore the next step is to estimate a model with different slopes (or different ‘b’ for the abline arguments) for the different lines. The model needs an interaction term between addsalt and age.
lm3 <- lm(sbp ~ age * saltadd)
summary(lm3)
##
## Call:
## lm(formula = sbp ~ age * saltadd)
##
## Residuals:
## Min 1Q Median 3Q Max
## -69.885 -24.891 -1.142 21.228 66.867
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 78.0066 20.3981 3.824 0.000267 ***
## age 1.2419 0.4128 3.009 0.003558 **
## saltaddyes -12.2540 31.4574 -0.390 0.697965
## age:saltaddyes 0.7199 0.6282 1.146 0.255441
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 30.77 on 76 degrees of freedom
## (20 observations deleted due to missingness)
## Multiple R-squared: 0.3445, Adjusted R-squared: 0.3186
## F-statistic: 13.31 on 3 and 76 DF, p-value: 4.528e-07
For the intercept of the salt users, the second term and the fourth are all zero (since age is zero) but the third should be kept as such. This term is negative. The intercept of salt users is therefore lower than that of the non-users.
a0 <- coef(lm3)[1]
a1 <- coef(lm3)[1] + coef(lm3)[3]
For the slope of the non-salt users, the second coefficient alone is enough since the first and the third are not involved with each unit of increment of age and the fourth term has saltadd being 0. The slope for the salt users group includes the second and the fourth coefficients since saltaddyes is 1.
b0 <- coef(lm3)[2]
b1 <- coef(lm3)[2] + coef(lm3)[4]
plot(age, sbp, main="Systolic BP by age", xlab="Years",ylab="mm.Hg", pch=18, col=as.numeric(saltadd))
abline(a = a0, b = b0, col = 1)
abline(a = a1, b = b1, col = 2)
legend("topleft", legend = c("Salt added", "No salt added"),lty=1, col=c("red","black"))
Note that as.numeric(saltadd) converts the factor levels into the integers 1 (black) and 2 (red), representing the non-salt adders and the salt adders, respectively. These colour codes come from the R colour palette.
This model suggests that at the young age, the systolic blood pressure of two groups are not much different as the two lines are close together on the left of the plot. For example, at the age of 25, the difference is 5.7mmHg. Increasing age increases the difference between the two groups. At 70 years of age, the difference is as great as 38mmHg. In this aspect, age modifies the effect of adding table salt on blood pressure.
On the other hand the slope of age is 1.24mmHg per year among those who did not add salt but becomes 1.24+0.72 = 1.96mmHg among the salt adders. Thus, salt adding modifies the effect of age. Note that interaction is a statistical term whereas effect modification is the equivalent epidemiological term.
Note also that the coefficient of the interaction term age:saltaddyes is not statistically significant !!!
This means that the two slopes just differ by chance (or that we are low-powered to detect a significant interaction..)
This was in fact just a didactic example to show how to introduce an interaction term in a regression model.
Conceptually, in prediction, in a certain sense we make comparisons between outcomes across different combinations of values of input variables to predict the probability of an outcome. In causal inference, we ask what would happen to an outcome y as a result of a treatment or intervention. Predictive inference relates to comparisons between units (different groups/subjects). Causal inference addresses comparisons of different treatments when applied to the same unit.
The FEV, which is an acronym for Forced Expiratory Volume, is a measure of how much air a person can exhale (in liters) during a forced breath. In this dataset, the FEV of 606 children, between the ages of 6 and 17, were measured. The dataset also provides additional information on these children: their age, their height, their gender and, most importantly, whether the child is a smoker or a non-smoker (the exposure of interest). This is an observational cross-sectional study.
The goal of this study was to find out whether or not smoking has an effect on the FEV of children.
Load the required libraries
library(tidyverse)
library(DataExplorer)
## Warning: package 'DataExplorer' was built under R version 4.3.2
library(SmartEDA)
## Warning: package 'SmartEDA' was built under R version 4.3.2
library(ggplot2)
library(ggstatsplot)
## Warning: package 'ggstatsplot' was built under R version 4.3.2
Set the working directory and import data
fev <- read_tsv(here("datasets","fev.txt"))
head(fev)
## # A tibble: 6 × 5
## age fev height gender smoking
## <dbl> <dbl> <dbl> <chr> <dbl>
## 1 9 1.71 57 f 0
## 2 8 1.72 67.5 f 0
## 3 7 1.72 54.5 f 0
## 4 9 1.56 53 m 0
## 5 9 1.90 57 m 0
## 6 8 2.34 61 f 0
There are a few things in the formatting of the data that can be improved upon:
gender
and smoking
can be
transformed to factors.height
variable is written in inches. Inches are
hard to interpret. Let’s add a new column, height_cm
, with
the values converted to centimeter using the mutate
function. (For this example we will not use this variable however).fev <- fev %>%
mutate(gender = as.factor(gender)) %>%
mutate(smoking = as.factor(smoking)) %>%
mutate(height_cm = height*2.54)
head(fev)
## # A tibble: 6 × 6
## age fev height gender smoking height_cm
## <dbl> <dbl> <dbl> <fct> <fct> <dbl>
## 1 9 1.71 57 f 0 145.
## 2 8 1.72 67.5 f 0 171.
## 3 7 1.72 54.5 f 0 138.
## 4 9 1.56 53 m 0 135.
## 5 9 1.90 57 m 0 145.
## 6 8 2.34 61 f 0 155.
Now, let’s make a first explorative boxplot, showing only the FEV for both smoking categories.
fev %>%
ggplot(aes(x=smoking,y=fev,fill=smoking)) +
scale_fill_manual(values=c("dimgrey","firebrick")) +
theme_bw() +
geom_boxplot(outlier.shape=NA) +
geom_jitter(width = 0.2, size=0.1) +
ggtitle("Boxplot of FEV versus smoking") +
ylab("fev (l)") +
xlab("smoking status")
Did you expect these results??
It appears that children that smoke have a higher median FEV than children that do not smoke. Should we change legislations worldwide and make smoking obligatory for children??
Maybe there is something else going on in the data. Now, we will generate a similar plot, but we will stratify the data based on age (age as factor).
fev %>%
ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
geom_boxplot(outlier.shape=NA) +
geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
theme_bw() +
scale_fill_manual(values=c("dimgrey","firebrick")) +
ggtitle("Boxplot of FEV versus smoking, stratified on age") +
ylab("fev (l)") +
xlab("Age (years)")
This plot seems to already give us a more plausible picture. First, it seems that we do not have any smoking children of ages 6, 7 or 8. Second, when looking at the results per age “category”, it seems no longer the case that smokers have a much higher FEV than non-smokers; for the higher ages, the contrary seems true.
This shows that taking into account confounders (in this case) is crucial! If we simply analyse the dataset based on the smoking status and FEV values only, our inference might be incorrect:
fit1 <- lm(fev~smoking, data=fev) # wrong beta0star, beta1star
fit2 <- lm(fev~age+smoking, data=fev) # "true" beta0, beta2, beta1
fitage <- lm(age~smoking, data=fev) # gamma0, gamma1
fit1
##
## Call:
## lm(formula = fev ~ smoking, data = fev)
##
## Coefficients:
## (Intercept) smoking1
## 2.6346 0.6054
fit2
##
## Call:
## lm(formula = fev ~ age + smoking, data = fev)
##
## Coefficients:
## (Intercept) age smoking1
## 0.2040 0.2479 -0.2358
fitage
##
## Call:
## lm(formula = age ~ smoking, data = fev)
##
## Coefficients:
## (Intercept) smoking1
## 9.804 3.393
beta0 <- coef(fit2)[1]
beta2 <- coef(fit2)[2]
beta1 <- coef(fit2)[3]
gamma0 <- coef(fitage)[1]
gamma1 <- coef(fitage)[2]
beta0star <- coef(fit1)[1]
beta1star <- coef(fit1)[2]
check that: beta0star = beta0 + beta2*gamma0 :
beta0 + beta2*gamma0
## (Intercept)
## 2.634628
beta0star
## (Intercept)
## 2.634628
and also check that (wrong) beta1star = beta1 + beta2*gamma1:
beta1 + beta2*gamma1
## smoking1
## 0.6054053
beta1star
## smoking1
## 0.6054053
Therefore, a beneficial estimated smoking effect is obtained when age is ignored. If the causal inference assumptions hold (see slides) we can consider beta1 as the average smoking effect in the population under study.
Imbalance and lack of complete overlap can make causal inference difficult. Remind : imbalance is when treatment groups differ with respect to an important covariate. Lack of complete overlap: when some combination of treatment level and covariate level is lacking (no observations, violation of the positivity assumption).
To explain fev, sex seems to matter, especially among older individuals:
plot(fev~age, col=gender, data=fev)
legend("topleft", pch=1, col=1:2, levels(fev$gender))
Another way to plot the same thing is:
fev %>%
ggplot(aes(x=as.factor(age),y=fev,fill=smoking)) +
geom_boxplot(outlier.shape=NA) +
geom_point(width = 0.2, size = 0.1, position = position_jitterdodge()) +
theme_bw() +
scale_fill_manual(values=c("dimgrey","firebrick")) +
ggtitle("Boxplot of FEV versus smoking, stratified on age and gender") +
ylab("fev (l)") +
xlab("smoking status") +
facet_grid(rows = vars(gender))
Especially for higher ages, the median FEV is higher for males as compared to females. [This could suggest a kind of interaction between gender and age, that could be explored in the regression model even if the interpretation of such interaction could be quite tricky (many levels for age!)].
Moreover, there is a slight gender imbalance among age categories:
counts <- table(fev$gender, as.factor(fev$age))
percentages <- round(prop.table(table(fev$gender, as.factor(fev$age)),2),digits=3)
barplot(percentages, main="Gender distribution",
xlab="ages", col=c("pink", "darkblue"),
legend = rownames(counts))
For imbalanced samples, simple comparisons of sample means between groups are not good estimates of treatment/risk factors effects. A model adjustment is of course one way to better estimate a treatment effect, where we add the covariate to the model. In this case for example we can add also gender in the regression model:
fit3 <- lm(fev~age+smoking+gender, data=fev)
summary(fit3)
##
## Call:
## lm(formula = fev ~ age + smoking + gender, data = fev)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1.39229 -0.35677 -0.03623 0.33644 1.91571
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 0.088007 0.097635 0.901 0.3677
## age 0.243707 0.009523 25.591 < 2e-16 ***
## smoking1 -0.180489 0.080950 -2.230 0.0261 *
## genderm 0.295794 0.044683 6.620 7.97e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.5465 on 602 degrees of freedom
## Multiple R-squared: 0.5681, Adjusted R-squared: 0.566
## F-statistic: 264 on 3 and 602 DF, p-value: < 2.2e-16
So now the effect of smoking is estimated conditional on both age and gender, and as expected is a negative effect on FEV; one possible problem however with these estimates could be that for some combinations of age-gender-smoking we do not have any observed data, so that the extrapolation of the regression model could be not so reliable. Also in this case (no interaction) if causal assumptions hold we can interpret the effect of smoking also as a marginal effect.
index <- fev$smoking==0
counts.noS <- table(fev[index,]$gender, as.factor(fev[index,]$age))
percentages.noS <- round(prop.table(table(fev[index,]$gender, as.factor(fev[index,]$age)),2),digits=3)
index1 <- fev$smoking==1
counts.S <- table(fev[index1,]$gender, as.factor(fev[index1,]$age))
percentages.S <- round(prop.table(table(fev[index1,]$gender, as.factor(fev[index1,]$age)),2),digits=3)
par(mfrow=c(1,2))
barplot(percentages.noS, main="Gender distribution among non-smokers",cex.main=0.8,
xlab="ages", col=c("pink", "darkblue"),
legend = rownames(percentages.noS))
barplot(percentages.S, main="Gender distribution among smokers",cex.main=0.8,
xlab="ages", col=c("pink", "darkblue"),
legend = rownames(percentages.S))
Observe that in fact there is also a lack of “smoking” children below age 9: lack of complete overlap is when there are no observations at all for some combination(s) of treatment levels / covariate levels.
For lack of complete overlap, there is no data available for some comparisons. This requires extrapolation using a model to make comparisons. This is in fact the job that the regression model does, but again extrapolation is always a risky business for the model in regions where no data are available. This is an even more serious problem than imbalance.
Matching is a possible strategy in these situations to overcome (avoid) imbalance, even if some data will be discarded (see in the propensity score methods examples).
Load the required libraries
library(tidyverse)
library(broom)
library(gtsummary)
library(MatchIt)
library(geepack)
## Warning: package 'geepack' was built under R version 4.3.2
library(boot)
Let’s simulate the data : first of all set the parameters of the data generating mechanism:
set.seed(0)
n.obs = 10000 #set sample size
#---- True parameters in outcome model
b0 = 60
b1 = 5
b2 = -0.3
b3 = -0.1
b4 = 8
b5 = 3
b6 = 2
#---- True parameters in exposure odds model
g0 = log(0.20/(1-0.20))
g1 = log(1.01)
g2 = log(1.005)
g3 = log(0.6)
g4 = log(0.5)
g5 = log(0.8)
#Function to compute outcome values
## Use the parameters specified above
mean_out <- function(C1, C2, C3, exposure){
b0 + b1*exposure + b2*C1 + b3*I(C1^2) + b4*C2 + b5*C3 + b6*exposure*C2 + rnorm(n = n.obs, mean = 0, sd = 5)
}
# Function to compute exposure probabilities
## Use the parameters specified above
prob_exp <- function(C1, C2, C3){
exp(g0 + g1*C1 + g2*I(C1^2) + g3*C2 + g4*C3 + g5*C2*C3)/(1 + exp(g0 + g1*C1+ g2*I(C1^2) + g3*C2 + g4*C3 + g5*C2*C3))
}
Now simulate the dataset :
df.sim <- tibble("ID" = seq(from = 1, to = n.obs, by = 1),
"C1" = rnorm(n = n.obs, mean = 0, sd = 5),
"C2" = rbinom(n = n.obs, size = 1, p = 0.4),
"C3" = rbinom(n = n.obs, size = 1, p = 0.3),
"Pexposure" = prob_exp(C1, C2, C3),
"Exposure" = rbinom(n = n.obs, size = 1,prob = Pexposure),
"Outcome" = as.numeric(mean_out(C1,C2,C3, Exposure)))
head(df.sim)
## # A tibble: 6 × 7
## ID C1 C2 C3 Pexposure Exposure Outcome
## <dbl> <dbl> <int> <int> <I<dbl>> <int> <dbl>
## 1 1 6.31 0 0 0.245 0 55.7
## 2 2 -1.63 0 0 0.200 0 62.9
## 3 3 6.65 0 0 0.250 0 47.7
## 4 4 6.36 0 0 0.246 0 45.9
## 5 5 2.07 0 1 0.115 0 61.5
## 6 6 -7.70 0 0 0.237 1 65.9
In this simulated data, A is exposure, C1 is a continuous covariate, and C2 and C3 are binary covariates.
The true outcome model is specified as follows:
Y=b0+b1xA+b2xC1+b3x(C1)^2+b4xC2+b5xC3+b6xAxC2+eps
where eps is the error term normally distributed with variance set at 5^2.
The true causal effect of exposure A among those with C2 = 0 is b1=5. The true causal effect of exposure A among those with C2 = 1 is b1+b6=7.
The marginal effect of A is: 5x0.6+7x0.4=5.8 because 40% of the total population has C2 = 1.
# Correctly specified model
df.sim %>%
lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .) %>%
tidy(conf.int = TRUE)
## # A tibble: 7 × 7
## term estimate std.error statistic p.value conf.low conf.high
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 (Intercept) 60.0 0.0875 686. 0 59.8 60.2
## 2 Exposure 5.02 0.166 30.2 4.61e-192 4.69 5.34
## 3 C2 8.10 0.112 72.4 0 7.88 8.32
## 4 C1 -0.305 0.0101 -30.1 2.49e-190 -0.324 -0.285
## 5 I(C1^2) -0.100 0.00148 -67.8 0 -0.103 -0.0972
## 6 C3 2.94 0.111 26.5 4.31e-150 2.72 3.16
## 7 Exposure:C2 1.93 0.294 6.56 5.77e- 11 1.35 2.51
The estimate for Exposure is 5.02. Note that this is an estimate of the conditional effect for C2=0 and it is identical to the true value because the model is correctly specified. The conditional effect for C2 = 1 is estimated to be 5.02 + 1.93 = 6.95, which again is nearly identical to the true parameter b1+b6=7.
One can now standardize the conditional effect estimates from the correctly specified multivariable regression model to get an estimate of a marginal effect:
# Make copies of original data
df.sim.a1 <- df.sim %>% mutate(Outcome = NA, Exposure = 1) #Assign Exposure = 1 to everyone
df.sim.a0 <- df.sim %>% mutate(Outcome = NA, Exposure = 0) #Assign Exposure = 0 to everyone
df.sim.combined <- bind_rows(df.sim.a1,df.sim.a0)
#mean(df.sim.combined$"Exposure")
# Fit an outcome model to the original data
## Correctly specified model
gcomp.fit <- df.sim %>% lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .)
# Predict outcome values using now the copied datasets
df.sim.combined$pred <- predict(gcomp.fit, newdata = df.sim.combined)
# ATE Estimate: difference between mean predicted values for rows with A=1 and mean predicted values for rows with A = 0
df.sim.combined %>% group_by(Exposure) %>% summarise( mean.Y = mean(pred) ) %>%
pivot_wider( names_from = Exposure, names_glue = "mean.Y.{Exposure}", values_from = mean.Y ) %>%
mutate(ATE = mean.Y.1-mean.Y.0)
## # A tibble: 1 × 3
## mean.Y.0 mean.Y.1 ATE
## <dbl> <dbl> <dbl>
## 1 61.6 67.4 5.78
The resulting estimate of a marginal effect is 5.78 — this is a consistent estimate of the true marginal effect of 5.8.
Confidence intervals for the standardized estimate can be obtained via bootstrapping:
standardization.boot <- function(data, indices){
df <- data[indices,]
df.a1 <- df %>%
mutate(Outcome = NA, Exposure = 1)
df.a0 <- df %>%
mutate(Outcome = NA, Exposure = 0)
df.combined <- bind_rows(df.a1,df.a0)
gcomp.fit <- df %>%
lm(Outcome ~ Exposure*C2 + C1 + I(C1^2) + C3, data = .)
df.combined$pred <- predict(gcomp.fit, newdata = df.combined)
output <- df.combined %>%
group_by(Exposure) %>%
summarise(
mean.Y = mean(pred)) %>%
pivot_wider(
names_from = Exposure,
names_glue = "mean.Y.{Exposure}",
values_from = mean.Y
) %>%
mutate(
ATE = mean.Y.1-mean.Y.0
)
return(output$ATE)
}
# bootstrap
standardization.results <- boot(data=df.sim, statistic=standardization.boot, R=100)
# 100 bootstrapped samples
# generating confidence intervals
empirical.se <- sd(standardization.results$t) # get empirical standard error estimate
estimate <- standardization.results$t0
ll <- estimate - qnorm(0.975)*empirical.se # normal approximation
ul <- estimate + qnorm(0.975)*empirical.se
data.frame(cbind(estimate, empirical.se, ll, ul))
## estimate empirical.se ll ul
## 1 5.782279 0.1425789 5.50283 6.061729
Of note: if you are interested in more details about these topics, you can explore the R library lmw at https://cran.r-project.org/web/packages/lmw/index.html and also the R library arm at https://cran.r-project.org/web/packages/arm/index.html.
Fist of all, we will upload the required libraries:
library(twang)
library(magrittr)
library(tidyverse)
library(gtsummary)
library(stddiff)
library(ggplot2)
library(data.table)
library(boot)
library(splines)
library(PSAgraphics)
library(Matching)
library(sandwich)
library(survey)
library(rms)
We will use a dataset coming from an observational study of 996 patients receiving an initial Percutaneous Coronary Intervention (PCI) at Ohio Heart Health, Christ Hospital, Cincinnati in 1997 and followed for at least 6 months by the staff of the Lindner Center.
The patients thought to be more severely diseased were assigned to treatment with abciximab (an expensive, high-molecular-weight IIb/IIIa cascade blocker); in fact, only 298 (29.9 percent) of patients received usual-care-alone with their initial PCI.
Our research question aims at estimating the treatment effect of abciximab+PCI (abcix) vs the standard care on the probability of of being deceased at 6 months.
In this practical, we will apply the methods based on the propensity score.
Measured pre-treatment characteristics that could confound the treatment-outcome relationship are:
data("lindner",package="twang")
set.seed(123)
summary(lindner)
## lifepres cardbill abcix stent
## Min. : 0.0 Min. : 2216 Min. :0.0000 Min. :0.0000
## 1st Qu.:11.6 1st Qu.: 10219 1st Qu.:0.0000 1st Qu.:0.0000
## Median :11.6 Median : 12458 Median :1.0000 Median :1.0000
## Mean :11.3 Mean : 15674 Mean :0.7008 Mean :0.6687
## 3rd Qu.:11.6 3rd Qu.: 16660 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :11.6 Max. :178534 Max. :1.0000 Max. :1.0000
## height female diabetic acutemi
## Min. :108.0 Min. :0.0000 Min. :0.0000 Min. :0.0000
## 1st Qu.:165.0 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Median :173.0 Median :0.0000 Median :0.0000 Median :0.0000
## Mean :171.4 Mean :0.3474 Mean :0.2239 Mean :0.1436
## 3rd Qu.:178.0 3rd Qu.:1.0000 3rd Qu.:0.0000 3rd Qu.:0.0000
## Max. :196.0 Max. :1.0000 Max. :1.0000 Max. :1.0000
## ejecfrac ves1proc sixMonthSurvive
## Min. : 0.00 Min. :0.000 Mode :logical
## 1st Qu.:45.00 1st Qu.:1.000 FALSE:26
## Median :55.00 Median :1.000 TRUE :970
## Mean :50.97 Mean :1.386
## 3rd Qu.:56.00 3rd Qu.:2.000
## Max. :90.00 Max. :5.000
How is the treatment variable distributed in the population?
# Exposure
lindner %>%
dplyr::select(abcix) %>%
tbl_summary()
Characteristic | N = 9961 |
---|---|
abcix | 698 (70%) |
1 n (%) |
ggplot(lindner)+
geom_bar(aes(x=abcix,fill=as.factor(abcix)),stat="count")+
scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
theme_classic()
Is the outcome rare?
# Outcome
lindner$sixMonthDeath <- 1-lindner$sixMonthSurvive
lindner %>%
dplyr::select(sixMonthDeath) %>%
tbl_summary()
Characteristic | N = 9961 |
---|---|
sixMonthDeath | 26 (2.6%) |
1 n (%) |
What is the crude odds ratio for mortality?
# Crude OR
fit.crude <- glm(sixMonthDeath~abcix,family = binomial,data=lindner)
tbl_regression(fit.crude,exponentiate=T)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
abcix | 0.30 | 0.13, 0.66 | 0.003 |
1 OR = Odds Ratio, CI = Confidence Interval |
Let’s now summarize the confounding variables by treatment group and by outcome, to have a general idea about the observed associations:
# Possible confounders
lindner %>%
dplyr::select(acutemi,
ejecfrac,
ves1proc,
stent,
diabetic,
female,
height) %>%
tbl_summary()
Characteristic | N = 9961 |
---|---|
acutemi | 143 (14%) |
ejecfrac | 55 (45, 56) |
ves1proc | |
0 | 4 (0.4%) |
1 | 680 (68%) |
2 | 252 (25%) |
3 | 45 (4.5%) |
4 | 14 (1.4%) |
5 | 1 (0.1%) |
stent | 666 (67%) |
diabetic | 223 (22%) |
female | 346 (35%) |
height | 173 (165, 178) |
1 n (%); Median (IQR) |
# Descriptive statistics of patients'characteristics by treatment group
lindner %>%
dplyr::select(acutemi,
ejecfrac,
ves1proc,
stent,
diabetic,
female,
height,
abcix) %>%
tbl_summary(by=abcix) %>%
add_overall() %>%
add_p()
Characteristic | Overall, N = 9961 | 0, N = 2981 | 1, N = 6981 | p-value2 |
---|---|---|---|---|
acutemi | 143 (14%) | 18 (6.0%) | 125 (18%) | <0.001 |
ejecfrac | 55 (45, 56) | 55 (50, 60) | 55 (45, 56) | 0.004 |
ves1proc | <0.001 | |||
0 | 4 (0.4%) | 1 (0.3%) | 3 (0.4%) | |
1 | 680 (68%) | 243 (82%) | 437 (63%) | |
2 | 252 (25%) | 47 (16%) | 205 (29%) | |
3 | 45 (4.5%) | 6 (2.0%) | 39 (5.6%) | |
4 | 14 (1.4%) | 1 (0.3%) | 13 (1.9%) | |
5 | 1 (0.1%) | 0 (0%) | 1 (0.1%) | |
stent | 666 (67%) | 174 (58%) | 492 (70%) | <0.001 |
diabetic | 223 (22%) | 80 (27%) | 143 (20%) | 0.027 |
female | 346 (35%) | 115 (39%) | 231 (33%) | 0.10 |
height | 173 (165, 178) | 173 (165, 178) | 173 (165, 180) | 0.8 |
1 n (%); Median (IQR) | ||||
2 Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test |
# Descriptive statistics of patients'characteristics by outcome
lindner %>%
dplyr::select(acutemi,
ejecfrac,
ves1proc,
stent,
diabetic,
female,
height,
sixMonthDeath) %>%
tbl_summary(by=sixMonthDeath) %>%
add_overall() %>%
add_p()
Characteristic | Overall, N = 9961 | 0, N = 9701 | 1, N = 261 | p-value2 |
---|---|---|---|---|
acutemi | 143 (14%) | 137 (14%) | 6 (23%) | 0.2 |
ejecfrac | 55 (45, 56) | 55 (45, 56) | 50 (33, 54) | 0.002 |
ves1proc | 0.10 | |||
0 | 4 (0.4%) | 4 (0.4%) | 0 (0%) | |
1 | 680 (68%) | 661 (68%) | 19 (73%) | |
2 | 252 (25%) | 249 (26%) | 3 (12%) | |
3 | 45 (4.5%) | 41 (4.2%) | 4 (15%) | |
4 | 14 (1.4%) | 14 (1.4%) | 0 (0%) | |
5 | 1 (0.1%) | 1 (0.1%) | 0 (0%) | |
stent | 666 (67%) | 649 (67%) | 17 (65%) | 0.9 |
diabetic | 223 (22%) | 212 (22%) | 11 (42%) | 0.014 |
female | 346 (35%) | 335 (35%) | 11 (42%) | 0.4 |
height | 173 (165, 178) | 173 (165, 178) | 168 (163, 177) | 0.10 |
1 n (%); Median (IQR) | ||||
2 Fisher’s exact test; Wilcoxon rank sum test; Pearson’s Chi-squared test |
We can now calculate the Standardized Difference,which can be use as a measure of balance in the treatment groups. It is a measure of difference between groups that is independent from statistical testing (remember that p values always depend on sample size !!). It is very similar to the definition of effect size that we discussed in Block 2.
It can be defined for a continuous covariate as:
\(SD_{c}=\frac{\overline{x_{1}}-\overline{x_{0}}} {\sqrt{\frac{s_1^2+s_0^2}{2}}}\)
and for a dichotomous covariate as:
\(SD_{d}=\frac{\overline{p_{1}}-\overline{p_{0}}} {\sqrt{\frac{\overline{p_{1}}(1-\overline{p_{1})}+\overline{p_{0}}(1-\overline{p_{0}})}{2}}}\)
The rough interpretation is that imbalance is present if the standardized difference is greater than 0.1 or 0.2.
s1 <- stddiff.numeric(vcol="height",gcol="abcix",data=lindner)
s2 <- stddiff.numeric(vcol="ejecfrac",gcol="abcix",data=lindner)
s3 <- stddiff.numeric(vcol="ves1proc",gcol="abcix",data=lindner)
s4 <- stddiff.binary(vcol="stent",gcol="abcix",data=lindner)
s5 <- stddiff.binary(vcol="female",gcol="abcix",data=lindner)
s6 <- stddiff.binary(vcol="diabetic",gcol="abcix",data=lindner)
s7 <- stddiff.binary(vcol="acutemi",gcol="abcix",data=lindner)
cont.var <- as.data.frame(rbind(s1,s2,s3))
rownames(cont.var) <- c("height", "ejecfrac", "ves1proc")
cont.var
## mean.c sd.c mean.t sd.t missing.c missing.t stddiff stddiff.l
## height 171.446 10.589 171.443 10.695 0 0 0.000 -0.135
## ejecfrac 52.289 10.297 50.403 10.419 0 0 0.182 0.046
## ves1proc 1.205 0.480 1.463 0.706 0 0 0.427 0.290
## stddiff.u
## height 0.136
## ejecfrac 0.318
## ves1proc 0.564
bin.var <- as.data.frame(rbind(s4,s5,s6, s7))
rownames(bin.var) <- c("stent", "female", "diabetic", "acutemi")
bin.var
## p.c p.t missing.c missing.t stddiff stddiff.l stddiff.u
## stent 0.584 0.705 0 0 0.255 0.119 0.391
## female 0.386 0.331 0 0 0.115 -0.021 0.251
## diabetic 0.268 0.205 0 0 0.150 0.014 0.286
## acutemi 0.060 0.179 0 0 0.372 0.235 0.508
Fit now a propensity score model — a logistic regression model with abciximab+PCI (vs. PCI) as the outcome, and the confounders listed in the table above included as covariates. We exclude from the list the variable height, since there was not a relevant difference between the groups.
# Fit a propensity score model
fit.ps<- glm(abcix~
acutemi+
ejecfrac+
ves1proc+
stent+
diabetic+
female,
data=lindner,family = binomial)
summary(fit.ps)
##
## Call:
## glm(formula = abcix ~ acutemi + ejecfrac + ves1proc + stent +
## diabetic + female, family = binomial, data = lindner)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 0.272481 0.440711 0.618 0.536394
## acutemi 1.178158 0.269479 4.372 1.23e-05 ***
## ejecfrac -0.015076 0.007391 -2.040 0.041374 *
## ves1proc 0.757365 0.138279 5.477 4.32e-08 ***
## stent 0.571165 0.150208 3.803 0.000143 ***
## diabetic -0.409191 0.170371 -2.402 0.016316 *
## female -0.133346 0.151377 -0.881 0.378379
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1215.5 on 995 degrees of freedom
## Residual deviance: 1127.0 on 989 degrees of freedom
## AIC: 1141
##
## Number of Fisher Scoring iterations: 4
# Save the estimated propensity score
lindner$ps <- fitted(fit.ps)
# Plot estimated ps
ggplot(lindner) +
geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) +
scale_y_continuous("Estimated PS") +
scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
theme_classic()
ggplot(lindner) +
geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) +
scale_y_continuous("Estimated PS") +
facet_grid(cols=vars(abcix))+
scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Assess whether there are non-overlapping scores (positivity violation) in the two exposure groups:
lindner %>%
dplyr::select(ps,abcix) %>%
tbl_summary(type = list(ps~"continuous2"),by=abcix,
statistic = all_continuous2() ~c(
"{median} ({p25}, {p75})",
"{min}, {max}"))
Characteristic | 0, N = 298 | 1, N = 698 |
---|---|---|
ps | ||
Median (IQR) | 0.65 (0.55, 0.70) | 0.71 (0.65, 0.82) |
Range | 0.25, 0.96 | 0.28, 0.98 |
Investigate overlap:
lindner %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
ps<=max(ps[abcix==0]),1,0))
# non-overlap: treatment group have higher ps than any non-abciximab user and
# control group have smaller ps than any abciximab user
with(lindner,table(overlap,abcix))
## abcix
## overlap 0 1
## 0 1 8
## 1 297 690
with(lindner,prop.table(table(overlap,abcix)),2)
## abcix
## overlap 0 1
## 0 0.001004016 0.008032129
## 1 0.298192771 0.692771084
In the successive steps, we remove subjects that does not overlap. This step reduce the original sample size, but we should respect the assumption of positivity in order to estimate a reasonable causal effect. It makes no sense including subjects that have “near-zero” probability to receive the treatment or to have a “match” in the successive analyses.
We can use the estimated PS as a covariate in a logistic regression model for the outcome:
# Model 1: Linear relationship between ps and outcome
fit.out<- glm(sixMonthDeath~abcix+ps,
data=lindner,
family = binomial,
subset=overlap==1)
# Model summary
summary(fit.out)
##
## Call:
## glm(formula = sixMonthDeath ~ abcix + ps, family = binomial,
## data = lindner, subset = overlap == 1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -4.211 1.165 -3.616 0.000300 ***
## abcix -1.445 0.439 -3.293 0.000992 ***
## ps 1.947 1.698 1.147 0.251522
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 233.15 on 986 degrees of freedom
## Residual deviance: 222.01 on 984 degrees of freedom
## AIC: 228.01
##
## Number of Fisher Scoring iterations: 7
The second step is to save the predicted probabilities for the treated and the untreated and estimate the causal effects of interest in the population:
# fitted values (probabilities)
lindner$predY0<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[3]*lindner$ps) # PCI subjects
lindner$predY1<-fit.out$family$linkinv(coef(fit.out)[1]+coef(fit.out)[2]+coef(fit.out)[3]*lindner$ps) # PCI+abciximab subjects
# ATE effect
Y1<-mean(lindner$predY1)
Y0<-mean(lindner$predY0)
# ATT effect
Y1_1<-mean(lindner$predY1[lindner$abcix==1])
Y0_1<-mean(lindner$predY0[lindner$abcix==1])
# ATE effect
Y1-Y0
## [1] -0.04246731
# ATT effect
Y1_1-Y0_1
## [1] -0.04436082
# Estimate odds ratios related to the "ATE" and the "ATT"
(Y1/(1-Y1))/(Y0/(1-Y0))
## [1] 0.2363187
(Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1))
## [1] 0.236305
The ATE effect is quite similar to the ATT effect, indicating that there is a protective effect of the PCI+abciximab vs PCI alone.
To obtain the corresponding confidence intervals we can use the bootstrap approach.
We do not here outline this procedure, see at the end of this practical the supplementary code.
This model relies on two additional assumptions: no interaction between propensity score and treatment, and a linear relationship between the propensity score and treatment.
Do these assumptions appear reasonable here?
We can try to fit different models, and then compare the AIC:
# Model 2: Non-linear relationship between ps and outcome
fit.out2<- glm(sixMonthDeath~abcix+ps+I(ps^2),
data=lindner,
family = binomial,
subset=overlap==1)
summary(fit.out2)
##
## Call:
## glm(formula = sixMonthDeath ~ abcix + ps + I(ps^2), family = binomial,
## data = lindner, subset = overlap == 1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.3787 3.8937 0.611 0.541255
## abcix -1.5088 0.4493 -3.358 0.000784 ***
## ps -17.7559 11.5272 -1.540 0.123475
## I(ps^2) 14.1968 8.3131 1.708 0.087681 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 233.15 on 986 degrees of freedom
## Residual deviance: 219.46 on 983 degrees of freedom
## AIC: 227.46
##
## Number of Fisher Scoring iterations: 7
# Model 3: Non-linear relationship between ps and outcome and interaction between ps and treatment
fit.out3<- glm(sixMonthDeath~abcix*ps+I(ps^2),
data=lindner,
family = binomial,
subset=overlap==1)
summary(fit.out3)
##
## Call:
## glm(formula = sixMonthDeath ~ abcix * ps + I(ps^2), family = binomial,
## data = lindner, subset = overlap == 1)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) 2.361196 3.786408 0.624 0.5329
## abcix 0.009567 2.075020 0.005 0.9963
## ps -18.641792 11.183815 -1.667 0.0955 .
## I(ps^2) 15.513326 8.167099 1.899 0.0575 .
## abcix:ps -2.130967 2.862505 -0.744 0.4566
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 233.15 on 986 degrees of freedom
## Residual deviance: 218.93 on 982 degrees of freedom
## AIC: 228.93
##
## Number of Fisher Scoring iterations: 7
It seems that the relationship could be partially non-linear, but there is no a statistical significance very strong, as well as for the interaction. So probably the best parsimonious model to keep is Model 1.
Create propensity score strata: this could be an iterative process, since we should verify if we have enough subjects/event in each stratum.
lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.25,0.5,0.75,1)),include.lowest=T,labels=c(1:4)))
#Check they have been created correctly
summary(lindner$strata)
## 1 2 3 4
## 252 246 257 241
tapply(lindner$ps,lindner$strata,summary)
## $`1`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.2499 0.5168 0.5463 0.5321 0.5713 0.6051
##
## $`2`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6078 0.6505 0.6674 0.6605 0.6807 0.6846
##
## $`3`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.6877 0.7036 0.7307 0.7406 0.7728 0.8106
##
## $`4`
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.8106 0.8305 0.8706 0.8759 0.9102 0.9765
#Look at numbers of events and patients in each strata/exposure group
table(lindner$sixMonthDeath,lindner$strata,lindner$abcix)
## , , = 0
##
##
## 1 2 3 4
## 0 105 89 64 25
## 1 7 1 4 3
##
## , , = 1
##
##
## 1 2 3 4
## 0 138 156 185 208
## 1 2 0 4 5
These strata seem quite “sparse” as number of events. Another possibility is :
#Create propensity score strata
lindner %<>% mutate(strata=cut(ps,quantile(ps,c(0,0.33,0.66,1)),include.lowest=T,labels=c(1:3)))
#Check they have been created correctly
summary(lindner$strata)
## 1 2 3
## 356 302 338
#Look at numbers of events and patients in each strata/exposure group
table(lindner$sixMonthDeath,lindner$strata,lindner$abcix)
## , , = 0
##
##
## 1 2 3
## 0 146 96 41
## 1 7 4 4
##
## , , = 1
##
##
## 1 2 3
## 0 201 201 285
## 1 2 1 8
Remind : we should also check for the balance of the confounders in each strata ! See the supplementary material for that. For now, let’s just estimate the OR in each stratum:
beta.treat<-numeric(3)
nstrata<-table(lindner$strata)
treated.strata<-table(lindner$strata,lindner$abcix)[,2]
for (i in 1:3){
ms<-glm(sixMonthDeath~abcix,data=lindner,subset = strata==i,family="binomial")
beta.treat[i]<-coef(ms)[2]
print(summary(ms))
}
##
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner,
## subset = strata == i)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0377 0.3869 -7.851 4.13e-15 ***
## abcix -1.5725 0.8091 -1.943 0.052 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 83.969 on 355 degrees of freedom
## Residual deviance: 79.319 on 354 degrees of freedom
## AIC: 83.319
##
## Number of Fisher Scoring iterations: 7
##
##
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner,
## subset = strata == i)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.1781 0.5103 -6.228 4.73e-10 ***
## abcix -2.1253 1.1249 -1.889 0.0589 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 50.927 on 301 degrees of freedom
## Residual deviance: 46.200 on 300 degrees of freedom
## AIC: 50.2
##
## Number of Fisher Scoring iterations: 8
##
##
## Call:
## glm(formula = sixMonthDeath ~ abcix, family = "binomial", data = lindner,
## subset = strata == i)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3273 0.5238 -4.443 8.88e-06 ***
## abcix -1.2458 0.6347 -1.963 0.0497 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 103.68 on 337 degrees of freedom
## Residual deviance: 100.39 on 336 degrees of freedom
## AIC: 104.39
##
## Number of Fisher Scoring iterations: 6
And, finally, let’s estimate the weighted OR related to the ATE and the ATT as a weighted average of the ORs in the various strata:
exp(sum(beta.treat*nstrata)/nrow(lindner))
## [1] 0.1960846
exp(sum(beta.treat*treated.strata)/sum(treated.strata))
## [1] 0.2028472
Also here, we should use a bootstrap approach to estimate the corresponding confidence intervals.
Here we should create a reduced dataset retaining only patients with the overlap:
lindner.overlap <- lindner %>% filter(overlap==1)
Now we proceed with the matching algorithm: there is plenty of different algorithms in R that produce matching, here we use one from the library(Matching).
library(Matching)
match <- Match(Y=lindner.overlap$sixMonthDeath,
Tr=lindner.overlap$abcix,
X=lindner.overlap$ps,
caliper=0.2,# all matches not equal to or within 0.2 standard deviations of ps are dropped
M=1,
ties=FALSE,
replace=TRUE # 1:1
)
# Number of pairs
nn <- length(match$index.treated)
# Create matched dataset
lindnerMatched <- cbind(rbind(lindner.overlap[match$index.treated,],
lindner.overlap[match$index.control,]),
pair=c(1:nn,1:nn))
table(lindner.overlap$abcix)
##
## 0 1
## 297 690
#Check number of treated patients
table(lindnerMatched$abcix)
##
## 0 1
## 689 689
#Look at people being used multiple times in the matched sample
summary(as.factor(table(match$index.treated)))
## 1
## 689
summary(as.factor(table(match$index.control)))
## 1 2 3 4 5 6 7 8 9 10 12 13 15 17 19
## 69 45 30 20 11 9 9 5 2 2 1 1 2 1 2
#Look at the propensity score distribution in the matched dataset
ggplot(lindnerMatched) +
geom_boxplot(aes(y = ps,group = as.factor(abcix),col = as.factor(abcix))) +
scale_y_continuous("Estimated PS") +
scale_color_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
theme_classic()
ggplot(lindnerMatched) +
geom_histogram(aes(x = ps,group = as.factor(abcix),fill = as.factor(abcix))) +
scale_y_continuous("Estimated PS") +
facet_grid(cols=vars(abcix))+
scale_fill_discrete("Treatment",labels = c("PCI", "PCI+abciximab")) +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Let’s check now the balance:
# Balance Diagnostics before and after matching
bal <- MatchBalance(abcix~
stent+
female+
diabetic+
acutemi+
ejecfrac+
ves1proc,
data=lindner.overlap,
match.out = match)
##
## ***** (V1) stent *****
## Before Matching After Matching
## mean treatment........ 0.70435 0.70537
## mean control.......... 0.58586 0.65602
## std mean diff......... 25.947 10.817
##
## mean raw eQQ diff..... 0.11785 0.049347
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.059245 0.024673
## med eCDF diff........ 0.059245 0.024673
## max eCDF diff........ 0.11849 0.049347
##
## var ratio (Tr/Co)..... 0.85663 0.92097
## T-test p-value........ 0.0004398 0.0057027
##
##
## ***** (V2) female *****
## Before Matching After Matching
## mean treatment........ 0.33333 0.33382
## mean control.......... 0.38384 0.30479
## std mean diff......... -10.706 6.151
##
## mean raw eQQ diff..... 0.050505 0.029028
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.025253 0.014514
## med eCDF diff........ 0.025253 0.014514
## max eCDF diff........ 0.050505 0.029028
##
## var ratio (Tr/Co)..... 0.9378 1.0495
## T-test p-value........ 0.13211 0.09767
##
##
## ***** (V3) diabetic *****
## Before Matching After Matching
## mean treatment........ 0.20725 0.2061
## mean control.......... 0.26599 0.21771
## std mean diff......... -14.483 -2.8684
##
## mean raw eQQ diff..... 0.057239 0.011611
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.029373 0.0058055
## med eCDF diff........ 0.029373 0.0058055
## max eCDF diff........ 0.058747 0.011611
##
## var ratio (Tr/Co)..... 0.83988 0.96072
## T-test p-value........ 0.050489 0.47958
##
##
## ***** (V4) acutemi *****
## Before Matching After Matching
## mean treatment........ 0.17101 0.17126
## mean control.......... 0.060606 0.16255
## std mean diff......... 29.302 2.3098
##
## mean raw eQQ diff..... 0.11111 0.0087083
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.055204 0.0043541
## med eCDF diff........ 0.055204 0.0043541
## max eCDF diff........ 0.11041 0.0087083
##
## var ratio (Tr/Co)..... 2.4853 1.0426
## T-test p-value........ 4.1797e-08 0.5361
##
##
## ***** (V5) ejecfrac *****
## Before Matching After Matching
## mean treatment........ 50.559 50.553
## mean control.......... 52.279 51.209
## std mean diff......... -16.899 -6.4414
##
## mean raw eQQ diff..... 1.8687 0.80406
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 20 15
##
## mean eCDF diff........ 0.033253 0.013763
## med eCDF diff........ 0.0087396 0.0043541
## max eCDF diff........ 0.10771 0.046444
##
## var ratio (Tr/Co)..... 0.97403 1.0808
## T-test p-value........ 0.016162 0.13002
## KS Bootstrap p-value.. 0.002 0.216
## KS Naive p-value...... 0.016165 0.44722
## KS Statistic.......... 0.10771 0.046444
##
##
## ***** (V6) ves1proc *****
## Before Matching After Matching
## mean treatment........ 1.4435 1.4456
## mean control.......... 1.2088 1.5109
## std mean diff......... 34.534 -9.6339
##
## mean raw eQQ diff..... 0.24242 0.065312
## med raw eQQ diff..... 0 0
## max raw eQQ diff..... 1 1
##
## mean eCDF diff........ 0.048684 0.013062
## med eCDF diff........ 0.014024 0.0043541
## max eCDF diff........ 0.1805 0.049347
##
## var ratio (Tr/Co)..... 2.0392 0.93506
## T-test p-value........ 9.0068e-10 0.0050611
## KS Bootstrap p-value.. < 2.22e-16 0.062
## KS Naive p-value...... 2.6627e-06 0.37114
## KS Statistic.......... 0.1805 0.049347
##
##
## Before Matching Minimum p.value: < 2.22e-16
## Variable Name(s): ves1proc Number(s): 6
##
## After Matching Minimum p.value: 0.0050611
## Variable Name(s): ves1proc Number(s): 6
lindnerMatched %>%
dplyr::select(stent,female,diabetic,acutemi,ejecfrac,ves1proc,abcix) %>%
tbl_summary(by=abcix) %>%
add_overall() %>%
add_p()
Characteristic | Overall, N = 1,3781 | 0, N = 6891 | 1, N = 6891 | p-value2 |
---|---|---|---|---|
stent | 938 (68%) | 452 (66%) | 486 (71%) | 0.049 |
female | 440 (32%) | 210 (30%) | 230 (33%) | 0.2 |
diabetic | 292 (21%) | 150 (22%) | 142 (21%) | 0.6 |
acutemi | 230 (17%) | 112 (16%) | 118 (17%) | 0.7 |
ejecfrac | 55 (45, 56) | 55 (45, 60) | 55 (45, 56) | 0.3 |
ves1proc | 0.3 | |||
0 | 2 (0.1%) | 0 (0%) | 2 (0.3%) | |
1 | 842 (61%) | 405 (59%) | 437 (63%) | |
2 | 434 (31%) | 231 (34%) | 203 (29%) | |
3 | 73 (5.3%) | 38 (5.5%) | 35 (5.1%) | |
4 | 27 (2.0%) | 15 (2.2%) | 12 (1.7%) | |
1 n (%); Median (IQR) | ||||
2 Pearson’s Chi-squared test; Wilcoxon rank sum test; Fisher’s exact test |
Note that the number of stent has not been well balanced after the matching procedure.
For this reason, we use this covariate in the regression model for the outcome.
Now we estimate the causal effect:
fit.out4<-glm(sixMonthDeath~abcix+stent, data=lindnerMatched,family=binomial)
summary(fit.out4)
##
## Call:
## glm(formula = sixMonthDeath ~ abcix + stent, family = binomial,
## data = lindnerMatched)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.4089 0.3429 -9.941 < 2e-16 ***
## abcix -1.5529 0.3562 -4.359 1.3e-05 ***
## stent 0.9438 0.3727 2.532 0.0113 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 449.29 on 1377 degrees of freedom
## Residual deviance: 418.68 on 1375 degrees of freedom
## AIC: 424.68
##
## Number of Fisher Scoring iterations: 7
We can see that the number of stent is statistically significant in the model,so it has been a good idea to control for it, since it was not well balanced in the matching procedure. As we already have discussed, sometimes also covariates that are not confounders for the effect of the treatment on the outcome could be included in the final model in order to obtain more accurate estimates of the effect.
Definition of the weights:
# Definition of weights for ATE
lindner.overlap %<>%mutate(w_ATE=case_when(abcix==1~1/ps,
abcix==0~1/(1-ps)))
#Definition of weights for ATT
lindner.overlap %<>%mutate(w_ATT=case_when(abcix==1~1,
abcix==0~ps/(1-ps)))
Check the extreme weights: sometimes it is useful to use truncated or stabilized weights, in order to reduce the variance of the final estimates, but we do not cover here this aspect.
#Check extremes
quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1))
## 0% 1% 5% 95% 99% 100%
## 1.041578 1.052785 1.076734 1.882227 2.339352 3.627077
quantile(lindner.overlap$w_ATE[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1))
## 0% 1% 5% 95% 99% 100%
## 1.566665 1.658869 1.791224 6.137759 14.832595 25.687628
quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==1],c(0,0.01,0.05,0.95,0.99,1))
## 0% 1% 5% 95% 99% 100%
## 1 1 1 1 1 1
quantile(lindner.overlap$w_ATT[lindner.overlap$abcix==0],c(0,0.01,0.05,0.95,0.99,1))
## 0% 1% 5% 95% 99% 100%
## 0.5666647 0.6588687 0.7912238 5.1377591 13.8325951 24.6876276
# Balance diagnostics
#ATE
bal_IPTW_ATE <- dx.wts(x=lindner.overlap$w_ATE,
data=lindner.overlap,
vars=colnames(lindner.overlap)[4:10],
treat.var = colnames(lindner.overlap)[3],
estimand = "ATE")
#ATT
bal_IPTW_ATT <- dx.wts(x=lindner.overlap$w_ATT,
data=lindner.overlap,
x.as.weights = T,
vars=colnames(lindner.overlap)[4:10],
treat.var = colnames(lindner.overlap)[3],
estimand = "ATT")
bal.table(bal_IPTW_ATE)
## $unw
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks ks.pval
## stent 0.704 0.457 0.586 0.493 0.252 3.541 0.000 0.118 0.005
## height 171.419 10.728 171.458 10.605 -0.004 -0.053 0.958 0.024 1.000
## female 0.333 0.472 0.384 0.487 -0.106 -1.509 0.132 0.051 0.641
## diabetic 0.207 0.406 0.266 0.443 -0.141 -1.962 0.050 0.059 0.450
## acutemi 0.171 0.377 0.061 0.239 0.320 5.537 0.000 0.110 0.012
## ejecfrac 50.559 10.179 52.279 10.313 -0.168 -2.415 0.016 0.108 0.015
## ves1proc 1.443 0.680 1.209 0.476 0.370 6.207 0.000 0.181 0.000
##
## [[2]]
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks ks.pval
## stent 0.669 0.471 0.668 0.472 0.000 0.005 0.996 0.000 1.000
## height 171.123 10.734 172.398 10.965 -0.119 -1.351 0.177 0.064 0.532
## female 0.347 0.476 0.330 0.471 0.035 0.468 0.640 0.017 1.000
## diabetic 0.225 0.418 0.242 0.429 -0.040 -0.461 0.645 0.017 1.000
## acutemi 0.137 0.344 0.149 0.357 -0.035 -0.307 0.759 0.012 1.000
## ejecfrac 51.077 9.910 51.093 10.182 -0.002 -0.020 0.984 0.044 0.909
## ves1proc 1.368 0.641 1.427 0.690 -0.093 -0.783 0.434 0.032 0.996
bal.table(bal_IPTW_ATT)
## $unw
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks ks.pval
## stent 0.704 0.457 0.586 0.493 0.259 3.541 0.000 0.118 0.005
## height 171.419 10.728 171.458 10.605 -0.004 -0.053 0.958 0.024 1.000
## female 0.333 0.472 0.384 0.487 -0.107 -1.509 0.132 0.051 0.641
## diabetic 0.207 0.406 0.266 0.443 -0.145 -1.962 0.050 0.059 0.450
## acutemi 0.171 0.377 0.061 0.239 0.293 5.537 0.000 0.110 0.012
## ejecfrac 50.559 10.179 52.279 10.313 -0.169 -2.415 0.016 0.108 0.015
## ves1proc 1.443 0.680 1.209 0.476 0.345 6.207 0.000 0.181 0.000
##
## [[2]]
## tx.mn tx.sd ct.mn ct.sd std.eff.sz stat p ks ks.pval
## stent 0.704 0.457 0.703 0.458 0.003 0.034 0.973 0.001 1.000
## height 171.419 10.728 172.793 11.090 -0.128 -1.264 0.206 0.068 0.597
## female 0.333 0.472 0.307 0.462 0.055 0.683 0.495 0.026 1.000
## diabetic 0.207 0.406 0.232 0.423 -0.061 -0.598 0.550 0.025 1.000
## acutemi 0.171 0.377 0.186 0.390 -0.041 -0.313 0.754 0.015 1.000
## ejecfrac 50.559 10.179 50.594 10.085 -0.003 -0.041 0.968 0.035 0.997
## ves1proc 1.443 0.680 1.519 0.743 -0.111 -0.827 0.409 0.043 0.966
Finally, let’s estimate the ATE causal effect on the weighted dataset:
# Estimate ATE
design.lindnerATE <- svydesign(ids=~1,
weights = ~w_ATE,
data=lindner.overlap)
fit_itpw_ATE <- svyglm(sixMonthDeath~abcix,
family=binomial,
design=design.lindnerATE)
tbl_regression(fit_itpw_ATE,exponentiate = T)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
abcix | 0.17 | 0.06, 0.49 | <0.001 |
1 OR = Odds Ratio, CI = Confidence Interval |
And the ATT:
design.lindnerATT <- svydesign(ids=~1,
weights = ~w_ATT,
data=lindner.overlap)
fit_iptw_ATT <- svyglm(sixMonthDeath~abcix,
family=binomial,
design=design.lindnerATT)
## Warning in eval(family$initialize): non-integer #successes in a binomial glm!
tbl_regression(fit_iptw_ATT,exponentiate = T)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
abcix | 0.15 | 0.05, 0.46 | <0.001 |
1 OR = Odds Ratio, CI = Confidence Interval |
Also here it is possible to estimate the 95% CI using boostrap methods (that are in general more robust).
results <- data.frame(ATE=rep(NA,4),ATT=rep(NA,4))
f_PSadj <- function(data, indices,outcome.formula) {
d <- data[indices,] # allows boot to select sample
# estimation of ps
m1<-glm(abcix~
acutemi+
ejecfrac+
ves1proc+
stent+
diabetic+
female,
data=d,
family = binomial)
d$ps<-fitted.values(m1)
# overlap
d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
ps<=max(ps[abcix==0]),1,0))
# outcome model
m2<-glm(outcome.formula,data=d,family="binomial",subset = overlap==1)
if(!m2$converged) print("Model did not converged")
d$predY0<-m2$family$linkinv(coef(m2)[1]+coef(m2)[3]*d$ps)
d$predY1<-m2$family$linkinv(coef(m2)[1]+coef(m2)[2]+coef(m2)[3]*d$ps)
Y1<-mean(d$predY1)
Y0<-mean(d$predY0)
Y1_1<-mean(d$predY1[lindner$abcix==1])
Y0_1<-mean(d$predY0[lindner$abcix==1])
ATE_PSadj<-(Y1/(1-Y1))/(Y0/(1-Y0))
ATT_PSadj<-(Y1_1/(1-Y1_1))/(Y0_1/(1-Y0_1))
return(c(ATE_PSadj,ATT_PSadj))
}
res_boot <- function(obj.boot,type="percent",digits=3){
suppressWarnings({
orig <- round(obj.boot$t0,digits)
ciATE <- paste(round(boot.ci(obj.boot,index=1)[[type]][4:5],digits),collapse = "-")
ciATT <- paste(round(boot.ci(obj.boot,index=2)[[type]][4:5],digits),collapse = "-")
})
res <- paste(orig,rbind(ciATE,ciATT),sep="(")
res.u <- paste(res,rep(")",2),sep = "")
return(res.u)
}
boot.out <- boot(data=lindner, statistic=f_PSadj,R=1000,outcome.formula=fit.out$formula)
print(boot.out)
##
## ORDINARY NONPARAMETRIC BOOTSTRAP
##
##
## Call:
## boot(data = lindner, statistic = f_PSadj, R = 1000, outcome.formula = fit.out$formula)
##
##
## Bootstrap Statistics :
## original bias std. error
## t1* 0.2363187 0.02393070 0.1318934
## t2* 0.2363050 0.02394347 0.1318932
# Get 95% confidence interval
results$ATE[1]<-res_boot(boot.out,digits=4)[1]
results$ATT[1]<-res_boot(boot.out,digits=4)[2]
results
## ATE ATT
## 1 0.2363(0.0929-0.5899) 0.2363(0.0929-0.5899)
## 2 <NA> <NA>
## 3 <NA> <NA>
## 4 <NA> <NA>
Very often with the stratification method there are many problems of convergence of the regression algorithm, since in the strata we have very few events !
f_PSstrat <- function(data, indices) {
d <- data[indices,] # allows boot to select sample
m1<-glm(abcix~
acutemi+
ejecfrac+
ves1proc+
stent+
diabetic+
female,
data=d,family = binomial)
d$ps<-fitted.values(m1)
quart_PS<-quantile(d$ps,c(0,0.33,0.66,1))
d$strata<-cut(d$ps, quart_PS, labels=c(1:3))
for (i in 1:3){
ms<-glm(sixMonthDeath~abcix,data=d,subset = strata==i,family="binomial")
beta.treat[i]<-coef(ms)[2]
}
ATE <- exp(sum(beta.treat*nstrata)/nrow(lindner))
ATT <- exp(sum(beta.treat*treated.strata)/sum(treated.strata))
return(c(ATE,ATT))
}
boot.out4 <- boot(data=lindner, statistic=f_PSstrat,R=1000)
# Get 95% confidence interval
results$ATE[2]<-res_boot(boot.out4,digits=3)[1]
results$ATT[2]<-res_boot(boot.out4,digits=3)[2]
We now estimate the robust standard errors related to the estimate on the matched dataset:
cov <- vcovHC(fit.out4, type = "HC0")
std.err <- sqrt(diag(cov))
q.val <- qnorm(0.975)
r <- cbind(
Estimate = coef(fit.out4)
, "Robust SE" = std.err
, z = (coef(fit.out4)/std.err)
, "Pr(>|z|) "= 2 * pnorm(abs(coef(fit.out4)/std.err), lower.tail = FALSE)
, LL = coef(fit.out4) - q.val * std.err
, UL = coef(fit.out4) + q.val * std.err
)
#Exponential to get the OR
results$ATT[3]<- paste0(round(exp(r[2,1]),4),"(",round(exp(r[2,5]),4),"-",round(exp(r[2,6]),4),")")
f_IPTW <- function(data, indices) {
d <- data[indices,] # allows boot to select sample
# estimation of ps
m1<-glm(abcix~
acutemi+
ejecfrac+
ves1proc+
stent+
diabetic+
female,
data=d,
family = binomial)
d$ps<-fitted.values(m1)
# overlap
d %<>% mutate(overlap=ifelse(ps>=min(ps[abcix==1]) &
ps<=max(ps[abcix==0]),1,0)) %>%
filter(overlap==1)
# Definition of weights for ATE
d %<>%mutate(w_ATE=case_when(abcix==1~1/ps,
abcix==0~1/(1-ps)))
#Definition of weights for ATT
d %<>%mutate(w_ATT=case_when(abcix==1~1,
abcix==0~ps/(1-ps)))
# Estimate ATE
design.lindnerATE <- svydesign(ids=~1,
weights = ~w_ATE,
data=d)
fit_itpw_ATE <- svyglm(sixMonthDeath~abcix,
family=binomial,
design=design.lindnerATE)
# Estimate ATT
design.lindnerATT <- svydesign(ids=~1,
weights = ~w_ATT,
data=d)
fit_iptw_ATT <- svyglm(sixMonthDeath~abcix,
family=binomial,
design=design.lindnerATT)
ATE <- exp(fit_itpw_ATE$coefficients[2])
ATT <- exp(fit_iptw_ATT$coefficients[2])
return(c(ATE,ATT))
}
boot.out5 <- boot(data=lindner, statistic=f_IPTW,R=1000)
results$ATE[4]<-res_boot(boot.out5,digits=4)[1]
results$ATT[4]<-res_boot(boot.out5,digits=4)[2]
# recall the crude OR from the original dataset
tbl_regression(fit.crude,exponentiate=T)
Characteristic | OR1 | 95% CI1 | p-value |
---|---|---|---|
abcix | 0.30 | 0.13, 0.66 | 0.003 |
1 OR = Odds Ratio, CI = Confidence Interval |
# PS results
row.names(results) <- c("PS Adj Linear","Stratification","Matching","IPTW")
results
## ATE ATT
## PS Adj Linear 0.2363(0.0929-0.5899) 0.2363(0.0929-0.5899)
## Stratification 0.196(0-43.913) 0.202(0.001-35.486)
## Matching <NA> 0.2116(0.1046-0.4283)
## IPTW 0.1749(0.0592-0.6322) 0.1525(0.0474-0.6406)
We can observe that using the propensity score based methods we obtain a stronger estimate of the treatment effect with respect to the crude odds ratio. We can also observe that stratification produce very unstable results due to the low sample size in the different strata. In conclusion, adjusting for confounders was important in this context !
# Function for graphical diagnostic to check balance
balance.Strat <- function(x){
if(length(unique(lindner[,x]))<10){
cat.psa(categorical = lindner[,x],
treatment = lindner$abcix,
strata = lindner$strata,
ylab="Relative Frequency")
title(x)
}
else{
box.psa(continuous = lindner[,x],
treatment = lindner$abcix,
strata = lindner$strata)
title(x)
}
}
var <- lindner %>%dplyr::select(stent,
female,
diabetic,
acutemi,
ejecfrac,
ves1proc) %>% colnames()
balance.Strat(var[1])
balance.Strat(var[2])
balance.Strat(var[3])
balance.Strat(var[4])
balance.Strat(var[5])
## Warning in xy.coords(x, y): NAs introduced by coercion
## Warning in xy.coords(x, y): NAs introduced by coercion
balance.Strat(var[6])
There are many R libraries that can be used to estimate weights see for example: https://cran.r-project.org/web/packages/PSweight/index.html This library enables the estimation and inference of average causal effects with binary and multiple treatments using overlap weights (ATO), inverse probability of treatment weights (ATE), average treatment effect among the treated weights (ATT), matching weights (ATM) and entropy weights (ATEN), with and without propensity score trimming. Reference: https://arxiv.org/pdf/2010.08893
In epidemiological data, most of the outcomes are often binary or dichotomous. For example, in the investigation of the cause of a disease, the status of the outcome, the disease, is diseased vs non-diseased. For a mortality study, the outcome is usually died or survived. For a continuous variable such as weight or height, we can think that the single representative number for the population or sample is the mean or median. For dichotomous data, the representative number is the proportion or percentage of one type of the outcome. For example, as you should remember from Block 1, prevalence is the proportion of the population with the disease of interest. Case-fatality is the proportion of deaths among the people with the disease. The other related term is probability. Proportion is a simple straightforward term. Probability denotes the likeliness, which is more theoretical (assuming a random mechanism that generate the observed data). In the case of a dichotomous variable, the proportion is used as the estimated probability, for example assuming a binomial random variable behaviour. For computation, the outcome is often represented with 1 and 0 otherwise. The prevalence is then the mean of diseased values among the study sample. Assuming the binomial mechanism, the standard regression model is the logistic regression, using the log(odds) as a dependent variable in the model. If P is the probability of having a disease, 1-P is probability of not having the disease. The odds is thus P/(1-P). As you remind, the relationship between probability and odds, mainly log(odds) can be plotted as follows:
p <- seq(from=0, to=1, by=.01)
odds <- p/(1-p)
plot(log(odds), p, type="l", col="blue", ylab="Probability",
main="Relationship between log odds and probability", las=1)
abline(h=.5)
abline(v=0)
Being on a linear and well balanced scale, the logit is a more appropriate scale for regression model for a binary outcome than the probability itself.
Similarly to the conversion between probabilities and odds, we can devise functions that convert both ways between probability and log-odds:
logit <- function(p) log(p/(1-p))
tigol <- function(t) 1/(1+exp(-t))
(tigol is just the letters of logit reversed).
Modelling logit(Y|X) ~ \(\beta\)X is the general form of logistic regression. It means that we are modelling the logit of Y given X (or conditioning on X), where X denotes one or more independent variables.
Suppose there are independent or exposure variables: X1 and X2. \(\beta\)X would be \(\beta_{0}\)+ \(\beta_{1}\)X1 + \(\beta_{2}\)X2, where \(\beta_{0}\) is the intercept.
The X can be age, sex, and other prognostic variables. In the explanatory/causal setting, among these X variables, one specific X is usually the focus of the study. Others are potential confounders and covariates.
Using logistic regression it turns out that the probability of observing Y given X Pr(Y|X) is equal to exp\(\beta\)X/(1 + exp\(\beta\)X). Hence, logistic regression is often used to compute the probability of an outcome under a given set of exposures. For example, prediction of probability of getting a disease under a given set of age, sex, and behaviour groups, etc.
The dataset Decay is a simple dataset containing two variables: ‘decay’, which is binary and ‘strep’, which is a continuous variable.
data(Decay)
summary(Decay)
## decay strep
## Min. :0.0000 Min. : 0.50
## 1st Qu.:0.0000 1st Qu.: 43.00
## Median :1.0000 Median :105.00
## Mean :0.6307 Mean : 95.25
## 3rd Qu.:1.0000 3rd Qu.:150.00
## Max. :1.0000 Max. :152.50
The outcome variable is ‘decay’, which indicates whether a person has at least one decayed tooth (1) or not (0). The exposure variable is ‘strep’, the number of colony forming units (CFU) of streptococci, a group of bacteria suspected to cause tooth decay. The prevalence of having decayed teeth is equal to the mean of the ‘decay’ variable, i.e. 0.63. To look at the ‘strep’ variable type:
summary(Decay$strep)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.50 43.00 105.00 95.25 150.00 152.50
hist(Decay$strep)
The plot shows that the vast majority have the value at about 150. Since the natural distribution of bacteria is logarithmic, a transformed variable is created and used as the independent variable. Let’s now estimate an univariable logistic regression model:
Decay$log10.strep <- log10(Decay$strep)
glm0 <- glm(decay~log10.strep, family=binomial, data=Decay)
Ask for the summary of the estimated model:
summary(glm0)
##
## Call:
## glm(formula = decay ~ log10.strep, family = binomial, data = Decay)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5539 0.5184 -4.927 8.36e-07 ***
## log10.strep 1.6808 0.2764 6.082 1.19e-09 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 574.27 on 435 degrees of freedom
## Residual deviance: 531.83 on 434 degrees of freedom
## AIC: 535.83
##
## Number of Fisher Scoring iterations: 4
Both the coefficients of the intercept and ‘log10.strep’ are statistically significant. Pr(>|z|) for ‘log10.strep’ is the P value from Wald’s test. This tests whether the coefficient, 1.681, is significantly different from 0. In this case it is. The estimated intercept is -2.554. This means that when log10.strep is 0 (or strep equals 1 CFU), the logit of having at least a decayed tooth is -2.55. We can then calculate the related baseline odds and probability.
exp(-2.554) -> baseline.odds
baseline.odds
## [1] 0.07776996
baseline.odds/(1+baseline.odds) -> baseline.prob
baseline.prob
## [1] 0.07215822
There is an odds of 0.077 or a probability of 7.2% of having at least one decayed tooth if the number of CFU of the mutan strep is at 1 CFU.
The coefficient of log10.strep is 1.681. For every unit increment of log10(strep), or an increment of 10 CFU, the logit will increase by 1.681.
This increment of logit is constant but not the increment of probability because the latter is not on a linear scale.
The probability at each point of CFU is computed by replacing both coefficients obtained from the model. For example, at 100 CFU, the probability is:
prob.100 <- coef(glm0)[1] + log10(100)*coef(glm0)[2]
prob.100
## (Intercept)
## 0.8078015
To see the relationship for the whole dataset:
plot(Decay$log10.strep, fitted(glm0), ylim=c(0,0.80))
A logistic nature of the curve is partly demonstrated. To make it clearer, the ranges of X and Y axes are both expanded to allow a more extensive curve fitting.
plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1))
Another vector of the same name ‘log10.strep’ is created in the form of a data frame for plotting a fitted line on the same graph.
newdata <- data.frame(log10.strep=seq(from=-2, to=4, by=.01))
predicted.line <- predict.glm(glm0,newdata,type="response")
The values for predicted line on the above command must be on the same scale as the ‘response’ variable. Since the response is either 0 or 1, the predicted line would be in between, ie. the predicted probability for each value of log10(strep).
plot(Decay$log10.strep, fitted(glm0), xlim = c(-2,4),
ylim=c(0,1), xlab=" ", ylab=" ", xaxt="n", las=1)
lines(newdata$log10.strep, predicted.line, col="blue")
axis(side=1, at=-2:4, labels=as.character(10^(-2:4)))
title(main="Relationship between mutan streptococci \n and probability of tooth decay", xlab="CFU",
ylab="Probability of having decayed teeth")
The above example of caries data has a continuous variable ‘log10.strep’ as the key independent variable.
In most epidemiological datasets, the independent variables are often categorical.
We will examine an example coming from an outbreak investigation.
On 25 August 1990, the local health officer in Supan Buri Province of Thailand reported the occurrence of an outbreak of acute gastrointestinal illness on a national handicapped sports day. Epidemiologists went to investigate. The dataset is called Outbreak. Most variable names are self explanatory. Variables are coded as 0 = no, 1 = yes and 9 = missing/unknown for three food items consumed by participants: ‘beefcurry’ (beef curry), ‘saltegg’ (salted eggs) and ‘water’.
Also on the menu were eclairs, a finger-shaped iced cake of choux pastry filled with cream. This variable records the number of pieces eaten by each participant. Missing values were coded as follows: 88 = “ate but do not remember how much”, while code 90 represents totally missing information.
Some participants experienced gastrointestinal symptoms, such as: nausea, vomiting, abdominal pain and diarrhea. The ages of each participant are recorded in years with 99 representing a missing value.
The variables ‘exptime’ and ‘onset’ are the exposure and onset times, which are in character format.
Let’s look at the data.
data(Outbreak)
summary(Outbreak)
## id sex age exptime
## Min. : 1.0 Min. :0.000 Min. : 1.00 Length:1094
## 1st Qu.: 274.2 1st Qu.:0.000 1st Qu.:14.00 Class :AsIs
## Median : 547.5 Median :1.000 Median :18.00 Mode :character
## Mean : 547.5 Mean :0.659 Mean :23.69
## 3rd Qu.: 820.8 3rd Qu.:1.000 3rd Qu.:24.00
## Max. :1094.0 Max. :1.000 Max. :99.00
## beefcurry saltegg eclair water
## Min. :0.0000 Min. :0.000 Min. : 0.00 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.: 0.00 1st Qu.:1.000
## Median :1.0000 Median :1.000 Median : 2.00 Median :1.000
## Mean :0.9534 Mean :0.957 Mean :11.48 Mean :1.021
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.: 2.00 3rd Qu.:1.000
## Max. :9.0000 Max. :9.000 Max. :90.00 Max. :9.000
## onset nausea vomiting abdpain
## Length:1094 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :AsIs 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3985 Mean :0.3793 Mean :0.3501
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## diarrhea
## Min. :0.0000
## 1st Qu.:0.0000
## Median :0.0000
## Mean :0.2148
## 3rd Qu.:0.0000
## Max. :1.0000
We will first of all define the cases. It was agreed among the investigators that a case should be defined as a person who had any of the four symptoms: ‘nausea’, ‘vomiting’, ‘abdpain’ or ‘diarrhea’. A case can then by computed as follows:
Outbreak$case <- (Outbreak$nausea==1)|(Outbreak$vomiting==1)|(Outbreak$abdpain==1)|(Outbreak$diarrhea==1)
The variable ‘case’ is now incorporated into the data.
summary(Outbreak)
## id sex age exptime
## Min. : 1.0 Min. :0.000 Min. : 1.00 Length:1094
## 1st Qu.: 274.2 1st Qu.:0.000 1st Qu.:14.00 Class :AsIs
## Median : 547.5 Median :1.000 Median :18.00 Mode :character
## Mean : 547.5 Mean :0.659 Mean :23.69
## 3rd Qu.: 820.8 3rd Qu.:1.000 3rd Qu.:24.00
## Max. :1094.0 Max. :1.000 Max. :99.00
## beefcurry saltegg eclair water
## Min. :0.0000 Min. :0.000 Min. : 0.00 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.000 1st Qu.: 0.00 1st Qu.:1.000
## Median :1.0000 Median :1.000 Median : 2.00 Median :1.000
## Mean :0.9534 Mean :0.957 Mean :11.48 Mean :1.021
## 3rd Qu.:1.0000 3rd Qu.:1.000 3rd Qu.: 2.00 3rd Qu.:1.000
## Max. :9.0000 Max. :9.000 Max. :90.00 Max. :9.000
## onset nausea vomiting abdpain
## Length:1094 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :AsIs 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3985 Mean :0.3793 Mean :0.3501
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
## diarrhea case
## Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 FALSE:625
## Median :0.0000 TRUE :469
## Mean :0.2148
## 3rd Qu.:0.0000
## Max. :1.0000
We now recode missing values.We do not impute them, so subjects with missing values will be removed from the analyses.
Outbreak = Outbreak %>%
mutate(
age=na_if(age, 99),
beefcurry=na_if(beefcurry, 9),
saltegg=na_if(saltegg, 9),
water=na_if(water, 9),
eclair=na_if(eclair, 90))
summary(Outbreak)
## id sex age exptime
## Min. : 1.0 Min. :0.000 Min. : 1.00 Length:1094
## 1st Qu.: 274.2 1st Qu.:0.000 1st Qu.:14.00 Class :AsIs
## Median : 547.5 Median :1.000 Median :17.00 Mode :character
## Mean : 547.5 Mean :0.659 Mean :19.32
## 3rd Qu.: 820.8 3rd Qu.:1.000 3rd Qu.:22.00
## Max. :1094.0 Max. :1.000 Max. :58.00
## NA's :60
## beefcurry saltegg eclair water
## Min. :0.0000 Min. :0.0000 Min. : 0.000 Min. :0.000
## 1st Qu.:1.0000 1st Qu.:1.0000 1st Qu.: 0.000 1st Qu.:1.000
## Median :1.0000 Median :1.0000 Median : 2.000 Median :1.000
## Mean :0.9164 Mean :0.9201 Mean : 2.072 Mean :0.977
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.: 2.000 3rd Qu.:1.000
## Max. :1.0000 Max. :1.0000 Max. :80.000 Max. :1.000
## NA's :5 NA's :5 NA's :117 NA's :6
## onset nausea vomiting abdpain
## Length:1094 Min. :0.0000 Min. :0.0000 Min. :0.0000
## Class :AsIs 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000
## Mode :character Median :0.0000 Median :0.0000 Median :0.0000
## Mean :0.3985 Mean :0.3793 Mean :0.3501
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:1.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
## diarrhea case
## Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 FALSE:625
## Median :0.0000 TRUE :469
## Mean :0.2148
## 3rd Qu.:0.0000
## Max. :1.0000
##
The three variables can also be changed to factors:
Outbreak= Outbreak %>%
mutate(
beefcurry=as.factor(beefcurry),
saltegg=as.factor(saltegg),
water=as.factor(water))
summary(Outbreak)
## id sex age exptime beefcurry
## Min. : 1.0 Min. :0.000 Min. : 1.00 Length:1094 0 : 91
## 1st Qu.: 274.2 1st Qu.:0.000 1st Qu.:14.00 Class :AsIs 1 :998
## Median : 547.5 Median :1.000 Median :17.00 Mode :character NA's: 5
## Mean : 547.5 Mean :0.659 Mean :19.32
## 3rd Qu.: 820.8 3rd Qu.:1.000 3rd Qu.:22.00
## Max. :1094.0 Max. :1.000 Max. :58.00
## NA's :60
## saltegg eclair water onset nausea
## 0 : 87 Min. : 0.000 0 : 25 Length:1094 Min. :0.0000
## 1 :1002 1st Qu.: 0.000 1 :1063 Class :AsIs 1st Qu.:0.0000
## NA's: 5 Median : 2.000 NA's: 6 Mode :character Median :0.0000
## Mean : 2.072 Mean :0.3985
## 3rd Qu.: 2.000 3rd Qu.:1.0000
## Max. :80.000 Max. :1.0000
## NA's :117
## vomiting abdpain diarrhea case
## Min. :0.0000 Min. :0.0000 Min. :0.0000 Mode :logical
## 1st Qu.:0.0000 1st Qu.:0.0000 1st Qu.:0.0000 FALSE:625
## Median :0.0000 Median :0.0000 Median :0.0000 TRUE :469
## Mean :0.3793 Mean :0.3501 Mean :0.2148
## 3rd Qu.:1.0000 3rd Qu.:1.0000 3rd Qu.:0.0000
## Max. :1.0000 Max. :1.0000 Max. :1.0000
##
All variables now look fine except ‘eclair’ which still contains the value 80 representing “ate but not remember how much”. We will analyse its relationship with ‘case’ by considering it as an ordered categorical variable.
tabpct(Outbreak$eclair, Outbreak$case)
##
## Original table
## Outbreak$case
## Outbreak$eclair FALSE TRUE Total
## 0 279 15 294
## 0.5 5 10 15
## 1 49 41 90
## 2 203 243 446
## 3 9 22 31
## 4 14 33 47
## 5 2 5 7
## 6 8 20 28
## 8 1 5 6
## 10 1 3 4
## 12 1 1 2
## 19 1 0 1
## 20 1 0 1
## 80 5 0 5
## Total 579 398 977
##
## Row percent
## Outbreak$case
## Outbreak$eclair FALSE TRUE Total
## 0 279 15 294
## (94.9) (5.1) (100)
## 0.5 5 10 15
## (33.3) (66.7) (100)
## 1 49 41 90
## (54.4) (45.6) (100)
## 2 203 243 446
## (45.5) (54.5) (100)
## 3 9 22 31
## (29) (71) (100)
## 4 14 33 47
## (29.8) (70.2) (100)
## 5 2 5 7
## (28.6) (71.4) (100)
## 6 8 20 28
## (28.6) (71.4) (100)
## 8 1 5 6
## (16.7) (83.3) (100)
## 10 1 3 4
## (25) (75) (100)
## 12 1 1 2
## (50) (50) (100)
## 19 1 0 1
## (100) (0) (100)
## 20 1 0 1
## (100) (0) (100)
## 80 5 0 5
## (100) (0) (100)
##
## Column percent
## Outbreak$case
## Outbreak$eclair FALSE % TRUE %
## 0 279 (48.2) 15 (3.8)
## 0.5 5 (0.9) 10 (2.5)
## 1 49 (8.5) 41 (10.3)
## 2 203 (35.1) 243 (61.1)
## 3 9 (1.6) 22 (5.5)
## 4 14 (2.4) 33 (8.3)
## 5 2 (0.3) 5 (1.3)
## 6 8 (1.4) 20 (5.0)
## 8 1 (0.2) 5 (1.3)
## 10 1 (0.2) 3 (0.8)
## 12 1 (0.2) 1 (0.3)
## 19 1 (0.2) 0 (0.0)
## 20 1 (0.2) 0 (0.0)
## 80 5 (0.9) 0 (0.0)
## Total 579 (100) 398 (100)
The width of the columns of the mosaic graph denotes the relative frequency of that category. The highest frequency is 2 pieces followed by 0 and 1 piece. The other numbers have relatively low frequencies; particularly the 5 records where ‘eclair’ was coded as 80.
There is a tendency of increasing red area or attack rate from left to right indicating that the risk was increased when more pieces of eclair were consumed.
We will use the distribution of these proportions to guide our grouping of eclair consumption.
The first column of zero consumption has a very low attack rate, therefore it should be a separate category. Only a few took half a piece and this could be combined with those who took only one piece.
Persons consuming 2 pieces should be kept as one category as their frequency is very high. Others who ate more than two pieces should be grouped into another category. Finally, those coded as ‘80’ will be dropped due to the unknown amount of consumption as well as its low frequency.
Remind that collapsing categories with low numbers help in increasing the power of the statistical estimate of the regression coefficients.
Outbreak$eclairgr <- cut(Outbreak$eclair, breaks = c(0, 0.4, 1, 2, 79),
include.lowest = TRUE, labels=c("0","1","2",">2"))
The argument ‘include.lowest=TRUE’ indicates that 0 eclair must be included in the lowest category.
tabpct(Outbreak$eclairgr, Outbreak$case)
##
## Original table
## Outbreak$case
## Outbreak$eclairgr FALSE TRUE Total
## 0 279 15 294
## 1 54 51 105
## 2 203 243 446
## >2 38 89 127
## Total 574 398 972
##
## Row percent
## Outbreak$case
## Outbreak$eclairgr FALSE TRUE Total
## 0 279 15 294
## (94.9) (5.1) (100)
## 1 54 51 105
## (51.4) (48.6) (100)
## 2 203 243 446
## (45.5) (54.5) (100)
## >2 38 89 127
## (29.9) (70.1) (100)
##
## Column percent
## Outbreak$case
## Outbreak$eclairgr FALSE % TRUE %
## 0 279 (48.6) 15 (3.8)
## 1 54 (9.4) 51 (12.8)
## 2 203 (35.4) 243 (61.1)
## >2 38 (6.6) 89 (22.4)
## Total 574 (100) 398 (100)
The attack rate or percentage of diseased in each category of exposure, as shown in the bracket of the column TRUE, increases from 5.1% among those who did not eat any eclairs to 70.1% among those heavy eaters of eclair.
The graph output is similar to the preceding one except that the groups now are more concise. We now have a continuous variable of ‘eclair’ and a categorical variable of ‘eclairgr’. The next step is to create a binary exposure for eclair.
Outbreak$eclair.eat <- Outbreak$eclair > 0
This binary exposure variable is now similar to the others, i.e. ‘beefcurry’, ‘saltegg’ and ‘water’.
We now model ‘case’ as the binary outcome variable and take ‘eclair.eat’ as the only explanatory variable:
glm0 <- glm(case ~ eclair.eat, family=binomial, data=Outbreak)
summary(glm0)
##
## Call:
## glm(formula = case ~ eclair.eat, family = binomial, data = Outbreak)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.923 0.265 -11.03 <2e-16 ***
## eclair.eatTRUE 3.167 0.276 11.47 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 1320.7 on 976 degrees of freedom
## Residual deviance: 1055.2 on 975 degrees of freedom
## (117 observations deleted due to missingness)
## AIC: 1059.2
##
## Number of Fisher Scoring iterations: 5
logistic.display(glm0)
##
## Logistic regression predicting case
##
## OR(95%CI) P(Wald's test) P(LR-test)
## eclair.eat 23.75 (13.82,40.79) < 0.001 < 0.001
##
## Log-likelihood = -527.6075
## No. of observations = 977
## AIC value = 1059.2149
The odds ratio from the logistic regression is derived from exponentiation of the estimate, i.e. 23.75 is obtained from:
exp(coef(summary(glm0))[2,1])
## [1] 23.746
The 95% confidence interval of the odds ratio is obtained from:
exp(coef(summary(glm0))[2,1] + c(-1,1) * 1.96 *coef(summary(glm0))[2,2])
## [1] 13.82402 40.78933
These values are close to simple calculation of the 2-by-2 table.
The output from logistic.display also contains the ‘LR-test’ (likelihood ratio test) result, which checks whether the likelihood of the given model, ‘glm0’, would be significantly different from the model without ‘eclair.eat’, which in this case would be the “null” model. For an independent variable with two levels, the LR-test does not add further important information because Wald’s test has already tested the hypothesis. When the independent variable has more than two levels, the LR-test is more important than Wald’s test as the following example demonstrates.
glm1 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm1)
##
## Logistic regression predicting case
##
## OR(95%CI) P(Wald's test) P(LR-test)
## eclairgr: ref.=0 < 0.001
## 1 17.57 (9.21,33.49) < 0.001
## 2 22.27 (12.82,38.66) < 0.001
## >2 43.56 (22.89,82.91) < 0.001
##
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471
Interpreting Wald’s test alone, one would conclude that all levels of eclair eaten would be significant. However, this depends on the reference level. By default, R assumes that the first level of an independent factor is the reference level. If we relevel the reference level to be 2 pieces of eclair, Wald’s test gives a different impression.
Outbreak$eclairgr <- relevel(Outbreak$eclairgr, ref="2")
glm2 <- glm(case ~ eclairgr, family=binomial, data=Outbreak)
logistic.display(glm2)
##
## Logistic regression predicting case
##
## OR(95%CI) P(Wald's test) P(LR-test)
## eclairgr: ref.=2 < 0.001
## 0 0.04 (0.03,0.08) < 0.001
## 1 0.79 (0.52,1.21) 0.275
## >2 1.96 (1.28,2.99) 0.002
##
## Log-likelihood = -516.8236
## No. of observations = 972
## AIC value = 1041.6471
The results show that eating only one piece of eclair does not reduce the risk significantly compared to eating two pieces. While results from Wald’s test depend on the reference level of the explanatory variable, the LR-test is concerned only with the contribution of the variable as a whole and ignores the reference level.
Imagine that we want to have an idea of the presence of possible confounders with respect to the effect of eclair and we have not the possibility to talk with an expert, so that we can investigate relationship observing the data. In this case a possibility is to examine univariable effects and then building multivariable models paying attention to the possible change in the effects estimation for the variable of interest when introducing the other. This procedure is not considered as valid as defining a DAG a priori with the help of an expert, from a causal inference point of view,but it give some indications at least.
Therefore, try ‘saltegg’ now as the only explanatory variable:
glm3 <- glm(case ~ saltegg, family = binomial, data=Outbreak)
logistic.display(glm3)
##
## Logistic regression predicting case
##
## OR(95%CI) P(Wald's test) P(LR-test)
## saltegg: 1 vs 0 2.54 (1.53,4.22) < 0.001 < 0.001
##
## Log-likelihood = -736.998
## No. of observations = 1089
## AIC value = 1477.996
The odds ratio for ‘saltegg’ is statistically significant. The number of valid records is also higher than the model containing ‘eclairgr’.
Note: One should always be very careful when analysing data that contain missing values. Advanced methods to handle missing values are beyond the scope of this course and for reasons of simplicity are ignored here. Data scientists should be advised to deal with missing values properly prior to conducting their analysis.
To check whether the odds ratio of ‘saltegg’ is confounded by ‘eclairgr’, the two explanatory variables are put together in the next model:
glm4 <- glm(case ~ eclairgr + saltegg, family=binomial, data=Outbreak)
logistic.display(glm4, crude.p.value=TRUE)
##
## Logistic regression predicting case
##
## crude OR(95%CI) crude P value adj. OR(95%CI)
## eclairgr: ref.=2
## 0 0.04 (0.03,0.08) < 0.001 0.04 (0.03,0.08)
## 1 0.79 (0.52,1.21) 0.275 0.79 (0.51,1.21)
## >2 1.96 (1.28,2.99) 0.002 1.96 (1.28,2.99)
##
## saltegg: 1 vs 0 2.37 (1.41,3.99) 0.001 1.01 (0.53,1.93)
##
## P(Wald's test) P(LR-test)
## eclairgr: ref.=2 < 0.001
## 0 < 0.001
## 1 0.279
## >2 0.002
##
## saltegg: 1 vs 0 0.975 0.975
##
## Log-likelihood = -516.8231
## No. of observations = 972
## AIC value = 1043.6461
The odds ratios of the explanatory variables in glm4 are adjusted for each other. The crude odds ratios are exactly the same as from the previous models with only single variable.
The adjusted odds ratios of ‘eclairgr’ do not change suggesting that it is not confounded by ‘saltegg’, whereas the odds ratio of ‘saltegg’ is clearly changed towards unity, and now has a very large P value.
The difference between the adjusted odds ratio and the crude odds ratio is an indication that ‘saltegg’ is confounded by ‘eclairgr’, which could be considered as an independent risk factor.
Now that we have a model containing two explanatory variables, we can compare models ‘glm4’ and ‘glm2’ using the lrtest command.
lrtest(glm4, glm2)
## Likelihood ratio test for MLE method
## Chi-squared 1 d.f. = 0.0009809013 , P value = 0.9750149
The P value of 0.975 is the same as that from ‘P(LR-test)’ of ‘saltegg’ obtained from the preceding command. The test determines whether removal of ‘saltegg’ in a model would make a significant difference than if it were kept.
When there is more than one explanatory variable, ‘P(LR-test)’ from logistic.display is actually obtained from the lrtest command, which compares the current model against one in which the particular variable is removed, while keeping all remaining variables.
Let’s further add covariates in the model:
glm5 <- glm(case~eclairgr+saltegg+sex, family=binomial, data=Outbreak)
logistic.display(glm5)
##
## Logistic regression predicting case
##
## crude OR(95%CI) adj. OR(95%CI) P(Wald's test) P(LR-test)
## eclairgr: ref.=2 < 0.001
## 0 0.04 (0.03,0.08) 0.04 (0.02,0.07) < 0.001
## 1 0.79 (0.52,1.21) 0.75 (0.49,1.16) 0.2
## >2 1.96 (1.28,2.99) 1.82 (1.19,2.8) 0.006
##
## saltegg: 1 vs 0 2.37 (1.41,3.99) 0.92 (0.48,1.76) 0.807 0.808
##
## sex: 1 vs 0 1.58 (1.19,2.08) 1.85 (1.35,2.53) < 0.001 < 0.001
##
## Log-likelihood = -509.5181
## No. of observations = 972
## AIC value = 1031.0361
The third explanatory variable ‘sex’ is another independent risk factor. Since females are the reference level, males have an increased risk compared to females. This variable is not considered a confounder to either of the preceding variables because it has not substantially changed the odds ratios of any of them (from ‘glm4’). The reason for not being able to confound is its lack of association with either of the preceding explanatory variables. In other words, males and females were not different in terms of eating eclairs and salted eggs.
Load the required libraries
library(haven)
library(tidyverse)
library(magrittr)
Data comes from SHARE, the Survey of Health, Ageing and Retirement in Europe (https://share-eric.eu/data/). It is a multidisciplinary and cross-national panel database of data on health, socio-economic status and social and family networks of about 140000 individuals aged 50 or older (around 380000 interviews). SHARE covers 27 European countries and Israel.
The dataset we use here contains only part of the data the survey produced.
The research question is whether a stressful period could be
associated with the occurrence of muscular weakness in the over 50
population: therefore we can consider it from the pint of view of an
explanatory (causal) model, not a predictive one.
This question could be of interest since muscular weakness can be
considered a proxy of poor physical health. Muscular strength was
defined as the hand grip measured with a dynamo-meter and the weakness
was defined as having a grip strength below a threshold calculated
according to BMI and sex.
The exposure group was defined as subjects who underwent a stressful period between the first and the last time they had been interviewed; all other subject were defined as non-exposed.
Of note, an inclusion criteria was to have a normal hand strength at the start of the study. Once the subjects were divided in the exposed and non-exposed group, the hand grip measurement after two years from the exposure definition was used to define the outcome.
In this example we will consider the following variables:
load("Logistic Regression examples.RData")
Before starting to fit any regression model we have to properly code categorical variables. This is usually done in the initial data analysis (IDA) and the descriptive statistics phase.
The first step consists in identifying which variables are categorical. This may seem as an easy step but it isn’t always. If fact the choice for some variables (i.e. indexes and scales) is not obvious and it requires either knowledge of the data we are analysing or a bit more effort on the data modelling.
The important message is that we can’t assume that it is correct to model a variable as numerical only because it was of type numerical/integer in the dataset imported in R.
Once we have identified the variables that we think are categorical we can process them. We can have three cases that leads to a slightly different procedure according to the R type of the variable:
Binary variables can either be treated as numerical, usually dummy 0/1 variables, or as factors. It does not make any difference in terms of the results obtained from the model but we have to be aware of how they are coded for a correct interpretation of the model.
In general, the model will always have 1 parameter for a binary variable. However, we have to choose which level the parameter should represent by deciding which group to code with 1. The choice should be made according to how we want the model to be interpreted. For example, for the treatment variable, 0 is usually the placebo or the control treatment and 1 is usually the experimental treatment. Alternatively, in the case of the covariates/exposure variables, we could simply use a statistical criteria: for a better stability of the model it is always better to use the most frequent category as a reference.
For example, it makes more sense to consider as the reference level of paid_job subjects who have had a paid job. This will help with the stability of the model but also it will ease the interpretation of the results. Therefore, we create a new variable no_paid_job:
datiSHAREStress$no_paid_job <- ifelse(datiSHAREStress$paid_job==0,1,0)
In this case we must transform the variable in a factor, assign labels to the factor levels and choose a reference level. The reference level will be the group all the others levels will be compared to. In fact when a factor variable with c levels is added to a model, R internally transforms it in c-1 dummy variables and a model parameter will be estimated for each of them. Again, is important we choose the reference level. In this case we have ses which we have to transform into a factor
datiSHAREStress$ses <- as.factor(datiSHAREStress$ses)
We can now check how R has coded the variable:
str(datiSHAREStress$ses)
## Factor w/ 4 levels "1","2","3","4": 4 4 4 4 4 4 4 2 4 4 ...
contrasts(datiSHAREStress$ses)
## 2 3 4
## 1 0 0 0
## 2 1 0 0
## 3 0 1 0
## 4 0 0 1
By default, R uses the group 1 as the reference level. With the contrasts() function we can see how the variable will be parametrized in the regression model. In the rows we have the variable levels and on the columns the parameters generated for this variable.
It is also useful to assign more meaningful labels:
datiSHAREStress$ses=factor(datiSHAREStress$ses,
labels = c("Great Difficulties",
"Some Difficulties",
"Fairly Easily",
"Easily"))
Last but not least we can set the reference category of our choice. In this case we choose “Easily” which is the most frequent category.
table(datiSHAREStress$ses)
##
## Great Difficulties Some Difficulties Fairly Easily Easily
## 191 813 1997 3148
datiSHAREStress$ses=relevel(datiSHAREStress$ses,ref="Easily")
If the variable is of type character the steps are similar to the previous ones with the exception that we won’t need to explicitly set the labels. As a reminder, by default R uses as a reference level the first level in alphabetical order.
We skip this part here, since we already discussed how to describe a dataset in the various IDA examples. Remind that you should pay particular attention to outliers, missing values, continuous variable’s distribution shapes, rare categories in categorical ones!
A common method used in the analysis of health data for variables selection, is fitting many models with one covariate at the time. This is also a way to begin to explore the dataset, even if as discussed it is not the suggested method to select variables in the model. In this case, we can start with the exposure to stress, our variable of interest. This kind of analysis in the medical literature is often referred as “univariable analysis”.
So we start by fitting the model
fit_uni_stress <- glm(low_grip~stress,family = binomial,data=datiSHAREStress)
As a reminder, the glm function automatically uses the logit as link function unless we state otherwise.
The results of the model can be obtained with
summary(fit_uni_stress)
##
## Call:
## glm(formula = low_grip ~ stress, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.5660 0.0501 -51.217 <2e-16 ***
## stress -0.5174 0.4204 -1.231 0.218
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3141.1 on 6147 degrees of freedom
## AIC: 3145.1
##
## Number of Fisher Scoring iterations: 5
It seems that we can’t reject the null hypothesis for \(\beta_{stress}\).
We can also obtain the 95% Confidence Interval
confint.default(fit_uni_stress)
## 2.5 % 97.5 %
## (Intercept) -2.664221 -2.4678285
## stress -1.341440 0.3066133
We can do more univariable models to look for possible confounders (previoulsy discussed with an expert if possible..) in the exposure-outcome relationship.
fit_uni_age <- glm(low_grip~age,family = binomial,data=datiSHAREStress)
summary(fit_uni_age)
##
## Call:
## glm(formula = low_grip ~ age, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.505544 0.419383 -22.67 <2e-16 ***
## age 0.101366 0.005822 17.41 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 2817.0 on 6147 degrees of freedom
## AIC: 2821
##
## Number of Fisher Scoring iterations: 6
fit_uni_job <- glm(low_grip~no_paid_job,family = binomial,data=datiSHAREStress)
summary(fit_uni_job)
##
## Call:
## glm(formula = low_grip ~ no_paid_job, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.61992 0.05134 -51.030 < 2e-16 ***
## no_paid_job 1.13184 0.21544 5.254 1.49e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3120.8 on 6147 degrees of freedom
## AIC: 3124.8
##
## Number of Fisher Scoring iterations: 5
fit_uni_sex <- glm(low_grip~female,family = binomial,data=datiSHAREStress)
summary(fit_uni_sex)
##
## Call:
## glm(formula = low_grip ~ female, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.56532 0.07178 -35.738 <2e-16 ***
## female -0.01918 0.09955 -0.193 0.847
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3142.8 on 6147 degrees of freedom
## AIC: 3146.8
##
## Number of Fisher Scoring iterations: 5
fit_uni_move <- glm(low_grip~house_move,family = binomial,data=datiSHAREStress)
summary(fit_uni_move)
##
## Call:
## glm(formula = low_grip ~ house_move, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.3735 0.1307 -18.161 <2e-16 ***
## house_move -0.2329 0.1413 -1.648 0.0993 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3140.2 on 6147 degrees of freedom
## AIC: 3144.2
##
## Number of Fisher Scoring iterations: 5
fit_uni_ses <- glm(low_grip~ses,family = binomial,data=datiSHAREStress)
summary(fit_uni_ses)
##
## Call:
## glm(formula = low_grip ~ ses, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -2.69592 0.07324 -36.809 < 2e-16 ***
## sesGreat Difficulties 0.70745 0.23408 3.022 0.00251 **
## sesSome Difficulties 0.37973 0.14288 2.658 0.00787 **
## sesFairly Easily 0.11084 0.11422 0.970 0.33182
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 3129.9 on 6145 degrees of freedom
## AIC: 3137.9
##
## Number of Fisher Scoring iterations: 5
When we have covariates with many levels, such as ses, we can use the global Wald test which takes into account for multiple testing.
drop1(fit_uni_ses,test="Chisq")
## Single term deletions
##
## Model:
## low_grip ~ ses
## Df Deviance AIC LRT Pr(>Chi)
## <none> 3129.8 3137.8
## ses 3 3142.8 3144.8 12.988 0.004663 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
We can now fit a multivariable logistic model with the variables that
were found to be associated with the outcome in the previous analysis
and we suspect could act as confounders of the main exposure of
interest.
Since we have not the possibility do discuss with an expert, we decide
to include in the multivariable regression model all covariates with a
p-value < 0.1. This criterion could be not the best one, but is
widely applied in clinical and epidemiological research.
We obviously include in the model stress , since it is our exposure of interest.
We will treat the others variables are possible confounders in the model, and we will explore if significant associations are present with the exposure of interest. In principle, in the explanatory setting, we should start from a DAG and make the causal assumptions (in order to estimate a causal effect!!) but we do not have here the experts of the matter and we will limit ourselves in the end to a cautious interpretation of the estimated association.
fit_multi <- glm(low_grip~stress+no_paid_job+age+house_move+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi)
##
## Call:
## glm(formula = low_grip ~ stress + no_paid_job + age + house_move +
## ses, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.486409 0.448437 -21.154 < 2e-16 ***
## stress -0.149441 0.438194 -0.341 0.733075
## no_paid_job 0.408889 0.233225 1.753 0.079568 .
## age 0.101474 0.005924 17.129 < 2e-16 ***
## house_move -0.196669 0.147181 -1.336 0.181472
## sesGreat Difficulties 0.939131 0.249104 3.770 0.000163 ***
## sesSome Difficulties 0.489696 0.149577 3.274 0.001061 **
## sesFairly Easily 0.066057 0.118419 0.558 0.576962
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 2790.2 on 6141 degrees of freedom
## AIC: 2806.2
##
## Number of Fisher Scoring iterations: 6
This model estimates the coefficients of each of the variables independently from all the others. This is key concept of multivariable regression and it is what allows to remove confounding.
At a first sight it seems that our exposure of interest is not associated with the outcome.
Knowledge of the context from which the data comes from can be used to make hypothesis about possible interaction between confounders and the exposure.
In this case we want to test for an interaction between no_paid_job and stress. In other words we want to see if the association between the outcome and the exposure could vary for different levels of no_paid_job.
We then fit the model:
fit_multi2 <- glm(low_grip~stress*no_paid_job+house_move+age+ses,family = binomial,data=datiSHAREStress)
summary(fit_multi2)
##
## Call:
## glm(formula = low_grip ~ stress * no_paid_job + house_move +
## age + ses, family = binomial, data = datiSHAREStress)
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -9.544923 0.450526 -21.186 < 2e-16 ***
## stress -0.812176 0.603047 -1.347 0.178049
## no_paid_job 0.267759 0.243706 1.099 0.271901
## house_move -0.181885 0.147844 -1.230 0.218604
## age 0.102246 0.005946 17.197 < 2e-16 ***
## sesGreat Difficulties 0.927537 0.249828 3.713 0.000205 ***
## sesSome Difficulties 0.490455 0.149577 3.279 0.001042 **
## sesFairly Easily 0.060946 0.118563 0.514 0.607225
## stress:no_paid_job 3.189101 1.054703 3.024 0.002497 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 3142.8 on 6148 degrees of freedom
## Residual deviance: 2781.6 on 6140 degrees of freedom
## AIC: 2799.6
##
## Number of Fisher Scoring iterations: 6
As a reminder, using * introduces both the main effect and the interaction effect between variables in the model.
Alternatively, we can also fit the same model as follows:
fit_multi2 <- glm(low_grip~stress+no_paid_job+age+ses+stress:no_paid_job+house_move,family = binomial,data=datiSHAREStress)
This second way of adding an interaction term is useful in case we have multiple interactions involving the same variable.
It seems that a significant interaction is present between stress and the no_paid_job confounder. Note that in any case, the main effect of stress is still not significant in the model.
Now that we have fitted these two models, which one should we choose?
We can use a formal statistical test to compare the two nested models:
anova(fit_multi,fit_multi2,test="Chisq")
## Analysis of Deviance Table
##
## Model 1: low_grip ~ stress + no_paid_job + age + house_move + ses
## Model 2: low_grip ~ stress + no_paid_job + age + ses + stress:no_paid_job +
## house_move
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 6141 2790.2
## 2 6140 2781.6 1 8.5938 0.003373 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The null hypothesis of this test is that the simpler model (i.e. the one with the smaller number of regression parameters) is no different from the more complex model. In this case we reject that hypothesis at a 95% confidence level and we would keep the model with the interaction.
Even if we are here estimating an explanatory model, this does not means that we are not at all interested in evaluating if the predicted probabilities are in line with the observed event rates. A basic procedure to evaluate this aspect in logistic regression is the Hosmer-Lemeshow test (see for details: https://onlinelibrary.wiley.com/doi/book/10.1002/0471722146). In brief, the null hypothesis of the test is that the predicted probabilities (splitted in ordinal groups) are in line with the observed rates of the events in each group.
library(generalhoslem)
## Loading required package: reshape
##
## Attaching package: 'reshape'
## The following object is masked from 'package:Matrix':
##
## expand
## The following object is masked from 'package:data.table':
##
## melt
## The following object is masked from 'package:lubridate':
##
## stamp
## The following object is masked from 'package:dplyr':
##
## rename
## The following objects are masked from 'package:tidyr':
##
## expand, smiths
logitgof(datiSHAREStress$low_grip,fitted(fit_multi2))
##
## Hosmer and Lemeshow test (binary model)
##
## data: datiSHAREStress$low_grip, fitted(fit_multi2)
## X-squared = 5.6329, df = 8, p-value = 0.6883
The model doesn’t show evidence for poor goodness of fit/calibration. Remind that when we are instead working with models used specifically for prediction, we should use more refined analyses such that the bootstrap overfitting-corrected calibration curves.
The next step is interpreting the results obtained from the model which is one of the most important part of the analysis to be discussed with experts. Of course we are keeping things simple here: in reality we would have had to consider many other possible candidate confounders for the outcome in order to properly control for confounding in this observational study, at least for the measured ones.
We obtain the OR, which is the measure of association selected when reporting and interpreting the results of a logistic model.
We start with the OR estimated for age:
exp(fit_multi2$coefficients)[4]
## age
## 1.107656
In general, OR>1 indicate an increase of the probability of the outcome whereas OR<1 a decrease. But the question is : with respect to what? We know that the OR is a relative measure of association so we must always ask ourselves what comparison we are making when we report an OR we have estimated. The above output does not mean anything by itself.
The first important thing to remember is that age is a continuous variable (measured in years) and in the model we have assumed it has a linear effect on the log-odds of the outcome. When we obtain the \(\hat{OR}\) by exponentiating the coefficient for age we obtain the OR estimates for an increase of one year of age. Being the \(\hat{OR}>1\), the probability of developing low grip increases as the age increases. Specifically, a subject has odds of having low grip 1.11 times greater with respect to a subject who is 1 year younger.
For continuous variables is important to report an OR for a difference that is relevant in the application at hand; a proper choice will depend of the scale the variable is measured and on the magnitude of the estimated coefficient.
For example here 1 year may not be so relevant from an epidemiological point of view, we may want to report an OR for a 5 years difference in age:
exp(fit_multi2$coefficients*5)[4]
## age
## 1.66734
What if we want to interpret the OR of a binary variable? Let’s consider the variable house_move. If it is coded numerical, then the coefficients refers always to the level 1. In this case 1 stands for having moved country in the past, so if the estimated coefficient was significant we could say that subjects who moved seem to have an odds lower by 17% (\(1-\hat{OR}\)) compared to people that have never moved countries in their life. However, when we look at the 95% CI we observe that it contains 1 so the association does not seem to be statistically significant.
#OR
exp(fit_multi2$coefficients)[8]
## house_move
## 0.8336973
exp(confint.default(fit_multi2))[8,]
## 2.5 % 97.5 %
## 0.6239695 1.1139185
The interpretation of the OR for categorical variables is similar: we will have to keep in mind that we are always comparing each of the variable levels to the reference level. Here, it seems (interestingly!) that the odds of health worsening is the same for subject with good or fairly good self-perceived socio-economic status while it increases as the economic situation gets worse.
# OR
exp(fit_multi2$coefficients)[5:7]
## sesGreat Difficulties sesSome Difficulties sesFairly Easily
## 2.528273 1.633059 1.062842
# 95% CI
exp(confint.default(fit_multi2))[5:7,]
## 2.5 % 97.5 %
## sesGreat Difficulties 1.549423 4.125513
## sesSome Difficulties 1.218097 2.189385
## sesFairly Easily 0.842456 1.340880
Finally, we will obtain the \(\hat{OR}s\) for stress and no_paid_job. Since there is an interaction involved, we have to be a bit more careful and we have to consider the two variables together. So first of all we have to know which is our reference group. In our case this is the group who have not experienced a stressful period and have done some paid job in their life (the reference levels for both variables involved in the interaction). Then, we can consider all combinations of the levels of the variables.
If we want the \(\hat{OR}\) for undergoing a stressful period and having had a paid job with respect to not having had a period of stress and having had a paid job we simply exponentiate the main estimated coefficient for stress (keeping at the same values the others covariates in the model):
exp(fit_multi2$coefficients)[2]
## stress
## 0.4438912
and we don’t reject the null hypothesis it is equal to 1 (no significant effect):
exp(confint.default(fit_multi2))[2,]
## 2.5 % 97.5 %
## 0.1361326 1.4474079
On the other hand, if we want the \(\hat{OR}\) for never having had a paid job and not being exposed to stress with respect to having had a paid job and not being exposed to stress we simply exponentiate the coefficient for the main effect of the variable no_paid_job:
exp(fit_multi2$coefficients)[3]
## no_paid_job
## 1.307032
Also this effect is not statistically significant which means that there does not seem to be a difference in the risk of low hand grip with regards to job experience in the non-exposed group (no stress) keeping fixed all others covariates.
exp(confint.default(fit_multi2))[3,]
## 2.5 % 97.5 %
## 0.8106679 2.1073152
Last, we can obtain the OR for subjects who were stressed and had never have a paid job with respect to the reference group (not experienced a stressful period and have done some paid job):
exp(fit_multi2$coefficients[9]+fit_multi2$coefficients[2]+fit_multi2$coefficients[3])
## stress:no_paid_job
## 14.07899
So in this subgroup it seems that the probability of developing low grip increases.
However, we should evaluate also the 95% CI for this \(\hat{OR}\).
One possibility to do it is to re-fit the multivariable logistic model using the rms R package:
library(rms)
dd <- datadist(datiSHAREStress)
## Warning in datadist(datiSHAREStress): intervallo is constant
options(datadist='dd') #define ranges of the covariates
fit_multi2b <- lrm(low_grip~stress*no_paid_job+age+ses+house_move,data=datiSHAREStress)
The code is quite similar, the main difference is that first we have to run the datadist function which stores the distribution summaries of the variables.
The print function returns a very similar output of the summary for the glm() object.
print(fit_multi2b)
## Logistic Regression Model
##
## lrm(formula = low_grip ~ stress * no_paid_job + age + ses + house_move,
## data = datiSHAREStress)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 6149 LR chi2 361.20 R2 0.143 C 0.752
## 0 5714 d.f. 8 R2(8,6149)0.056 Dxy 0.504
## 1 435 Pr(> chi2) <0.0001 R2(8,1212.7)0.253 gamma 0.505
## max |deriv| 2e-06 Brier 0.061 tau-a 0.066
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -9.5449 0.4505 -21.19 <0.0001
## stress -0.8122 0.6031 -1.35 0.1781
## no_paid_job 0.2678 0.2437 1.10 0.2719
## age 0.1022 0.0059 17.20 <0.0001
## ses=Great Difficulties 0.9275 0.2498 3.71 0.0002
## ses=Some Difficulties 0.4905 0.1496 3.28 0.0010
## ses=Fairly Easily 0.0609 0.1186 0.51 0.6072
## house_move -0.1819 0.1478 -1.23 0.2186
## stress * no_paid_job 3.1891 1.0547 3.02 0.0025
The summary on the other hand gives you the estimates for the coefficients as well as the ones for the OR. Of note by default for the continuous variables here, such as age, the function calculates the OR for the difference between the \(1^{st}\) and the \(3^{rd}\) quartile.
summary(fit_multi2b)
## Effects Response : low_grip
##
## Factor Low High Diff. Effect S.E. Lower 0.95
## stress 0 1.0 1.0 -0.812180 0.603110 -1.99430
## Odds Ratio 0 1.0 1.0 0.443890 NA 0.13612
## no_paid_job 0 1.0 1.0 0.267760 0.243710 -0.20990
## Odds Ratio 0 1.0 1.0 1.307000 NA 0.81067
## age 58 70.8 12.8 1.308700 0.076105 1.15960
## Odds Ratio 58 70.8 12.8 3.701500 NA 3.18860
## house_move 0 1.0 1.0 -0.181880 0.147840 -0.47165
## Odds Ratio 0 1.0 1.0 0.833700 NA 0.62397
## ses - Great Difficulties:Easily 1 2.0 NA 0.927540 0.249830 0.43788
## Odds Ratio 1 2.0 NA 2.528300 NA 1.54940
## ses - Some Difficulties:Easily 1 3.0 NA 0.490460 0.149580 0.19729
## Odds Ratio 1 3.0 NA 1.633100 NA 1.21810
## ses - Fairly Easily:Easily 1 4.0 NA 0.060946 0.118560 -0.17143
## Odds Ratio 1 4.0 NA 1.062800 NA 0.84246
## Upper 0.95
## 0.36990
## 1.44760
## 0.74541
## 2.10730
## 1.45790
## 4.29700
## 0.10788
## 1.11390
## 1.41720
## 4.12550
## 0.78362
## 2.18940
## 0.29333
## 1.34090
##
## Adjusted to: stress=0 no_paid_job=0
So far, the OR for stress and the job experience are the ones for the main effects. We can easily obtain here the OR for the interaction with the built-in contrast function:
c <- contrast(fit_multi2b,
list(stress=1,no_paid_job=1), list(stress=0,no_paid_job=0),type="average")
c
## Contrast S.E. Lower Upper Z Pr(>|z|)
## 11 2.644684 0.8328779 1.012273 4.277095 3.18 0.0015
##
## Confidence intervals are 0.95 individual intervals
#OR and 95% CI
exp(c$Contrast)
## 1
## 14.07899
exp(c$Lower)
## 1
## 2.75185
exp(c$Upper)
## 1
## 72.03085
So it seems that there is a significant interaction between stress and no paid job on the outcome.
Remind our initial research question: whether a stressful period could be associated with the occurrence of muscular weakness in the over 50 population. Based on our analysis it seems that is not the single event of a stessful period that increase the occurrence of muscular weakness, but it is the joint impact of having a stressful period coupled with no paid job that is associated with the occurrence of muscular weakness, adjusting for (or independently from) the specific age, house condition or socio-economic status.
In nature, an event usually takes place in a very small amount of time. At any given point of time, the probability of encountering such an event is small. Instead of the probability of the single event, now we focus on the frequency of the events as a density, which means incidence or ‘count’ of events over a period of time. (While time is one dimension, the same concept applies to the density of counts of small objects in a two-dimensional area or three-dimensional space). Moreover we can assume that one event is independent from another and that the densities in different units of time vary with a variance equal to the average density. We can approximate this kind of random process using the Poisson random variable. When the probability of having an event is affected by some factors, a model is needed to explain and predict the density. Variation among different strata of a population could be explained by the various combination of factors. Within each stratum (defined by covariates combination), the distribution of the events is assumed random. Poisson regression deals with outcome variables that are counts in nature (whole numbers or integers). Independent covariates have the same role as those encountered in linear and logistic regression. In epidemiology, Poisson regression is very often used for analysing grouped population based or cohort data, looking at incidence density among person-time contributed by subjects that share similar characteristics of interest. Poisson regression is one of 3 common regression models used in epidemiological studies. The other two that are more commonly used are linear regression and logistic regression, which have been already covered. The last family is survival methods, that we will explain in the last block 4.
There are two main assumptions for Poisson regression:
risk is homogeneous among person-times contributed by different subjects who share the same characteristics of interest (e.g. sex, age-group) and the same period.
asymptotically, or as the sample size becomes larger and larger, the mean of the counts is equal to the variance.
Straightforward linear regression methods (assuming constant variance and normal errors) are not appropriate for count data for four main reasons:
the linear model might lead to the prediction of negative counts
the variance of the response may increase with the mean
the errors will not be normally distributed
zero counts are difficult to handle in transformations
Moreover, in studies that invole the time dimension different subjects may have different person-times of exposure. Analysing risk factors while ignoring differences in person-times is wrong.
Poisson regression overcomes these limitations.
Note that in survival analysis using for example Cox regression (see block 4), the hazard ratio will be estimated for each covariate in the model, not the incidence density in each subgroup; in the Cox model the interest will be focused on the “how long until an event occurs - time to event -”, instead in the Poisson regression model the focus is on “how many events occur in given interval”.
The dataset Montana was extracted from an occupational cohort study conducted to test the association between respiratory deaths (outcome) and exposure to arsenic in the industry, after adjusting for various other risk factors/confounders. The main outcome variable is respdeath. This is the count of the number of deaths among personyrs or personyears of subjects in each category. The other variables are independent covariates including age group agegr, period of employment period, starting time of employment start and the level of exposure to arsenic during the study period arsenic (the exposure of interest). Read in the data first and examine the variables.
data(Montana)
summary(Montana)
## respdeath personyrs agegr period
## Min. : 0.000 Min. : 4.2 Min. :1.000 Min. :1.000
## 1st Qu.: 0.000 1st Qu.: 127.5 1st Qu.:2.000 1st Qu.:1.000
## Median : 1.000 Median : 335.1 Median :3.000 Median :2.000
## Mean : 2.421 Mean : 1096.4 Mean :2.605 Mean :2.404
## 3rd Qu.: 3.000 3rd Qu.: 925.7 3rd Qu.:4.000 3rd Qu.:3.000
## Max. :19.000 Max. :12451.3 Max. :4.000 Max. :4.000
## starting arsenic
## Min. :1.000 Min. :1.000
## 1st Qu.:1.000 1st Qu.:1.250
## Median :1.000 Median :2.000
## Mean :1.456 Mean :2.474
## 3rd Qu.:2.000 3rd Qu.:3.000
## Max. :2.000 Max. :4.000
The last four variables are classed as integers. We need to tell R to interpret them as categorical variables, or factors, and attach labels to each of the levels. This can be done using the factor command with a ‘labels’ argument included.
Montana$agegr <- factor(Montana$agegr, labels=c("40-49","50-59","60-69","70-79"))
Montana$period <- factor(Montana$period, labels=c("1938-1949", "1950-1959","1960-1969", "1970-1977"))
Montana$start <- factor(Montana$start, labels=c("pre-1925", "1925 & after"))
Montana$arsenic1 <- factor(Montana$arsenic, labels=c("<1 year", "1-4years","5-14 years", "15+ years"))
summary(Montana)
## respdeath personyrs agegr period starting
## Min. : 0.000 Min. : 4.2 40-49:24 1938-1949:30 Min. :1.000
## 1st Qu.: 0.000 1st Qu.: 127.5 50-59:28 1950-1959:32 1st Qu.:1.000
## Median : 1.000 Median : 335.1 60-69:31 1960-1969:28 Median :1.000
## Mean : 2.421 Mean : 1096.4 70-79:31 1970-1977:24 Mean :1.456
## 3rd Qu.: 3.000 3rd Qu.: 925.7 3rd Qu.:2.000
## Max. :19.000 Max. :12451.3 Max. :2.000
## arsenic start arsenic1
## Min. :1.000 pre-1925 :62 <1 year :29
## 1st Qu.:1.250 1925 & after:52 1-4years :29
## Median :2.000 5-14 years:29
## Mean :2.474 15+ years :27
## 3rd Qu.:3.000
## Max. :4.000
We keep the original arsenic variable unchanged for use later on.
Let us explore the person-years breakdown by age and period. Firstly, create a table for total person-years:
tapply(Montana$personyrs, list(Montana$period, Montana$agegr), sum) -> table.pyears
Carry out the same procedure for number of deaths, and compute the table of incidence per 10,000 person years for each cell.
tapply(Montana$respdeath, list(Montana$period, Montana$agegr), sum) -> table.deaths
table.inc10000 <- table.deaths/table.pyears*10000
table.inc10000
## 40-49 50-59 60-69 70-79
## 1938-1949 5.424700 17.13102 34.95107 26.53928
## 1950-1959 3.344638 23.47556 49.01961 64.82632
## 1960-1969 4.341516 20.49375 58.23803 55.06608
## 1970-1977 4.408685 14.77747 44.09949 80.81413
Now, create a time-series plot of the incidence:
plot.ts(table.inc10000, plot.type="single", xlab=" ",ylab="#/10,000 person-years", xaxt="n", col=c("black",
"blue","red","green"), lty=c(2,1,1,2), las=1)
points(rep(1:4,4), table.inc10000, pch=22, cex=table.pyears/sum(table.pyears) * 20)
title(main = "Incidence by age and period")
axis(side = 1, at = 1:4, labels = levels(Montana$period))
legend(3.2,40, legend=levels(Montana$agegr)[4:1], col=c("green","red", "blue", "black"), bg = "white", lty=c(2,1,1,2))
The above graph shows that the older age group is generally associated with a higher risk. On the other hand, the sample size (reflected by the size of the squares at each point) decreases with age. The possibility of a confounding effect of age on the exposure of interest can better be examined by using Poisson regression.
Let’s estimate a Poisson regression model taking into account only period as a covariate:
mode11 <- glm(respdeath ~ period, offset = log(personyrs),family = poisson, data=Montana)
summary(mode11)
##
## Call:
## glm(formula = respdeath ~ period, family = poisson, data = Montana,
## offset = log(personyrs))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -6.4331 0.1715 -37.511 <2e-16 ***
## period1950-1959 0.2365 0.2117 1.117 0.2638
## period1960-1969 0.3781 0.2001 1.889 0.0588 .
## period1970-1977 0.4830 0.2036 2.372 0.0177 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 376.02 on 113 degrees of freedom
## Residual deviance: 369.27 on 110 degrees of freedom
## AIC: 596.05
##
## Number of Fisher Scoring iterations: 6
The option offset = log(personyrs) allows the variable personyrs to be the denominator for the counts of respdeath. A logarithmic transformation is needed since, for a Poisson generalized linear model, the link function is the natural log, and the default link for the Poisson family is the log link.
Remind : an important criterion in the choice of a link function for various families of distributions is to ensure that the fitted values from the modelling stay within reasonable bounds. Specifying a log link (default for Poisson) ensures that the fitted counts are all greater than or equal to zero. For more details on default links for various families of distributions related to generalized linear modelling, see the help in R under help(family).
The first model above with period as the only independent variable suggests that the death rate increased with time. The model can be tested for goodness of fit and the checked whether the Poisson assumptions mentioned earlier have been violated.
To test the goodness of fit of the Poisson model, type:
poisgof(mode11)
## $results
## [1] "Goodness-of-fit test for Poisson assumption"
##
## $chisq
## [1] 369.2654
##
## $df
## [1] 110
##
## $p.value
## [1] 9.578443e-30
The component ‘$chisq’ is actually computed from the model deviance, a parameter reflecting the level of errors. A large chi-squared value with small degrees of freedom results in a significant violation of the Poisson assumption (p < 0.05). If only the P value is wanted, the command can be shortened.
poisgof(mode11)$p.value
## [1] 9.578443e-30
The P value is very small indicating a poor fit.
Note:It should be noted that this method works under the assumption of a large sample size. An alternative method is to a fit negative binomial regression model (but not covered in the slides!).
We now add the second independent variable ‘agegr’ to the model:
mode12 <- glm(respdeath~agegr+period, offset=log(personyrs), family = poisson, data=Montana)
AIC(mode12)
## [1] 396.6407
The AIC (Akaike Information Criterion) has decreased remarkably from ‘model1’ to ‘model2’ indicating a poor fit of the first model.
poisgof(mode12)$p.value
## [1] 0.0003381328
But ‘model2’ still violates the Poisson assumption.
mode13 <- glm(respdeath ~ agegr, offset = log(personyrs), family = poisson, data=Montana)
AIC(mode13)
## [1] 394.4699
poisgof(mode13)$p.value
## [1] 0.0003295142
Removal of ‘period’ further reduces the AIC but still violates the Poisson assumption to the same extent as the previous model. The next step is to add the exposure of interest: ‘arsenic1’.
mode14 <- glm(respdeath ~ agegr + arsenic1, offset=log(personyrs), family = poisson, data=Montana)
summary(mode14)
##
## Call:
## glm(formula = respdeath ~ agegr + arsenic1, family = poisson,
## data = Montana, offset = log(personyrs))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.9947 0.2237 -35.739 < 2e-16 ***
## agegr50-59 1.4624 0.2454 5.960 2.53e-09 ***
## agegr60-69 2.3502 0.2380 9.873 < 2e-16 ***
## agegr70-79 2.5987 0.2564 10.136 < 2e-16 ***
## arsenic11-4years 0.8041 0.1577 5.099 3.42e-07 ***
## arsenic15-14 years 0.5957 0.2059 2.893 0.00381 **
## arsenic115+ years 0.9976 0.1758 5.673 1.40e-08 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 376.02 on 113 degrees of freedom
## Residual deviance: 122.25 on 107 degrees of freedom
## AIC: 355.04
##
## Number of Fisher Scoring iterations: 5
poisgof(mode14)$p.value
## [1] 0.1486902
Fortunately, ‘model4’ has a much lower AIC than model3 and it now does not violate the assumption.
If we change the reference level for arsenic and we use 1-4 years vs others:
Montana$arsenic.b <- relevel(Montana$arsenic1,ref="1-4years")
mode15 <- glm(respdeath ~ agegr + arsenic.b, offset=log(personyrs), family = poisson, data=Montana)
summary(mode15)
##
## Call:
## glm(formula = respdeath ~ agegr + arsenic.b, family = poisson,
## data = Montana, offset = log(personyrs))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -7.1905 0.2466 -29.156 < 2e-16 ***
## agegr50-59 1.4624 0.2454 5.960 2.53e-09 ***
## agegr60-69 2.3502 0.2380 9.873 < 2e-16 ***
## agegr70-79 2.5987 0.2564 10.136 < 2e-16 ***
## arsenic.b<1 year -0.8041 0.1577 -5.099 3.42e-07 ***
## arsenic.b5-14 years -0.2085 0.2326 -0.896 0.37
## arsenic.b15+ years 0.1935 0.2071 0.935 0.35
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 376.02 on 113 degrees of freedom
## Residual deviance: 122.25 on 107 degrees of freedom
## AIC: 355.04
##
## Number of Fisher Scoring iterations: 5
It does not appear to be any increase in the risk of death from more than 4 years of exposure to arsenic so it may be worth combining it into just two levels:
Montana$arsenic2 <- Montana$arsenic1
levels(Montana$arsenic2) <- c("<1 year", rep("1+ years", 3))
model6 <- glm(respdeath ~ agegr + arsenic2,offset=log(personyrs), family=poisson, data=Montana)
summary(model6)
##
## Call:
## glm(formula = respdeath ~ agegr + arsenic2, family = poisson,
## data = Montana, offset = log(personyrs))
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -8.0086 0.2233 -35.859 < 2e-16 ***
## agegr50-59 1.4702 0.2453 5.994 2.04e-09 ***
## agegr60-69 2.3661 0.2372 9.976 < 2e-16 ***
## agegr70-79 2.6238 0.2548 10.297 < 2e-16 ***
## arsenic21+ years 0.8109 0.1210 6.699 2.09e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for poisson family taken to be 1)
##
## Null deviance: 376.02 on 113 degrees of freedom
## Residual deviance: 125.02 on 109 degrees of freedom
## AIC: 353.8
##
## Number of Fisher Scoring iterations: 5
At this stage, we would accept ‘model6’ as the final model, since it has the smallest AIC among all the models that we have tried. We conclude that exposure to arsenic for at least one year is associated with an increased risk for the disease by exp(0.8109) or 2.25 times with statistical significance, independently from age.
In the Poisson model, the outcome is a count. In the general linear model, the relationship between the values of the outcome (as measured in the data and predicted by the model in the fitted values) and the linear predictor is determined by the link function. This link function relates the mean value of the outcome to its linear predictor. By default, the link function for the Poisson distribution is the natural logarithm. With the offset being log(person-time), the value of the outcome becomes the log(incidence rate).
The matrix ‘table.inc10000’ (created previously) gives the crude incidence rate by age group and period. Each of the Poisson regression models above can be used to compute the predicted incidence rate when the variables in the model are given. For example, to compute the incidence rate from a population of 100,000 people aged between 40-49 years who were exposed to arsenic for less than one year using ‘model6’, type:
newdata <- as.data.frame(list(agegr="40-49",arsenic2="<1 year", personyrs=100000))
predict(model6, newdata, type="response")
## 1
## 33.25736
This population would have an estimated incidence rate of 33.26 per 100,000 person-years.
In this example, all subjects pool their follow-up times and this number is called ‘person time’, which is then used as the denominator for the event, resulting in ‘incidence rate’. Comparing the incidence rate among two groups of subjects by their exposure status is fairer than comparing the crude risks (considering population at baseline). The ratio between the incidence rates of two groups is called the incidence rate ratio (IRR), which is an improved form of the relative risk.
In ‘model6’, for example, we want to compute the incidence rate ratio between the subjects exposed to arsenic for one or more years against those exposed for less than one year.
The shorter way to obtain this IRR is to exponentiate the coefficient of the specific variable ‘arsenic’, which is the fifth coefficient in the model.
coef(model6)
## (Intercept) agegr50-59 agegr60-69 agegr70-79
## -8.0086495 1.4701505 2.3661111 2.6237532
## arsenic21+ years
## 0.8108698
exp(coef(model6)[5])
## arsenic21+ years
## 2.249864
The following steps explain how the 95% confidence interval for all variables can be obtained.
coeff <- coef(model6)
coeff.95ci <- cbind(coeff, confint(model6))
coeff.95ci
## coeff 2.5 % 97.5 %
## (Intercept) -8.0086495 -8.4780952 -7.598323
## agegr50-59 1.4701505 1.0094838 1.976026
## agegr60-69 2.3661111 1.9239329 2.858522
## agegr70-79 2.6237532 2.1411250 3.145495
## arsenic21+ years 0.8108698 0.5724818 1.047494
Note that confint(glm6) provides a 95% confidence interval for the model coefficients.
IRR.95ci <- round(exp(coeff.95ci), 1)[-1,]
The required values are obtained from exponentiating the last matrix with the first row or intercept removed. The display is rounded to 1 decimal place for better viewing. Then the matrix column is labelled and the 95% CI is displayed.
colnames(IRR.95ci) <- c("IRR", "lower95ci", "upper95ci")
IRR.95ci
## IRR lower95ci upper95ci
## agegr50-59 4.3 2.7 7.2
## agegr60-69 10.7 6.8 17.4
## agegr70-79 13.8 8.5 23.2
## arsenic21+ years 2.2 1.8 2.9
A simpler way is to use the command idr.display in epiDisplay (but they use the term IDR: Incidence Density Rate).
idr.display(model6, decimal=1)
##
## Poisson regression predicting respdeath with offset = log(personyrs)
##
## crude IDR(95%CI) adj. IDR(95%CI) P(Wald's test)
## agegr: ref.=40-49
## 50-59 4.5 (2.8,7.3) 4.3 (2.7,7) < 0.001
## 60-69 11.3 (7.1,17.9) 10.7 (6.7,17) < 0.001
## 70-79 14.5 (8.8,23.8) 13.8 (8.4,22.7) < 0.001
##
## arsenic2: 1+ years vs <1 year 2.5 (2,3.1) 2.2 (1.8,2.9) < 0.001
##
## P(LR-test)
## agegr: ref.=40-49 < 0.001
## 50-59
## 60-69
## 70-79
##
## arsenic2: 1+ years vs <1 year < 0.001
##
## Log-likelihood = -171.9
## No. of observations = 114
## AIC value = 353.8
The command idr.display gives results to 3 decimal places by default. This can easily be changed by the user.
Recall that for Poisson regression one of the assumptions for a valid model is that the mean and variance of the count variable are equal. The negative binomial distribution is a more generalized form of distribution used for ‘count’ response data, allowing for greater dispersion or variance of counts. In practice, it is quite common for the variance of the outcome to be larger than the mean. This is called overdispersion. If a count variable is overdispersed, Poisson regression underestimates the standard errors of the predictor variables. When overdispersion is evident, one solution is to specify that the errors have a negative binomial distribution. Negative binomial regression gives the same coefficients as those from Poisson regression but give larger standard errors. The interpretation of the results is the same as that from Poisson regression. Take an example of counts of water containers infested with mosquito larvae in a field survey. The data is contained in the dataset DHF99.
library(MASS)
data(DHF99)
summary(DHF99)
## houseid village education containers
## Min. : 1.00 Min. : 1.00 Primary :168 Min. : 0.0000
## 1st Qu.: 77.75 1st Qu.: 23.00 Secondary : 36 1st Qu.: 0.0000
## Median :154.50 Median : 51.00 High school: 34 Median : 0.0000
## Mean :174.27 Mean : 48.56 Bachelor : 25 Mean : 0.3512
## 3rd Qu.:269.25 3rd Qu.: 75.00 Other : 37 3rd Qu.: 0.0000
## Max. :385.00 Max. :105.00 Max. :11.0000
## NA's :1
## viltype
## rural:180
## urban: 72
## slum : 48
##
##
##
##
describeBy(DHF99$containers, group=DHF99$viltype)
##
## Descriptive statistics by group
## group: rural
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 179 0.49 1.25 0 0.2 0 0 11 11 4.58 29.1 0.09
## ------------------------------------------------------------
## group: urban
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 72 0.07 0.26 0 0 0 0 1 1 3.32 9.13 0.03
## ------------------------------------------------------------
## group: slum
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 48 0.25 0.53 0 0.15 0 0 2 2 1.93 2.84 0.08
The function for performing a negative binomial glm is glm.nb. This function is located in the MASS library. In addition, a very helpful function for selecting the best model based on the AIC value is the step function, which is located in the stats library (a default library loaded on start-up).
model.poisson <- step(glm(containers ~ education + viltype, family=poisson, data=DHF99))
## Start: AIC=513.3
## containers ~ education + viltype
##
## Df Deviance AIC
## - education 4 363.67 505.92
## <none> 363.05 513.30
## - viltype 2 397.54 543.79
##
## Step: AIC=505.92
## containers ~ viltype
##
## Df Deviance AIC
## <none> 363.67 505.92
## - viltype 2 398.52 536.76
model.nb <- step(glm.nb(containers ~ education + viltype, data=DHF99))
## Start: AIC=432.05
## containers ~ education + viltype
##
## Df Deviance AIC
## - education 4 156.64 424.23
## <none> 156.46 432.05
## - viltype 2 174.78 446.37
##
## Step: AIC=424.23
## containers ~ viltype
##
## Df Deviance AIC
## <none> 156.37 424.23
## - viltype 2 174.95 438.80
coef(model.poisson)
## (Intercept) viltypeurban viltypeslum
## -0.7100490 -1.9571792 -0.6762454
coef(model.nb)
## (Intercept) viltypeurban viltypeslum
## -0.7100490 -1.9571792 -0.6762454
Both models end up with only ‘viltype’ being selected. The coefficients are very similar. The Poisson model has significant overdispersion but not the negative binomial model.
poisgof(model.poisson)$p.value
## [1] 0.004387831
poisgof(model.nb)$p.value
## [1] 1
The AIC of the negative binomial model is also better (smaller) than that of the Poisson model.
model.poisson$aic
## [1] 505.9154
model.nb$aic
## [1] 426.2283
Finally, the main differences to be examined are their standard errors, the 95% confidence intervals and P values.
summary(model.poisson)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7100490 0.1066000 -6.660873 2.722059e-11
## viltypeurban -1.9571792 0.4597429 -4.257117 2.070800e-05
## viltypeslum -0.6762454 0.3077286 -2.197538 2.798202e-02
summary(model.nb)$coefficients
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -0.7100490 0.1731160 -4.101578 4.103414e-05
## viltypeurban -1.9571792 0.5255707 -3.723912 1.961591e-04
## viltypeslum -0.6762454 0.4274174 -1.582166 1.136116e-01
idr.display(model.poisson)
##
## Poisson regression predicting # infested vessels
##
## IDR(95%CI) P(Wald's test) P(LR-test)
## Village type: ref.=rural < 0.001
## urban 0.14 (0.06,0.35) < 0.001
## slum 0.51 (0.28,0.93) 0.028
##
## Log-likelihood = -249.9577
## No. of observations = 299
## AIC value = 505.9154
attach(DHF99)
idr.display(model.nb)
##
## Negative binomial regression predicting # infested vessels
##
## IDR(95%CI) P(Wald's test) P(LR-test)
## Village type: ref.=rural < 0.001
## urban 0.14 (0.05,0.4) < 0.001
## slum 0.51 (0.22,1.18) 0.114
##
## Log-likelihood = -209.1141
## No. of observations = 299
## AIC value = 426.2283
The standard errors from the negative binomial model are slightly larger than those from the Poisson model resulting in wider 95% confidence intervals and larger P values. From the Poisson regression, both urban community and slum area had a significantly lower risk (around 14% and a half reduction, respectively) for infestation. However, from the negative binomial regression, only the urban community had a significantly lower risk.