First of all, install and load the required libraries:
library(rms)
library(Hmisc)
library(gtsummary)
library(magrittr)
library(tidyverse)
library(haven)
library(ggplot2)
library(ggpubr)
library(pROC)
setwd("C:\\Users\\Giulia Barbati\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\Block 3\\R codes")
This practical contains a case study on developing, describing, and validating a regression prediction model.
In the original data set of the study there was also a follow up time to the events, but we did not consider here this aspect (just for didactic purposes we will consider it as another candidate predictor in the model).
The original dataset was composed by 506 patients with prostate cancer from the Byar and Green paper (see if interested: D. P. Byar and S. B. Green. The choice of treatment for cancer patients based on covariate information: Application to prostate cancer. Bulletin Cancer, Paris, 67:477-488, 1980).
These data come originally from a randomized trial (RCT) comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen.
In this trial, larger doses of estrogen reduced the effect of prostate cancer but at the cost of increased risk of cardiovascular death. (Four patients had missing values on many variables. These patients have been excluded from consideration).
Our goal here is to predict the probability of incurring in cardiovascular-cerebrovascular death in the subset of patients that experienced this event or the prostate cancer death.
We will subset the original dataset of those patients dying from prostate cancer, heart or vascular disease, or cerebrovascular disease.
Here are the description of the variables contained in the dataset:
Some a priori information from clinicians: stage is defined by ap as well as X-ray results. Of the patients in stage 3, 92% have ap ≤ 0.8. Of those in stage 4, 93% have ap > 0.8.
Since stage can be predicted almost certainly from ap, we do not consider stage in the analyses. (You can check this relationship on the data).
This part is an exercise that you will do on your own: first of all, let’s load the original dataset:
prostate <- read_sav("prostate.sav")
prostate <-labelled::to_factor(prostate, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
summary.data.frame(prostate)
## patno stage rx dtime
## Min. : 1.0 Min. :3.000 placebo :127 Min. : 0.00
## 1st Qu.:126.2 1st Qu.:3.000 0.2 mg estrogen:124 1st Qu.:14.25
## Median :251.5 Median :3.000 1.0 mg estrogen:126 Median :34.00
## Mean :251.7 Mean :3.424 5.0 mg estrogen:125 Mean :36.13
## 3rd Qu.:376.8 3rd Qu.:4.000 3rd Qu.:57.75
## Max. :506.0 Max. :4.000 Max. :76.00
##
## status age wt
## alive :148 Min. :48.00 Min. : 69.00
## dead - prostatic ca :130 1st Qu.:70.00 1st Qu.: 90.00
## dead - heart or vascular : 96 Median :73.00 Median : 98.00
## dead - cerebrovascular : 31 Mean :71.46 Mean : 99.03
## dead - other specific non-ca: 28 3rd Qu.:76.00 3rd Qu.:107.00
## dead - other ca : 25 Max. :89.00 Max. :152.00
## (Other) : 44 NA's :1 NA's :2
## pf hx sbp dbp
## normal activity :450 Min. :0.0000 Min. : 8.00 Min. : 4.000
## in bed < 50% daytime: 37 1st Qu.:0.0000 1st Qu.:13.00 1st Qu.: 7.000
## in bed > 50% daytime: 13 Median :0.0000 Median :14.00 Median : 8.000
## confined to bed : 2 Mean :0.4243 Mean :14.35 Mean : 8.149
## 3rd Qu.:1.0000 3rd Qu.:16.00 3rd Qu.: 9.000
## Max. :1.0000 Max. :30.00 Max. :18.000
##
## ekg hg sz
## normal :168 Min. : 5.899 Min. : 0.00
## heart strain :150 1st Qu.:12.299 1st Qu.: 5.00
## old MI : 75 Median :13.699 Median :11.00
## rhythmic disturb & electrolyte ch: 51 Mean :13.446 Mean :14.63
## heart block or conduction def : 26 3rd Qu.:14.699 3rd Qu.:21.00
## (Other) : 24 Max. :21.199 Max. :69.00
## NA's : 8 NA's :5
## sg ap bm
## Min. : 5.00 Min. : 0.100 Min. :0.0000
## 1st Qu.: 9.00 1st Qu.: 0.500 1st Qu.:0.0000
## Median :10.00 Median : 0.700 Median :0.0000
## Mean :10.31 Mean : 12.176 Mean :0.1633
## 3rd Qu.:11.00 3rd Qu.: 2.975 3rd Qu.:0.0000
## Max. :15.00 Max. :999.875 Max. :1.0000
## NA's :11
Here are the suggestions for the initial data analysis:
Do you think that some variables could be re-coded combining an infrequent category with the next category?
For example, pay attention to the frequencies of ekg and pf.
Hint from a discussion with clinicians: in the ekg variable we could collapse the normal and benign conditions.
For pf, you can combine in bed > 50% daytime with confined to bed in one category.
Then, here there is another question for you: based on the study design (RCT), do you expect to find significant statistical differences between baseline characteristics of groups defined by the levels of the treatment rx ? Build a descriptive table comparing the four groups and evaluating the global p value across groups. (Hint: for numerical variables you could use the Kruskal-Wallis test and for categorical ones the Chi-square test).
Give a look to the distribution’s shape of the continuous variables : do you think that some scale transformation (for example taking the logarithm) could help in improving the normality of some of them ?
Now: subset the original dataset selecting only the patients died for cardiovascular–cerebrovascular death or prostate cancer death. Create the outcome variable cvd identifying as “1” patients died for cardiovascular–cerebrovascular death and as “0” patients died for prostate cancer. (We will use then logistic regression for the prediction model).
The reduced dataset “prostate_rid” is created. Here we begin with the prediction model development part.
First of all, we upload the reduced dataset with the recoded variables that have been suggested in the IDA phase:
prostate_rid <- read_sav("prostate_rid.sav")
prostate_rid <-labelled::to_factor(prostate_rid, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
summary.data.frame(prostate_rid)
## patno stage rx dtime
## Min. : 3.0 Min. :3.000 placebo :71 Min. : 0.00
## 1st Qu.:131.0 1st Qu.:3.000 0.2 mg estrogen:67 1st Qu.:11.00
## Median :265.0 Median :4.000 1.0 mg estrogen:49 Median :24.00
## Mean :260.5 Mean :3.525 5.0 mg estrogen:70 Mean :25.26
## 3rd Qu.:395.0 3rd Qu.:4.000 3rd Qu.:37.00
## Max. :506.0 Max. :4.000 Max. :71.00
##
## status age wt
## dead - prostatic ca :130 Min. :48.00 Min. : 71.00
## dead - heart or vascular: 96 1st Qu.:70.00 1st Qu.: 89.00
## dead - cerebrovascular : 31 Median :73.00 Median : 97.00
## alive : 0 Mean :71.55 Mean : 98.03
## dead - pulmonary embolus: 0 3rd Qu.:76.00 3rd Qu.:106.00
## dead - other ca : 0 Max. :88.00 Max. :152.00
## (Other) : 0
## pf hx sbp dbp
## normal activity :221 Min. :0.0000 Min. : 8.00 Min. : 4.000
## in bed < 50% daytime: 24 1st Qu.:0.0000 1st Qu.:13.00 1st Qu.: 7.000
## in bed > 50% daytime: 10 Median :1.0000 Median :14.00 Median : 8.000
## confined to bed : 2 Mean :0.5097 Mean :14.53 Mean : 8.144
## 3rd Qu.:1.0000 3rd Qu.:16.00 3rd Qu.: 9.000
## Max. :1.0000 Max. :24.00 Max. :13.000
##
## ekg hg sz
## heart strain :83 Min. : 5.899 Min. : 0.00
## normal :74 1st Qu.:12.000 1st Qu.: 6.00
## old MI :44 Median :13.600 Median :14.00
## rhythmic disturb & electrolyte ch:28 Mean :13.301 Mean :16.99
## benign :13 3rd Qu.:14.600 3rd Qu.:24.50
## (Other) :10 Max. :21.199 Max. :69.00
## NA's : 5 NA's :2
## sg ap bm ap80 ekg_rec
## Min. : 5.00 Min. : 0.10 Min. :0.0000 Min. :0.0000 bngn :87
## 1st Qu.: 9.00 1st Qu.: 0.60 1st Qu.:0.0000 1st Qu.:0.0000 rd&ec:28
## Median :11.00 Median : 1.10 Median :0.0000 Median :0.0000 hbocd:10
## Mean :10.75 Mean : 16.17 Mean :0.2451 Mean :0.4553 hrts :83
## 3rd Qu.:12.00 3rd Qu.: 7.00 3rd Qu.:0.0000 3rd Qu.:1.0000 MI :44
## Max. :15.00 Max. :999.88 Max. :1.0000 Max. :1.0000 NA's : 5
## NA's :5
## pf_rec status_num subset
## normal :221 dead - prostatic ca :130 Min. :1
## bed_under_50 : 24 dead - heart or vascular: 96 1st Qu.:1
## bed_over_50_conf: 12 dead - cerebrovascular : 31 Median :1
## alive : 0 Mean :1
## dead - pulmonary embolus: 0 3rd Qu.:1
## dead - other ca : 0 Max. :1
## (Other) : 0
## cvd log_ap
## Min. :0.0000 Min. :-2.30268
## 1st Qu.:0.0000 1st Qu.:-0.51087
## Median :0.0000 Median : 0.09518
## Mean :0.4942 Mean : 0.70723
## 3rd Qu.:1.0000 3rd Qu.: 1.94591
## Max. :1.0000 Max. : 6.90763
##
Here we define ranges of the variables contained in the dataset (this is a useful function both for data description and for the subsequent interpretation of the regression coefficients):
options(datadist='dd')
dd <-datadist(prostate_rid)
dd
## patno stage rx dtime status
## Low:effect 131.00000 3 <NA> 11 <NA>
## Adjust to 265.00000 3 placebo 24 dead - prostatic ca
## High:effect 395.00000 4 <NA> 37 <NA>
## Low:prediction 24.96109 3 placebo 1 alive
## High:prediction 482.03891 4 5.0 mg estrogen 58 dead - unknown cause
## Low 3.00000 3 placebo 0 alive
## High 506.00000 4 5.0 mg estrogen 71 dead - unknown cause
## age wt pf hx sbp dbp ekg
## Low:effect 70.00000 89.0000 <NA> 0 13 7 <NA>
## Adjust to 73.00000 97.0000 normal activity 0 14 8 heart strain
## High:effect 76.00000 106.0000 <NA> 1 16 9 <NA>
## Low:prediction 55.96109 76.0000 normal activity 0 11 6 normal
## High:prediction 80.00000 125.0389 confined to bed 1 19 10 recent MI
## Low 48.00000 71.0000 normal activity 0 8 4 normal
## High 88.00000 152.0000 confined to bed 1 24 13 recent MI
## hg sz sg ap bm ap80 ekg_rec
## Low:effect 12.000000 6.00000 9 0.59997559 0 0 <NA>
## Adjust to 13.599609 14.00000 11 1.09985352 0 0 bngn
## High:effect 14.599609 24.50000 12 7.00000000 1 1 <NA>
## Low:prediction 9.298828 2.00000 8 0.19998169 0 0 bngn
## High:prediction 16.503800 46.11673 14 54.97592412 1 1 MI
## Low 5.899414 0.00000 5 0.09999084 0 0 bngn
## High 21.199219 69.00000 15 999.87500000 1 1 MI
## pf_rec status_num subset cvd log_ap
## Low:effect <NA> <NA> 1 0 -0.51086631
## Adjust to normal dead - prostatic ca 1 0 0.09517701
## High:effect <NA> <NA> 1 1 1.94591015
## Low:prediction normal alive 1 0 -1.60952947
## High:prediction bed_over_50_conf dead - unknown cause 1 1 4.00677071
## Low normal alive 1 0 -2.30267665
## High bed_over_50_conf dead - unknown cause 1 1 6.90763027
##
## Values:
##
## stage : 3 4
## rx : placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen
## status :
## [1] alive dead - prostatic ca
## [3] dead - heart or vascular dead - cerebrovascular
## [5] dead - pulmonary embolus dead - other ca
## [7] dead - respiratory disease dead - other specific non-ca
## [9] dead - unspecified non-ca dead - unknown cause
## pf : normal activity in bed < 50% daytime in bed > 50% daytime confined to bed
## hx : 0 1
## dbp : 4 5 6 7 8 9 10 11 12 13
## ekg :
## [1] normal benign
## [3] rhythmic disturb & electrolyte ch heart block or conduction def
## [5] heart strain old MI
## [7] recent MI
## bm : 0 1
## ap80 : 0 1
## ekg_rec : bngn rd&ec hbocd hrts MI
## pf_rec : normal bed_under_50 bed_over_50_conf
## status_num :
## [1] alive dead - prostatic ca
## [3] dead - heart or vascular dead - cerebrovascular
## [5] dead - pulmonary embolus dead - other ca
## [7] dead - respiratory disease dead - other specific non-ca
## [9] dead - unspecified non-ca dead - unknown cause
## subset : 1
## cvd : 0 1
Now we visually explore the assumption about the linearity of the effect for continuous candidate predictors by means of nonparametric smoothed regression:
par(mfrow=c(2,2))
plsmo(prostate_rid$age,prostate_rid$cvd)
plsmo(prostate_rid$wt, prostate_rid$cvd)
plsmo(prostate_rid$sbp,prostate_rid$cvd)
plsmo(prostate_rid$dbp,prostate_rid$cvd)
A strange behaviour is observed for wt i.e. the Weight Index [calculated as wt= wt(kg)-ht(cm)+200]; the others variables could be roughly approximated by a linear effect.
par(mfrow=c(2,3))
plsmo(prostate_rid$hg, prostate_rid$cvd)
plsmo(prostate_rid$sz, prostate_rid$cvd)
plsmo(prostate_rid$sg, prostate_rid$cvd)
plsmo(prostate_rid$ap, prostate_rid$cvd)
plsmo(prostate_rid$dtime, prostate_rid$cvd)
Again, a strange behaviour is observed for dtime (Months of Follow-up, as it is obvioulsy expected…), the others could be roughly approximated by a linear effect. Remind that this dataset is quite small, so we have to be parsimonious in the number of coefficients to be estimated !
Note that, being this dataset quite small, we will use all the dataset for the training phase of the regression model, but after for evaluating the performance of the model we will apply boostrap techniques. This is obviously not the optimal way to proceed, but when there is a constraint of a low-moderate sample size (as often happen in medical data collected prospectively) there is no possibility of further splitting the sample. With such a small sample size we could not afford an initial split of the study sample in separate training and test subgroups. This is also a great obstacle in applying machine learning techniques to small-moderate medical datasets.
Even if univariable analysis is generally discouraged to select candidate predictors (it works only with perfectly uncorrelated variables, that is nearly impossible in biological problems), it is another exploratory step that could be useful also to discuss then with clinicians about the indications of associations that are present in the data. So first of all we can create a table with univariable results, assuming for the moment a linear effect for all the continuous variables :
prostate_rid %>% select(-patno,-stage,-pf,-ekg,-status,-status_num,-subset) %>%
tbl_uvregression(method = glm, # glm function
method.args = list(family = binomial),# logistic model
exponentiate = T, # report OR
y=cvd,# outcome variable
conf.level = 0.95,
pvalue_fun = function(x) style_pvalue(x, digits = 3))
Characteristic | N | OR1 | 95% CI1 | p-value |
---|---|---|---|---|
rx | 257 | |||
placebo | — | — | ||
0.2 mg estrogen | 0.65 | 0.33, 1.27 | 0.210 | |
1.0 mg estrogen | 1.13 | 0.55, 2.36 | 0.736 | |
5.0 mg estrogen | 1.73 | 0.89, 3.41 | 0.107 | |
Months of Follow-up | 257 | 1.00 | 0.98, 1.01 | 0.707 |
Age in Years | 257 | 1.07 | 1.03, 1.11 | <0.001 |
Weight Index = wt(kg)-ht(cm)+200 | 257 | 1.00 | 0.99, 1.02 | 0.726 |
History of Cardiovascular Disease | 257 | 3.96 | 2.37, 6.71 | <0.001 |
Systolic Blood Pressure/10 | 257 | 1.09 | 0.99, 1.21 | 0.082 |
Diastolic Blood Pressure/10 | 257 | 1.30 | 1.09, 1.57 | 0.005 |
Serum Hemoglobin (g/100ml) | 257 | 1.17 | 1.04, 1.33 | 0.012 |
Size of Primary Tumor (cm^2) | 255 | 0.93 | 0.91, 0.95 | <0.001 |
Combined Index of Stage and Hist. Grade | 252 | 0.56 | 0.47, 0.66 | <0.001 |
Serum Prostatic Acid Phosphatase | 257 | 0.93 | 0.89, 0.96 | <0.001 |
Bone Metastases | 257 | 0.28 | 0.15, 0.52 | <0.001 |
ap80 | 257 | 5.74 | 3.38, 9.96 | <0.001 |
ekg_rec | 252 | |||
bngn | — | — | ||
rd&ec | 1.35 | 0.57, 3.20 | 0.492 | |
hbocd | 1.04 | 0.25, 3.91 | 0.955 | |
hrts | 2.04 | 1.11, 3.78 | 0.023 | |
MI | 2.73 | 1.30, 5.88 | 0.009 | |
pf_rec | 257 | |||
normal | — | — | ||
bed_under_50 | 1.36 | 0.58, 3.28 | 0.477 | |
bed_over_50_conf | 0.09 | 0.00, 0.47 | 0.021 | |
log_ap | 257 | 0.55 | 0.45, 0.65 | <0.001 |
1 OR = Odds Ratio, CI = Confidence Interval |
Note that here all the odds ratios (OR) are expressing variations by 1-unit on the scale of the independent variable.
This table should be commented with the clinicians in order to discuss if the observed statistical significant associations are in line with what it is expected from the biological perspective.
Now, we can come back to the functional forms for the continuous variables: try now to use a nonlinear effect for wt and for dtime
fit.splines.wt <- lrm(cvd ~ rcs(wt,4), data=prostate_rid)
print(fit.splines.wt)
## Logistic Regression Model
##
## lrm(formula = cvd ~ rcs(wt, 4), data = prostate_rid)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 257 LR chi2 0.66 R2 0.003 C 0.518
## 0 130 d.f. 3 R2(3,257)0.000 Dxy 0.036
## 1 127 Pr(> chi2) 0.8833 R2(3,192.7)0.000 gamma 0.037
## max |deriv| 1e-12 Brier 0.249 tau-a 0.018
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -2.5286 3.5066 -0.72 0.4708
## wt 0.0300 0.0418 0.72 0.4731
## wt' -0.1046 0.1446 -0.72 0.4693
## wt'' 0.3319 0.4557 0.73 0.4664
summary(fit.splines.wt)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## wt 89 106 17 -0.14415 0.32689 -0.78484 0.49654
## Odds Ratio 89 106 17 0.86576 NA 0.45619 1.64300
We observe that there is no significant effect for the nonlinear components of the spline.
fit.splines.dtime <- lrm(cvd ~ rcs(dtime,4), data=prostate_rid)
print(fit.splines.dtime)
## Logistic Regression Model
##
## lrm(formula = cvd ~ rcs(dtime, 4), data = prostate_rid)
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 257 LR chi2 8.04 R2 0.041 C 0.591
## 0 130 d.f. 3 R2(3,257)0.019 Dxy 0.181
## 1 127 Pr(> chi2) 0.0452 R2(3,192.7)0.026 gamma 0.185
## max |deriv| 1e-06 Brier 0.242 tau-a 0.091
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept 0.6180 0.3867 1.60 0.1100
## dtime -0.0991 0.0422 -2.35 0.0188
## dtime' 0.4881 0.1833 2.66 0.0078
## dtime'' -1.2217 0.4470 -2.73 0.0063
summary(fit.splines.dtime)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## dtime 11 37 26 0.73072 0.35467 0.035589 1.4259
## Odds Ratio 11 37 26 2.07660 NA 1.036200 4.1614
Here instead there is a significant nonlinear effect for the dtime variable (as expected being the follow up…) In this case, the interpretation of the odds ratio is reported as a contrast between two specific values (37 vs 11), i.e. the third and first quartile of the variable. This is done in order to help in the interpretation of the non-linear effect modelled by the spline function.
Let’s visualize the effect of this candidate predictor modeled in a nonlinear way:
ggplot(Predict(fit.splines.dtime))
This is quite similar to the non parametric regression previously used:
plsmo(prostate_rid$dtime, prostate_rid$cvd)
We decide to start with a full (no selection of predictors) binary logistic model assuming linearity for continuous predictors except for dtime:
fit.multi <- lrm(cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt +hg + ekg_rec + pf_rec + bm + hx + rx +rcs(dtime,4), data =prostate_rid)
Note here that there is usually an upper-bound on the number of candidate predictors that can be used in a [logistic/poisson/survival] regression model dictated by the number of events occurring in the population (we will take into account this aspect when the sample size issue in multivariable modelling will be discussed).
Let’s see the results:
print(fit.multi)
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap sbp dbp age wt hg ekg_rec
## 0 2 5 0 0 0 0 0 0 5
## pf_rec bm hx rx dtime
## 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt +
## hg + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 245 LR chi2 150.16 R2 0.611 C 0.906
## 0 121 d.f. 22 R2(22,245)0.407 Dxy 0.811
## 1 124 Pr(> chi2) <0.0001 R2(22,183.7)0.502 gamma 0.811
## max |deriv| 2e-09 Brier 0.124 tau-a 0.407
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.7151 3.6106 -1.31 0.1916
## sz -0.0635 0.0174 -3.64 0.0003
## sg -0.3281 0.1247 -2.63 0.0085
## ap -0.3808 0.1468 -2.59 0.0095
## sbp -0.0278 0.0931 -0.30 0.7654
## dbp 0.4263 0.1727 2.47 0.0136
## age 0.1004 0.0317 3.16 0.0016
## wt -0.0172 0.0148 -1.16 0.2442
## hg 0.0708 0.1060 0.67 0.5042
## ekg_rec=rd&ec -0.5096 0.6760 -0.75 0.4509
## ekg_rec=hbocd -1.3951 0.8889 -1.57 0.1166
## ekg_rec=hrts 0.1518 0.4586 0.33 0.7407
## ekg_rec=MI 0.0106 0.6210 0.02 0.9864
## pf_rec=bed_under_50 1.3081 0.8132 1.61 0.1077
## pf_rec=bed_over_50_conf -2.2541 1.2487 -1.81 0.0710
## bm 0.0981 0.5397 0.18 0.8557
## hx 1.3719 0.4195 3.27 0.0011
## rx=0.2 mg estrogen -0.3661 0.5245 -0.70 0.4852
## rx=1.0 mg estrogen 0.9277 0.5579 1.66 0.0964
## rx=5.0 mg estrogen 0.7030 0.5238 1.34 0.1796
## dtime -0.1256 0.0691 -1.82 0.0691
## dtime' 0.6755 0.2979 2.27 0.0234
## dtime'' -1.7766 0.7258 -2.45 0.0144
summary(fit.multi)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E.
## sz 6.00000 24.5 18.5000 -1.174100 0.32268
## Odds Ratio 6.00000 24.5 18.5000 0.309110 NA
## sg 9.00000 12.0 3.0000 -0.984350 0.37423
## Odds Ratio 9.00000 12.0 3.0000 0.373680 NA
## ap 0.59998 7.0 6.4000 -0.935590 0.36060
## Odds Ratio 0.59998 7.0 6.4000 0.392360 NA
## sbp 13.00000 16.0 3.0000 -0.083384 0.27943
## Odds Ratio 13.00000 16.0 3.0000 0.920000 NA
## dbp 7.00000 9.0 2.0000 0.852590 0.34535
## Odds Ratio 7.00000 9.0 2.0000 2.345700 NA
## age 70.00000 76.0 6.0000 0.602180 0.19035
## Odds Ratio 70.00000 76.0 6.0000 1.826100 NA
## wt 89.00000 106.0 17.0000 -0.292780 0.25140
## Odds Ratio 89.00000 106.0 17.0000 0.746180 NA
## hg 12.00000 14.6 2.5996 0.184020 0.27553
## Odds Ratio 12.00000 14.6 2.5996 1.202000 NA
## bm 0.00000 1.0 1.0000 0.098134 0.53966
## Odds Ratio 0.00000 1.0 1.0000 1.103100 NA
## hx 0.00000 1.0 1.0000 1.371900 0.41948
## Odds Ratio 0.00000 1.0 1.0000 3.943000 NA
## dtime 11.00000 37.0 26.0000 1.053600 0.55450
## Odds Ratio 11.00000 37.0 26.0000 2.867900 NA
## ekg_rec - rd&ec:bngn 1.00000 2.0 NA -0.509560 0.67596
## Odds Ratio 1.00000 2.0 NA 0.600760 NA
## ekg_rec - hbocd:bngn 1.00000 3.0 NA -1.395100 0.88895
## Odds Ratio 1.00000 3.0 NA 0.247820 NA
## ekg_rec - hrts:bngn 1.00000 4.0 NA 0.151770 0.45859
## Odds Ratio 1.00000 4.0 NA 1.163900 NA
## ekg_rec - MI:bngn 1.00000 5.0 NA 0.010576 0.62096
## Odds Ratio 1.00000 5.0 NA 1.010600 NA
## pf_rec - bed_under_50:normal 1.00000 2.0 NA 1.308100 0.81317
## Odds Ratio 1.00000 2.0 NA 3.699200 NA
## pf_rec - bed_over_50_conf:normal 1.00000 3.0 NA -2.254100 1.24870
## Odds Ratio 1.00000 3.0 NA 0.104970 NA
## rx - 0.2 mg estrogen:placebo 1.00000 2.0 NA -0.366070 0.52451
## Odds Ratio 1.00000 2.0 NA 0.693460 NA
## rx - 1.0 mg estrogen:placebo 1.00000 3.0 NA 0.927670 0.55790
## Odds Ratio 1.00000 3.0 NA 2.528600 NA
## rx - 5.0 mg estrogen:placebo 1.00000 4.0 NA 0.702950 0.52383
## Odds Ratio 1.00000 4.0 NA 2.019700 NA
## Lower 0.95 Upper 0.95
## -1.8065000 -0.54161
## 0.1642300 0.58181
## -1.7178000 -0.25087
## 0.1794500 0.77813
## -1.6424000 -0.22882
## 0.1935200 0.79547
## -0.6310500 0.46429
## 0.5320300 1.59090
## 0.1757100 1.52950
## 1.1921000 4.61570
## 0.2291100 0.97525
## 1.2575000 2.65180
## -0.7855300 0.19996
## 0.4558800 1.22140
## -0.3560000 0.72405
## 0.7004700 2.06280
## -0.9595800 1.15590
## 0.3830500 3.17670
## 0.5497700 2.19410
## 1.7329000 8.97190
## -0.0332040 2.14040
## 0.9673400 8.50280
## -1.8344000 0.81529
## 0.1597100 2.25980
## -3.1374000 0.34723
## 0.0433970 1.41510
## -0.7470500 1.05060
## 0.4737600 2.85930
## -1.2065000 1.22760
## 0.2992500 3.41320
## -0.2856600 2.90190
## 0.7515200 18.20900
## -4.7014000 0.19330
## 0.0090825 1.21320
## -1.3941000 0.66195
## 0.2480600 1.93860
## -0.1657900 2.02110
## 0.8472200 7.54690
## -0.3237300 1.72960
## 0.7234500 5.63860
an <- anova (fit.multi)
an
## Wald Statistics Response: cvd
##
## Factor Chi-Square d.f. P
## sz 13.24 1 0.0003
## sg 6.92 1 0.0085
## ap 6.73 1 0.0095
## sbp 0.09 1 0.7654
## dbp 6.09 1 0.0136
## age 10.01 1 0.0016
## wt 1.36 1 0.2442
## hg 0.45 1 0.5042
## ekg_rec 3.60 4 0.4628
## pf_rec 6.11 2 0.0471
## bm 0.03 1 0.8557
## hx 10.70 1 0.0011
## rx 6.67 3 0.0831
## dtime 8.80 3 0.0321
## Nonlinear 8.02 2 0.0182
## TOTAL 62.82 22 <.0001
plot(an)
More than interpreting the single regression coefficients and their corresponding statistical significance (even if in the last phase, i.e. when presenting results to the clinicians also this aspect is important!!), in prediction modelling we are mostly interested in evaluating the accuracy in predictions.
Now we want to evaluate the performance of the full model:
s <- fit.multi$stats
round(s, digits=3)
## Obs Max Deriv Model L.R. d.f. P C
## 245.000 0.000 150.158 22.000 0.000 0.906
## Dxy Gamma Tau-a R2 R2(245) R2(22,245)
## 0.811 0.811 0.407 0.611 0.458 0.407
## R2(183.7) R2(22,183.7) Brier g gr gp
## 0.558 0.502 0.124 3.008 20.245 0.408
C is the AUC (discrimination measure) and Dxy is related to C (Dxy=2*(C-0.5)), so they are both discrimination measures. They measure how the model perform in the relative ranking of the estimated risk.
gamma.hat <- (s['Model L.R.'] - s['d.f.'])/s['Model L.R.']
gamma.hat
## Model L.R.
## 0.8534877
Estimation of the shrinkage coefficient gamma allows the quantification of the amount of overfitting present, and it also allows one to estimate the likelihood that the model will reliably predict new observations. The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.85, indicating that this model will validate on new data about 15% worse than on this dataset. This is also related to the calibration concept (the absolute ordering of the predicted risks).
Let’s now visualize the predictors effects on the logit scale:
ggplot(Predict(fit.multi))
Now let’s see them on the odds ratio scale:
plot(summary(fit.multi),log=TRUE)
The general function is plot(summary(fit.multi)) but it less readable, for this reason there is the option log=TRUE.
Estimates of the associations are related to the interquartile-range odds ratios for continuous predictors and odds ratios with respect to a reference level for categorical predictors. Numbers at left are upper quartile vs lower quartile or current group vs reference group. The bars represent confidence limits. The intervals are drawn on the log odds ratio scale and labeled on the odds ratio scale. Ranges are on the original scale. Discussing this data with clinicians is also useful for thinking about possible interaction effects between some of the candidate predictors; in this example we will not examine this issue.
Now, forget for a moment our sample size limitation, we want to check if it is worth to expand all the others continuous variables in a full model approach using splines:
fit.all.splines <- lrm(cvd ~ rcs(sz,4) + rcs(sg,4) + rcs(log(ap),4) +rcs(sbp,4) + rcs(dbp,4) + rcs(age,4)
+ rcs(wt,4) + rcs(hg,4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime,4),data=prostate_rid)
print(fit.all.splines)
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap sbp dbp age wt hg ekg_rec
## 0 2 5 0 0 0 0 0 0 5
## pf_rec bm hx rx dtime
## 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ rcs(sz, 4) + rcs(sg, 4) + rcs(log(ap), 4) +
## rcs(sbp, 4) + rcs(dbp, 4) + rcs(age, 4) + rcs(wt, 4) + rcs(hg,
## 4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 245 LR chi2 177.11 R2 0.686 C 0.933
## 0 121 d.f. 38 R2(38,245)0.433 Dxy 0.865
## 1 124 Pr(> chi2) <0.0001 R2(38,183.7)0.531 gamma 0.865
## max |deriv| 4e-11 Brier 0.104 tau-a 0.434
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -24.1991 13.5603 -1.78 0.0743
## sz -0.2597 0.1295 -2.00 0.0450
## sz' 1.6649 0.9134 1.82 0.0683
## sz'' -3.1513 1.6810 -1.87 0.0608
## sg 0.5706 0.5774 0.99 0.3230
## sg' -4.7064 2.0646 -2.28 0.0226
## sg'' 21.8060 8.9783 2.43 0.0152
## ap -0.4486 0.8526 -0.53 0.5988
## ap' 3.6227 5.5246 0.66 0.5120
## ap'' -7.7643 10.2918 -0.75 0.4506
## sbp -0.0453 0.3717 -0.12 0.9030
## sbp' -0.1906 1.2777 -0.15 0.8814
## sbp'' 0.5496 7.7526 0.07 0.9435
## dbp -0.1280 0.4527 -0.28 0.7773
## dbp' 0.4811 1.0301 0.47 0.6405
## dbp'' 3.6419 6.7889 0.54 0.5916
## age 0.3420 0.1106 3.09 0.0020
## age' -0.3789 0.1770 -2.14 0.0323
## age'' 3.6426 2.0773 1.75 0.0795
## wt 0.0024 0.0893 0.03 0.9788
## wt' -0.0771 0.2949 -0.26 0.7936
## wt'' 0.2732 0.9056 0.30 0.7629
## hg 0.2602 0.3600 0.72 0.4697
## hg' -0.9822 0.8826 -1.11 0.2657
## hg'' 7.9914 5.4689 1.46 0.1439
## ekg_rec=rd&ec -0.6508 0.7960 -0.82 0.4136
## ekg_rec=hbocd -2.0159 1.0458 -1.93 0.0539
## ekg_rec=hrts -0.0923 0.5329 -0.17 0.8625
## ekg_rec=MI -0.1072 0.7061 -0.15 0.8793
## pf_rec=bed_under_50 1.7917 0.9468 1.89 0.0584
## pf_rec=bed_over_50_conf -1.2181 1.3352 -0.91 0.3616
## bm -0.1454 0.6230 -0.23 0.8155
## hx 1.8027 0.5203 3.46 0.0005
## rx=0.2 mg estrogen -0.4657 0.5854 -0.80 0.4263
## rx=1.0 mg estrogen 0.6931 0.6265 1.11 0.2686
## rx=5.0 mg estrogen 0.5258 0.6375 0.82 0.4095
## dtime -0.1925 0.0810 -2.38 0.0175
## dtime' 0.9895 0.3536 2.80 0.0051
## dtime'' -2.5924 0.8682 -2.99 0.0028
summary(fit.all.splines)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E.
## sz 6.00000 24.5 18.5000 -0.4753200 0.57195
## Odds Ratio 6.00000 24.5 18.5000 0.6216800 NA
## sg 9.00000 12.0 3.0000 -3.0788000 0.86958
## Odds Ratio 9.00000 12.0 3.0000 0.0460160 NA
## ap 0.59998 7.0 6.4000 -0.0015339 0.71565
## Odds Ratio 0.59998 7.0 6.4000 0.9984700 NA
## sbp 13.00000 16.0 3.0000 -0.4203600 0.64517
## Odds Ratio 13.00000 16.0 3.0000 0.6568100 NA
## dbp 7.00000 9.0 2.0000 0.7533600 0.70002
## Odds Ratio 7.00000 9.0 2.0000 2.1241000 NA
## age 70.00000 76.0 6.0000 -0.4358300 0.56104
## Odds Ratio 70.00000 76.0 6.0000 0.6467300 NA
## wt 89.00000 106.0 17.0000 -0.4036000 0.59476
## Odds Ratio 89.00000 106.0 17.0000 0.6679100 NA
## hg 12.00000 14.6 2.5996 -0.5632000 0.62210
## Odds Ratio 12.00000 14.6 2.5996 0.5693800 NA
## bm 0.00000 1.0 1.0000 -0.1453900 0.62300
## Odds Ratio 0.00000 1.0 1.0000 0.8646800 NA
## hx 0.00000 1.0 1.0000 1.8027000 0.52030
## Odds Ratio 0.00000 1.0 1.0000 6.0659000 NA
## dtime 11.00000 37.0 26.0000 1.3504000 0.64498
## Odds Ratio 11.00000 37.0 26.0000 3.8589000 NA
## ekg_rec - rd&ec:bngn 1.00000 2.0 NA -0.6508500 0.79604
## Odds Ratio 1.00000 2.0 NA 0.5216000 NA
## ekg_rec - hbocd:bngn 1.00000 3.0 NA -2.0159000 1.04580
## Odds Ratio 1.00000 3.0 NA 0.1332000 NA
## ekg_rec - hrts:bngn 1.00000 4.0 NA -0.0922990 0.53294
## Odds Ratio 1.00000 4.0 NA 0.9118300 NA
## ekg_rec - MI:bngn 1.00000 5.0 NA -0.1072100 0.70615
## Odds Ratio 1.00000 5.0 NA 0.8983400 NA
## pf_rec - bed_under_50:normal 1.00000 2.0 NA 1.7917000 0.94679
## Odds Ratio 1.00000 2.0 NA 5.9995000 NA
## pf_rec - bed_over_50_conf:normal 1.00000 3.0 NA -1.2181000 1.33520
## Odds Ratio 1.00000 3.0 NA 0.2957800 NA
## rx - 0.2 mg estrogen:placebo 1.00000 2.0 NA -0.4657400 0.58539
## Odds Ratio 1.00000 2.0 NA 0.6276700 NA
## rx - 1.0 mg estrogen:placebo 1.00000 3.0 NA 0.6931000 0.62648
## Odds Ratio 1.00000 3.0 NA 1.9999000 NA
## rx - 5.0 mg estrogen:placebo 1.00000 4.0 NA 0.5257800 0.63747
## Odds Ratio 1.00000 4.0 NA 1.6918000 NA
## Lower 0.95 Upper 0.95
## -1.5963000 0.64568
## 0.2026400 1.90730
## -4.7831000 -1.37440
## 0.0083699 0.25299
## -1.4042000 1.40110
## 0.2455700 4.05970
## -1.6849000 0.84415
## 0.1854700 2.32600
## -0.6186600 2.12540
## 0.5386600 8.37610
## -1.5354000 0.66378
## 0.2153600 1.94210
## -1.5693000 0.76211
## 0.2081900 2.14280
## -1.7825000 0.65608
## 0.1682200 1.92720
## -1.3664000 1.07570
## 0.2550100 2.93190
## 0.7829300 2.82240
## 2.1879000 16.81800
## 0.0862480 2.61450
## 1.0901000 13.66100
## -2.2111000 0.90936
## 0.1095900 2.48270
## -4.0657000 0.03389
## 0.0171510 1.03450
## -1.1368000 0.95224
## 0.3208300 2.59150
## -1.4912000 1.27680
## 0.2251000 3.58520
## -0.0640040 3.64740
## 0.9380000 38.37300
## -3.8351000 1.39890
## 0.0215980 4.05060
## -1.6131000 0.68161
## 0.1992700 1.97710
## -0.5347800 1.92100
## 0.5858000 6.82760
## -0.7236500 1.77520
## 0.4849800 5.90150
an.s <- anova (fit.all.splines)
an.s
## Wald Statistics Response: cvd
##
## Factor Chi-Square d.f. P
## sz 11.55 3 0.0091
## Nonlinear 3.82 2 0.1481
## sg 13.31 3 0.0040
## Nonlinear 6.55 2 0.0379
## ap 3.32 3 0.3443
## Nonlinear 1.46 2 0.4827
## sbp 1.40 3 0.7054
## Nonlinear 0.14 2 0.9312
## dbp 12.05 3 0.0072
## Nonlinear 7.18 2 0.0276
## age 13.01 3 0.0046
## Nonlinear 5.02 2 0.0813
## wt 0.69 3 0.8762
## Nonlinear 0.16 2 0.9212
## hg 3.13 3 0.3714
## Nonlinear 3.03 2 0.2193
## ekg_rec 4.20 4 0.3798
## pf_rec 4.71 2 0.0951
## bm 0.05 1 0.8155
## hx 12.00 1 0.0005
## rx 4.23 3 0.2373
## dtime 12.40 3 0.0061
## Nonlinear 10.89 2 0.0043
## TOTAL NONLINEAR 24.81 18 0.1302
## TOTAL 57.36 38 0.0227
plot(an.s)
Let’s check what happens to the model’s performance:
s.splines <- fit.all.splines$stats # performance of the estimated model
round(s.splines, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures
## Obs Max Deriv Model L.R. d.f. P C
## 245.000 0.000 177.108 38.000 0.000 0.933
## Dxy Gamma Tau-a R2 R2(245) R2(38,245)
## 0.865 0.865 0.434 0.686 0.515 0.433
## R2(183.7) R2(38,183.7) Brier g gr gp
## 0.619 0.531 0.104 4.107 60.753 0.432
gamma.hat.splines <- (s.splines['Model L.R.'] - s.splines['d.f.'])/s.splines['Model L.R.']
gamma.hat.splines
## Model L.R.
## 0.7854418
The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.79, indicating that this model will validate on new data about 21% worse than on this dataset (more complex than the previous model, more overfitting, as expected).
Let’s compare the performance of the two models in terms of AIC: the Akaike Information Criterion (the smaller is the better is):
AIC.models <- c(AIC(fit.multi), AIC(fit.all.splines))
AIC.models
## [1] 235.4473 240.4973
Based on AIC, the simpler model fitted to the raw data and assuming linearity for all the continuous predictors except dtime has a lower AIC and a lower optimism, so we would prefer that one as a predictive tool.
Now we use fast backward step-down (with total residual AIC as the stopping rule, that is a reasonable way to use stepwise selection) to identify the variables that explain the bulk of the cause of death. This algorithm performs a slightly inefficient but numerically stable version of fast backward elimination on factors;it uses the fitted full model and computes approximate Wald statistics by computing conditional (restricted) maximum likelihood estimates (assuming multivariate normality of estimates..). The function prints the deletion statistics for each variable in turn, and prints approximate parameter estimates for the model after deleting variables. The approximation is better when the number of factors deleted is not large.
fastbw(fit.multi)
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC
## ekg_rec 3.60 4 0.4628 3.60 4 0.4628 -4.40
## bm 0.01 1 0.9242 3.61 5 0.6069 -6.39
## sbp 0.08 1 0.7830 3.69 6 0.7192 -8.31
## hg 0.30 1 0.5827 3.99 7 0.7813 -10.01
## wt 0.92 1 0.3380 4.91 8 0.7677 -11.09
## rx 7.31 3 0.0628 12.21 11 0.3480 -9.79
## pf_rec 5.72 2 0.0573 17.93 13 0.1602 -8.07
## dtime 7.29 3 0.0633 25.22 16 0.0661 -6.78
## sg 3.36 1 0.0667 28.58 17 0.0386 -5.42
## dbp 4.98 1 0.0257 33.55 18 0.0143 -2.45
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept -3.68667 2.00132 -1.842 0.0654580
## sz -0.04316 0.01605 -2.689 0.0071632
## ap -0.41027 0.11823 -3.470 0.0005202
## age 0.05934 0.02809 2.112 0.0346497
## hx 0.82995 0.36398 2.280 0.0225953
##
## Factors in Final Model
##
## [1] sz ap age hx
In the final step the following covariates are retained: sz, ap, age
and hx.
Let’s estimate the reduced model:
fred <- lrm(cvd ~ sz + log(ap) + age + hx , data =prostate_rid)
print(fred)
## Frequencies of Missing Values Due to Each Variable
## cvd sz ap age hx
## 0 2 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + log(ap) + age + hx, data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 255 LR chi2 100.81 R2 0.435 C 0.840
## 0 129 d.f. 4 R2(4,255)0.316 Dxy 0.679
## 1 126 Pr(> chi2) <0.0001 R2(4,191.2)0.397 gamma 0.679
## max |deriv| 3e-08 Brier 0.163 tau-a 0.341
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.9339 1.6846 -2.93 0.0034
## sz -0.0543 0.0136 -3.99 <0.0001
## ap -0.5048 0.1044 -4.83 <0.0001
## age 0.0776 0.0235 3.30 0.0010
## hx 1.0713 0.3101 3.45 0.0006
summary(fred)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E. Lower 0.95 Upper 0.95
## sz 6.00000 24.5 18.5 -1.00520 0.25199 -1.49910 -0.51135
## Odds Ratio 6.00000 24.5 18.5 0.36595 NA 0.22332 0.59968
## ap 0.59998 7.0 6.4 -1.24020 0.25656 -1.74310 -0.73737
## Odds Ratio 0.59998 7.0 6.4 0.28932 NA 0.17498 0.47837
## age 70.00000 76.0 6.0 0.46531 0.14111 0.18874 0.74188
## Odds Ratio 70.00000 76.0 6.0 1.59250 NA 1.20770 2.09990
## hx 0.00000 1.0 1.0 1.07130 0.31007 0.46355 1.67900
## Odds Ratio 0.00000 1.0 1.0 2.91910 NA 1.58970 5.36010
Again, we want to evaluate the performance of the reduced model:
s.red <- fred$stats
round(s.red, digits=3)
## Obs Max Deriv Model L.R. d.f. P C
## 255.000 0.000 100.810 4.000 0.000 0.840
## Dxy Gamma Tau-a R2 R2(255) R2(4,255)
## 0.679 0.679 0.341 0.435 0.327 0.316
## R2(191.2) R2(4,191.2) Brier g gr gp
## 0.410 0.397 0.163 1.917 6.800 0.340
gamma.hat.red <- (s.red['Model L.R.'] - s.red['d.f.'])/s.red['Model L.R.']
gamma.hat.red
## Model L.R.
## 0.9603214
The van Houwelingen-Le Cessie heuristic shrinkage estimate is now 0.96, indicating that this model will validate on new data about 4% worse than on this dataset (simpler than the full model, less overfitting).
Let us now use a bootstrap approach to further evaluate calibration and discrimination of this reduced model:
fred <- update(fred, x=TRUE , y=TRUE)
vred <- validate (fred, B=200)
vred
## index.orig training test optimism index.corrected n
## Dxy 0.6792 0.6854 0.6695 0.0159 0.6633 200
## R2 0.4354 0.4468 0.4254 0.0214 0.4140 200
## Intercept 0.0000 0.0000 0.0031 -0.0031 0.0031 200
## Slope 1.0000 1.0000 0.9560 0.0440 0.9560 200
## Emax 0.0000 0.0000 0.0105 0.0105 0.0105 200
## D 0.3914 0.4055 0.3804 0.0251 0.3663 200
## U -0.0078 -0.0078 0.0014 -0.0093 0.0014 200
## Q 0.3993 0.4133 0.3789 0.0344 0.3649 200
## B 0.1630 0.1596 0.1667 -0.0072 0.1702 200
## g 1.9169 1.9832 1.8735 0.1097 1.8073 200
## gp 0.3396 0.3420 0.3352 0.0068 0.3328 200
The slope shrinkage as already seen is around 4%. There is a sligth drop-off in all indexes. The estimated likely future predictive discrimination of the model as measured by Somers’ Dxy change from 0.68 to 0.66. The latter estimate is the one that should be claimed when describing model performance.
A nearly unbiased estimate of future calibration of the stepwise-derived model is given below:
fred <- update(fred, x=TRUE , y=TRUE)
cal <- calibrate(fred, B=200)
plot(cal)
##
## n=255 Mean absolute error=0.012 Mean squared error=0.00019
## 0.9 Quantile of absolute error=0.022
Bootstrap overfitting-corrected calibration curve estimate for the backwards step-down cause of death logistic model, along with a rug plot showing the distribution of predicted risks. The smooth nonparametric calibration estimator (loess) is used.The estimated mean absolute error is 0.013 between estimated probabilities and observed events: quite good!
For comparison, let consider now a bootstrap validation of the full model without using any variable selection:
fit.multi <- update(fit.multi, x=TRUE , y=TRUE)
vfull <- validate (fit.multi, B=200)
vfull
## index.orig training test optimism index.corrected n
## Dxy 0.8113 0.8594 0.7646 0.0949 0.7164 200
## R2 0.6110 0.6812 0.5372 0.1440 0.4670 200
## Intercept 0.0000 0.0000 0.0098 -0.0098 0.0098 200
## Slope 1.0000 1.0000 0.6633 0.3367 0.6633 200
## Emax 0.0000 0.0000 0.0925 0.0925 0.0925 200
## D 0.6088 0.7128 0.5122 0.2006 0.4082 200
## U -0.0082 -0.0082 0.0649 -0.0731 0.0649 200
## Q 0.6170 0.7210 0.4473 0.2737 0.3433 200
## B 0.1241 0.1043 0.1416 -0.0374 0.1615 200
## g 3.0079 3.9797 2.5893 1.3904 1.6175 200
## gp 0.4077 0.4293 0.3796 0.0498 0.3579 200
Compared to the validation of the full model, the step-down model has much less optimism, as expected, even if it has a smaller Dxy (discrimination) due to loss of information from removing moderately important variables. Let’s now explore calibration of the full model:
cal.full <- calibrate(fit.multi, B=200)
plot(cal.full)
##
## n=245 Mean absolute error=0.043 Mean squared error=0.00223
## 0.9 Quantile of absolute error=0.067
The estimated mean absolute error is 0.04 between estimated probabilities and observed events, the bias-corrected performance is worse than the reduced model.
We could also make a test to compare the two models in terms of AUC (i.e the discrimination power): note that in the reduced model we have less missing values (in any case the % of missing values in this dataset is quite irrelevant). Let’s extract the probabilities predicted from the models:
full.pred <- predict(fit.multi, type="fitted")
rid.pred <- predict(fred, type="fitted")
dati.pp <- data.frame(prostate_rid, full.pred, rid.pred)
And perform the test:
roc.test(response = dati.pp$cvd, predictor1= dati.pp$full.pred, predictor2 = dati.pp$rid.pred)
##
## DeLong's test for two correlated ROC curves
##
## data: dati.pp$full.pred and dati.pp$rid.pred by dati.pp$cvd (0, 1)
## Z = 3.4561, p-value = 0.0005481
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## 0.02628403 0.09515026
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.9056252 0.8449080
We observe as expected significant better discrimination of the full
model, but remember also that
this model is more prone to overfitting (the bootstrap-corrected
calibration is worse…) and that these measures are derived from the
“training-all” dataset so they need also to be rescaled in new
data.
roc.full <- roc(dati.pp$cvd, dati.pp$full.pred)
roc.red <- roc(dati.pp$cvd, dati.pp$rid.pred)
plot(roc.full)
plot(roc.red, add=TRUE, col="red")
legend("bottomright",legend=c("Full model", "Reduced model"), lty=c(1,1), col=c("black", "red"), bty="n")
title("Discrimination on the development sample")
We can try to be less conservative in estimating a reduced model, in order to increase the discrimination, using a different significance level for a variable to stay in the model and using individual approximate Wald tests rather than tests combining all deleted variables:
vrid2 <- validate(fit.multi, bw=TRUE , sls =0.5, type = 'individual' , B=200)
##
## Backwards Step-down - Original Model
##
## Deleted Chi-Sq d.f. P Residual d.f. P AIC
## ekg_rec 3.60 4 0.4628 3.60 4 0.4628 -4.40
## bm 0.01 1 0.9242 3.61 5 0.6069 -6.39
## sbp 0.08 1 0.7830 3.69 6 0.7192 -8.31
## hg 0.30 1 0.5827 3.99 7 0.7813 -10.01
## wt 0.92 1 0.3380 4.91 8 0.7677 -11.09
##
## Approximate Estimates after Deleting Factors
##
## Coef S.E. Wald Z P
## Intercept -5.12176 3.12777 -1.6375 0.1015240
## sz -0.05646 0.01689 -3.3422 0.0008313
## sg -0.33179 0.12027 -2.7587 0.0058026
## ap -0.34171 0.13370 -2.5558 0.0105945
## dbp 0.35870 0.14429 2.4859 0.0129210
## age 0.09340 0.03045 3.0677 0.0021573
## pf_rec=bed_under_50 1.28840 0.78205 1.6475 0.0994630
## pf_rec=bed_over_50_conf -2.31951 1.21744 -1.9052 0.0567492
## hx 1.27821 0.39842 3.2082 0.0013358
## rx=0.2 mg estrogen -0.28720 0.50331 -0.5706 0.5682597
## rx=1.0 mg estrogen 0.94249 0.54264 1.7369 0.0824130
## rx=5.0 mg estrogen 0.85665 0.50700 1.6896 0.0910964
## dtime -0.11929 0.06723 -1.7742 0.0760268
## dtime' 0.67239 0.29006 2.3181 0.0204412
## dtime'' -1.78597 0.70664 -2.5274 0.0114911
##
## Factors in Final Model
##
## [1] sz sg ap dbp age pf_rec hx rx dtime
Now let’s estimate a second reduced model:
fred2 <- lrm(cvd ~ sz + sg+ log(ap) + dbp + age + hx +pf_rec+rx +rcs(dtime,4) , data =prostate_rid)
fred2
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap dbp age hx pf_rec rx dtime
## 0 2 5 0 0 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hx + pf_rec +
## rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 250 LR chi2 145.18 R2 0.587 C 0.897
## 0 125 d.f. 14 R2(14,250)0.408 Dxy 0.794
## 1 125 Pr(> chi2) <0.0001 R2(14,187.5)0.503 gamma 0.794
## max |deriv| 7e-10 Brier 0.131 tau-a 0.398
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.7269 3.0077 -1.57 0.1160
## sz -0.0614 0.0165 -3.71 0.0002
## sg -0.3203 0.1167 -2.74 0.0061
## ap -0.3365 0.1311 -2.57 0.0103
## dbp 0.3416 0.1381 2.47 0.0134
## age 0.0878 0.0290 3.02 0.0025
## hx 1.3681 0.3838 3.56 0.0004
## pf_rec=bed_under_50 1.2721 0.7466 1.70 0.0884
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87 0.0611
## rx=0.2 mg estrogen -0.2781 0.4865 -0.57 0.5676
## rx=1.0 mg estrogen 0.9088 0.5227 1.74 0.0821
## rx=5.0 mg estrogen 0.9254 0.4880 1.90 0.0579
## dtime -0.1070 0.0644 -1.66 0.0966
## dtime' 0.5910 0.2755 2.15 0.0319
## dtime'' -1.5783 0.6699 -2.36 0.0185
print(fred2)
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap dbp age hx pf_rec rx dtime
## 0 2 5 0 0 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hx + pf_rec +
## rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 250 LR chi2 145.18 R2 0.587 C 0.897
## 0 125 d.f. 14 R2(14,250)0.408 Dxy 0.794
## 1 125 Pr(> chi2) <0.0001 R2(14,187.5)0.503 gamma 0.794
## max |deriv| 7e-10 Brier 0.131 tau-a 0.398
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.7269 3.0077 -1.57 0.1160
## sz -0.0614 0.0165 -3.71 0.0002
## sg -0.3203 0.1167 -2.74 0.0061
## ap -0.3365 0.1311 -2.57 0.0103
## dbp 0.3416 0.1381 2.47 0.0134
## age 0.0878 0.0290 3.02 0.0025
## hx 1.3681 0.3838 3.56 0.0004
## pf_rec=bed_under_50 1.2721 0.7466 1.70 0.0884
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87 0.0611
## rx=0.2 mg estrogen -0.2781 0.4865 -0.57 0.5676
## rx=1.0 mg estrogen 0.9088 0.5227 1.74 0.0821
## rx=5.0 mg estrogen 0.9254 0.4880 1.90 0.0579
## dtime -0.1070 0.0644 -1.66 0.0966
## dtime' 0.5910 0.2755 2.15 0.0319
## dtime'' -1.5783 0.6699 -2.36 0.0185
summary(fred2)
## Effects Response : cvd
##
## Factor Low High Diff. Effect S.E.
## sz 6.00000 24.5 18.5 -1.135100 0.30581
## Odds Ratio 6.00000 24.5 18.5 0.321390 NA
## sg 9.00000 12.0 3.0 -0.960990 0.35016
## Odds Ratio 9.00000 12.0 3.0 0.382510 NA
## ap 0.59998 7.0 6.4 -0.826700 0.32204
## Odds Ratio 0.59998 7.0 6.4 0.437490 NA
## dbp 7.00000 9.0 2.0 0.683210 0.27626
## Odds Ratio 7.00000 9.0 2.0 1.980200 NA
## age 70.00000 76.0 6.0 0.526990 0.17427
## Odds Ratio 70.00000 76.0 6.0 1.693800 NA
## hx 0.00000 1.0 1.0 1.368100 0.38383
## Odds Ratio 0.00000 1.0 1.0 3.928100 NA
## dtime 11.00000 37.0 26.0 0.924720 0.51670
## Odds Ratio 11.00000 37.0 26.0 2.521200 NA
## pf_rec - bed_under_50:normal 1.00000 2.0 NA 1.272100 0.74664
## Odds Ratio 1.00000 2.0 NA 3.568500 NA
## pf_rec - bed_over_50_conf:normal 1.00000 3.0 NA -2.360800 1.26070
## Odds Ratio 1.00000 3.0 NA 0.094345 NA
## rx - 0.2 mg estrogen:placebo 1.00000 2.0 NA -0.278060 0.48652
## Odds Ratio 1.00000 2.0 NA 0.757250 NA
## rx - 1.0 mg estrogen:placebo 1.00000 3.0 NA 0.908770 0.52267
## Odds Ratio 1.00000 3.0 NA 2.481300 NA
## rx - 5.0 mg estrogen:placebo 1.00000 4.0 NA 0.925360 0.48803
## Odds Ratio 1.00000 4.0 NA 2.522800 NA
## Lower 0.95 Upper 0.95
## -1.7345000 -0.53573
## 0.1764900 0.58524
## -1.6473000 -0.27468
## 0.1925700 0.75981
## -1.4579000 -0.19552
## 0.2327300 0.82241
## 0.1417600 1.22470
## 1.1523000 3.40300
## 0.1854300 0.86855
## 1.2037000 2.38350
## 0.6158600 2.12040
## 1.8513000 8.33470
## -0.0879970 1.93740
## 0.9157600 6.94090
## -0.1912400 2.73550
## 0.8259400 15.41800
## -4.8317000 0.11013
## 0.0079727 1.11640
## -1.2316000 0.67550
## 0.2918200 1.96500
## -0.1156500 1.93320
## 0.8907900 6.91150
## -0.0311690 1.88190
## 0.9693100 6.56590
Give a look to the performance of this alternative reduced model:
fred2 <- update(fred2, x=TRUE , y=TRUE)
vred2 <- validate (fred2, B=200)
vred2
## index.orig training test optimism index.corrected n
## Dxy 0.7937 0.8189 0.7621 0.0568 0.7369 200
## R2 0.5873 0.6268 0.5381 0.0887 0.4986 200
## Intercept 0.0000 0.0000 0.0064 -0.0064 0.0064 200
## Slope 1.0000 1.0000 0.7792 0.2208 0.7792 200
## Emax 0.0000 0.0000 0.0566 0.0566 0.0566 200
## D 0.5767 0.6324 0.5131 0.1193 0.4574 200
## U -0.0080 -0.0080 0.0224 -0.0304 0.0224 200
## Q 0.5847 0.6404 0.4906 0.1497 0.4350 200
## B 0.1315 0.1206 0.1426 -0.0220 0.1535 200
## g 2.8350 3.3964 2.6249 0.7715 2.0635 200
## gp 0.3990 0.4106 0.3800 0.0305 0.3685 200
The performance statistics are now midway between the full model and the smaller stepwise model.
cal.red2 <- calibrate(fred2, B=200)
plot(cal.red2)
##
## n=250 Mean absolute error=0.029 Mean squared error=0.00094
## 0.9 Quantile of absolute error=0.038
The mean absolute error is 0.03.
Let’s compare the AUCs :
rid2.pred <- predict(fred2, type="fitted")
dati.pp2 <- data.frame(dati.pp,rid2.pred)
roc.test(response = dati.pp2$cvd, predictor1= dati.pp2$full.pred, predictor2 = dati.pp2$rid2.pred)
##
## DeLong's test for two correlated ROC curves
##
## data: dati.pp2$full.pred and dati.pp2$rid2.pred by dati.pp2$cvd (0, 1)
## Z = 1.1526, p-value = 0.2491
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
## -0.004248807 0.016378906
## sample estimates:
## AUC of roc1 AUC of roc2
## 0.9056252 0.8995601
We have a comparable discrimination power on the development sample with respect to the full model, but with a better calibration.
roc.full <- roc(dati.pp2$cvd, dati.pp2$full.pred)
roc.red2 <- roc(dati.pp2$cvd, dati.pp2$rid2.pred)
plot(roc.full)
plot(roc.red2, add=TRUE, col="red")
legend("bottomright",legend=c("Full model", "Reduced 2 model"), lty=c(1,1), col=c("black", "red"), bty="n")
title("Discrimination on the development sample")
Finally we decide to present our estimated prediction model to the clinicians. Let’s prepare a visualization that could be helpful to discuss with them; we rename some levels of the categorical variables:
prostate_rid$hxf <- as.factor(prostate_rid$hx)
levels(prostate_rid$hxf) <- c("No", "Yes")
label(prostate_rid$hxf) <- "History of CV disease"
options(datadist='dd')
dd <-datadist(prostate_rid)
dd
## patno stage rx dtime status
## Low:effect 131.00000 3 <NA> 11 <NA>
## Adjust to 265.00000 3 placebo 24 dead - prostatic ca
## High:effect 395.00000 4 <NA> 37 <NA>
## Low:prediction 24.96109 3 placebo 1 alive
## High:prediction 482.03891 4 5.0 mg estrogen 58 dead - unknown cause
## Low 3.00000 3 placebo 0 alive
## High 506.00000 4 5.0 mg estrogen 71 dead - unknown cause
## age wt pf hx sbp dbp ekg
## Low:effect 70.00000 89.0000 <NA> 0 13 7 <NA>
## Adjust to 73.00000 97.0000 normal activity 0 14 8 heart strain
## High:effect 76.00000 106.0000 <NA> 1 16 9 <NA>
## Low:prediction 55.96109 76.0000 normal activity 0 11 6 normal
## High:prediction 80.00000 125.0389 confined to bed 1 19 10 recent MI
## Low 48.00000 71.0000 normal activity 0 8 4 normal
## High 88.00000 152.0000 confined to bed 1 24 13 recent MI
## hg sz sg ap bm ap80 ekg_rec
## Low:effect 12.000000 6.00000 9 0.59997559 0 0 <NA>
## Adjust to 13.599609 14.00000 11 1.09985352 0 0 bngn
## High:effect 14.599609 24.50000 12 7.00000000 1 1 <NA>
## Low:prediction 9.298828 2.00000 8 0.19998169 0 0 bngn
## High:prediction 16.503800 46.11673 14 54.97592412 1 1 MI
## Low 5.899414 0.00000 5 0.09999084 0 0 bngn
## High 21.199219 69.00000 15 999.87500000 1 1 MI
## pf_rec status_num subset cvd log_ap
## Low:effect <NA> <NA> 1 0 -0.51086631
## Adjust to normal dead - prostatic ca 1 0 0.09517701
## High:effect <NA> <NA> 1 1 1.94591015
## Low:prediction normal alive 1 0 -1.60952947
## High:prediction bed_over_50_conf dead - unknown cause 1 1 4.00677071
## Low normal alive 1 0 -2.30267665
## High bed_over_50_conf dead - unknown cause 1 1 6.90763027
## hxf
## Low:effect <NA>
## Adjust to Yes
## High:effect <NA>
## Low:prediction No
## High:prediction Yes
## Low No
## High Yes
##
## Values:
##
## stage : 3 4
## rx : placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen
## status :
## [1] alive dead - prostatic ca
## [3] dead - heart or vascular dead - cerebrovascular
## [5] dead - pulmonary embolus dead - other ca
## [7] dead - respiratory disease dead - other specific non-ca
## [9] dead - unspecified non-ca dead - unknown cause
## pf : normal activity in bed < 50% daytime in bed > 50% daytime confined to bed
## hx : 0 1
## dbp : 4 5 6 7 8 9 10 11 12 13
## ekg :
## [1] normal benign
## [3] rhythmic disturb & electrolyte ch heart block or conduction def
## [5] heart strain old MI
## [7] recent MI
## bm : 0 1
## ap80 : 0 1
## ekg_rec : bngn rd&ec hbocd hrts MI
## pf_rec : normal bed_under_50 bed_over_50_conf
## status_num :
## [1] alive dead - prostatic ca
## [3] dead - heart or vascular dead - cerebrovascular
## [5] dead - pulmonary embolus dead - other ca
## [7] dead - respiratory disease dead - other specific non-ca
## [9] dead - unspecified non-ca dead - unknown cause
## subset : 1
## cvd : 0 1
## hxf : No Yes
We estimate the final model selected:
fredfin <- lrm(cvd ~ sz + sg+ log(ap) + dbp+ age + hxf +pf_rec+rx+rcs(dtime,4), data =prostate_rid)
fredfin
## Frequencies of Missing Values Due to Each Variable
## cvd sz sg ap dbp age hxf pf_rec rx dtime
## 0 2 5 0 0 0 0 0 0 0
##
## Logistic Regression Model
##
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hxf + pf_rec +
## rx + rcs(dtime, 4), data = prostate_rid)
##
##
## Model Likelihood Discrimination Rank Discrim.
## Ratio Test Indexes Indexes
## Obs 250 LR chi2 145.18 R2 0.587 C 0.897
## 0 125 d.f. 14 R2(14,250)0.408 Dxy 0.794
## 1 125 Pr(> chi2) <0.0001 R2(14,187.5)0.503 gamma 0.794
## max |deriv| 7e-10 Brier 0.131 tau-a 0.398
##
## Coef S.E. Wald Z Pr(>|Z|)
## Intercept -4.7269 3.0077 -1.57 0.1160
## sz -0.0614 0.0165 -3.71 0.0002
## sg -0.3203 0.1167 -2.74 0.0061
## ap -0.3365 0.1311 -2.57 0.0103
## dbp 0.3416 0.1381 2.47 0.0134
## age 0.0878 0.0290 3.02 0.0025
## hxf=Yes 1.3681 0.3838 3.56 0.0004
## pf_rec=bed_under_50 1.2721 0.7466 1.70 0.0884
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87 0.0611
## rx=0.2 mg estrogen -0.2781 0.4865 -0.57 0.5676
## rx=1.0 mg estrogen 0.9088 0.5227 1.74 0.0821
## rx=5.0 mg estrogen 0.9254 0.4880 1.90 0.0579
## dtime -0.1070 0.0644 -1.66 0.0966
## dtime' 0.5910 0.2755 2.15 0.0319
## dtime'' -1.5783 0.6699 -2.36 0.0185
nom <- nomogram (fredfin , ap=c(.1 , .5 , 1, 5, 10, 20, 30, 40),
fun=plogis , funlabel ="Probability ",
lp=TRUE,
fun.at =c(.01,.1,.25,.5,.75,.95))
nom
## Points per unit of linear predictor: 21.51285
## Linear predictor units per point : 0.04648384
##
##
## sz Points
## 0 92
## 5 86
## 10 79
## 15 73
## 20 66
## 25 59
## 30 53
## 35 46
## 40 40
## 45 33
## 50 26
## 55 20
## 60 13
## 65 7
## 70 0
##
##
## sg Points
## 5 69
## 6 62
## 7 55
## 8 48
## 9 41
## 10 34
## 11 28
## 12 21
## 13 14
## 14 7
## 15 0
##
##
## ap Points
## 0.1 43
## 0.5 32
## 1.0 27
## 5.0 15
## 10.0 10
## 20.0 5
## 30.0 2
## 40.0 0
##
##
## dbp Points
## 4 0
## 5 7
## 6 15
## 7 22
## 8 29
## 9 37
## 10 44
## 11 51
## 12 59
## 13 66
##
##
## age Points
## 45 0
## 50 9
## 55 19
## 60 28
## 65 38
## 70 47
## 75 57
## 80 66
## 85 76
## 90 85
##
##
## History of CV disease Points
## No 0
## Yes 29
##
##
## pf_rec Points
## normal 51
## bed_under_50 78
## bed_over_50_conf 0
##
##
## rx Points
## placebo 6
## 0.2 mg estrogen 0
## 1.0 mg estrogen 26
## 5.0 mg estrogen 26
##
##
## Months of Follow-up Points
## 0 100
## 10 79
## 20 79
## 30 96
## 40 95
## 50 77
## 60 52
## 70 26
## 80 0
##
##
## Total Points Probability
## 268 0.01
## 319 0.10
## 343 0.25
## 367 0.50
## 390 0.75
## 430 0.95
This is a useful tool to visualize a prediction model and discuss it with clinicians:
plot(nom,cex.axis=.75)