First of all, install and load the required libraries:

library(rms)
library(Hmisc)
library(gtsummary)
library(magrittr)
library(tidyverse)
library(haven)
library(ggplot2)
library(ggpubr)
library(pROC)
setwd("C:\\Users\\Giulia Barbati\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\Block 3\\R codes")

Data set “Prostate” Case Study

This practical contains a case study on developing, describing, and validating a regression prediction model.

In the original data set of the study there was also a follow up time to the events, but we did not consider here this aspect (just for didactic purposes we will consider it as another candidate predictor in the model).

The original dataset was composed by 506 patients with prostate cancer from the Byar and Green paper (see if interested: D. P. Byar and S. B. Green. The choice of treatment for cancer patients based on covariate information: Application to prostate cancer. Bulletin Cancer, Paris, 67:477-488, 1980).

These data come originally from a randomized trial (RCT) comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen.

In this trial, larger doses of estrogen reduced the effect of prostate cancer but at the cost of increased risk of cardiovascular death. (Four patients had missing values on many variables. These patients have been excluded from consideration).

Our goal here is to predict the probability of incurring in cardiovascular-cerebrovascular death in the subset of patients that experienced this event or the prostate cancer death.

We will subset the original dataset of those patients dying from prostate cancer, heart or vascular disease, or cerebrovascular disease.

Here are the description of the variables contained in the dataset:

Some a priori information from clinicians: stage is defined by ap as well as X-ray results. Of the patients in stage 3, 92% have ap ≤ 0.8. Of those in stage 4, 93% have ap > 0.8.

Since stage can be predicted almost certainly from ap, we do not consider stage in the analyses. (You can check this relationship on the data).

Initial Data Analysis (IDA) on the original dataset

This part is an exercise that you will do on your own: first of all, let’s load the original dataset:

prostate <- read_sav("prostate.sav")
prostate <-labelled::to_factor(prostate, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
summary.data.frame(prostate)
##      patno           stage                     rx          dtime      
##  Min.   :  1.0   Min.   :3.000   placebo        :127   Min.   : 0.00  
##  1st Qu.:126.2   1st Qu.:3.000   0.2 mg estrogen:124   1st Qu.:14.25  
##  Median :251.5   Median :3.000   1.0 mg estrogen:126   Median :34.00  
##  Mean   :251.7   Mean   :3.424   5.0 mg estrogen:125   Mean   :36.13  
##  3rd Qu.:376.8   3rd Qu.:4.000                         3rd Qu.:57.75  
##  Max.   :506.0   Max.   :4.000                         Max.   :76.00  
##                                                                       
##                           status         age              wt        
##  alive                       :148   Min.   :48.00   Min.   : 69.00  
##  dead - prostatic ca         :130   1st Qu.:70.00   1st Qu.: 90.00  
##  dead - heart or vascular    : 96   Median :73.00   Median : 98.00  
##  dead - cerebrovascular      : 31   Mean   :71.46   Mean   : 99.03  
##  dead - other specific non-ca: 28   3rd Qu.:76.00   3rd Qu.:107.00  
##  dead - other ca             : 25   Max.   :89.00   Max.   :152.00  
##  (Other)                     : 44   NA's   :1       NA's   :2       
##                     pf            hx              sbp             dbp        
##  normal activity     :450   Min.   :0.0000   Min.   : 8.00   Min.   : 4.000  
##  in bed < 50% daytime: 37   1st Qu.:0.0000   1st Qu.:13.00   1st Qu.: 7.000  
##  in bed > 50% daytime: 13   Median :0.0000   Median :14.00   Median : 8.000  
##  confined to bed     :  2   Mean   :0.4243   Mean   :14.35   Mean   : 8.149  
##                             3rd Qu.:1.0000   3rd Qu.:16.00   3rd Qu.: 9.000  
##                             Max.   :1.0000   Max.   :30.00   Max.   :18.000  
##                                                                              
##                                 ekg            hg               sz       
##  normal                           :168   Min.   : 5.899   Min.   : 0.00  
##  heart strain                     :150   1st Qu.:12.299   1st Qu.: 5.00  
##  old MI                           : 75   Median :13.699   Median :11.00  
##  rhythmic disturb & electrolyte ch: 51   Mean   :13.446   Mean   :14.63  
##  heart block or conduction def    : 26   3rd Qu.:14.699   3rd Qu.:21.00  
##  (Other)                          : 24   Max.   :21.199   Max.   :69.00  
##  NA's                             :  8                    NA's   :5      
##        sg              ap                bm        
##  Min.   : 5.00   Min.   :  0.100   Min.   :0.0000  
##  1st Qu.: 9.00   1st Qu.:  0.500   1st Qu.:0.0000  
##  Median :10.00   Median :  0.700   Median :0.0000  
##  Mean   :10.31   Mean   : 12.176   Mean   :0.1633  
##  3rd Qu.:11.00   3rd Qu.:  2.975   3rd Qu.:0.0000  
##  Max.   :15.00   Max.   :999.875   Max.   :1.0000  
##  NA's   :11

Here are the suggestions for the initial data analysis:

  1. Create a descriptive summary of all the variables contained in the dataset (except for patient number!).

Do you think that some variables could be re-coded combining an infrequent category with the next category?

For example, pay attention to the frequencies of ekg and pf.

Hint from a discussion with clinicians: in the ekg variable we could collapse the normal and benign conditions.

For pf, you can combine in bed > 50% daytime with confined to bed in one category.

  1. Then, here there is another question for you: based on the study design (RCT), do you expect to find significant statistical differences between baseline characteristics of groups defined by the levels of the treatment rx ? Build a descriptive table comparing the four groups and evaluating the global p value across groups. (Hint: for numerical variables you could use the Kruskal-Wallis test and for categorical ones the Chi-square test).

  2. Give a look to the distribution’s shape of the continuous variables : do you think that some scale transformation (for example taking the logarithm) could help in improving the normality of some of them ?

  3. Now: subset the original dataset selecting only the patients died for cardiovascular–cerebrovascular death or prostate cancer death. Create the outcome variable cvd identifying as “1” patients died for cardiovascular–cerebrovascular death and as “0” patients died for prostate cancer. (We will use then logistic regression for the prediction model).

The reduced dataset “prostate_rid” is created. Here we begin with the prediction model development part.

Let’s begin with the prediction model construction

First of all, we upload the reduced dataset with the recoded variables that have been suggested in the IDA phase:

prostate_rid <- read_sav("prostate_rid.sav")
prostate_rid <-labelled::to_factor(prostate_rid, labelled_only = TRUE, strict = TRUE, unclass = TRUE)
summary.data.frame(prostate_rid)
##      patno           stage                     rx         dtime      
##  Min.   :  3.0   Min.   :3.000   placebo        :71   Min.   : 0.00  
##  1st Qu.:131.0   1st Qu.:3.000   0.2 mg estrogen:67   1st Qu.:11.00  
##  Median :265.0   Median :4.000   1.0 mg estrogen:49   Median :24.00  
##  Mean   :260.5   Mean   :3.525   5.0 mg estrogen:70   Mean   :25.26  
##  3rd Qu.:395.0   3rd Qu.:4.000                        3rd Qu.:37.00  
##  Max.   :506.0   Max.   :4.000                        Max.   :71.00  
##                                                                      
##                       status         age              wt        
##  dead - prostatic ca     :130   Min.   :48.00   Min.   : 71.00  
##  dead - heart or vascular: 96   1st Qu.:70.00   1st Qu.: 89.00  
##  dead - cerebrovascular  : 31   Median :73.00   Median : 97.00  
##  alive                   :  0   Mean   :71.55   Mean   : 98.03  
##  dead - pulmonary embolus:  0   3rd Qu.:76.00   3rd Qu.:106.00  
##  dead - other ca         :  0   Max.   :88.00   Max.   :152.00  
##  (Other)                 :  0                                   
##                     pf            hx              sbp             dbp        
##  normal activity     :221   Min.   :0.0000   Min.   : 8.00   Min.   : 4.000  
##  in bed < 50% daytime: 24   1st Qu.:0.0000   1st Qu.:13.00   1st Qu.: 7.000  
##  in bed > 50% daytime: 10   Median :1.0000   Median :14.00   Median : 8.000  
##  confined to bed     :  2   Mean   :0.5097   Mean   :14.53   Mean   : 8.144  
##                             3rd Qu.:1.0000   3rd Qu.:16.00   3rd Qu.: 9.000  
##                             Max.   :1.0000   Max.   :24.00   Max.   :13.000  
##                                                                              
##                                 ekg           hg               sz       
##  heart strain                     :83   Min.   : 5.899   Min.   : 0.00  
##  normal                           :74   1st Qu.:12.000   1st Qu.: 6.00  
##  old MI                           :44   Median :13.600   Median :14.00  
##  rhythmic disturb & electrolyte ch:28   Mean   :13.301   Mean   :16.99  
##  benign                           :13   3rd Qu.:14.600   3rd Qu.:24.50  
##  (Other)                          :10   Max.   :21.199   Max.   :69.00  
##  NA's                             : 5                    NA's   :2      
##        sg              ap               bm              ap80         ekg_rec  
##  Min.   : 5.00   Min.   :  0.10   Min.   :0.0000   Min.   :0.0000   bngn :87  
##  1st Qu.: 9.00   1st Qu.:  0.60   1st Qu.:0.0000   1st Qu.:0.0000   rd&ec:28  
##  Median :11.00   Median :  1.10   Median :0.0000   Median :0.0000   hbocd:10  
##  Mean   :10.75   Mean   : 16.17   Mean   :0.2451   Mean   :0.4553   hrts :83  
##  3rd Qu.:12.00   3rd Qu.:  7.00   3rd Qu.:0.0000   3rd Qu.:1.0000   MI   :44  
##  Max.   :15.00   Max.   :999.88   Max.   :1.0000   Max.   :1.0000   NA's : 5  
##  NA's   :5                                                                    
##               pf_rec                       status_num      subset 
##  normal          :221   dead - prostatic ca     :130   Min.   :1  
##  bed_under_50    : 24   dead - heart or vascular: 96   1st Qu.:1  
##  bed_over_50_conf: 12   dead - cerebrovascular  : 31   Median :1  
##                         alive                   :  0   Mean   :1  
##                         dead - pulmonary embolus:  0   3rd Qu.:1  
##                         dead - other ca         :  0   Max.   :1  
##                         (Other)                 :  0              
##       cvd             log_ap        
##  Min.   :0.0000   Min.   :-2.30268  
##  1st Qu.:0.0000   1st Qu.:-0.51087  
##  Median :0.0000   Median : 0.09518  
##  Mean   :0.4942   Mean   : 0.70723  
##  3rd Qu.:1.0000   3rd Qu.: 1.94591  
##  Max.   :1.0000   Max.   : 6.90763  
## 

Here we define ranges of the variables contained in the dataset (this is a useful function both for data description and for the subsequent interpretation of the regression coefficients):

options(datadist='dd')
dd <-datadist(prostate_rid)
dd
##                     patno stage              rx dtime               status
## Low:effect      131.00000     3            <NA>    11                 <NA>
## Adjust to       265.00000     3         placebo    24  dead - prostatic ca
## High:effect     395.00000     4            <NA>    37                 <NA>
## Low:prediction   24.96109     3         placebo     1                alive
## High:prediction 482.03891     4 5.0 mg estrogen    58 dead - unknown cause
## Low               3.00000     3         placebo     0                alive
## High            506.00000     4 5.0 mg estrogen    71 dead - unknown cause
##                      age       wt              pf hx sbp dbp          ekg
## Low:effect      70.00000  89.0000            <NA>  0  13   7         <NA>
## Adjust to       73.00000  97.0000 normal activity  0  14   8 heart strain
## High:effect     76.00000 106.0000            <NA>  1  16   9         <NA>
## Low:prediction  55.96109  76.0000 normal activity  0  11   6       normal
## High:prediction 80.00000 125.0389 confined to bed  1  19  10    recent MI
## Low             48.00000  71.0000 normal activity  0   8   4       normal
## High            88.00000 152.0000 confined to bed  1  24  13    recent MI
##                        hg       sz sg           ap bm ap80 ekg_rec
## Low:effect      12.000000  6.00000  9   0.59997559  0    0    <NA>
## Adjust to       13.599609 14.00000 11   1.09985352  0    0    bngn
## High:effect     14.599609 24.50000 12   7.00000000  1    1    <NA>
## Low:prediction   9.298828  2.00000  8   0.19998169  0    0    bngn
## High:prediction 16.503800 46.11673 14  54.97592412  1    1      MI
## Low              5.899414  0.00000  5   0.09999084  0    0    bngn
## High            21.199219 69.00000 15 999.87500000  1    1      MI
##                           pf_rec           status_num subset cvd      log_ap
## Low:effect                  <NA>                 <NA>      1   0 -0.51086631
## Adjust to                 normal  dead - prostatic ca      1   0  0.09517701
## High:effect                 <NA>                 <NA>      1   1  1.94591015
## Low:prediction            normal                alive      1   0 -1.60952947
## High:prediction bed_over_50_conf dead - unknown cause      1   1  4.00677071
## Low                       normal                alive      1   0 -2.30267665
## High            bed_over_50_conf dead - unknown cause      1   1  6.90763027
## 
## Values:
## 
## stage : 3 4 
## rx : placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen 
## status :
##  [1] alive                        dead - prostatic ca         
##  [3] dead - heart or vascular     dead - cerebrovascular      
##  [5] dead - pulmonary embolus     dead - other ca             
##  [7] dead - respiratory disease   dead - other specific non-ca
##  [9] dead - unspecified non-ca    dead - unknown cause        
## pf : normal activity in bed < 50% daytime in bed > 50% daytime confined to bed 
## hx : 0 1 
## dbp :  4  5  6  7  8  9 10 11 12 13 
## ekg :
## [1] normal                            benign                           
## [3] rhythmic disturb & electrolyte ch heart block or conduction def    
## [5] heart strain                      old MI                           
## [7] recent MI                        
## bm : 0 1 
## ap80 : 0 1 
## ekg_rec : bngn rd&ec hbocd hrts MI 
## pf_rec : normal bed_under_50 bed_over_50_conf 
## status_num :
##  [1] alive                        dead - prostatic ca         
##  [3] dead - heart or vascular     dead - cerebrovascular      
##  [5] dead - pulmonary embolus     dead - other ca             
##  [7] dead - respiratory disease   dead - other specific non-ca
##  [9] dead - unspecified non-ca    dead - unknown cause        
## subset : 1 
## cvd : 0 1

Functional forms of continous variables

Now we visually explore the assumption about the linearity of the effect for continuous candidate predictors by means of nonparametric smoothed regression:

par(mfrow=c(2,2))
plsmo(prostate_rid$age,prostate_rid$cvd)
plsmo(prostate_rid$wt, prostate_rid$cvd)
plsmo(prostate_rid$sbp,prostate_rid$cvd)
plsmo(prostate_rid$dbp,prostate_rid$cvd)

A strange behaviour is observed for wt i.e. the Weight Index [calculated as wt= wt(kg)-ht(cm)+200]; the others variables could be roughly approximated by a linear effect.

par(mfrow=c(2,3))
plsmo(prostate_rid$hg, prostate_rid$cvd)
plsmo(prostate_rid$sz, prostate_rid$cvd)
plsmo(prostate_rid$sg, prostate_rid$cvd)
plsmo(prostate_rid$ap, prostate_rid$cvd) 
plsmo(prostate_rid$dtime, prostate_rid$cvd) 

Again, a strange behaviour is observed for dtime (Months of Follow-up, as it is obvioulsy expected…), the others could be roughly approximated by a linear effect. Remind that this dataset is quite small, so we have to be parsimonious in the number of coefficients to be estimated !

Note that, being this dataset quite small, we will use all the dataset for the training phase of the regression model, but after for evaluating the performance of the model we will apply boostrap techniques. This is obviously not the optimal way to proceed, but when there is a constraint of a low-moderate sample size (as often happen in medical data collected prospectively) there is no possibility of further splitting the sample. With such a small sample size we could not afford an initial split of the study sample in separate training and test subgroups. This is also a great obstacle in applying machine learning techniques to small-moderate medical datasets.

Univariable analysis

Even if univariable analysis is generally discouraged to select candidate predictors (it works only with perfectly uncorrelated variables, that is nearly impossible in biological problems), it is another exploratory step that could be useful also to discuss then with clinicians about the indications of associations that are present in the data. So first of all we can create a table with univariable results, assuming for the moment a linear effect for all the continuous variables :

prostate_rid %>% select(-patno,-stage,-pf,-ekg,-status,-status_num,-subset) %>% 
  tbl_uvregression(method = glm, # glm function
                 method.args = list(family = binomial),# logistic model
                 exponentiate = T, # report OR
                 y=cvd,# outcome variable
                 conf.level = 0.95,
                 pvalue_fun = function(x) style_pvalue(x, digits = 3))
Characteristic N OR1 95% CI1 p-value
rx 257


    placebo

    0.2 mg estrogen
0.65 0.33, 1.27 0.210
    1.0 mg estrogen
1.13 0.55, 2.36 0.736
    5.0 mg estrogen
1.73 0.89, 3.41 0.107
Months of Follow-up 257 1.00 0.98, 1.01 0.707
Age in Years 257 1.07 1.03, 1.11 <0.001
Weight Index = wt(kg)-ht(cm)+200 257 1.00 0.99, 1.02 0.726
History of Cardiovascular Disease 257 3.96 2.37, 6.71 <0.001
Systolic Blood Pressure/10 257 1.09 0.99, 1.21 0.082
Diastolic Blood Pressure/10 257 1.30 1.09, 1.57 0.005
Serum Hemoglobin (g/100ml) 257 1.17 1.04, 1.33 0.012
Size of Primary Tumor (cm^2) 255 0.93 0.91, 0.95 <0.001
Combined Index of Stage and Hist. Grade 252 0.56 0.47, 0.66 <0.001
Serum Prostatic Acid Phosphatase 257 0.93 0.89, 0.96 <0.001
Bone Metastases 257 0.28 0.15, 0.52 <0.001
ap80 257 5.74 3.38, 9.96 <0.001
ekg_rec 252


    bngn

    rd&ec
1.35 0.57, 3.20 0.492
    hbocd
1.04 0.25, 3.91 0.955
    hrts
2.04 1.11, 3.78 0.023
    MI
2.73 1.30, 5.88 0.009
pf_rec 257


    normal

    bed_under_50
1.36 0.58, 3.28 0.477
    bed_over_50_conf
0.09 0.00, 0.47 0.021
log_ap 257 0.55 0.45, 0.65 <0.001
1 OR = Odds Ratio, CI = Confidence Interval

Note that here all the odds ratios (OR) are expressing variations by 1-unit on the scale of the independent variable.

This table should be commented with the clinicians in order to discuss if the observed statistical significant associations are in line with what it is expected from the biological perspective.

Now, we can come back to the functional forms for the continuous variables: try now to use a nonlinear effect for wt and for dtime

fit.splines.wt <-  lrm(cvd ~ rcs(wt,4), data=prostate_rid)
print(fit.splines.wt)
## Logistic Regression Model
## 
## lrm(formula = cvd ~ rcs(wt, 4), data = prostate_rid)
## 
##                       Model Likelihood      Discrimination    Rank Discrim.    
##                             Ratio Test             Indexes          Indexes    
## Obs           257    LR chi2      0.66      R2       0.003    C       0.518    
##  0            130    d.f.            3      R2(3,257)0.000    Dxy     0.036    
##  1            127    Pr(> chi2) 0.8833    R2(3,192.7)0.000    gamma   0.037    
## max |deriv| 1e-12                           Brier    0.249    tau-a   0.018    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -2.5286 3.5066 -0.72  0.4708  
## wt         0.0300 0.0418  0.72  0.4731  
## wt'       -0.1046 0.1446 -0.72  0.4693  
## wt''       0.3319 0.4557  0.73  0.4664
summary(fit.splines.wt)
##              Effects              Response : cvd 
## 
##  Factor      Low High Diff. Effect   S.E.    Lower 0.95 Upper 0.95
##  wt          89  106  17    -0.14415 0.32689 -0.78484   0.49654   
##   Odds Ratio 89  106  17     0.86576      NA  0.45619   1.64300

We observe that there is no significant effect for the nonlinear components of the spline.

fit.splines.dtime <-  lrm(cvd ~ rcs(dtime,4), data=prostate_rid)
print(fit.splines.dtime)
## Logistic Regression Model
## 
## lrm(formula = cvd ~ rcs(dtime, 4), data = prostate_rid)
## 
##                       Model Likelihood      Discrimination    Rank Discrim.    
##                             Ratio Test             Indexes          Indexes    
## Obs           257    LR chi2      8.04      R2       0.041    C       0.591    
##  0            130    d.f.            3      R2(3,257)0.019    Dxy     0.181    
##  1            127    Pr(> chi2) 0.0452    R2(3,192.7)0.026    gamma   0.185    
## max |deriv| 1e-06                           Brier    0.242    tau-a   0.091    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept  0.6180 0.3867  1.60  0.1100  
## dtime     -0.0991 0.0422 -2.35  0.0188  
## dtime'     0.4881 0.1833  2.66  0.0078  
## dtime''   -1.2217 0.4470 -2.73  0.0063
summary(fit.splines.dtime)
##              Effects              Response : cvd 
## 
##  Factor      Low High Diff. Effect  S.E.    Lower 0.95 Upper 0.95
##  dtime       11  37   26    0.73072 0.35467 0.035589   1.4259    
##   Odds Ratio 11  37   26    2.07660      NA 1.036200   4.1614

Here instead there is a significant nonlinear effect for the dtime variable (as expected being the follow up…) In this case, the interpretation of the odds ratio is reported as a contrast between two specific values (37 vs 11), i.e. the third and first quartile of the variable. This is done in order to help in the interpretation of the non-linear effect modelled by the spline function.

Let’s visualize the effect of this candidate predictor modeled in a nonlinear way:

ggplot(Predict(fit.splines.dtime))

This is quite similar to the non parametric regression previously used:

plsmo(prostate_rid$dtime, prostate_rid$cvd) 

Multivariable full model

We decide to start with a full (no selection of predictors) binary logistic model assuming linearity for continuous predictors except for dtime:

fit.multi <- lrm(cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt +hg + ekg_rec + pf_rec + bm + hx + rx +rcs(dtime,4), data =prostate_rid)

Note here that there is usually an upper-bound on the number of candidate predictors that can be used in a [logistic/poisson/survival] regression model dictated by the number of events occurring in the population (we will take into account this aspect when the sample size issue in multivariable modelling will be discussed).

Let’s see the results:

print(fit.multi)
## Frequencies of Missing Values Due to Each Variable
##     cvd      sz      sg      ap     sbp     dbp     age      wt      hg ekg_rec 
##       0       2       5       0       0       0       0       0       0       5 
##  pf_rec      bm      hx      rx   dtime 
##       0       0       0       0       0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + sg + log(ap) + sbp + dbp + age + wt + 
##     hg + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           245    LR chi2     150.16       R2       0.611    C       0.906    
##  0            121    d.f.            22      R2(22,245)0.407    Dxy     0.811    
##  1            124    Pr(> chi2) <0.0001    R2(22,183.7)0.502    gamma   0.811    
## max |deriv| 2e-09                             Brier    0.124    tau-a   0.407    
## 
##                         Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept               -4.7151 3.6106 -1.31  0.1916  
## sz                      -0.0635 0.0174 -3.64  0.0003  
## sg                      -0.3281 0.1247 -2.63  0.0085  
## ap                      -0.3808 0.1468 -2.59  0.0095  
## sbp                     -0.0278 0.0931 -0.30  0.7654  
## dbp                      0.4263 0.1727  2.47  0.0136  
## age                      0.1004 0.0317  3.16  0.0016  
## wt                      -0.0172 0.0148 -1.16  0.2442  
## hg                       0.0708 0.1060  0.67  0.5042  
## ekg_rec=rd&ec           -0.5096 0.6760 -0.75  0.4509  
## ekg_rec=hbocd           -1.3951 0.8889 -1.57  0.1166  
## ekg_rec=hrts             0.1518 0.4586  0.33  0.7407  
## ekg_rec=MI               0.0106 0.6210  0.02  0.9864  
## pf_rec=bed_under_50      1.3081 0.8132  1.61  0.1077  
## pf_rec=bed_over_50_conf -2.2541 1.2487 -1.81  0.0710  
## bm                       0.0981 0.5397  0.18  0.8557  
## hx                       1.3719 0.4195  3.27  0.0011  
## rx=0.2 mg estrogen      -0.3661 0.5245 -0.70  0.4852  
## rx=1.0 mg estrogen       0.9277 0.5579  1.66  0.0964  
## rx=5.0 mg estrogen       0.7030 0.5238  1.34  0.1796  
## dtime                   -0.1256 0.0691 -1.82  0.0691  
## dtime'                   0.6755 0.2979  2.27  0.0234  
## dtime''                 -1.7766 0.7258 -2.45  0.0144
summary(fit.multi)
##              Effects              Response : cvd 
## 
##  Factor                           Low      High  Diff.   Effect    S.E.   
##  sz                                6.00000  24.5 18.5000 -1.174100 0.32268
##   Odds Ratio                       6.00000  24.5 18.5000  0.309110      NA
##  sg                                9.00000  12.0  3.0000 -0.984350 0.37423
##   Odds Ratio                       9.00000  12.0  3.0000  0.373680      NA
##  ap                                0.59998   7.0  6.4000 -0.935590 0.36060
##   Odds Ratio                       0.59998   7.0  6.4000  0.392360      NA
##  sbp                              13.00000  16.0  3.0000 -0.083384 0.27943
##   Odds Ratio                      13.00000  16.0  3.0000  0.920000      NA
##  dbp                               7.00000   9.0  2.0000  0.852590 0.34535
##   Odds Ratio                       7.00000   9.0  2.0000  2.345700      NA
##  age                              70.00000  76.0  6.0000  0.602180 0.19035
##   Odds Ratio                      70.00000  76.0  6.0000  1.826100      NA
##  wt                               89.00000 106.0 17.0000 -0.292780 0.25140
##   Odds Ratio                      89.00000 106.0 17.0000  0.746180      NA
##  hg                               12.00000  14.6  2.5996  0.184020 0.27553
##   Odds Ratio                      12.00000  14.6  2.5996  1.202000      NA
##  bm                                0.00000   1.0  1.0000  0.098134 0.53966
##   Odds Ratio                       0.00000   1.0  1.0000  1.103100      NA
##  hx                                0.00000   1.0  1.0000  1.371900 0.41948
##   Odds Ratio                       0.00000   1.0  1.0000  3.943000      NA
##  dtime                            11.00000  37.0 26.0000  1.053600 0.55450
##   Odds Ratio                      11.00000  37.0 26.0000  2.867900      NA
##  ekg_rec - rd&ec:bngn              1.00000   2.0      NA -0.509560 0.67596
##   Odds Ratio                       1.00000   2.0      NA  0.600760      NA
##  ekg_rec - hbocd:bngn              1.00000   3.0      NA -1.395100 0.88895
##   Odds Ratio                       1.00000   3.0      NA  0.247820      NA
##  ekg_rec - hrts:bngn               1.00000   4.0      NA  0.151770 0.45859
##   Odds Ratio                       1.00000   4.0      NA  1.163900      NA
##  ekg_rec - MI:bngn                 1.00000   5.0      NA  0.010576 0.62096
##   Odds Ratio                       1.00000   5.0      NA  1.010600      NA
##  pf_rec - bed_under_50:normal      1.00000   2.0      NA  1.308100 0.81317
##   Odds Ratio                       1.00000   2.0      NA  3.699200      NA
##  pf_rec - bed_over_50_conf:normal  1.00000   3.0      NA -2.254100 1.24870
##   Odds Ratio                       1.00000   3.0      NA  0.104970      NA
##  rx - 0.2 mg estrogen:placebo      1.00000   2.0      NA -0.366070 0.52451
##   Odds Ratio                       1.00000   2.0      NA  0.693460      NA
##  rx - 1.0 mg estrogen:placebo      1.00000   3.0      NA  0.927670 0.55790
##   Odds Ratio                       1.00000   3.0      NA  2.528600      NA
##  rx - 5.0 mg estrogen:placebo      1.00000   4.0      NA  0.702950 0.52383
##   Odds Ratio                       1.00000   4.0      NA  2.019700      NA
##  Lower 0.95 Upper 0.95
##  -1.8065000 -0.54161  
##   0.1642300  0.58181  
##  -1.7178000 -0.25087  
##   0.1794500  0.77813  
##  -1.6424000 -0.22882  
##   0.1935200  0.79547  
##  -0.6310500  0.46429  
##   0.5320300  1.59090  
##   0.1757100  1.52950  
##   1.1921000  4.61570  
##   0.2291100  0.97525  
##   1.2575000  2.65180  
##  -0.7855300  0.19996  
##   0.4558800  1.22140  
##  -0.3560000  0.72405  
##   0.7004700  2.06280  
##  -0.9595800  1.15590  
##   0.3830500  3.17670  
##   0.5497700  2.19410  
##   1.7329000  8.97190  
##  -0.0332040  2.14040  
##   0.9673400  8.50280  
##  -1.8344000  0.81529  
##   0.1597100  2.25980  
##  -3.1374000  0.34723  
##   0.0433970  1.41510  
##  -0.7470500  1.05060  
##   0.4737600  2.85930  
##  -1.2065000  1.22760  
##   0.2992500  3.41320  
##  -0.2856600  2.90190  
##   0.7515200 18.20900  
##  -4.7014000  0.19330  
##   0.0090825  1.21320  
##  -1.3941000  0.66195  
##   0.2480600  1.93860  
##  -0.1657900  2.02110  
##   0.8472200  7.54690  
##  -0.3237300  1.72960  
##   0.7234500  5.63860
an <- anova (fit.multi)
an
##                 Wald Statistics          Response: cvd 
## 
##  Factor     Chi-Square d.f. P     
##  sz         13.24       1   0.0003
##  sg          6.92       1   0.0085
##  ap          6.73       1   0.0095
##  sbp         0.09       1   0.7654
##  dbp         6.09       1   0.0136
##  age        10.01       1   0.0016
##  wt          1.36       1   0.2442
##  hg          0.45       1   0.5042
##  ekg_rec     3.60       4   0.4628
##  pf_rec      6.11       2   0.0471
##  bm          0.03       1   0.8557
##  hx         10.70       1   0.0011
##  rx          6.67       3   0.0831
##  dtime       8.80       3   0.0321
##   Nonlinear  8.02       2   0.0182
##  TOTAL      62.82      22   <.0001
plot(an) 

More than interpreting the single regression coefficients and their corresponding statistical significance (even if in the last phase, i.e. when presenting results to the clinicians also this aspect is important!!), in prediction modelling we are mostly interested in evaluating the accuracy in predictions.

Now we want to evaluate the performance of the full model:

s         <- fit.multi$stats
round(s, digits=3)
##          Obs    Max Deriv   Model L.R.         d.f.            P            C 
##      245.000        0.000      150.158       22.000        0.000        0.906 
##          Dxy        Gamma        Tau-a           R2      R2(245)   R2(22,245) 
##        0.811        0.811        0.407        0.611        0.458        0.407 
##    R2(183.7) R2(22,183.7)        Brier            g           gr           gp 
##        0.558        0.502        0.124        3.008       20.245        0.408

C is the AUC (discrimination measure) and Dxy is related to C (Dxy=2*(C-0.5)), so they are both discrimination measures. They measure how the model perform in the relative ranking of the estimated risk.

gamma.hat <-  (s['Model L.R.'] - s['d.f.'])/s['Model L.R.']
gamma.hat
## Model L.R. 
##  0.8534877

Estimation of the shrinkage coefficient gamma allows the quantification of the amount of overfitting present, and it also allows one to estimate the likelihood that the model will reliably predict new observations. The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.85, indicating that this model will validate on new data about 15% worse than on this dataset. This is also related to the calibration concept (the absolute ordering of the predicted risks).

Model visualization: a useful interim step to be discussed with clinicians !

Let’s now visualize the predictors effects on the logit scale:

ggplot(Predict(fit.multi))

Now let’s see them on the odds ratio scale:

plot(summary(fit.multi),log=TRUE)

The general function is plot(summary(fit.multi)) but it less readable, for this reason there is the option log=TRUE.

Estimates of the associations are related to the interquartile-range odds ratios for continuous predictors and odds ratios with respect to a reference level for categorical predictors. Numbers at left are upper quartile vs lower quartile or current group vs reference group. The bars represent confidence limits. The intervals are drawn on the log odds ratio scale and labeled on the odds ratio scale. Ranges are on the original scale. Discussing this data with clinicians is also useful for thinking about possible interaction effects between some of the candidate predictors; in this example we will not examine this issue.

Using more splines!

Now, forget for a moment our sample size limitation, we want to check if it is worth to expand all the others continuous variables in a full model approach using splines:

fit.all.splines <-  lrm(cvd ~ rcs(sz,4) + rcs(sg,4) + rcs(log(ap),4) +rcs(sbp,4) + rcs(dbp,4) + rcs(age,4) 
                        + rcs(wt,4) + rcs(hg,4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime,4),data=prostate_rid)
print(fit.all.splines)
## Frequencies of Missing Values Due to Each Variable
##     cvd      sz      sg      ap     sbp     dbp     age      wt      hg ekg_rec 
##       0       2       5       0       0       0       0       0       0       5 
##  pf_rec      bm      hx      rx   dtime 
##       0       0       0       0       0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ rcs(sz, 4) + rcs(sg, 4) + rcs(log(ap), 4) + 
##     rcs(sbp, 4) + rcs(dbp, 4) + rcs(age, 4) + rcs(wt, 4) + rcs(hg, 
##     4) + ekg_rec + pf_rec + bm + hx + rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           245    LR chi2     177.11       R2       0.686    C       0.933    
##  0            121    d.f.            38      R2(38,245)0.433    Dxy     0.865    
##  1            124    Pr(> chi2) <0.0001    R2(38,183.7)0.531    gamma   0.865    
## max |deriv| 4e-11                             Brier    0.104    tau-a   0.434    
## 
##                         Coef     S.E.    Wald Z Pr(>|Z|)
## Intercept               -24.1991 13.5603 -1.78  0.0743  
## sz                       -0.2597  0.1295 -2.00  0.0450  
## sz'                       1.6649  0.9134  1.82  0.0683  
## sz''                     -3.1513  1.6810 -1.87  0.0608  
## sg                        0.5706  0.5774  0.99  0.3230  
## sg'                      -4.7064  2.0646 -2.28  0.0226  
## sg''                     21.8060  8.9783  2.43  0.0152  
## ap                       -0.4486  0.8526 -0.53  0.5988  
## ap'                       3.6227  5.5246  0.66  0.5120  
## ap''                     -7.7643 10.2918 -0.75  0.4506  
## sbp                      -0.0453  0.3717 -0.12  0.9030  
## sbp'                     -0.1906  1.2777 -0.15  0.8814  
## sbp''                     0.5496  7.7526  0.07  0.9435  
## dbp                      -0.1280  0.4527 -0.28  0.7773  
## dbp'                      0.4811  1.0301  0.47  0.6405  
## dbp''                     3.6419  6.7889  0.54  0.5916  
## age                       0.3420  0.1106  3.09  0.0020  
## age'                     -0.3789  0.1770 -2.14  0.0323  
## age''                     3.6426  2.0773  1.75  0.0795  
## wt                        0.0024  0.0893  0.03  0.9788  
## wt'                      -0.0771  0.2949 -0.26  0.7936  
## wt''                      0.2732  0.9056  0.30  0.7629  
## hg                        0.2602  0.3600  0.72  0.4697  
## hg'                      -0.9822  0.8826 -1.11  0.2657  
## hg''                      7.9914  5.4689  1.46  0.1439  
## ekg_rec=rd&ec            -0.6508  0.7960 -0.82  0.4136  
## ekg_rec=hbocd            -2.0159  1.0458 -1.93  0.0539  
## ekg_rec=hrts             -0.0923  0.5329 -0.17  0.8625  
## ekg_rec=MI               -0.1072  0.7061 -0.15  0.8793  
## pf_rec=bed_under_50       1.7917  0.9468  1.89  0.0584  
## pf_rec=bed_over_50_conf  -1.2181  1.3352 -0.91  0.3616  
## bm                       -0.1454  0.6230 -0.23  0.8155  
## hx                        1.8027  0.5203  3.46  0.0005  
## rx=0.2 mg estrogen       -0.4657  0.5854 -0.80  0.4263  
## rx=1.0 mg estrogen        0.6931  0.6265  1.11  0.2686  
## rx=5.0 mg estrogen        0.5258  0.6375  0.82  0.4095  
## dtime                    -0.1925  0.0810 -2.38  0.0175  
## dtime'                    0.9895  0.3536  2.80  0.0051  
## dtime''                  -2.5924  0.8682 -2.99  0.0028
summary(fit.all.splines)
##              Effects              Response : cvd 
## 
##  Factor                           Low      High  Diff.   Effect     S.E.   
##  sz                                6.00000  24.5 18.5000 -0.4753200 0.57195
##   Odds Ratio                       6.00000  24.5 18.5000  0.6216800      NA
##  sg                                9.00000  12.0  3.0000 -3.0788000 0.86958
##   Odds Ratio                       9.00000  12.0  3.0000  0.0460160      NA
##  ap                                0.59998   7.0  6.4000 -0.0015339 0.71565
##   Odds Ratio                       0.59998   7.0  6.4000  0.9984700      NA
##  sbp                              13.00000  16.0  3.0000 -0.4203600 0.64517
##   Odds Ratio                      13.00000  16.0  3.0000  0.6568100      NA
##  dbp                               7.00000   9.0  2.0000  0.7533600 0.70002
##   Odds Ratio                       7.00000   9.0  2.0000  2.1241000      NA
##  age                              70.00000  76.0  6.0000 -0.4358300 0.56104
##   Odds Ratio                      70.00000  76.0  6.0000  0.6467300      NA
##  wt                               89.00000 106.0 17.0000 -0.4036000 0.59476
##   Odds Ratio                      89.00000 106.0 17.0000  0.6679100      NA
##  hg                               12.00000  14.6  2.5996 -0.5632000 0.62210
##   Odds Ratio                      12.00000  14.6  2.5996  0.5693800      NA
##  bm                                0.00000   1.0  1.0000 -0.1453900 0.62300
##   Odds Ratio                       0.00000   1.0  1.0000  0.8646800      NA
##  hx                                0.00000   1.0  1.0000  1.8027000 0.52030
##   Odds Ratio                       0.00000   1.0  1.0000  6.0659000      NA
##  dtime                            11.00000  37.0 26.0000  1.3504000 0.64498
##   Odds Ratio                      11.00000  37.0 26.0000  3.8589000      NA
##  ekg_rec - rd&ec:bngn              1.00000   2.0      NA -0.6508500 0.79604
##   Odds Ratio                       1.00000   2.0      NA  0.5216000      NA
##  ekg_rec - hbocd:bngn              1.00000   3.0      NA -2.0159000 1.04580
##   Odds Ratio                       1.00000   3.0      NA  0.1332000      NA
##  ekg_rec - hrts:bngn               1.00000   4.0      NA -0.0922990 0.53294
##   Odds Ratio                       1.00000   4.0      NA  0.9118300      NA
##  ekg_rec - MI:bngn                 1.00000   5.0      NA -0.1072100 0.70615
##   Odds Ratio                       1.00000   5.0      NA  0.8983400      NA
##  pf_rec - bed_under_50:normal      1.00000   2.0      NA  1.7917000 0.94679
##   Odds Ratio                       1.00000   2.0      NA  5.9995000      NA
##  pf_rec - bed_over_50_conf:normal  1.00000   3.0      NA -1.2181000 1.33520
##   Odds Ratio                       1.00000   3.0      NA  0.2957800      NA
##  rx - 0.2 mg estrogen:placebo      1.00000   2.0      NA -0.4657400 0.58539
##   Odds Ratio                       1.00000   2.0      NA  0.6276700      NA
##  rx - 1.0 mg estrogen:placebo      1.00000   3.0      NA  0.6931000 0.62648
##   Odds Ratio                       1.00000   3.0      NA  1.9999000      NA
##  rx - 5.0 mg estrogen:placebo      1.00000   4.0      NA  0.5257800 0.63747
##   Odds Ratio                       1.00000   4.0      NA  1.6918000      NA
##  Lower 0.95 Upper 0.95
##  -1.5963000  0.64568  
##   0.2026400  1.90730  
##  -4.7831000 -1.37440  
##   0.0083699  0.25299  
##  -1.4042000  1.40110  
##   0.2455700  4.05970  
##  -1.6849000  0.84415  
##   0.1854700  2.32600  
##  -0.6186600  2.12540  
##   0.5386600  8.37610  
##  -1.5354000  0.66378  
##   0.2153600  1.94210  
##  -1.5693000  0.76211  
##   0.2081900  2.14280  
##  -1.7825000  0.65608  
##   0.1682200  1.92720  
##  -1.3664000  1.07570  
##   0.2550100  2.93190  
##   0.7829300  2.82240  
##   2.1879000 16.81800  
##   0.0862480  2.61450  
##   1.0901000 13.66100  
##  -2.2111000  0.90936  
##   0.1095900  2.48270  
##  -4.0657000  0.03389  
##   0.0171510  1.03450  
##  -1.1368000  0.95224  
##   0.3208300  2.59150  
##  -1.4912000  1.27680  
##   0.2251000  3.58520  
##  -0.0640040  3.64740  
##   0.9380000 38.37300  
##  -3.8351000  1.39890  
##   0.0215980  4.05060  
##  -1.6131000  0.68161  
##   0.1992700  1.97710  
##  -0.5347800  1.92100  
##   0.5858000  6.82760  
##  -0.7236500  1.77520  
##   0.4849800  5.90150
an.s <- anova (fit.all.splines)
an.s
##                 Wald Statistics          Response: cvd 
## 
##  Factor          Chi-Square d.f. P     
##  sz              11.55       3   0.0091
##   Nonlinear       3.82       2   0.1481
##  sg              13.31       3   0.0040
##   Nonlinear       6.55       2   0.0379
##  ap               3.32       3   0.3443
##   Nonlinear       1.46       2   0.4827
##  sbp              1.40       3   0.7054
##   Nonlinear       0.14       2   0.9312
##  dbp             12.05       3   0.0072
##   Nonlinear       7.18       2   0.0276
##  age             13.01       3   0.0046
##   Nonlinear       5.02       2   0.0813
##  wt               0.69       3   0.8762
##   Nonlinear       0.16       2   0.9212
##  hg               3.13       3   0.3714
##   Nonlinear       3.03       2   0.2193
##  ekg_rec          4.20       4   0.3798
##  pf_rec           4.71       2   0.0951
##  bm               0.05       1   0.8155
##  hx              12.00       1   0.0005
##  rx               4.23       3   0.2373
##  dtime           12.40       3   0.0061
##   Nonlinear      10.89       2   0.0043
##  TOTAL NONLINEAR 24.81      18   0.1302
##  TOTAL           57.36      38   0.0227
plot(an.s) 

Let’s check what happens to the model’s performance:

s.splines <- fit.all.splines$stats # performance of the estimated model 
round(s.splines, digits=3) # C is the AUC and Dxy is related to C (Dxy=2*(C-0.5)), are both discrimination measures
##          Obs    Max Deriv   Model L.R.         d.f.            P            C 
##      245.000        0.000      177.108       38.000        0.000        0.933 
##          Dxy        Gamma        Tau-a           R2      R2(245)   R2(38,245) 
##        0.865        0.865        0.434        0.686        0.515        0.433 
##    R2(183.7) R2(38,183.7)        Brier            g           gr           gp 
##        0.619        0.531        0.104        4.107       60.753        0.432
gamma.hat.splines <-  (s.splines['Model L.R.'] - s.splines['d.f.'])/s.splines['Model L.R.'] 
gamma.hat.splines
## Model L.R. 
##  0.7854418

The van Houwelingen-Le Cessie heuristic shrinkage estimate is 0.79, indicating that this model will validate on new data about 21% worse than on this dataset (more complex than the previous model, more overfitting, as expected).

Let’s compare the performance of the two models in terms of AIC: the Akaike Information Criterion (the smaller is the better is):

AIC.models <- c(AIC(fit.multi), AIC(fit.all.splines))
AIC.models
## [1] 235.4473 240.4973

Based on AIC, the simpler model fitted to the raw data and assuming linearity for all the continuous predictors except dtime has a lower AIC and a lower optimism, so we would prefer that one as a predictive tool.

Selecting candidate predictors using backwards step-down selection

Now we use fast backward step-down (with total residual AIC as the stopping rule, that is a reasonable way to use stepwise selection) to identify the variables that explain the bulk of the cause of death. This algorithm performs a slightly inefficient but numerically stable version of fast backward elimination on factors;it uses the fitted full model and computes approximate Wald statistics by computing conditional (restricted) maximum likelihood estimates (assuming multivariate normality of estimates..). The function prints the deletion statistics for each variable in turn, and prints approximate parameter estimates for the model after deleting variables. The approximation is better when the number of factors deleted is not large.

fastbw(fit.multi)
## 
##  Deleted Chi-Sq d.f. P      Residual d.f. P      AIC   
##  ekg_rec 3.60   4    0.4628  3.60     4   0.4628  -4.40
##  bm      0.01   1    0.9242  3.61     5   0.6069  -6.39
##  sbp     0.08   1    0.7830  3.69     6   0.7192  -8.31
##  hg      0.30   1    0.5827  3.99     7   0.7813 -10.01
##  wt      0.92   1    0.3380  4.91     8   0.7677 -11.09
##  rx      7.31   3    0.0628 12.21    11   0.3480  -9.79
##  pf_rec  5.72   2    0.0573 17.93    13   0.1602  -8.07
##  dtime   7.29   3    0.0633 25.22    16   0.0661  -6.78
##  sg      3.36   1    0.0667 28.58    17   0.0386  -5.42
##  dbp     4.98   1    0.0257 33.55    18   0.0143  -2.45
## 
## Approximate Estimates after Deleting Factors
## 
##               Coef    S.E. Wald Z         P
## Intercept -3.68667 2.00132 -1.842 0.0654580
## sz        -0.04316 0.01605 -2.689 0.0071632
## ap        -0.41027 0.11823 -3.470 0.0005202
## age        0.05934 0.02809  2.112 0.0346497
## hx         0.82995 0.36398  2.280 0.0225953
## 
## Factors in Final Model
## 
## [1] sz  ap  age hx

In the final step the following covariates are retained: sz, ap, age and hx.
Let’s estimate the reduced model:

fred <- lrm(cvd ~ sz + log(ap) + age + hx , data =prostate_rid)
print(fred)
## Frequencies of Missing Values Due to Each Variable
## cvd  sz  ap age  hx 
##   0   2   0   0   0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + log(ap) + age + hx, data = prostate_rid)
## 
## 
##                        Model Likelihood      Discrimination    Rank Discrim.    
##                              Ratio Test             Indexes          Indexes    
## Obs           255    LR chi2     100.81      R2       0.435    C       0.840    
##  0            129    d.f.             4      R2(4,255)0.316    Dxy     0.679    
##  1            126    Pr(> chi2) <0.0001    R2(4,191.2)0.397    gamma   0.679    
## max |deriv| 3e-08                            Brier    0.163    tau-a   0.341    
## 
##           Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept -4.9339 1.6846 -2.93  0.0034  
## sz        -0.0543 0.0136 -3.99  <0.0001 
## ap        -0.5048 0.1044 -4.83  <0.0001 
## age        0.0776 0.0235  3.30  0.0010  
## hx         1.0713 0.3101  3.45  0.0006
summary(fred)
##              Effects              Response : cvd 
## 
##  Factor      Low      High Diff. Effect   S.E.    Lower 0.95 Upper 0.95
##  sz           6.00000 24.5 18.5  -1.00520 0.25199 -1.49910   -0.51135  
##   Odds Ratio  6.00000 24.5 18.5   0.36595      NA  0.22332    0.59968  
##  ap           0.59998  7.0  6.4  -1.24020 0.25656 -1.74310   -0.73737  
##   Odds Ratio  0.59998  7.0  6.4   0.28932      NA  0.17498    0.47837  
##  age         70.00000 76.0  6.0   0.46531 0.14111  0.18874    0.74188  
##   Odds Ratio 70.00000 76.0  6.0   1.59250      NA  1.20770    2.09990  
##  hx           0.00000  1.0  1.0   1.07130 0.31007  0.46355    1.67900  
##   Odds Ratio  0.00000  1.0  1.0   2.91910      NA  1.58970    5.36010

Again, we want to evaluate the performance of the reduced model:

s.red <- fred$stats  
round(s.red, digits=3) 
##         Obs   Max Deriv  Model L.R.        d.f.           P           C 
##     255.000       0.000     100.810       4.000       0.000       0.840 
##         Dxy       Gamma       Tau-a          R2     R2(255)   R2(4,255) 
##       0.679       0.679       0.341       0.435       0.327       0.316 
##   R2(191.2) R2(4,191.2)       Brier           g          gr          gp 
##       0.410       0.397       0.163       1.917       6.800       0.340
gamma.hat.red <-  (s.red['Model L.R.'] - s.red['d.f.'])/s.red['Model L.R.']
gamma.hat.red
## Model L.R. 
##  0.9603214

The van Houwelingen-Le Cessie heuristic shrinkage estimate is now 0.96, indicating that this model will validate on new data about 4% worse than on this dataset (simpler than the full model, less overfitting).

Let us now use a bootstrap approach to further evaluate calibration and discrimination of this reduced model:

fred <- update(fred, x=TRUE , y=TRUE)
vred <- validate (fred, B=200)
vred
##           index.orig training   test optimism index.corrected   n
## Dxy           0.6792   0.6854 0.6695   0.0159          0.6633 200
## R2            0.4354   0.4468 0.4254   0.0214          0.4140 200
## Intercept     0.0000   0.0000 0.0031  -0.0031          0.0031 200
## Slope         1.0000   1.0000 0.9560   0.0440          0.9560 200
## Emax          0.0000   0.0000 0.0105   0.0105          0.0105 200
## D             0.3914   0.4055 0.3804   0.0251          0.3663 200
## U            -0.0078  -0.0078 0.0014  -0.0093          0.0014 200
## Q             0.3993   0.4133 0.3789   0.0344          0.3649 200
## B             0.1630   0.1596 0.1667  -0.0072          0.1702 200
## g             1.9169   1.9832 1.8735   0.1097          1.8073 200
## gp            0.3396   0.3420 0.3352   0.0068          0.3328 200

The slope shrinkage as already seen is around 4%. There is a sligth drop-off in all indexes. The estimated likely future predictive discrimination of the model as measured by Somers’ Dxy change from 0.68 to 0.66. The latter estimate is the one that should be claimed when describing model performance.

A nearly unbiased estimate of future calibration of the stepwise-derived model is given below:

fred <- update(fred, x=TRUE , y=TRUE)
cal  <- calibrate(fred, B=200)
plot(cal)

## 
## n=255   Mean absolute error=0.012   Mean squared error=0.00019
## 0.9 Quantile of absolute error=0.022

Bootstrap overfitting-corrected calibration curve estimate for the backwards step-down cause of death logistic model, along with a rug plot showing the distribution of predicted risks. The smooth nonparametric calibration estimator (loess) is used.The estimated mean absolute error is 0.013 between estimated probabilities and observed events: quite good!

For comparison, let consider now a bootstrap validation of the full model without using any variable selection:

fit.multi <- update(fit.multi, x=TRUE , y=TRUE)
vfull     <- validate (fit.multi, B=200)
vfull
##           index.orig training   test optimism index.corrected   n
## Dxy           0.8113   0.8594 0.7646   0.0949          0.7164 200
## R2            0.6110   0.6812 0.5372   0.1440          0.4670 200
## Intercept     0.0000   0.0000 0.0098  -0.0098          0.0098 200
## Slope         1.0000   1.0000 0.6633   0.3367          0.6633 200
## Emax          0.0000   0.0000 0.0925   0.0925          0.0925 200
## D             0.6088   0.7128 0.5122   0.2006          0.4082 200
## U            -0.0082  -0.0082 0.0649  -0.0731          0.0649 200
## Q             0.6170   0.7210 0.4473   0.2737          0.3433 200
## B             0.1241   0.1043 0.1416  -0.0374          0.1615 200
## g             3.0079   3.9797 2.5893   1.3904          1.6175 200
## gp            0.4077   0.4293 0.3796   0.0498          0.3579 200

Compared to the validation of the full model, the step-down model has much less optimism, as expected, even if it has a smaller Dxy (discrimination) due to loss of information from removing moderately important variables. Let’s now explore calibration of the full model:

cal.full <- calibrate(fit.multi, B=200)
plot(cal.full)

## 
## n=245   Mean absolute error=0.043   Mean squared error=0.00223
## 0.9 Quantile of absolute error=0.067

The estimated mean absolute error is 0.04 between estimated probabilities and observed events, the bias-corrected performance is worse than the reduced model.

We could also make a test to compare the two models in terms of AUC (i.e the discrimination power): note that in the reduced model we have less missing values (in any case the % of missing values in this dataset is quite irrelevant). Let’s extract the probabilities predicted from the models:

full.pred  <- predict(fit.multi, type="fitted") 
rid.pred   <- predict(fred, type="fitted") 
dati.pp    <- data.frame(prostate_rid, full.pred, rid.pred)

And perform the test:

roc.test(response = dati.pp$cvd, predictor1= dati.pp$full.pred, predictor2 = dati.pp$rid.pred)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  dati.pp$full.pred and dati.pp$rid.pred by dati.pp$cvd (0, 1)
## Z = 3.4561, p-value = 0.0005481
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.02628403 0.09515026
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.9056252   0.8449080

We observe as expected significant better discrimination of the full model, but remember also that
this model is more prone to overfitting (the bootstrap-corrected calibration is worse…) and that these measures are derived from the “training-all” dataset so they need also to be rescaled in new data.

roc.full  <- roc(dati.pp$cvd, dati.pp$full.pred)
roc.red   <- roc(dati.pp$cvd, dati.pp$rid.pred)
plot(roc.full)
plot(roc.red, add=TRUE, col="red")
legend("bottomright",legend=c("Full model", "Reduced model"), lty=c(1,1), col=c("black", "red"), bty="n")
title("Discrimination on the development sample")

Another threshold to selecting candidate predictors

We can try to be less conservative in estimating a reduced model, in order to increase the discrimination, using a different significance level for a variable to stay in the model and using individual approximate Wald tests rather than tests combining all deleted variables:

vrid2 <- validate(fit.multi, bw=TRUE , sls =0.5, type = 'individual' , B=200)
## 
##      Backwards Step-down - Original Model
## 
##  Deleted Chi-Sq d.f. P      Residual d.f. P      AIC   
##  ekg_rec 3.60   4    0.4628 3.60     4    0.4628  -4.40
##  bm      0.01   1    0.9242 3.61     5    0.6069  -6.39
##  sbp     0.08   1    0.7830 3.69     6    0.7192  -8.31
##  hg      0.30   1    0.5827 3.99     7    0.7813 -10.01
##  wt      0.92   1    0.3380 4.91     8    0.7677 -11.09
## 
## Approximate Estimates after Deleting Factors
## 
##                             Coef    S.E.  Wald Z         P
## Intercept               -5.12176 3.12777 -1.6375 0.1015240
## sz                      -0.05646 0.01689 -3.3422 0.0008313
## sg                      -0.33179 0.12027 -2.7587 0.0058026
## ap                      -0.34171 0.13370 -2.5558 0.0105945
## dbp                      0.35870 0.14429  2.4859 0.0129210
## age                      0.09340 0.03045  3.0677 0.0021573
## pf_rec=bed_under_50      1.28840 0.78205  1.6475 0.0994630
## pf_rec=bed_over_50_conf -2.31951 1.21744 -1.9052 0.0567492
## hx                       1.27821 0.39842  3.2082 0.0013358
## rx=0.2 mg estrogen      -0.28720 0.50331 -0.5706 0.5682597
## rx=1.0 mg estrogen       0.94249 0.54264  1.7369 0.0824130
## rx=5.0 mg estrogen       0.85665 0.50700  1.6896 0.0910964
## dtime                   -0.11929 0.06723 -1.7742 0.0760268
## dtime'                   0.67239 0.29006  2.3181 0.0204412
## dtime''                 -1.78597 0.70664 -2.5274 0.0114911
## 
## Factors in Final Model
## 
## [1] sz     sg     ap     dbp    age    pf_rec hx     rx     dtime

Now let’s estimate a second reduced model:

fred2 <- lrm(cvd ~ sz + sg+ log(ap) + dbp + age + hx +pf_rec+rx +rcs(dtime,4) , data =prostate_rid)
fred2
## Frequencies of Missing Values Due to Each Variable
##    cvd     sz     sg     ap    dbp    age     hx pf_rec     rx  dtime 
##      0      2      5      0      0      0      0      0      0      0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hx + pf_rec + 
##     rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           250    LR chi2     145.18       R2       0.587    C       0.897    
##  0            125    d.f.            14      R2(14,250)0.408    Dxy     0.794    
##  1            125    Pr(> chi2) <0.0001    R2(14,187.5)0.503    gamma   0.794    
## max |deriv| 7e-10                             Brier    0.131    tau-a   0.398    
## 
##                         Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept               -4.7269 3.0077 -1.57  0.1160  
## sz                      -0.0614 0.0165 -3.71  0.0002  
## sg                      -0.3203 0.1167 -2.74  0.0061  
## ap                      -0.3365 0.1311 -2.57  0.0103  
## dbp                      0.3416 0.1381  2.47  0.0134  
## age                      0.0878 0.0290  3.02  0.0025  
## hx                       1.3681 0.3838  3.56  0.0004  
## pf_rec=bed_under_50      1.2721 0.7466  1.70  0.0884  
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87  0.0611  
## rx=0.2 mg estrogen      -0.2781 0.4865 -0.57  0.5676  
## rx=1.0 mg estrogen       0.9088 0.5227  1.74  0.0821  
## rx=5.0 mg estrogen       0.9254 0.4880  1.90  0.0579  
## dtime                   -0.1070 0.0644 -1.66  0.0966  
## dtime'                   0.5910 0.2755  2.15  0.0319  
## dtime''                 -1.5783 0.6699 -2.36  0.0185
print(fred2)
## Frequencies of Missing Values Due to Each Variable
##    cvd     sz     sg     ap    dbp    age     hx pf_rec     rx  dtime 
##      0      2      5      0      0      0      0      0      0      0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hx + pf_rec + 
##     rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           250    LR chi2     145.18       R2       0.587    C       0.897    
##  0            125    d.f.            14      R2(14,250)0.408    Dxy     0.794    
##  1            125    Pr(> chi2) <0.0001    R2(14,187.5)0.503    gamma   0.794    
## max |deriv| 7e-10                             Brier    0.131    tau-a   0.398    
## 
##                         Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept               -4.7269 3.0077 -1.57  0.1160  
## sz                      -0.0614 0.0165 -3.71  0.0002  
## sg                      -0.3203 0.1167 -2.74  0.0061  
## ap                      -0.3365 0.1311 -2.57  0.0103  
## dbp                      0.3416 0.1381  2.47  0.0134  
## age                      0.0878 0.0290  3.02  0.0025  
## hx                       1.3681 0.3838  3.56  0.0004  
## pf_rec=bed_under_50      1.2721 0.7466  1.70  0.0884  
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87  0.0611  
## rx=0.2 mg estrogen      -0.2781 0.4865 -0.57  0.5676  
## rx=1.0 mg estrogen       0.9088 0.5227  1.74  0.0821  
## rx=5.0 mg estrogen       0.9254 0.4880  1.90  0.0579  
## dtime                   -0.1070 0.0644 -1.66  0.0966  
## dtime'                   0.5910 0.2755  2.15  0.0319  
## dtime''                 -1.5783 0.6699 -2.36  0.0185
summary(fred2)
##              Effects              Response : cvd 
## 
##  Factor                           Low      High Diff. Effect    S.E.   
##  sz                                6.00000 24.5 18.5  -1.135100 0.30581
##   Odds Ratio                       6.00000 24.5 18.5   0.321390      NA
##  sg                                9.00000 12.0  3.0  -0.960990 0.35016
##   Odds Ratio                       9.00000 12.0  3.0   0.382510      NA
##  ap                                0.59998  7.0  6.4  -0.826700 0.32204
##   Odds Ratio                       0.59998  7.0  6.4   0.437490      NA
##  dbp                               7.00000  9.0  2.0   0.683210 0.27626
##   Odds Ratio                       7.00000  9.0  2.0   1.980200      NA
##  age                              70.00000 76.0  6.0   0.526990 0.17427
##   Odds Ratio                      70.00000 76.0  6.0   1.693800      NA
##  hx                                0.00000  1.0  1.0   1.368100 0.38383
##   Odds Ratio                       0.00000  1.0  1.0   3.928100      NA
##  dtime                            11.00000 37.0 26.0   0.924720 0.51670
##   Odds Ratio                      11.00000 37.0 26.0   2.521200      NA
##  pf_rec - bed_under_50:normal      1.00000  2.0   NA   1.272100 0.74664
##   Odds Ratio                       1.00000  2.0   NA   3.568500      NA
##  pf_rec - bed_over_50_conf:normal  1.00000  3.0   NA  -2.360800 1.26070
##   Odds Ratio                       1.00000  3.0   NA   0.094345      NA
##  rx - 0.2 mg estrogen:placebo      1.00000  2.0   NA  -0.278060 0.48652
##   Odds Ratio                       1.00000  2.0   NA   0.757250      NA
##  rx - 1.0 mg estrogen:placebo      1.00000  3.0   NA   0.908770 0.52267
##   Odds Ratio                       1.00000  3.0   NA   2.481300      NA
##  rx - 5.0 mg estrogen:placebo      1.00000  4.0   NA   0.925360 0.48803
##   Odds Ratio                       1.00000  4.0   NA   2.522800      NA
##  Lower 0.95 Upper 0.95
##  -1.7345000 -0.53573  
##   0.1764900  0.58524  
##  -1.6473000 -0.27468  
##   0.1925700  0.75981  
##  -1.4579000 -0.19552  
##   0.2327300  0.82241  
##   0.1417600  1.22470  
##   1.1523000  3.40300  
##   0.1854300  0.86855  
##   1.2037000  2.38350  
##   0.6158600  2.12040  
##   1.8513000  8.33470  
##  -0.0879970  1.93740  
##   0.9157600  6.94090  
##  -0.1912400  2.73550  
##   0.8259400 15.41800  
##  -4.8317000  0.11013  
##   0.0079727  1.11640  
##  -1.2316000  0.67550  
##   0.2918200  1.96500  
##  -0.1156500  1.93320  
##   0.8907900  6.91150  
##  -0.0311690  1.88190  
##   0.9693100  6.56590

Give a look to the performance of this alternative reduced model:

fred2 <- update(fred2, x=TRUE , y=TRUE)
vred2 <- validate (fred2, B=200)
vred2
##           index.orig training   test optimism index.corrected   n
## Dxy           0.7937   0.8189 0.7621   0.0568          0.7369 200
## R2            0.5873   0.6268 0.5381   0.0887          0.4986 200
## Intercept     0.0000   0.0000 0.0064  -0.0064          0.0064 200
## Slope         1.0000   1.0000 0.7792   0.2208          0.7792 200
## Emax          0.0000   0.0000 0.0566   0.0566          0.0566 200
## D             0.5767   0.6324 0.5131   0.1193          0.4574 200
## U            -0.0080  -0.0080 0.0224  -0.0304          0.0224 200
## Q             0.5847   0.6404 0.4906   0.1497          0.4350 200
## B             0.1315   0.1206 0.1426  -0.0220          0.1535 200
## g             2.8350   3.3964 2.6249   0.7715          2.0635 200
## gp            0.3990   0.4106 0.3800   0.0305          0.3685 200

The performance statistics are now midway between the full model and the smaller stepwise model.

cal.red2 <- calibrate(fred2, B=200)
plot(cal.red2)

## 
## n=250   Mean absolute error=0.029   Mean squared error=0.00094
## 0.9 Quantile of absolute error=0.038

The mean absolute error is 0.03.

Let’s compare the AUCs :

rid2.pred    <- predict(fred2, type="fitted") 
dati.pp2     <- data.frame(dati.pp,rid2.pred)
roc.test(response = dati.pp2$cvd, predictor1= dati.pp2$full.pred, predictor2 = dati.pp2$rid2.pred)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  dati.pp2$full.pred and dati.pp2$rid2.pred by dati.pp2$cvd (0, 1)
## Z = 1.1526, p-value = 0.2491
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  -0.004248807  0.016378906
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.9056252   0.8995601

We have a comparable discrimination power on the development sample with respect to the full model, but with a better calibration.

roc.full  <- roc(dati.pp2$cvd, dati.pp2$full.pred)
roc.red2  <- roc(dati.pp2$cvd, dati.pp2$rid2.pred)
plot(roc.full)
plot(roc.red2, add=TRUE, col="red")
legend("bottomright",legend=c("Full model", "Reduced 2 model"), lty=c(1,1), col=c("black", "red"), bty="n")
title("Discrimination on the development sample")

Final decision !

Finally we decide to present our estimated prediction model to the clinicians. Let’s prepare a visualization that could be helpful to discuss with them; we rename some levels of the categorical variables:

prostate_rid$hxf              <- as.factor(prostate_rid$hx)
levels(prostate_rid$hxf)      <- c("No", "Yes")
label(prostate_rid$hxf)       <- "History of CV disease"


options(datadist='dd')
dd <-datadist(prostate_rid)
dd
##                     patno stage              rx dtime               status
## Low:effect      131.00000     3            <NA>    11                 <NA>
## Adjust to       265.00000     3         placebo    24  dead - prostatic ca
## High:effect     395.00000     4            <NA>    37                 <NA>
## Low:prediction   24.96109     3         placebo     1                alive
## High:prediction 482.03891     4 5.0 mg estrogen    58 dead - unknown cause
## Low               3.00000     3         placebo     0                alive
## High            506.00000     4 5.0 mg estrogen    71 dead - unknown cause
##                      age       wt              pf hx sbp dbp          ekg
## Low:effect      70.00000  89.0000            <NA>  0  13   7         <NA>
## Adjust to       73.00000  97.0000 normal activity  0  14   8 heart strain
## High:effect     76.00000 106.0000            <NA>  1  16   9         <NA>
## Low:prediction  55.96109  76.0000 normal activity  0  11   6       normal
## High:prediction 80.00000 125.0389 confined to bed  1  19  10    recent MI
## Low             48.00000  71.0000 normal activity  0   8   4       normal
## High            88.00000 152.0000 confined to bed  1  24  13    recent MI
##                        hg       sz sg           ap bm ap80 ekg_rec
## Low:effect      12.000000  6.00000  9   0.59997559  0    0    <NA>
## Adjust to       13.599609 14.00000 11   1.09985352  0    0    bngn
## High:effect     14.599609 24.50000 12   7.00000000  1    1    <NA>
## Low:prediction   9.298828  2.00000  8   0.19998169  0    0    bngn
## High:prediction 16.503800 46.11673 14  54.97592412  1    1      MI
## Low              5.899414  0.00000  5   0.09999084  0    0    bngn
## High            21.199219 69.00000 15 999.87500000  1    1      MI
##                           pf_rec           status_num subset cvd      log_ap
## Low:effect                  <NA>                 <NA>      1   0 -0.51086631
## Adjust to                 normal  dead - prostatic ca      1   0  0.09517701
## High:effect                 <NA>                 <NA>      1   1  1.94591015
## Low:prediction            normal                alive      1   0 -1.60952947
## High:prediction bed_over_50_conf dead - unknown cause      1   1  4.00677071
## Low                       normal                alive      1   0 -2.30267665
## High            bed_over_50_conf dead - unknown cause      1   1  6.90763027
##                  hxf
## Low:effect      <NA>
## Adjust to        Yes
## High:effect     <NA>
## Low:prediction    No
## High:prediction  Yes
## Low               No
## High             Yes
## 
## Values:
## 
## stage : 3 4 
## rx : placebo 0.2 mg estrogen 1.0 mg estrogen 5.0 mg estrogen 
## status :
##  [1] alive                        dead - prostatic ca         
##  [3] dead - heart or vascular     dead - cerebrovascular      
##  [5] dead - pulmonary embolus     dead - other ca             
##  [7] dead - respiratory disease   dead - other specific non-ca
##  [9] dead - unspecified non-ca    dead - unknown cause        
## pf : normal activity in bed < 50% daytime in bed > 50% daytime confined to bed 
## hx : 0 1 
## dbp :  4  5  6  7  8  9 10 11 12 13 
## ekg :
## [1] normal                            benign                           
## [3] rhythmic disturb & electrolyte ch heart block or conduction def    
## [5] heart strain                      old MI                           
## [7] recent MI                        
## bm : 0 1 
## ap80 : 0 1 
## ekg_rec : bngn rd&ec hbocd hrts MI 
## pf_rec : normal bed_under_50 bed_over_50_conf 
## status_num :
##  [1] alive                        dead - prostatic ca         
##  [3] dead - heart or vascular     dead - cerebrovascular      
##  [5] dead - pulmonary embolus     dead - other ca             
##  [7] dead - respiratory disease   dead - other specific non-ca
##  [9] dead - unspecified non-ca    dead - unknown cause        
## subset : 1 
## cvd : 0 1 
## hxf : No Yes

We estimate the final model selected:

fredfin <- lrm(cvd ~ sz + sg+ log(ap) + dbp+ age + hxf +pf_rec+rx+rcs(dtime,4), data =prostate_rid)
fredfin
## Frequencies of Missing Values Due to Each Variable
##    cvd     sz     sg     ap    dbp    age    hxf pf_rec     rx  dtime 
##      0      2      5      0      0      0      0      0      0      0 
## 
## Logistic Regression Model
## 
## lrm(formula = cvd ~ sz + sg + log(ap) + dbp + age + hxf + pf_rec + 
##     rx + rcs(dtime, 4), data = prostate_rid)
## 
## 
##                        Model Likelihood       Discrimination    Rank Discrim.    
##                              Ratio Test              Indexes          Indexes    
## Obs           250    LR chi2     145.18       R2       0.587    C       0.897    
##  0            125    d.f.            14      R2(14,250)0.408    Dxy     0.794    
##  1            125    Pr(> chi2) <0.0001    R2(14,187.5)0.503    gamma   0.794    
## max |deriv| 7e-10                             Brier    0.131    tau-a   0.398    
## 
##                         Coef    S.E.   Wald Z Pr(>|Z|)
## Intercept               -4.7269 3.0077 -1.57  0.1160  
## sz                      -0.0614 0.0165 -3.71  0.0002  
## sg                      -0.3203 0.1167 -2.74  0.0061  
## ap                      -0.3365 0.1311 -2.57  0.0103  
## dbp                      0.3416 0.1381  2.47  0.0134  
## age                      0.0878 0.0290  3.02  0.0025  
## hxf=Yes                  1.3681 0.3838  3.56  0.0004  
## pf_rec=bed_under_50      1.2721 0.7466  1.70  0.0884  
## pf_rec=bed_over_50_conf -2.3608 1.2607 -1.87  0.0611  
## rx=0.2 mg estrogen      -0.2781 0.4865 -0.57  0.5676  
## rx=1.0 mg estrogen       0.9088 0.5227  1.74  0.0821  
## rx=5.0 mg estrogen       0.9254 0.4880  1.90  0.0579  
## dtime                   -0.1070 0.0644 -1.66  0.0966  
## dtime'                   0.5910 0.2755  2.15  0.0319  
## dtime''                 -1.5783 0.6699 -2.36  0.0185
nom <-  nomogram (fredfin , ap=c(.1 , .5 , 1, 5, 10, 20, 30, 40),
                fun=plogis , funlabel ="Probability ",
                lp=TRUE, 
                fun.at =c(.01,.1,.25,.5,.75,.95))
nom
## Points per unit of linear predictor: 21.51285 
## Linear predictor units per point   : 0.04648384 
## 
## 
##  sz Points
##   0 92    
##   5 86    
##  10 79    
##  15 73    
##  20 66    
##  25 59    
##  30 53    
##  35 46    
##  40 40    
##  45 33    
##  50 26    
##  55 20    
##  60 13    
##  65  7    
##  70  0    
## 
## 
##  sg Points
##   5 69    
##   6 62    
##   7 55    
##   8 48    
##   9 41    
##  10 34    
##  11 28    
##  12 21    
##  13 14    
##  14  7    
##  15  0    
## 
## 
##  ap   Points
##   0.1 43    
##   0.5 32    
##   1.0 27    
##   5.0 15    
##  10.0 10    
##  20.0  5    
##  30.0  2    
##  40.0  0    
## 
## 
##  dbp Points
##   4   0    
##   5   7    
##   6  15    
##   7  22    
##   8  29    
##   9  37    
##  10  44    
##  11  51    
##  12  59    
##  13  66    
## 
## 
##  age Points
##  45   0    
##  50   9    
##  55  19    
##  60  28    
##  65  38    
##  70  47    
##  75  57    
##  80  66    
##  85  76    
##  90  85    
## 
## 
##  History of CV disease Points
##  No                     0    
##  Yes                   29    
## 
## 
##  pf_rec           Points
##  normal           51    
##  bed_under_50     78    
##  bed_over_50_conf  0    
## 
## 
##  rx              Points
##  placebo          6    
##  0.2 mg estrogen  0    
##  1.0 mg estrogen 26    
##  5.0 mg estrogen 26    
## 
## 
##  Months of Follow-up Points
##   0                  100   
##  10                   79   
##  20                   79   
##  30                   96   
##  40                   95   
##  50                   77   
##  60                   52   
##  70                   26   
##  80                    0   
## 
## 
##  Total Points Probability 
##           268         0.01
##           319         0.10
##           343         0.25
##           367         0.50
##           390         0.75
##           430         0.95

This is a useful tool to visualize a prediction model and discuss it with clinicians:

plot(nom,cex.axis=.75)