Load the required libraries:

library(knitr)
library(tidyverse)
library(lubridate)
library(survival)
library(survminer)
library(gtsummary)
library(ISwR)
library(KMsurv)
library(pmsampsize)

The Cox regression model

We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables on the hazard function.

The Cox regression model is a semi-parametric model that can be used to fit univariable and multivariable regression models that have survival outcomes.

\[h(t|X_i) = h_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})\]

\(h(t)\): hazard, or the instantaneous rate at which events occur \(h_0(t)\): underlying baseline hazard

Some key assumptions of the model:

Note: parametric regression models for survival outcomes are also available, but they won’t be addressed in this course.

We can fit regression models for survival data using the coxph function, which takes a Surv object on the left hand side and has standard syntax for regression formulas in R on the right hand side.

The lung dataset

The lung dataset contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. Some variables we will use to demonstrate methods include:

data("cancer", package="survival")
str(lung)
## 'data.frame':    228 obs. of  10 variables:
##  $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : num  306 455 1010 210 883 ...
##  $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : num  1175 1225 NA 1150 NA ...
##  $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...
head(lung)
##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0
coxph(Surv(time, status) ~ sex, data = lung)
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
## 
##        coef exp(coef) se(coef)      z       p
## sex -0.5310    0.5880   0.1672 -3.176 0.00149
## 
## Likelihood ratio test=10.63  on 1 df, p=0.001111
## n= 228, number of events= 165

Formatting Cox regression results

We can see a tidy version of the output using the tidy function from the broom package:

broom::tidy(
  coxph(Surv(time, status) ~ sex, data = lung), 
  exp = TRUE
  ) %>% 
  kable()
term estimate std.error statistic p.value
sex 0.5880028 0.1671786 -3.176385 0.0014912

Or use tbl_regression from the gtsummary package

coxph(Surv(time, status) ~ sex, data = lung) %>% 
  gtsummary::tbl_regression(exp = TRUE) 
Characteristic HR1 95% CI1 p-value
sex 0.59 0.42, 0.82 0.001
1 HR = Hazard Ratio, CI = Confidence Interval

Hazard ratios

Here is a plot of the cumulative hazard functions by sex:

Assessing proportional hazards (I)

One assumption of the Cox proportional hazards regression model is that the hazards are proportional at each point in time throughout follow-up. How can we check to see if our data meet this assumption?

Use the cox.zph function from the survival package. It results in two main things:

  1. A hypothesis test of whether the effect of each covariate differs according to time, and a global test of all covariates at once.
    • This is done by testing for an interaction effect between the covariate and log(time)
    • A significant p-value indicates that the proportional hazards assumption is violated
  2. Plots of the Schoenfeld residuals
    • Deviation from a zero-slope line is evidence that the proportional hazards assumption is violated
mv_fit <- coxph(Surv(time, status) ~ sex + age, data = lung)
cz <- cox.zph(mv_fit)
print(cz)
##        chisq df    p
## sex    2.608  1 0.11
## age    0.209  1 0.65
## GLOBAL 2.771  2 0.25
plot(cz)

In this example, the PH assumption could be considered valid for both covariates.

Another example of Cox PH Regression

Now, on the melanoma dataset, consider a Cox model with the single regressor sex:

Loading and viewing dataset melanom:

data("melanom", package="ISwR")
str(melanom)
## 'data.frame':    205 obs. of  6 variables:
##  $ no    : int  789 13 97 16 21 469 685 7 932 944 ...
##  $ status: int  3 3 2 3 1 1 1 1 3 1 ...
##  $ days  : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ ulc   : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ thick : int  676 65 134 290 1208 484 516 1288 322 741 ...
##  $ sex   : int  2 2 2 1 2 2 2 2 1 1 ...
head(melanom)
##    no status days ulc thick sex
## 1 789      3   10   1   676   2
## 2  13      3   30   2    65   2
## 3  97      2   35   2   134   2
## 4  16      3   99   2   290   1
## 5  21      1  185   1  1208   2
## 6 469      1  204   1   484   2
attach(melanom)

Create the survival object:

mod.sex <- coxph(Surv(days,status==1)~sex)
summary(mod.sex)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex)
## 
##   n= 205, number of events= 57 
## 
##       coef exp(coef) se(coef)     z Pr(>|z|)  
## sex 0.6622    1.9390   0.2651 2.498   0.0125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## sex     1.939     0.5157     1.153      3.26
## 
## Concordance= 0.59  (se = 0.034 )
## Likelihood ratio test= 6.15  on 1 df,   p=0.01
## Wald test            = 6.24  on 1 df,   p=0.01
## Score (logrank) test = 6.47  on 1 df,   p=0.01

Males (=2) have an hazard of death for melanoma that is nearly twice than women (=1).

A more elaborate example, involving also a continuous covariate:

mod.cov <- coxph(Surv(days,status==1)~sex+log(thick))
summary(mod.cov)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)    
## sex        0.4580    1.5809   0.2687 1.705   0.0883 .  
## log(thick) 0.7809    2.1834   0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex            1.581     0.6326    0.9337     2.677
## log(thick)     2.183     0.4580    1.6040     2.972
## 
## Concordance= 0.749  (se = 0.033 )
## Likelihood ratio test= 33.45  on 2 df,   p=5e-08
## Wald test            = 31  on 2 df,   p=2e-07
## Score (logrank) test = 32.52  on 2 df,   p=9e-08

Remember that ‘thick’ is the tumor thickness; we use logarithm since the distribution is asymmetric. Each 1 point change in log(thick) is associated with a 2.2-fold increase in a patient’s risk.

Verifying the PH Assumption (II)

Question: adjusting for log(thick), does the effect of gender follow a proportional hazards model? If the PH assumption holds, the log cumulative hazards for the two groups, adjusting for log(thick), should be roughly parallel:

fit1 <- coxph( Surv(days,status==1) ~ log(thick)+ strata(sex),data=melanom)
plot(survfit(fit1), lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"), fun="cloglog",xlab="Days",ylab="Log-Cumulative Hazard Function", xlim=c(100, 5000) )
legend(100, -1, lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"),legend=c("Females", "Males"),bty="n")

Conclusion: not strong evidence of non-proportional hazards. This is a good look at gross departures, but it is far from a formal test.

As before, let’s do a formal test to check PH assumption:

check.ph <- cox.zph(mod.cov, transform="km", global=TRUE)
check.ph
##            chisq df     p
## sex         1.33  1 0.248
## log(thick)  6.23  1 0.013
## GLOBAL      6.70  2 0.035

Visualizing the residual plot: if the PH assumption for log(thick) holds, then the plot of b(t) vs time should be on a horizontal line.In this case we note that for log(thick) the PH assumption is not valid, so we should apply some more advanced options (as for example introducing an interaction between the covariate and a function of time). We do not introduce here this option.

plot(check.ph[2], xlab="Days")
abline(h=0, lty=3,lwd=2)
abline(h=coef(mod.cov)[2], lty=3,col="red",lwd=2)

We can observe that instead the effect of log(thick) is gradually decreasing with time; red dashed line indicates Cox model’s estimate for the overall log thick effect.

A more elaborate example: binary factor + continuous covariate + stratification variable

mod.cov.strat <- coxph(Surv(days,status==1)~sex+log(thick)+strata(ulc))
summary(mod.cov.strat)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick) + 
##     strata(ulc))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)   
## sex        0.3600    1.4333   0.2702 1.332   0.1828   
## log(thick) 0.5599    1.7505   0.1784 3.139   0.0017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex            1.433     0.6977     0.844     2.434
## log(thick)     1.750     0.5713     1.234     2.483
## 
## Concordance= 0.673  (se = 0.039 )
## Likelihood ratio test= 13.3  on 2 df,   p=0.001
## Wald test            = 12.88  on 2 df,   p=0.002
## Score (logrank) test = 12.98  on 2 df,   p=0.002

Plotting estimated survival curves in the strata:

plot(survfit(mod.cov.strat), col=c("red","black"),ylim=c(0.5,1), xlab="Days", ylab="Survival")
legend(3000,1, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 

The default for survfit is to generate curves for a pseudoindividual for which the covariates are at their mean values. In the present case, that would correspond to a tumor thickness of 1.86 mm and a gender of 1.39 (!). We have been sloppy in not defining sex as a factor variable, but that would not actually give a different result : coxph subtracts the means of the regressors before fitting, so a 1/2 coding is the same as 0/1, which is what a factor with treatment contrasts usually gives:

sex.f <- as.factor(sex)
sex.f
##   [1] 2 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 2 2 1 2 1 2
##  [38] 2 1 2 1 1 1 2 2 2 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2
##  [75] 1 1 1 2 2 1 1 1 2 1 1 2 1 2 2 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2
## [112] 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1
## [149] 2 2 2 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1
## [186] 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1
## Levels: 1 2
mod.cov.strat.f <- coxph(Surv(days,status==1)~sex.f+log(thick)+strata(ulc))
summary(mod.cov.strat.f)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex.f + log(thick) + 
##     strata(ulc))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)   
## sex.f2     0.3600    1.4333   0.2702 1.332   0.1828   
## log(thick) 0.5599    1.7505   0.1784 3.139   0.0017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex.f2         1.433     0.6977     0.844     2.434
## log(thick)     1.750     0.5713     1.234     2.483
## 
## Concordance= 0.673  (se = 0.039 )
## Likelihood ratio test= 13.3  on 2 df,   p=0.001
## Wald test            = 12.88  on 2 df,   p=0.002
## Score (logrank) test = 12.98  on 2 df,   p=0.002

In order to estimate survival curves for subjects with certain values of the covariates, we could use the option newdata in survfit:

par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Male, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Female, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 

par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F) 
title("Male, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F) 
title("Female, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 

Sample size for Survival outcomes

Use pmsampsize to calculate the minimum sample size required for developing a multivariable Cox prediction model with a survival outcome using 30 candidate predictors. We know an existing prediction model in the same field has an R-squared adjusted of 0.051. Further, in the previous study the mean follow-up was 2.07 years, and overall event rate was 0.065. We select a timepoint of interest for prediction using the newly developed model of 2 years.

pmsampsize(type = "s", rsquared = 0.051, parameters = 30, rate = 0.065,timepoint = 2, meanfup = 2.07)
## NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared 
## NB: Assuming 0.05 margin of error in estimation of overall risk at time point = 2  
## NB: Events per Predictor Parameter (EPP) assumes overall event rate = 0.065  
##  
##              Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq   EPP
## Criteria 1        5143     0.900        30  0.051   0.555   0.092 23.07
## Criteria 2        1039     0.648        30  0.051   0.555   0.092  4.66
## Criteria 3 *      5143     0.900        30  0.051   0.555   0.092 23.07
## Final SS          5143     0.900        30  0.051   0.555   0.092 23.07
##  
##  Minimum sample size required for new model development based on user inputs = 5143, 
##  corresponding to 10646 person-time** of follow-up, with 692 outcome events 
##  assuming an overall event rate = 0.065 and therefore an EPP = 23.07  
##  
##  * 95% CI for overall risk = (0.113, 0.13), for true value of 0.122 and sample size n = 5143 
##  **where time is in the units mean follow-up time was specified in

We need 5143 subjects in our study, according to the assumptions declared above.