Load the required libraries:
library(knitr)
library(tidyverse)
library(lubridate)
library(survival)
library(survminer)
library(gtsummary)
library(ISwR)
library(KMsurv)
library(pmsampsize)
We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables on the hazard function.
The Cox regression model is a semi-parametric model that can be used to fit univariable and multivariable regression models that have survival outcomes.
\[h(t|X_i) = h_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})\]
\(h(t)\): hazard, or the instantaneous rate at which events occur \(h_0(t)\): underlying baseline hazard
Some key assumptions of the model:
Note: parametric regression models for survival outcomes are also available, but they won’t be addressed in this course.
We can fit regression models for survival data using the
coxph
function, which takes a Surv
object on
the left hand side and has standard syntax for regression formulas in
R
on the right hand side.
The lung
dataset contain subjects with advanced lung
cancer from the North Central Cancer Treatment Group. Some variables we
will use to demonstrate methods include:
data("cancer", package="survival")
str(lung)
## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
head(lung)
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
coxph(Surv(time, status) ~ sex, data = lung)
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
##
## coef exp(coef) se(coef) z p
## sex -0.5310 0.5880 0.1672 -3.176 0.00149
##
## Likelihood ratio test=10.63 on 1 df, p=0.001111
## n= 228, number of events= 165
We can see a tidy version of the output using the tidy
function from the broom
package:
broom::tidy(
coxph(Surv(time, status) ~ sex, data = lung),
exp = TRUE
) %>%
kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
sex | 0.5880028 | 0.1671786 | -3.176385 | 0.0014912 |
Or use tbl_regression
from the gtsummary
package
coxph(Surv(time, status) ~ sex, data = lung) %>%
gtsummary::tbl_regression(exp = TRUE)
Characteristic | HR1 | 95% CI1 | p-value |
---|---|---|---|
sex | 0.59 | 0.42, 0.82 | 0.001 |
1 HR = Hazard Ratio, CI = Confidence Interval |
estimate
in our coxph
) then HR = \(\exp(\beta)\).Here is a plot of the cumulative hazard functions by sex:
One assumption of the Cox proportional hazards regression model is that the hazards are proportional at each point in time throughout follow-up. How can we check to see if our data meet this assumption?
Use the cox.zph
function from the survival package. It
results in two main things:
mv_fit <- coxph(Surv(time, status) ~ sex + age, data = lung)
cz <- cox.zph(mv_fit)
print(cz)
## chisq df p
## sex 2.608 1 0.11
## age 0.209 1 0.65
## GLOBAL 2.771 2 0.25
plot(cz)
In this example, the PH assumption could be considered valid for both covariates.
Now, on the melanoma dataset, consider a Cox model with the single regressor sex:
Loading and viewing dataset melanom:
data("melanom", package="ISwR")
str(melanom)
## 'data.frame': 205 obs. of 6 variables:
## $ no : int 789 13 97 16 21 469 685 7 932 944 ...
## $ status: int 3 3 2 3 1 1 1 1 3 1 ...
## $ days : int 10 30 35 99 185 204 210 232 232 279 ...
## $ ulc : int 1 2 2 2 1 1 1 1 1 1 ...
## $ thick : int 676 65 134 290 1208 484 516 1288 322 741 ...
## $ sex : int 2 2 2 1 2 2 2 2 1 1 ...
head(melanom)
## no status days ulc thick sex
## 1 789 3 10 1 676 2
## 2 13 3 30 2 65 2
## 3 97 2 35 2 134 2
## 4 16 3 99 2 290 1
## 5 21 1 185 1 1208 2
## 6 469 1 204 1 484 2
attach(melanom)
Create the survival object:
mod.sex <- coxph(Surv(days,status==1)~sex)
summary(mod.sex)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex)
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.6622 1.9390 0.2651 2.498 0.0125 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.939 0.5157 1.153 3.26
##
## Concordance= 0.59 (se = 0.034 )
## Likelihood ratio test= 6.15 on 1 df, p=0.01
## Wald test = 6.24 on 1 df, p=0.01
## Score (logrank) test = 6.47 on 1 df, p=0.01
Males (=2) have an hazard of death for melanoma that is nearly twice than women (=1).
A more elaborate example, involving also a continuous covariate:
mod.cov <- coxph(Surv(days,status==1)~sex+log(thick))
summary(mod.cov)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick))
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.4580 1.5809 0.2687 1.705 0.0883 .
## log(thick) 0.7809 2.1834 0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.581 0.6326 0.9337 2.677
## log(thick) 2.183 0.4580 1.6040 2.972
##
## Concordance= 0.749 (se = 0.033 )
## Likelihood ratio test= 33.45 on 2 df, p=5e-08
## Wald test = 31 on 2 df, p=2e-07
## Score (logrank) test = 32.52 on 2 df, p=9e-08
Remember that ‘thick’ is the tumor thickness; we use logarithm since the distribution is asymmetric. Each 1 point change in log(thick) is associated with a 2.2-fold increase in a patient’s risk.
Question: adjusting for log(thick), does the effect of gender follow a proportional hazards model? If the PH assumption holds, the log cumulative hazards for the two groups, adjusting for log(thick), should be roughly parallel:
fit1 <- coxph( Surv(days,status==1) ~ log(thick)+ strata(sex),data=melanom)
plot(survfit(fit1), lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"), fun="cloglog",xlab="Days",ylab="Log-Cumulative Hazard Function", xlim=c(100, 5000) )
legend(100, -1, lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"),legend=c("Females", "Males"),bty="n")
Conclusion: not strong evidence of non-proportional hazards. This is a good look at gross departures, but it is far from a formal test.
As before, let’s do a formal test to check PH assumption:
check.ph <- cox.zph(mod.cov, transform="km", global=TRUE)
check.ph
## chisq df p
## sex 1.33 1 0.248
## log(thick) 6.23 1 0.013
## GLOBAL 6.70 2 0.035
Visualizing the residual plot: if the PH assumption for log(thick) holds, then the plot of b(t) vs time should be on a horizontal line.In this case we note that for log(thick) the PH assumption is not valid, so we should apply some more advanced options (as for example introducing an interaction between the covariate and a function of time). We do not introduce here this option.
plot(check.ph[2], xlab="Days")
abline(h=0, lty=3,lwd=2)
abline(h=coef(mod.cov)[2], lty=3,col="red",lwd=2)
We can observe that instead the effect of log(thick) is gradually decreasing with time; red dashed line indicates Cox model’s estimate for the overall log thick effect.
A more elaborate example: binary factor + continuous covariate + stratification variable
mod.cov.strat <- coxph(Surv(days,status==1)~sex+log(thick)+strata(ulc))
summary(mod.cov.strat)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick) +
## strata(ulc))
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.3600 1.4333 0.2702 1.332 0.1828
## log(thick) 0.5599 1.7505 0.1784 3.139 0.0017 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.433 0.6977 0.844 2.434
## log(thick) 1.750 0.5713 1.234 2.483
##
## Concordance= 0.673 (se = 0.039 )
## Likelihood ratio test= 13.3 on 2 df, p=0.001
## Wald test = 12.88 on 2 df, p=0.002
## Score (logrank) test = 12.98 on 2 df, p=0.002
Plotting estimated survival curves in the strata:
plot(survfit(mod.cov.strat), col=c("red","black"),ylim=c(0.5,1), xlab="Days", ylab="Survival")
legend(3000,1, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
The default for survfit is to generate curves for a pseudoindividual for which the covariates are at their mean values. In the present case, that would correspond to a tumor thickness of 1.86 mm and a gender of 1.39 (!). We have been sloppy in not defining sex as a factor variable, but that would not actually give a different result : coxph subtracts the means of the regressors before fitting, so a 1/2 coding is the same as 0/1, which is what a factor with treatment contrasts usually gives:
sex.f <- as.factor(sex)
sex.f
## [1] 2 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 2 2 1 2 1 2
## [38] 2 1 2 1 1 1 2 2 2 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2
## [75] 1 1 1 2 2 1 1 1 2 1 1 2 1 2 2 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2
## [112] 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1
## [149] 2 2 2 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1
## [186] 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1
## Levels: 1 2
mod.cov.strat.f <- coxph(Surv(days,status==1)~sex.f+log(thick)+strata(ulc))
summary(mod.cov.strat.f)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex.f + log(thick) +
## strata(ulc))
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex.f2 0.3600 1.4333 0.2702 1.332 0.1828
## log(thick) 0.5599 1.7505 0.1784 3.139 0.0017 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex.f2 1.433 0.6977 0.844 2.434
## log(thick) 1.750 0.5713 1.234 2.483
##
## Concordance= 0.673 (se = 0.039 )
## Likelihood ratio test= 13.3 on 2 df, p=0.001
## Wald test = 12.88 on 2 df, p=0.002
## Score (logrank) test = 12.98 on 2 df, p=0.002
In order to estimate survival curves for subjects with certain values of the covariates, we could use the option newdata in survfit:
par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Male, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Female, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Male, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Female, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
Use pmsampsize to calculate the minimum sample size required for developing a multivariable Cox prediction model with a survival outcome using 30 candidate predictors. We know an existing prediction model in the same field has an R-squared adjusted of 0.051. Further, in the previous study the mean follow-up was 2.07 years, and overall event rate was 0.065. We select a timepoint of interest for prediction using the newly developed model of 2 years.
pmsampsize(type = "s", rsquared = 0.051, parameters = 30, rate = 0.065,timepoint = 2, meanfup = 2.07)
## NB: Assuming 0.05 acceptable difference in apparent & adjusted R-squared
## NB: Assuming 0.05 margin of error in estimation of overall risk at time point = 2
## NB: Events per Predictor Parameter (EPP) assumes overall event rate = 0.065
##
## Samp_size Shrinkage Parameter CS_Rsq Max_Rsq Nag_Rsq EPP
## Criteria 1 5143 0.900 30 0.051 0.555 0.092 23.07
## Criteria 2 1039 0.648 30 0.051 0.555 0.092 4.66
## Criteria 3 * 5143 0.900 30 0.051 0.555 0.092 23.07
## Final SS 5143 0.900 30 0.051 0.555 0.092 23.07
##
## Minimum sample size required for new model development based on user inputs = 5143,
## corresponding to 10646 person-time** of follow-up, with 692 outcome events
## assuming an overall event rate = 0.065 and therefore an EPP = 23.07
##
## * 95% CI for overall risk = (0.113, 0.13), for true value of 0.122 and sample size n = 5143
## **where time is in the units mean follow-up time was specified in
We need 5143 subjects in our study, according to the assumptions declared above.