This practical provides an introduction to survival analysis, and to conducting a survival analysis in R.
We will cover some basics of survival analysis; the following series of tutorial papers can be helpful for additional reading:
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part I: Basic concepts and first analyses. 232-238. ISSN 0007-0920.
M J Bradburn, T G Clark, S B Love, & D G Altman. (2003). Survival Analysis Part II: Multivariate data analysis - an introduction to concepts and methods. British Journal of Cancer, 89(3), 431-436.
Bradburn, M., Clark, T., Love, S., & Altman, D. (2003). Survival analysis Part III: Multivariate data analysis – choosing a model and assessing its adequacy and fit. 89(4), 605-11.
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part IV: Further concepts and methods in survival analysis. 781-786. ISSN 0007-0920.
Some packages we’ll be using include:
lubridate
survival
survminer
library(survival)
library(survminer)
library(lubridate)
Moreover, it will be necessary install a library from a github:
# devtools::install_github("zabore/ezfun")
library(ezfun)
Time-to-event data that consist of a distinct start time and end time.
Examples:
Because survival analysis is common in many other fields, it also goes by other names
The lung
dataset is available from the
survival
package in R
. The data contain
subjects with advanced lung cancer from the North Central Cancer
Treatment Group. Some variables we will use to demonstrate methods
include:
A subject may be censored due to:
Specifically these are examples of right censoring.
Left censoring and interval censoring are also possible, and methods exist to analyze this type of data, but in this course we will be limited to right censoring.
In this example, how would we compute the proportion of subjects who are event-free at 10 years?
Subjects 6 and 7 were event-free at 10 years. Subjects 2,9 and 10 had the event before 10 years. Subjects 1, 3, 4, 5 and 8 were censored before 10 years, so we don’t know whether they had the event or not by 10 years - how do we incorporate these subjects into our estimate?
ggplot(lung, aes(x = time, fill = factor(status))) +
geom_histogram(bins = 25, alpha = 0.6, position = "identity") +
scale_fill_manual(values = c("blue", "red"), labels = c("Censored", "Dead")) +
ezfun::theme_ezbasic() +
labs(x = "Days",
y = "Count")
For subject \(i\):
Event time \(T_i\)
Censoring time \(C_i\)
Event indicator \(\delta_i\):
Observed time \(Y_i = \min(T_i, C_i)\)
The observed times and an event indicator are provided in the
lung
data
kable(head(lung))
inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
---|---|---|---|---|---|---|---|---|---|
3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | NA |
3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 |
3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | NA | 15 |
5 | 210 | 2 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 |
1 | 883 | 2 | 60 | 1 | 0 | 100 | 90 | NA | 0 |
12 | 1022 | 1 | 74 | 1 | 1 | 50 | 80 | 513 | 0 |
Data will often come with start and end dates rather than pre-calculated survival times. The first step is to make sure these are formatted as dates in R.
Let’s create a small example dataset with variables
sx_date
for surgery date and last_fup_date
for
the last follow-up date.
date_ex <-
tibble(
sx_date = c("2007-06-22", "2004-02-13", "2010-10-27"),
last_fup_date = c("2017-04-15", "2018-07-04", "2016-10-31")
)
date_ex
## # A tibble: 3 × 2
## sx_date last_fup_date
## <chr> <chr>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
We see these are both character variables, which will often be the case, but we need them to be formatted as dates.
date_ex %>%
mutate(
sx_date = as.Date(sx_date, format = "%Y-%m-%d"),
last_fup_date = as.Date(last_fup_date, format = "%Y-%m-%d")
)
## # A tibble: 3 × 2
## sx_date last_fup_date
## <date> <date>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
R
the format must include the
separator as well as the symbol. e.g. if your date is in format m/d/Y
then you would need format = "%m/%d/%Y"
We can also use the lubridate
package to format dates.
In this case, use the ymd
function
date_ex %>%
mutate(
sx_date = ymd(sx_date),
last_fup_date = ymd(last_fup_date)
)
## # A tibble: 3 × 2
## sx_date last_fup_date
## <date> <date>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
R
option, the separators do
not need to be specified?dmy
will show all format
options.Now that the dates formatted, we need to calculate the difference
between start and end time in some units, usually months or years. In
base R
, use difftime
to calculate the number
of days between our two dates and convert it to a numeric value using
as.numeric
. Then convert to years by dividing by
365.25
, the average number of days in a year.
date_ex %>%
mutate(
os_yrs =
as.numeric(
difftime(last_fup_date,
sx_date,
units = "days")) / 365.25
)
## # A tibble: 3 × 3
## sx_date last_fup_date os_yrs
## <date> <date> <dbl>
## 1 2007-06-22 2017-04-15 9.82
## 2 2004-02-13 2018-07-04 14.4
## 3 2010-10-27 2016-10-31 6.01
Using the lubridate
package, the operator
%--%
designates a time interval, which is then converted to
the number of elapsed seconds using as.duration
and finally
converted to years by dividing by dyears(1)
, which gives
the number of seconds in a year.
date_ex %>%
mutate(
os_yrs =
as.duration(sx_date %--% last_fup_date) / dyears(1)
)
## # A tibble: 3 × 3
## sx_date last_fup_date os_yrs
## <date> <date> <dbl>
## 1 2007-06-22 2017-04-15 9.82
## 2 2004-02-13 2018-07-04 14.4
## 3 2010-10-27 2016-10-31 6.01
lubridate
package
using a call to library
in order to be able to access the
special operators (similar to situation with pipes)For the components of survival data I mentioned the event indicator:
Event indicator \(\delta_i\):
However, in R
the Surv
function will also
accept TRUE/FALSE (TRUE = event) or 1/2 (2 = event).
In the lung
data, we have:
The probability that a subject will survive beyond any given specified time
\[S(t) = Pr(T>t) = 1 - F(t)\]
\(S(t)\): survival function \(F(t) = Pr(T \leq t)\): cumulative distribution function
In theory the survival function is smooth; in practice we observe events on a discrete time scale.
The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.
Surv
function from the survival
package creates a survival object for use as the response in a model
formula. There will be one entry for each subject that is the survival
time, which is followed by a +
if the subject was censored.
Let’s look at the first 10 observations:Surv(lung$time, lung$status)[1:10]
## [1] 306 455 1010+ 210 883 1022+ 310 361 218 166
survfit
function creates survival curves based on a
formula. Let’s generate the overall survival curve for the entire
cohort, assign it to object f1
, and look at the
names
of that object:f1 <- survfit(Surv(time, status) ~ 1, data = lung)
names(f1)
## [1] "n" "time" "n.risk" "n.event" "n.censor" "surv"
## [7] "std.err" "cumhaz" "std.chaz" "type" "logse" "conf.int"
## [13] "conf.type" "lower" "upper" "t0" "call"
Some key components of this survfit
object that will be
used to create survival curves include:
time
, which contains the start and endpoints of each
time intervalsurv
, which contains the survival probability
corresponding to each time
Now we plot the survfit
object in base R
to
get the Kaplan-Meier plot.
plot(survfit(Surv(time, status) ~ 1, data = lung),
xlab = "Days",
ylab = "Overall survival probability", mark.time = T)
R
shows the step function
(solid line) with associated confidence intervals (dotted lines)mark.time = TRUE
)Alternatively, the ggsurvplot
function from the
survminer
package is built on ggplot2
, and can
be used to create Kaplan-Meier plots. Checkout the cheatsheet
for the survminer
package.
ggsurvplot(
fit = survfit(Surv(time, status) ~ 1, data = lung),
xlab = "Days",
ylab = "Overall survival probability")
ggsurvplot
shows the step
function (solid line) with associated confidence bands (shaded
area).censor = FALSE
One quantity often of interest in a survival analysis is the probability of surviving beyond a certain number (\(x\)) of years.
For example, to estimate the probability of surviving to \(1\) year, use summary
with the
times
argument (Note the time
variable in the lung
data is actually in days, so we need
to use times = 365.25
)
summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 365 65 121 0.409 0.0358 0.345 0.486
We find that the \(1\)-year probability of survival in this study is 41%.
The associated lower and upper bounds of the 95% confidence interval are also displayed.
The \(1\)-year survival probability is the point on the y-axis that corresponds to \(1\) year on the x-axis for the survival curve.
plot_main <-
ggsurvplot(
data = lung,
fit = f1,
xlab = "Months",
legend = "none",
xscale = 30.4,
break.x.by = 182.4,
risk.table = TRUE,
risk.table.y.text = FALSE)
plot1 <- plot_main
plot1$plot <- plot1$plot +
geom_segment(x = 365.25, xend = 365.25, y = -0.05, yend = 0.4092416,
size = 1.5) +
geom_segment(x = 365.25, xend = -40, y = 0.4092416, yend = 0.4092416,
size = 1.5,
arrow = arrow(length = unit(0.2, "inches")))
plot1
What happens if you use a “naive” estimate?
121 of the 228 patients died by \(1\) year so:
\[\Big(1 - \frac{121}{228}\Big) \times 100 = 47\%\]
You get an incorrect estimate of the \(1\)-year probability of survival when you ignore the fact that 42 patients were censored before \(1\) year.
Recall the correct estimate of the \(1\)-year probability of survival was 41%.
fakedata2 <- lung %>%
mutate(time = ifelse(status == 2, time, 1022),
group = "No censoring") %>%
full_join(mutate(lung, group = "With censoring"))
fit3 <- survfit(Surv(time, status) ~ group, data = fakedata2)
ggsurvplot(
data = fakedata2,
fit = fit3,
xlab = "Months",
legend = "bottom",
legend.title = "",
legend.labs = c("No censoring", "With censoring"),
xscale = 30.4,
break.x.by = 182.4,
risk.table = TRUE,
risk.table.y.text = FALSE)
Another quantity often of interest in a survival analysis is the average survival time, which we quantify using the median. Survival times are not expected to be normally distributed so the mean is not an appropriate summary.
We can obtain this directly from our survfit
object
survfit(Surv(time, status) ~ 1, data = lung)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## n events median 0.95LCL 0.95UCL
## [1,] 228 165 310 285 363
We see the median survival time is 310 days The lower and upper bounds of the 95% confidence interval are also displayed.
Median survival is the time corresponding to a survival probability of \(0.5\):
plot2 <- plot_main
plot2$plot <- plot2$plot +
geom_segment(x = -45, xend = 310, y = 0.5, yend = 0.5, size = 1.5) +
geom_segment(x = 310, xend = 310, y = 0.5, yend = -0.03, size = 1.5,
arrow = arrow(length = unit(0.2, "inches")))
plot2
What happens if you use a “naive” estimate?
Summarize the median survival time among the 165 patients who died
lung %>%
filter(status == 2) %>%
summarize(median_surv = median(time))
## median_surv
## 1 226
lung
data is shown in
blue for comparisonfakedata <- lung %>%
filter(status == 2) %>%
mutate(group = "Ignoring censoring") %>%
full_join(mutate(lung, group = "With censoring"))
fit2 <- survfit(Surv(time, status) ~ group, data = fakedata)
ggsurvplot(
data = fakedata,
fit = fit2,
xlab = "Months",
legend = "bottom",
legend.title = "",
legend.labs = c("Ignoring censoring", "With censoring"),
xscale = 30.4,
break.x.by = 182.4,
risk.table = TRUE,
risk.table.y.text = FALSE)
We will use the “Duration of nursing home stay” data.
The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. Treated nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient’s health and sending them home.
Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.
Variables include:
Say we want to obtain the Fleming-Harrington estimate of the survival function for married females, in the healthiest initial subgroup, who are randomized to the untreated group of the nursing home study.
nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat",
head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))
nurs2 = subset(nurshome, gender==0 & rx==0 & health==2 & married==1)
fit.fh = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="fleming-harrington", conf.type="log-log")
summary(fit.fh)
## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "fleming-harrington",
## conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 14 12 1 0.9200 0.0767 0.55345 0.988
## 24 11 1 0.8401 0.1036 0.49777 0.957
## 25 10 1 0.7601 0.1207 0.42614 0.916
## 38 9 1 0.6802 0.1318 0.35609 0.866
## 64 8 1 0.6003 0.1384 0.29015 0.810
## 89 7 1 0.5204 0.1412 0.22897 0.749
## 113 6 1 0.4405 0.1402 0.17290 0.682
## 123 5 1 0.3606 0.1356 0.12238 0.609
## 149 4 1 0.2809 0.1268 0.07814 0.531
## 168 3 1 0.2012 0.1129 0.04142 0.446
## 185 2 1 0.1221 0.0917 0.01445 0.352
## 234 1 1 0.0449 0.0562 0.00107 0.245
Compare with KM:
fit.km = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="kaplan-meier", conf.type="log-log")
summary(fit.km)
## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "kaplan-meier",
## conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 14 12 1 0.9167 0.0798 0.53898 0.988
## 24 11 1 0.8333 0.1076 0.48171 0.956
## 25 10 1 0.7500 0.1250 0.40842 0.912
## 38 9 1 0.6667 0.1361 0.33702 0.860
## 64 8 1 0.5833 0.1423 0.27014 0.801
## 89 7 1 0.5000 0.1443 0.20848 0.736
## 113 6 1 0.4167 0.1423 0.15247 0.665
## 123 5 1 0.3333 0.1361 0.10270 0.588
## 149 4 1 0.2500 0.1250 0.06014 0.505
## 168 3 1 0.1667 0.1076 0.02651 0.413
## 185 2 1 0.0833 0.0798 0.00505 0.311
## 234 1 1 0.0000 NaN NA NA
Compare plots:
plot(fit.fh$time, fit.fh$surv,type="n", xlab="Follow up", ylab="Survival curve")
points(fit.fh$time, fit.fh$surv, col="blue", pch=17)
points(fit.km$time,fit.km$surv, col="red", pch=14)
lines(fit.fh$time,fit.fh$surv, col="blue", lwd=2)
lines(fit.km$time, fit.km$surv,col="red", lwd=2)
legend("topright", legend=c("FH", "KM"), col=c("blue", "red"), pch=c(17,14), lty=c(1,1), lwd=c(2,2), bty="n")
Generally, the Fleming Harrington estimator is slightly higher than the KM at every time point, but with larger datasets the two will typically be much closer.
Load libraries:
library(survival)
library(MASS)
library(rms)
library(ISwR)
library(survMisc)
library(survminer)
library(KMsurv)
Loading and viewing dataset melanom:
dati <- melanom
head(dati)
## no status days ulc thick sex
## 1 789 3 10 1 676 2
## 2 13 3 30 2 65 2
## 3 97 2 35 2 134 2
## 4 16 3 99 2 290 1
## 5 21 1 185 1 1208 2
## 6 469 1 204 1 484 2
attach(dati)
Create the survival object:
msurv <- with(dati, Surv(days, status == 1))
msurv
## [1] 10+ 30+ 35+ 99+ 185 204 210 232 232+ 279 295 355+
## [13] 386 426 469 493+ 529 621 629 659 667 718 752 779
## [25] 793 817 826+ 833 858 869 872 967 977 982 1041 1055
## [37] 1062 1075 1156 1228 1252 1271 1312 1427+ 1435 1499+ 1506 1508+
## [49] 1510+ 1512+ 1516 1525+ 1542+ 1548 1557+ 1560 1563+ 1584 1605+ 1621
## [61] 1627+ 1634+ 1641+ 1641+ 1648+ 1652+ 1654+ 1654+ 1667 1678+ 1685+ 1690
## [73] 1710+ 1710+ 1726 1745+ 1762+ 1779+ 1787+ 1787+ 1793+ 1804+ 1812+ 1836+
## [85] 1839+ 1839+ 1854+ 1856+ 1860+ 1864+ 1899+ 1914+ 1919+ 1920+ 1927+ 1933
## [97] 1942+ 1955+ 1956+ 1958+ 1963+ 1970+ 2005+ 2007+ 2011+ 2024+ 2028+ 2038+
## [109] 2056+ 2059+ 2061 2062 2075+ 2085+ 2102+ 2103 2104+ 2108 2112+ 2150+
## [121] 2156+ 2165+ 2209+ 2227+ 2227+ 2256 2264+ 2339+ 2361+ 2387+ 2388 2403+
## [133] 2426+ 2426+ 2431+ 2460+ 2467 2492+ 2493+ 2521+ 2542+ 2559+ 2565 2570+
## [145] 2660+ 2666+ 2676+ 2738+ 2782 2787+ 2984+ 3032+ 3040+ 3042 3067+ 3079+
## [157] 3101+ 3144+ 3152+ 3154+ 3180+ 3182+ 3185+ 3199+ 3228+ 3229+ 3278+ 3297+
## [169] 3328+ 3330+ 3338 3383+ 3384+ 3385+ 3388+ 3402+ 3441+ 3458+ 3459+ 3459+
## [181] 3476+ 3523+ 3667+ 3695+ 3695+ 3776+ 3776+ 3830+ 3856+ 3872+ 3909+ 3968+
## [193] 4001+ 4103+ 4119+ 4124+ 4207+ 4310+ 4390+ 4479+ 4492+ 4668+ 4688+ 4926+
## [205] 5565+
Histogram of the follow up (days) and descriptive statistics:
hist(days)
summary(days)
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 10 1525 2005 2153 3042 5565
Estimate the KM curve:
mfit <- survfit(Surv(days, status==1)~1, data = dati)
mfit
## Call: survfit(formula = Surv(days, status == 1) ~ 1, data = dati)
##
## n events median 0.95LCL 0.95UCL
## [1,] 205 57 NA NA NA
Output of the KM estimate at different time points:
summary(mfit, times=c(1000,3000,4000))
## Call: survfit(formula = Surv(days, status == 1) ~ 1, data = dati)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 1000 171 26 0.869 0.0240 0.823 0.917
## 3000 54 29 0.677 0.0385 0.605 0.757
## 4000 13 2 0.645 0.0431 0.566 0.735
Graphical representation of the curve:
plot(mfit)
abline(h=0.50, col="red",lty=2)
In this case the median survival time is not even interesting because the estimate is infinite.
The survival curve does not cross the 50% mark before all patients are censored.
If you want to see also censored times in the plot:
plot(mfit,mark.time=T)
abline(h=0.50, col="red",lty=2)
if you want to see also the number at risk at different time points:
ggsurvplot(mfit, data=dati, risk.table = TRUE)
Remember that estimate of the curve is reliable until the risk set
has a sufficient sample size.
As a rule of thumb: the curve can become unreliable when the number of
patients at risk is less than 15. The width of confidence intervals will
help to decide.
surv.bysex <- survfit(Surv(days,status==1)~sex)
Plot the curves:
plot(surv.bysex, conf.int=T,col=c("red","black"))
Other types of graphical representations:
ggsurvplot(surv.bysex, data=dati, risk.table = TRUE)
ggsurvplot(surv.bysex, data=dati, risk.table ="abs_pct")
Log-rank test to see if there is a significant difference between males and females: the survdiff function implements different tests specified by a parameter called rho, allowing a non-proportional hazards alternative to the null hypothesis, but the default value of rho gives the log-rank test.
survdiff(Surv(days,status==1)~sex)
## Call:
## survdiff(formula = Surv(days, status == 1) ~ sex)
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 126 28 37.1 2.25 6.47
## sex=2 79 29 19.9 4.21 6.47
##
## Chisq= 6.5 on 1 degrees of freedom, p= 0.01
We conclude that there is a significant difference in survival between males and females.
surv.bysex.ulc <- survdiff(Surv(days,status==1)~sex+strata(ulc))
surv.bysex.ulc
## Call:
## survdiff(formula = Surv(days, status == 1) ~ sex + strata(ulc))
##
## N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 126 28 34.7 1.28 3.31
## sex=2 79 29 22.3 1.99 3.31
##
## Chisq= 3.3 on 1 degrees of freedom, p= 0.07
To make separate graphs:
index.ulc1 <- ulc==1
index.ulc2 <- ulc==2
surv.bysex1 <- survfit(Surv(days,status==1)~sex, data=dati[index.ulc1,])
surv.bysex2 <- survfit(Surv(days,status==1)~sex, data=dati[index.ulc2,])
par(pty=("s"), mfrow=c(1,2))
plot(surv.bysex1, conf.int=T,col=c("red","black"))
title("Presence of ulcer")
plot(surv.bysex2, conf.int=T,col=c("red","black"))
title("Absence of ulcer")
Stratification makes the effect of sex less significant. A possible explanation might be that males seek treatment when the disease is in a more advanced state than women do, so that the gender difference is reduced when adjusted for a measure of disease progression.
Example of log rank with more than 2 groups:
data(myeloma)
fit <- survfit(Surv(time, event) ~ molecular_group, data = myeloma)
ggsurvplot(fit, legend.labs = levels(myeloma$molecular_group),
pval = TRUE)
This plot shows the global p value across groups. Let’s see the differences between pairs (adjusting for multiple comparisons):
res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
data = myeloma)
res
##
## Pairwise comparisons using Log-Rank test
##
## data: myeloma and molecular_group
##
## Cyclin D-1 Cyclin D-2 Hyperdiploid Low bone disease MAF
## Cyclin D-2 0.723 - - - -
## Hyperdiploid 0.943 0.723 - - -
## Low bone disease 0.723 0.988 0.644 - -
## MAF 0.644 0.447 0.523 0.485 -
## MMSET 0.328 0.103 0.103 0.103 0.723
## Proliferation 0.103 0.038 0.038 0.062 0.485
## MMSET
## Cyclin D-2 -
## Hyperdiploid -
## Low bone disease -
## MAF -
## MMSET -
## Proliferation 0.527
##
## P value adjustment method: BH
Proliferation vs Cyclin D-2 and vs Hyperdiploid show significant differences in survival curves.No others significant differences are detected between others groups.
Load the required libraries:
library(knitr)
library(tidyverse)
library(lubridate)
library(survival)
library(survminer)
library(gtsummary)
library(ISwR)
library(KMsurv)
library(pmsampsize)
We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables on the hazard function.
The Cox regression model is a semi-parametric model that can be used to fit univariable and multivariable regression models that have survival outcomes.
\[h(t|X_i) = h_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})\]
\(h(t)\): hazard, or the instantaneous rate at which events occur \(h_0(t)\): underlying baseline hazard
Some key assumptions of the model:
Note: parametric regression models for survival outcomes are also available, but they won’t be addressed in this course.
We can fit regression models for survival data using the
coxph
function, which takes a Surv
object on
the left hand side and has standard syntax for regression formulas in
R
on the right hand side.
The lung
dataset contain subjects with advanced lung
cancer from the North Central Cancer Treatment Group. Some variables we
will use to demonstrate methods include:
data("cancer", package="survival")
str(lung)
## 'data.frame': 228 obs. of 10 variables:
## $ inst : num 3 3 3 5 1 12 7 11 1 7 ...
## $ time : num 306 455 1010 210 883 ...
## $ status : num 2 2 1 2 2 1 2 2 2 2 ...
## $ age : num 74 68 56 57 60 74 68 71 53 61 ...
## $ sex : num 1 1 1 1 1 1 2 2 1 1 ...
## $ ph.ecog : num 1 0 0 1 0 1 2 2 1 2 ...
## $ ph.karno : num 90 90 90 90 100 50 70 60 70 70 ...
## $ pat.karno: num 100 90 90 60 90 80 60 80 80 70 ...
## $ meal.cal : num 1175 1225 NA 1150 NA ...
## $ wt.loss : num NA 15 15 11 0 0 10 1 16 34 ...
head(lung)
## inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1 3 306 2 74 1 1 90 100 1175 NA
## 2 3 455 2 68 1 0 90 90 1225 15
## 3 3 1010 1 56 1 0 90 90 NA 15
## 4 5 210 2 57 1 1 90 60 1150 11
## 5 1 883 2 60 1 0 100 90 NA 0
## 6 12 1022 1 74 1 1 50 80 513 0
coxph(Surv(time, status) ~ sex, data = lung)
## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
##
## coef exp(coef) se(coef) z p
## sex -0.5310 0.5880 0.1672 -3.176 0.00149
##
## Likelihood ratio test=10.63 on 1 df, p=0.001111
## n= 228, number of events= 165
We can see a tidy version of the output using the tidy
function from the broom
package:
broom::tidy(
coxph(Surv(time, status) ~ sex, data = lung),
exp = TRUE
) %>%
kable()
term | estimate | std.error | statistic | p.value |
---|---|---|---|---|
sex | 0.5880028 | 0.1671786 | -3.176385 | 0.0014912 |
Or use tbl_regression
from the gtsummary
package
coxph(Surv(time, status) ~ sex, data = lung) %>%
gtsummary::tbl_regression(exp = TRUE)
Characteristic | HR | 95% CI | p-value |
---|---|---|---|
sex | 0.59 | 0.42, 0.82 | 0.001 |
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio |
estimate
in our coxph
) then HR = \(\exp(\beta)\).Here is a plot of the cumulative hazard functions by sex:
One assumption of the Cox proportional hazards regression model is that the hazards are proportional at each point in time throughout follow-up. How can we check to see if our data meet this assumption?
Use the cox.zph
function from the survival package. It
results in two main things:
mv_fit <- coxph(Surv(time, status) ~ sex + age, data = lung)
cz <- cox.zph(mv_fit)
print(cz)
## chisq df p
## sex 2.608 1 0.11
## age 0.209 1 0.65
## GLOBAL 2.771 2 0.25
plot(cz)
In this example, the PH assumption could be considered valid for both covariates.
Now, on the melanoma dataset, consider a Cox model with the single regressor sex:
Loading and viewing dataset melanom:
data("melanom", package="ISwR")
str(melanom)
## 'data.frame': 205 obs. of 6 variables:
## $ no : int 789 13 97 16 21 469 685 7 932 944 ...
## $ status: int 3 3 2 3 1 1 1 1 3 1 ...
## $ days : int 10 30 35 99 185 204 210 232 232 279 ...
## $ ulc : int 1 2 2 2 1 1 1 1 1 1 ...
## $ thick : int 676 65 134 290 1208 484 516 1288 322 741 ...
## $ sex : int 2 2 2 1 2 2 2 2 1 1 ...
head(melanom)
## no status days ulc thick sex
## 1 789 3 10 1 676 2
## 2 13 3 30 2 65 2
## 3 97 2 35 2 134 2
## 4 16 3 99 2 290 1
## 5 21 1 185 1 1208 2
## 6 469 1 204 1 484 2
attach(melanom)
## The following objects are masked from dati:
##
## days, no, sex, status, thick, ulc
Create the survival object:
mod.sex <- coxph(Surv(days,status==1)~sex)
summary(mod.sex)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex)
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.6622 1.9390 0.2651 2.498 0.0125 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.939 0.5157 1.153 3.26
##
## Concordance= 0.59 (se = 0.034 )
## Likelihood ratio test= 6.15 on 1 df, p=0.01
## Wald test = 6.24 on 1 df, p=0.01
## Score (logrank) test = 6.47 on 1 df, p=0.01
Males (=2) have an hazard of death for melanoma that is nearly twice than women (=1).
A more elaborate example, involving also a continuous covariate:
mod.cov <- coxph(Surv(days,status==1)~sex+log(thick))
summary(mod.cov)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick))
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.4580 1.5809 0.2687 1.705 0.0883 .
## log(thick) 0.7809 2.1834 0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.581 0.6326 0.9337 2.677
## log(thick) 2.183 0.4580 1.6040 2.972
##
## Concordance= 0.749 (se = 0.033 )
## Likelihood ratio test= 33.45 on 2 df, p=5e-08
## Wald test = 31 on 2 df, p=2e-07
## Score (logrank) test = 32.52 on 2 df, p=9e-08
Remember that ‘thick’ is the tumor thickness; we use logarithm since the distribution is asymmetric. Each 1 point change in log(thick) is associated with a 2.2-fold increase in a patient’s risk.
Question: adjusting for log(thick), does the effect of gender follow a proportional hazards model? If the PH assumption holds, the log cumulative hazards for the two groups, adjusting for log(thick), should be roughly parallel:
fit1 <- coxph( Surv(days,status==1) ~ log(thick)+ strata(sex),data=melanom)
plot(survfit(fit1), lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"), fun="cloglog",xlab="Days",ylab="Log-Cumulative Hazard Function", xlim=c(100, 5000) )
legend(100, -1, lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"),legend=c("Females", "Males"),bty="n")
Conclusion: not strong evidence of non-proportional hazards. This is a good look at gross departures, but it is far from a formal test.
As before, let’s do a formal test to check PH assumption:
check.ph <- cox.zph(mod.cov, transform="km", global=TRUE)
check.ph
## chisq df p
## sex 1.33 1 0.248
## log(thick) 6.23 1 0.013
## GLOBAL 6.70 2 0.035
Visualizing the residual plot: if the PH assumption for log(thick) holds, then the plot of b(t) vs time should be on a horizontal line.In this case we note that for log(thick) the PH assumption is not valid, so we should apply some more advanced options (as for example introducing an interaction between the covariate and a function of time). We do not introduce here this option.
plot(check.ph[2], xlab="Days")
abline(h=0, lty=3,lwd=2)
abline(h=coef(mod.cov)[2], lty=3,col="red",lwd=2)
We can observe that instead the effect of log(thick) is gradually decreasing with time; red dashed line indicates Cox model’s estimate for the overall log thick effect.
A more elaborate example: binary factor + continuous covariate + stratification variable
mod.cov.strat <- coxph(Surv(days,status==1)~sex+log(thick)+strata(ulc))
summary(mod.cov.strat)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick) +
## strata(ulc))
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.3600 1.4333 0.2702 1.332 0.1828
## log(thick) 0.5599 1.7505 0.1784 3.139 0.0017 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.433 0.6977 0.844 2.434
## log(thick) 1.750 0.5713 1.234 2.483
##
## Concordance= 0.673 (se = 0.039 )
## Likelihood ratio test= 13.3 on 2 df, p=0.001
## Wald test = 12.88 on 2 df, p=0.002
## Score (logrank) test = 12.98 on 2 df, p=0.002
Plotting estimated survival curves in the strata:
plot(survfit(mod.cov.strat), col=c("red","black"),ylim=c(0.5,1), xlab="Days", ylab="Survival")
legend(3000,1, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
The default for survfit is to generate curves for a pseudoindividual for which the covariates are at their mean values. In the present case, that would correspond to a tumor thickness of 1.86 mm and a gender of 1.39 (!). We have been sloppy in not defining sex as a factor variable, but that would not actually give a different result : coxph subtracts the means of the regressors before fitting, so a 1/2 coding is the same as 0/1, which is what a factor with treatment contrasts usually gives:
sex.f <- as.factor(sex)
sex.f
## [1] 2 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 2 2 1 2 1 2
## [38] 2 1 2 1 1 1 2 2 2 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2
## [75] 1 1 1 2 2 1 1 1 2 1 1 2 1 2 2 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2
## [112] 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1
## [149] 2 2 2 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1
## [186] 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1
## Levels: 1 2
mod.cov.strat.f <- coxph(Surv(days,status==1)~sex.f+log(thick)+strata(ulc))
summary(mod.cov.strat.f)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex.f + log(thick) +
## strata(ulc))
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex.f2 0.3600 1.4333 0.2702 1.332 0.1828
## log(thick) 0.5599 1.7505 0.1784 3.139 0.0017 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex.f2 1.433 0.6977 0.844 2.434
## log(thick) 1.750 0.5713 1.234 2.483
##
## Concordance= 0.673 (se = 0.039 )
## Likelihood ratio test= 13.3 on 2 df, p=0.001
## Wald test = 12.88 on 2 df, p=0.002
## Score (logrank) test = 12.98 on 2 df, p=0.002
In order to estimate survival curves for subjects with certain values of the covariates, we could use the option newdata in survfit:
par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Male, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Female, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Male, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Female, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))
Use pmsampsize to calculate the minimum sample size required for developing a multivariable Cox prediction model with a survival outcome using 30 candidate predictors. We know an existing prediction model in the same field has an R-squared adjusted of 0.051. Further, in the previous study the mean follow-up was 2.07 years, and overall event rate was 0.065. We select a timepoint of interest for prediction using the newly developed model of 2 years.
# this code is not evaluated - pmsampsize does not work on mac
pmsampsize(type = "s", rsquared = 0.051, parameters = 30, rate = 0.065,timepoint = 2, meanfup = 2.07)
We need 5143 subjects in our study, according to the assumptions declared above.
Load the required libraries:
# load packages
library(knitr)
library(tidyverse)
library(lubridate)
library(survival)
library(survminer)
library(gtsummary)
library(ISwR)
library(KMsurv)
library(timeROC)
library(rms)
Load the dataset melanom:
data("melanom", package="ISwR")
str(melanom)
## 'data.frame': 205 obs. of 6 variables:
## $ no : int 789 13 97 16 21 469 685 7 932 944 ...
## $ status: int 3 3 2 3 1 1 1 1 3 1 ...
## $ days : int 10 30 35 99 185 204 210 232 232 279 ...
## $ ulc : int 1 2 2 2 1 1 1 1 1 1 ...
## $ thick : int 676 65 134 290 1208 484 516 1288 322 741 ...
## $ sex : int 2 2 2 1 2 2 2 2 1 1 ...
head(melanom)
## no status days ulc thick sex
## 1 789 3 10 1 676 2
## 2 13 3 30 2 65 2
## 3 97 2 35 2 134 2
## 4 16 3 99 2 290 1
## 5 21 1 185 1 1208 2
## 6 469 1 204 1 484 2
First of all we evaluate the discrimination accuracy of the tumor thickness as a prognostic biomarker for death from malignant melanoma:
ROC.thick<-timeROC(T=melanom$days,delta=melanom$status,marker=melanom$thick,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.thick
## Time-dependent-Roc curve estimated using IPCW (n=205, with competing risks).
## Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800 45 124 36 0 75.39 4.10 76.51 3.85
## t=2000 46 103 56 0 75.60 4.14 76.11 3.82
## t=2200 50 83 72 0 72.55 4.40 74.33 3.81
##
## Method used for estimating IPCW:marginal
##
## Total computation time : 0.44 secs.
Note that since we are here in a competing risk setting (two causes of death, but we are interested in cause=1), we obtain two AUCs estimated each one for a risk. We are now interested in evaluating AUC_1. We can also evaluate the confidence intervals for the estimated AUCs:
confint(ROC.thick)
## $CI_AUC_1
## 2.5% 97.5%
## t=1800 67.36 83.41
## t=2000 67.49 83.72
## t=2200 63.92 81.19
##
## $CB_AUC_1
## 2.5% 97.5%
## t=1800 66.66 84.11
## t=2000 66.78 84.42
## t=2200 63.18 81.93
##
## $C.alpha.1
## 95%
## 2.129302
##
## $CI_AUC_2
## 2.5% 97.5%
## t=1800 68.96 84.05
## t=2000 68.63 83.60
## t=2200 66.86 81.79
##
## $CB_AUC_2
## 2.5% 97.5%
## t=1800 68.55 84.46
## t=2000 68.23 84.00
## t=2200 66.46 82.20
##
## $C.alpha.2
## 95%
## 2.065624
Then, we estimate a Cox model by adding sex to the tumor thickness:
mod.cov <- coxph(Surv(days,status==1)~sex+log(thick), data=melanom)
summary(mod.cov)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick), data = melanom)
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.4580 1.5809 0.2687 1.705 0.0883 .
## log(thick) 0.7809 2.1834 0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.581 0.6326 0.9337 2.677
## log(thick) 2.183 0.4580 1.6040 2.972
##
## Concordance= 0.749 (se = 0.033 )
## Likelihood ratio test= 33.45 on 2 df, p=5e-08
## Wald test = 31 on 2 df, p=2e-07
## Score (logrank) test = 32.52 on 2 df, p=9e-08
We now extract the linear predictor from the Cox model (X*beta): note that the linear predictor of this cause-specific Cox model is evaluated at baseline and it is independent of time.
lin.pred <- predict(mod.cov)
And we now compute the time-dependent discrimination from this model:
ROC.cox <- timeROC(T=melanom$days,delta=melanom$status,marker=lin.pred,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.cox
## Time-dependent-Roc curve estimated using IPCW (n=205, with competing risks).
## Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800 45 124 36 0 75.76 4.09 76.94 3.85
## t=2000 46 103 56 0 76.17 4.15 76.39 3.83
## t=2200 50 83 72 0 74.33 4.34 75.13 3.75
##
## Method used for estimating IPCW:marginal
##
## Total computation time : 0.2 secs.
along with the estimated confidence intervals:
confint(ROC.cox)
## $CI_AUC_1
## 2.5% 97.5%
## t=1800 67.75 83.77
## t=2000 68.04 84.30
## t=2200 65.82 82.83
##
## $CB_AUC_1
## 2.5% 97.5%
## t=1800 66.99 84.53
## t=2000 67.26 85.08
## t=2200 65.02 83.64
##
## $C.alpha.1
## 95%
## 2.146461
##
## $CI_AUC_2
## 2.5% 97.5%
## t=1800 69.40 84.49
## t=2000 68.87 83.90
## t=2200 67.77 82.48
##
## $CB_AUC_2
## 2.5% 97.5%
## t=1800 68.98 84.91
## t=2000 68.46 84.31
## t=2200 67.37 82.88
##
## $C.alpha.2
## 95%
## 2.06678
Now we can compare the AUCs: is there a significant improvement in discrimination by adding sex?
pp1 <- compare(ROC.thick,ROC.cox)
round(pp1$p_values_AUC_1,digits=4)
## t=1800 t=2000 t=2200
## 0.7713 0.6594 0.1936
No significant increment in discrimination is observed from this test.
Now we add also ulcer in the Cox model but this time as a covariate (not as a strata variable):
mod.cov2 <- coxph(Surv(days,status==1)~sex+log(thick)+ulc, data=melanom)
summary(mod.cov2)
## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick) +
## ulc, data = melanom)
##
## n= 205, number of events= 57
##
## coef exp(coef) se(coef) z Pr(>|z|)
## sex 0.3813 1.4641 0.2706 1.409 0.15879
## log(thick) 0.5756 1.7782 0.1794 3.209 0.00133 **
## ulc -0.9389 0.3911 0.3243 -2.895 0.00379 **
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## exp(coef) exp(-coef) lower .95 upper .95
## sex 1.4641 0.6830 0.8615 2.4883
## log(thick) 1.7782 0.5624 1.2511 2.5273
## ulc 0.3911 2.5571 0.2071 0.7384
##
## Concordance= 0.768 (se = 0.032 )
## Likelihood ratio test= 42.66 on 3 df, p=3e-09
## Wald test = 36.99 on 3 df, p=5e-08
## Score (logrank) test = 42.81 on 3 df, p=3e-09
We now extract also from this model the linear predictor (X*beta):
lin.pred2 <- predict(mod.cov2)
And now we compare the performance of this model with the tumor thickness:
ROC.cox2 <- timeROC(T=melanom$days,delta=melanom$status,marker=lin.pred2,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.cox2
## Time-dependent-Roc curve estimated using IPCW (n=205, with competing risks).
## Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800 45 124 36 0 78.74 3.95 79.36 3.71
## t=2000 46 103 56 0 78.59 4.06 78.58 3.73
## t=2200 50 83 72 0 75.88 4.27 76.66 3.69
##
## Method used for estimating IPCW:marginal
##
## Total computation time : 0.24 secs.
confint(ROC.cox2)
## $CI_AUC_1
## 2.5% 97.5%
## t=1800 71.00 86.47
## t=2000 70.64 86.54
## t=2200 67.51 84.25
##
## $CB_AUC_1
## 2.5% 97.5%
## t=1800 69.98 87.49
## t=2000 69.59 87.59
## t=2200 66.40 85.35
##
## $C.alpha.1
## 95%
## 2.218481
##
## $CI_AUC_2
## 2.5% 97.5%
## t=1800 72.10 86.62
## t=2000 71.27 85.88
## t=2200 69.43 83.89
##
## $CB_AUC_2
## 2.5% 97.5%
## t=1800 71.55 87.17
## t=2000 70.72 86.43
## t=2200 68.88 84.43
##
## $C.alpha.2
## 95%
## 2.108273
pp1 <- compare(ROC.thick,ROC.cox2)
round(pp1$p_values_AUC_1,digits=4)
## t=1800 t=2000 t=2200
## 0.1473 0.2036 0.1781
Again, no significant difference in discrimination is observed.
Finally, we can plot the estimated time-dependent AUCs:
plotAUCcurve(ROC.thick, FP = 2, add = FALSE,conf.int = FALSE,conf.band = FALSE, col = "black")
plotAUCcurve(ROC.cox, FP = 2, add =TRUE, conf.int = FALSE,conf.band = FALSE, col = "blue")
plotAUCcurve(ROC.cox2, FP = 2, add =TRUE, conf.int = FALSE,conf.band = FALSE, col = "red")
title(main="Time-dependent AUCs")
legend("topleft", legend=c("Tumor thickness","Thick + sex", "Thick + sex+ulc"), lty=c(1,1,1),lwd=c(2,2,2),
col=c("black","blue", "red"), bty="n")
We can also fix a specific time-horizon :
plot(ROC.thick,time=2200,add = FALSE, col = "black",title=FALSE, lwd=2)
plot(ROC.cox,time=2200,add = TRUE, col = "blue",title=FALSE, lwd=2)
plot(ROC.cox2,time=2200,add = TRUE, col = "red",title=FALSE, lwd=2)
legend("bottomright", legend=c("Tumor thickness","Thick + sex", "Thick + sex + ulc"), lty=c(1,1,1),lwd=c(2,2,2),
col=c("black","blue", "red"), bty="n")
title(main="Time-dependent ROC curves at 2200 days")
Note that discrimination in this context is focused exclusively on the performance of the linear predictor extracted from the Cox model. We are not considering here the survival estimates.
Just for didactic reasons, we are in this example IGNORING the competing risk present in these data, and act as we have an unique event. Therefore the method is illustrated is valid only in the standard survival setting, not in the competing risk setting (more advanced approaches are needed in that case, as for example estimating cumulative incidence curves using the multi-state approach). We now evaluate calibration for the Cox model: first of all, we determine summaries of variables for effect and plotting ranges, values to adjust to, and overall ranges for the dataset Melanom, since this is required in order to use the R functions developed in the rms library.
options(datadist='dd')
dd <-datadist(melanom)
dd
## no status days ulc thick sex
## Low:effect 222.0000 1 1525.0000 1 97.0000 1
## Adjust to 469.0000 2 2005.0000 1 194.0000 1
## High:effect 731.0000 3 3042.0000 2 356.0000 2
## Low:prediction 11.0000 1 294.2195 1 32.0000 1
## High:prediction 943.0488 3 4119.2439 2 838.7805 2
## Low 2.0000 1 10.0000 1 10.0000 1
## High 992.0000 3 5565.0000 2 1742.0000 2
##
## Values:
##
## status : 1 2 3
## ulc : 1 2
## sex : 1 2
Then, we re-estimate the Cox model by adding some options useful to fix the time horizon to evaluate calibration, in this case fixed at 2000 days:
mod.cov2 <- cph(Surv(days,status==1)~sex+log(thick)+ulc, data=melanom,x=TRUE , y=TRUE,surv=TRUE,time.inc=2000)
Finally, we use the function calibrate by comparing predicted probabilities from the Cox model with the observed ones estimated by the Kaplan-Meier:
calkm <- calibrate(mod.cov2, u=2000, m=40, cmethod= 'KM', B=200)
## Using Cox survival estimates at 2000 Days
calkm
## calibrate.cph(fit = mod.cov2, cmethod = "KM", u = 2000, m = 40,
## B = 200)
##
##
## n=205 B=200 u=2000 Day
##
## index.orig training test mean.optimism mean.corrected n
## [1,] 0.01505666 0.015399968 0.028326832 -0.012926864 0.02798352 200
## [2,] 0.01466150 0.006498119 -0.001349244 0.007847363 0.00681414 200
## [3,] 0.01195721 0.006277704 0.004442661 0.001835043 0.01012216 200
## [4,] -0.01622951 0.003399131 -0.013930637 0.017329768 -0.03355928 200
## [5,] 0.01160306 0.004361092 0.002688064 0.001673028 0.00993003 200
## mean.predicted KM KM.corrected std.err
## [1,] 0.4394888 0.4545455 0.4674723 0.17797217
## [2,] 0.6603385 0.6750000 0.6671526 0.10971343
## [3,] 0.8233362 0.8352934 0.8334584 0.07384917
## [4,] 0.8978469 0.8816174 0.8642876 0.06477229
## [5,] 0.9396165 0.9512195 0.9495465 0.03536639
plot(calkm, cex.lab=0.7)
Another more general approach to estimating calibration curves for
survival models is to use the flexible adaptive hazard regression
approach (not covered here). For survival models, “predicted” means
predicted survival probability at a single time point, and “observed”
refers to the corresponding Kaplan-Meier survival estimate. The number
of intervals of predicted survival probabilities in which subjects are
ranked depend on the parameter m that correspond to the average
number of subjects in each group evaluated at u-time.
As an internal validation tool, the bootstrap is used to de-bias the
estimates to correct for overfitting, allowing estimation of the
likely future calibration performance of the fitted model.
These examples are taken from the paper: “Analysis of time-to-event for observational studies: Guidance to the use of intensity models”, available at : https://pubmed.ncbi.nlm.nih.gov/33043497/.
load(here("datasets", "pad.rda"))
library(splines)
library(timereg)
library(knitr)
library(survival)
# Advanced ovarian data from dynpred package
data(ova, package="dynpred")
Peripheral arterial disease (PAD) is a common circulatory problem in
which narrowed arteries reduce blood flow to peripheral limbs, often the
legs. It is also likely to be a sign of a more widespread
atherosclerosis, and subjects manifesting the disease carry an increased
risk for all atherothrombotic events including myocardial infarction,
stroke, and cardiac death. The pad
data set contains the
results of a Slovene study. Briefly, the study was conducted by 74
primary care physicians-researchers, each of whom was asked to recruit
10 subjects with PAD along with 10 age and sex matched controls without
PAD; actual recruitment ranged from 1 to 31 pairs. All participating
subjects consented to a yearly examination and interview for 5 years.
Important endpoints are death, either due to cardiovascular disease (CVD
death) or other causes, non-fatal CVD endpoints of infarction and
stroke, and patient interventions attributed to the disease such as
revascularization procedures.
The final study includes 742 PAD and 713 controls, with baseline data for each subject, measurements at each visit, and endpoints. This analysis will focus on cardiac endpoints and on overall mortality.
The starting R data has 3 dataframes:
pad1
has one observation per subject containing
baseline information,pad2
has one observation per subject visit containing
measurements at that visit, andpad3
has an observation for each observed
endpoint.The dataset pad2 contains all the repeated examinations over the 5 years of the study. Note that in this dataset the ID of the subject is repeated for multiple lines (data are in the long format). The same happens for pad3 that reports all the events of interest. In the IDA phase (Initial Data Analysis) we look at the overall follow-up for subjects, and decide on a time scale and an endpoint. Per the study design subjects should have a visit at years 1–5. We see in Figure that visit 1 dates are stretched out over a full year and by visit 5 over almost 2 years, a pattern that is unavoidable in observational studies of human subjects. Based on the counts shown in the figure there is no evidence of differential drop out between the PAD and control subjects.
vcount <- with(pad2, table(pad, visit)) # visit count
vtime <- pad2$day/365.25
plot(vtime ~ jitter(visit + .3*(pad=="PAD") - .15), pad2, subset=(visit>0),
col=c(1,2)[as.numeric(pad)], ylim=c(0, max(vtime)),
xlab= "Visit number", ylab="Years from initial visit")
#abline(h=1:6, lty=2, col=4)
segments(rep(0,6), 1:6, 1:6+.5, 1:6, lty=2, col='gray')
legend(4,2, c("PAD", "Control"), col=c(2,1), bty='n', pch=19)
text(1:5 - .15, 0:4, vcount[1,-1])
text(1:5 + .15, 0:4, vcount[2,-1])
Time to each visit for the PAD study.
pfig2 <- survfit(Surv(futime/365.25, status=='censor') ~ pad, pad1)
plot(pfig2, fun='event', col=1:2,
xlab="Years post enrollment", ylab="Fraction censored")
legend(2, .6, c("Control", "PAD", "Ideal"), col=c(1,2,1), lty=c(1,1,2),
bty='n')
lines(c(0, 4.8, 6.8), c(0,0,1), lty=3)
Observed and ideal follow-up in the PAD study
The censoring pattern for subjects in the two arms, showed in the
above figure, are computed using a reverse Kaplan-Meier
(i.e. the status is to be censored, and the censoring is having the
event).
There is no evidence of differential follow-up.
Given the visit slippage of the first figure, an ideal censoring pattern would be a final followup that is approximately uniformly distributed between 4.8 and 6.8 years, this is shown as a dotted line on the figure.
Examining the counts of subjects for each visit the total number of visits for the subjects is not what was planned — about 25% do not have a fifth visit — but patients have adhered to the 5 year time commitment quite closely. As stated in the primary paper, a decision was made to truncate all patient data at 5 years for the purpose of analysis.
The initial survival curves can be created using a copy of the baseline data (pad1), with follow-up truncated at 5 years.
pdata <- subset(pad1, select=c(id, pad, sex, age, futime, status))
pdata$death <- ifelse(pdata$status == 'censor', 0, 1) # any cause of death
# truncate follow=up at 5 years
over5 <- pdata$futime > 5*(365.25)
pdata$futime <- pmin(1826, pdata$futime) # 1826 days = 5 years
pdata$death[over5] <- 0
pdata$status[over5] <- "censor"
table(pdata$status)
##
## censor CV death nonCV death
## 1296 68 91
# Convert to follow-up in years, and add an age scale
pdata$years <- pdata$futime/365.25
pdata$age1 <- as.numeric(pad1$d.first - pad1$d.birth)/365.25
pdata$age2 <- pdata$age1 + pdata$years
oldpar <- par(mfrow=c(1,2))
psurv1 <- survfit(Surv(years, death) ~ pad + sex, data=pdata)
psurv2 <- survfit(Surv(age1, age2, death) ~ pad+ sex, pdata,
start.time=55)
plot(psurv1, col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
xlab="Years since enrollment", ylab="Death")
legend(.1, .18, c("PAD, male","PAD, female", "Control, male","Control, female"),
col=c(2,2,1,1), lty=c(2,1,2,1), lwd=2, bty='n', cex=.8)
plot(psurv2, col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
xlab='Age', ylab='Death, given alive at age 55')
Kaplan-Meier curves of overall survival.
Our first analysis will use time since enrollment as the primary time scale. For the PAD patients the time since diagnosis is a natural scale since it represents both the progression of disease and of treatments for the disease. Survival curves for the control subjects serve as a comparison outcome of similarly aged subjects without the disease, but do not have a natural stand-alone interpretation.
Death rates are higher for males than for females, which is no surprise given a mean age at entry of 65 years, and is higher for PAD subjects than for the age matched controls. The right hand panel shows the curves on age scale. Note that when using the age scale, we include two time points in the Survival function: age at entry in the study and age at the event or censoring. These data are both right-censored and left-truncated (we have not discussed this point in the lessons, but it is of interest if you want to deepen the survival framework).In two words, left-truncation is when the subjects to enter under your observation should be survived until a certain time-point (i.e. age at entry).
Survival curves can equivalently display either the fraction who have died (curves start at 0: 1-KM) or the fraction who have not died (curves start at 1); the choice between the two is often based on tradition or habit.
The analysis will focus on the impact on survival of the covariates PAD, age, sex, and serum values for high density and low density lipoprotein (HDL, LDL). Figure shows density estimates for each of the three continuous covariates at baseline, by PAD group. By study design there should be no difference in the age distribution. HDL and LDL are slightly lower for the PAD subjects.
oldpar <- par(mfrow=c(1,3), mar=c(5.1, 4.1, 5.1, 1))
dfun <- function(data, x, ...) {
d0 <- density((data[[x]])[data$pad=="Control"])
d1 <- density((data[[x]])[data$pad=="PAD"])
matplot(cbind(d0$x, d1$x), cbind(d0$y, d1$y), type='l', lty=1, col=1:2,
...)
}
lunit <- 38.67 # convert from mmol/L to mg/dl
xx <- c(0, 50, 100, 150, 200, 250, 300)
dfun(pdata, "age", xlab="Age", ylab="Density", lwd=2)
legend(51, .02, c("PAD", "Control"), lty=1, lwd=2, col=2:1, bty='n')
dfun(na.omit(subset(pad2, visit==0, c(hdl, pad))),
"hdl", xlab="HDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)
dfun(na.omit(subset(pad2, visit==0, c(ldl, pad))),
"ldl", xlab="LDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)
Distributions of three continuous covariates, by group. Blood value scales are in mmol/L (lower) and mg/dl (above).
The Cox model fits will use age, sex, baseline values of high density
and low density lipoproteins (hdl0, ldl0
),
time-dependent lipid values (hdl, ldl
), and a
time-dependent value of “years since enrollment”
(etime
). Time since enrollment only plays a role in age
scale models: in a time-since-entry model every one in a given risk set
will have exactly the same value of the covariate, and hence the
estimated coefficient for the variable would be 0.
For time-dependent analysis we use a counting process form. Mathematically each subject is represented by 3 processes: \(Y_i(t)=1\) if the subject is currently at risk, \(X_i(t)\) is the time dependent covariate process, and \(N(t)\) the cumulative number of events. In the R code this translates into multiple observations for each subject of the form start-stop:
id, time1, time2, event, x1, x2, ...
where (time1, time2)
describe an interval over which the
subject is at risk, event
corresponds to \(dN(t)\) and will be 1 of the subject had an
event at time2
and 0 otherwise, x1, x2, ...
are the value of the covariates over the interval, and id
is a subject identifier. Subject 9, for instance, is a male with blood
values measured on days 0, 583, 945, and 1309, who died on day 1349.
Their data has the form
id tstart tstop death sex ldl hdl
9 0 583 0 male 2.66 1.08
9 583 945 0 male 3.20 1.53
9 945 1309 0 male 2.81 1.32
9 1309 1349 1 male 2.51 1.30
The lipid values and time intervals encode the most recent
measurement of the value. Creation and checking of the data often
represents a major share of the work in an analysis. There is often
concern that because the data set has multiple rows for the same
subject, a valid variance estimate for the model will need to account
for correlation.
This is not the case, however.
Internally, there is only one copy of each subject; the multiple rows
are merely a way to communicate to the program which covariate
value is applicable at any given time point.
When there are competing risks, such as CVD and non-CVD death, the 0/1 event event variable is replaced by a factor with levels of “none”, “CVD death”, “non-CVD death”, which shows both when an event occurred and its type.
Below we create separate time-dependent data sets for analysis on the
time-from-enrollment scale (pdata2
) and on the age scale
(pdata3
).
The two data sets have separate event indicators for death, the competing risk of CVD/non-CVD death, and for any CVD event/non-CVD death. Creating (time1, time2) data sets on age-in-years scale can lead to anomalies due to round off error; a detailed discussion and example is found in the “Round off error and tied times” vignette of the survival package. For this reason the time-dependent data sets all use time in days. Use birth date and entry date to create agedays = age at enrollment, in days
This data set will have two time scales (age and time since enrollment), time dependent covariates (hdl, ldl, etime), and multiple endpoints (all cause death, CVD vs non-CVD death, major CVD event).
A. etime = time since enrollment as a time dependent variable 1. This is a continuous time-dependent covariate – technically it changes minute by minute. 2. We will be using it in age scale fits. If we create it on age scale then the cuts for survSplit can be limited to “the ages at which deaths occur”
B. The tmerge routine can keep track of multiple endpoints on a single time scale, the survSplit routine a single endpoint and single time scale.
C. The tmerge operations needs to be sequential: no step that would change the number of rows can be inserted in the midst.
Taken all together, it is simplest to create the data sets, pdata2 on the time since enrollment scale, pdata3 on the age scale. Many of the steps will need to be repeated, e.g., adding time dependent hdl and ldl.
pdata2 <- tmerge(pdata[, c("id", "pad", "age", "sex")], pdata,
id=id, status = event(futime, status))
pdata2 <- tmerge(pdata2, pad2, id=id, ldl= tdc(day, ldl),
hdl = tdc(day, hdl))
# add baseline hdl and ldl as covariates
indx <- match(pdata2$id, pdata2$id)
pdata2$ldl0 <- pdata2$ldl[indx]
pdata2$hdl0 <- pdata2$hdl[indx]
pdata2$age10 <- pdata2$age/10 # age in decades
# create "any cvd event" as a competing risk, it will be used later
temp <- subset(pad3, event %in% c("cvd death", "infarction", "stroke"))
temp <- temp[!duplicated(temp$id),] # first of these
temp <- subset(temp, day< 5*365.25) # must be <= 5 years
pdata2 <- tmerge(pdata2, temp, id=id, cvstat=event(day),
priorcv = tdc(day))
pdata2$death <- ifelse(pdata2$status=="censor", 0, 1) # all cause death
tstat <- ifelse(pdata2$cvstat>0, pdata2$cvstat, 2*pdata2$death)
pdata2$cvstat <- factor(tstat, 0:2,
c("censor", "CVD event", "death"))
pdata2$death <- ifelse(pdata2$status=="censor", 0, 1) # all cause death
# Reprise all the steps, on age scale. We need to add "age" to the
# merged-in data sets before adding them.
ageday <- as.numeric(pad1$d.first - pad1$d.birth) # age in days at enrollmemt
pdata3 <- tmerge(pdata[, c("id", "pad", "sex")], pdata,
id=id, status = event(futime + ageday, status),
tstart=ageday, tstop=ageday+futime,
options=list(tstart="age1", tstop="age2"))
temp2 <- pad2
temp2$age <- temp2$day + ageday[match(temp2$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp2, id=id, ldl= tdc(age, ldl),
hdl = tdc(age, hdl))
# add baseline hdl and ldl as covariates
indx <- match(pdata3$id, pdata3$id) #points to the first row for each subject
pdata3$ldl0 <- pdata3$ldl[indx]
pdata3$hdl0 <- pdata3$hdl[indx]
# Add time since enrollment as a time-dependent covariate.
# Use a temporary data set + survSplit to create a 'variable' that looks
# like yet another lab test, then merge it in on the age scale.
# Time since enrollment will be in weeks, which will create 52*5= 260 obs
# per subject. (Time since enrollment in days would be overkill.)
pdata$ageday <- ageday
temp3 <- survSplit(Surv(futime, status) ~ ageday + id, data= pdata,
cut= seq(7, max(pdata$futime), by=7))
pdata3 <- tmerge(pdata3, temp3, id, etime = tdc(ageday + tstart, tstart))
pdata3$etime <- pdata3$etime/365.25 # recode to time in years
# add 'any cv' as an event
temp$age <- temp$day + ageday[match(temp$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp, id=id, cvstat=event(age),
priorcv = tdc(age))
# recode event variables, after all survSplit or tmerge steps are done
pdata3$death <- ifelse(pdata3$status=="censor", 0, 1) # all cause death
tstat <- ifelse(pdata3$cvstat>0, pdata3$cvstat, 2*pdata3$death)
pdata3$cvstat <- factor(tstat, 0:2, c("censor", "CVD event", "nonCVD-death"))
# Creation of multi-row data sets has pitfalls. It is always wise to
# sanity check the results.
check1 <- survcheck(Surv(tstart, tstop, death) ~ 1, id=id, data= pdata2)
check1
## Call:
## survcheck(formula = Surv(tstart, tstop, death) ~ 1, data = pdata2,
## id = id)
##
## Unique identifiers Observations Transitions
## 1455 6731 159
##
## Transitions table:
## to
## from 1 (censored)
## (s0) 159 1296
## 1 0 0
##
## Number of subjects with 0, 1, ... transitions to each state:
## count
## state 0 1
## 1 1296 159
## (any) 1296 159
check2 <- survcheck(Surv(age1, age2, death) ~1, id=id, data= pdata3)
check2
## Call:
## survcheck(formula = Surv(age1, age2, death) ~ 1, data = pdata3,
## id = id)
##
## Unique identifiers Observations Transitions
## 1455 362876 159
##
## Transitions table:
## to
## from 1 (censored)
## (s0) 159 1296
## 1 0 0
##
## Number of subjects with 0, 1, ... transitions to each state:
## count
## state 0 1
## 1 1296 159
## (any) 1296 159
The first pair of fits shows that the estimated overall effects of PAD and sex hardly differ between analyses done on the enrollment or age scales. A second pair fits looks at the coefficient effects within the PAD and control groups. (Inclusion of the covariate by strata interactions gives results that are equivalent to separate fits for the the PAD and control groups; bundling them as a single model allows a formal test of whether the effects differ between the two strata.) None of sex, enrollment age, or time since enrollment show meaningful differences between the PAD and control groups. That is, on the log-hazard scale, PAD appears to act as an additive risk.
# Fit the first set of models: pad, age, sex on the time since enrollment
# scale (cfit1a) and on the age scale (cfit1b)
cfit1a <- coxph(Surv(tstart, tstop, death) ~ pad + sex + age10, pdata2)
cfit1b <- coxph(Surv(age1, age2, death) ~ pad + sex + etime, pdata3)
# Add interactions
cfit2a <- coxph(Surv(tstart, tstop, death) ~ (sex + age10)*strata(pad), pdata2)
cfit2b <- coxph(Surv(age1, age2, death) ~ (sex + etime)*strata(pad),
pdata3)
cat("Overall model on time since enrollment scale\n")
## Overall model on time since enrollment scale
print(cfit1a)
## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ pad + sex + age10,
## data = pdata2)
##
## coef exp(coef) se(coef) z p
## padPAD 0.8755 2.4000 0.1739 5.036 4.76e-07
## sexmale 0.6924 1.9985 0.1822 3.800 0.000145
## age10 0.6583 1.9315 0.1053 6.250 4.11e-10
##
## Likelihood ratio test=82.63 on 3 df, p=< 2.2e-16
## n= 6731, number of events= 159
cat("\nOverall model on age scale\n")
##
## Overall model on age scale
print(cfit1b)
## Call:
## coxph(formula = Surv(age1, age2, death) ~ pad + sex + etime,
## data = pdata3)
##
## coef exp(coef) se(coef) z p
## padPAD 0.87354 2.39538 0.17400 5.020 5.16e-07
## sexmale 0.70540 2.02466 0.18270 3.861 0.000113
## etime 0.16593 1.18049 0.05917 2.804 0.005042
##
## Likelihood ratio test=51.15 on 3 df, p=4.538e-11
## n= 362876, number of events= 159
cat("\nInteractions between covariates and treatment, time since enrollment\n")
##
## Interactions between covariates and treatment, time since enrollment
print(cfit2a)
## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ (sex + age10) *
## strata(pad), data = pdata2)
##
## coef exp(coef) se(coef) z p
## sexmale 0.67576 1.96552 0.33494 2.018 0.043638
## age10 0.68341 1.98062 0.19354 3.531 0.000414
## sexmale:strata(pad)PAD 0.02208 1.02233 0.39925 0.055 0.955897
## age10:strata(pad)PAD -0.03498 0.96563 0.23081 -0.152 0.879552
##
## Likelihood ratio test=55.21 on 4 df, p=2.938e-11
## n= 6731, number of events= 159
cat("\nInteractions between covariates and treatment, age scale\n")
##
## Interactions between covariates and treatment, age scale
print(cfit2b)
## Call:
## coxph(formula = Surv(age1, age2, death) ~ (sex + etime) * strata(pad),
## data = pdata3)
##
## coef exp(coef) se(coef) z p
## sexmale 0.6980525 2.0098347 0.3357736 2.079 0.0376
## etime 0.1176904 1.1248958 0.1097373 1.072 0.2835
## sexmale:strata(pad)PAD 0.0001539 1.0001539 0.4003644 0.000 0.9997
## etime:strata(pad)PAD 0.0665382 1.0688018 0.1304441 0.510 0.6100
##
## Likelihood ratio test=24.44 on 4 df, p=6.516e-05
## n= 362876, number of events= 159
Therefore, we can see that in the multivariable Cox model on the follow up scale, being PAD and increasing age decrease the estimated survival, male sex is at higher risk. On the age scale, these results are confirmed. There is no a significant interaction between PAD and gender or age.
Since PAD is expected to exert its primary effect via cardiovascular
disease (CVD) endpoints, a more interesting analysis is to focus on CVD
rather than overall mortality. Since in this case non-CVD death will be
a competing risk this needs to be accounted for in our estimate
of the probability in state; in the R survival package this is
accomplished by using a multi-level categorical variable as the
“status”, rather than 0/1. The survfit
function then
computes Aalen-Johansen (AJ) estimates of a multi state model.
oldpar <- par(mfrow=c(1,2))
psurv4 <- survfit(Surv(tstart, tstop, status) ~ pad, id = id, pdata2)
plot(psurv4, col=1:2, lty=c(1,1,2,2), ylim=c(0,.15), xscale=365.25,
xlab="Years since enrollment", ylab="Absolute risk")
legend("topleft", c("PAD CVD death", "Control CVD death",
"PAD nCVD death","Control nCVD"),
col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n')
psurv5 <- survfit(Surv(tstart, tstop, cvstat) ~ pad, pdata2, id=id,
subset=(priorcv==0))
plot(psurv5, col=1:2, lty=c(1,1,2,2), xscale=365.25, ylim=c(0, .15),
xlab="Years since enrollment", ylab="Absolute risk")
legend("topleft", c("Major CVD, PAD", "Major CVD, Control",
"death w/o CVD, PAD","death w/o CVD, Control"),
col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.85)
Aalen-Johansen curves for CVD vs. non-CVD death outcomes (left) and for any CVD event vs. death without CVD (right).
Figure shows the Aalen-Johansen estimates of probability in state for CVD and non-CVD death in the left panel (which could also be labeled as “absolute risk”, “competing risk”, or “cumulative incidence”) and the AJ curves for first major CVD event (infarction, stroke, CVD death) versus non-CVD death in the right panel. For controls, CVD deaths are about 1/2 of the non-CVD, approximately 2% versus 4% at 5 years; both outcomes are near 8% for the PAD subjects. Nearly 14% of the PAD subjects experience a major CVD event versus 6% for controls.
The absolute risk curves require that all transitions in the model be considered jointly: incorrect curves will result from a simple Kaplan-Meier of time to CVD, censoring at non-CVD death, or a KM of time to non-CVD death, censoring at CVD.
Interestingly, this is not true for the cumulative hazard estimate: the naive estimate for one outcome, censoring the others, will be identical to the joint estimate.
The Cox model is a model for the hazards and has the same identity:
the first coxph
call below, which models the entire
process, will have the same results for the CVD death portion as the
second coxph
call, which ignores non-CVD deaths by
censoring them; this is verified by the all.equal
test.
However, if one desires to predict absolute risk curves from the fitted Cox model then the first form (using the “complete” multi-state approach) is essential.
pcox1 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10 + hdl0 + ldl0,
pdata2, id=id)
cat("transitions table\n")
## transitions table
pcox1$transitions # a check of the data
## to
## from CV death nonCV death (censored)
## (s0) 67 91 1275
## CV death 0 0 0
## nonCV death 0 0 0
pcox1
## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 +
## hdl0 + ldl0, data = pdata2, id = id)
##
##
## 1:2 coef exp(coef) se(coef) robust se z p
## padPAD 1.05440 2.87025 0.28291 0.28029 3.762 0.000169
## sexmale 0.51298 1.67026 0.27813 0.27881 1.840 0.065786
## age10 0.65698 1.92895 0.16378 0.17894 3.672 0.000241
## hdl0 -0.30026 0.74063 0.32993 0.44220 -0.679 0.497130
## ldl0 -0.08128 0.92194 0.12687 0.12754 -0.637 0.523966
##
##
## 1:3 coef exp(coef) se(coef) robust se z p
## padPAD 0.71390 2.04194 0.22494 0.22789 3.133 0.00173
## sexmale 0.75348 2.12438 0.25070 0.25505 2.954 0.00313
## age10 0.65666 1.92835 0.14006 0.14379 4.567 4.95e-06
## hdl0 -0.20347 0.81589 0.28098 0.32688 -0.622 0.53363
## ldl0 0.02213 1.02237 0.10630 0.10857 0.204 0.83850
##
## States: 1= (s0), 2= CV death, 3= nonCV death
##
## Likelihood ratio test=83.96 on 10 df, p=8.386e-14
## n= 6626, unique id= 1433, number of events= 158
## (105 observations deleted due to missingness)
# cause-specific Cox model censoring non CV death
pcox1b <- coxph(Surv(tstart, tstop, status=="CV death") ~ pad +sex + age10
+ hdl0 + ldl0, data=pdata2) #censor non-CVD death
pcox1b
## Call:
## coxph(formula = Surv(tstart, tstop, status == "CV death") ~ pad +
## sex + age10 + hdl0 + ldl0, data = pdata2)
##
## coef exp(coef) se(coef) z p
## padPAD 1.05445 2.87039 0.28292 3.727 0.000194
## sexmale 0.51298 1.67027 0.27813 1.844 0.065122
## age10 0.65703 1.92905 0.16379 4.011 6.03e-05
## hdl0 -0.30029 0.74060 0.32993 -0.910 0.362744
## ldl0 -0.08124 0.92197 0.12687 -0.640 0.521959
##
## Likelihood ratio test=39.51 on 5 df, p=1.874e-07
## n= 6626, number of events= 67
## (105 observations deleted due to missingness)
pcox2 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10+
+ hdl + ldl, data=pdata2, id=id) # time dependent hdl and ldl
pcox2
## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 +
## +hdl + ldl, data = pdata2, id = id)
##
##
## 1:2 coef exp(coef) se(coef) robust se z p
## padPAD 0.8751 2.3991 0.2857 0.2745 3.188 0.001433
## sexmale 0.3094 1.3626 0.2775 0.2707 1.143 0.252954
## age10 0.6963 2.0064 0.1652 0.1813 3.840 0.000123
## hdl -1.5482 0.2126 0.4109 0.4657 -3.325 0.000885
## ldl -0.2797 0.7560 0.1479 0.1496 -1.869 0.061609
##
##
## 1:3 coef exp(coef) se(coef) robust se z p
## padPAD 0.7545 2.1265 0.2277 0.2256 3.345 0.000824
## sexmale 0.7847 2.1918 0.2518 0.2461 3.188 0.001432
## age10 0.6761 1.9662 0.1401 0.1409 4.797 1.61e-06
## hdl -0.1489 0.8617 0.2910 0.3901 -0.382 0.702680
## ldl 0.1338 1.1432 0.1125 0.1070 1.250 0.211251
##
## States: 1= (s0), 2= CV death, 3= nonCV death
##
## Likelihood ratio test=105.6 on 10 df, p=< 2.2e-16
## n= 6702, unique id= 1453, number of events= 158
## (29 observations deleted due to missingness)
There were 67 CVD and 91 non-CVD deaths in 5 years of follow-up, which are the transitions from the initial state (s0) to the CVD death and non-CVD death states, respectively, in the transitions table. No one transitioned from death to another state, which would indicate an error in the data set. Table displays the coefficients for the fits with baseline and time-dependent lipids, for both of the competing outcomes. The joint Cox model shows that after adjustment for age, sex, and baseline lipids, PAD is associated with nearly three fold increase in the rate of CVD deaths and double fold increase in non-CVD death. Age and sex are important predictors as well, as expected, with male sex having a somewhat larger effect on the non-CVD endpoint. The effect of baseline lipid values was not significant, however, time-dependent HDL shows an effect on the CVD death endpoint, somewhat reducing the estimated effect of PAD on CVD death.
Predicted \(P(state)\) curves can be created from the multi-state fits, as well. As with any Cox model prediction, the first task is to choose a set of covariate values for which prediction is desired. We create the predictions for four hypothetical male subjects with ages of 58 and 72 (quartiles of age), and PAD vs. control. For HDL and LDL use values of 1.3 and 3, which are near the medians. (These curves are for the model of CVD death vs other death, time-fixed hdl and ldl.)
dummy <- expand.grid(pad=c("PAD", "Control"), age10=c(5.8, 7.2), sex="male",
hdl0 =1.3, ldl0 =3)
dummy
## pad age10 sex hdl0 ldl0
## 1 PAD 5.8 male 1.3 3
## 2 Control 5.8 male 1.3 3
## 3 PAD 7.2 male 1.3 3
## 4 Control 7.2 male 1.3 3
coxpred1 <- survfit(pcox1, newdata= dummy)
dim(coxpred1)
## data states
## 4 3
coxpred1$states
## [1] "(s0)" "CV death" "nonCV death"
The resulting set of absolute risk curves can be viewed as an array
of estimates by stratum, target covariates, and state. The
pcox1
fit had no strata so the first dimension is 1. The
resulting curves are shown in Figure. Because a multi-state model can
generate a very large number of curves it is often helpful to break them
into multiple panels.
par(mfrow=c(1, 2))
plot(coxpred1[1, 1:2 ,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
ylim=c(0, .15),
xlab="Years since enrollment", ylab="predicted P(state), age = 58")
legend("topleft", c("PAD, CVD death", "Control, CVD death",
"PAD, nonCVD death", "Control, nonCVD death"),
col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
plot(coxpred1[1, 3:4,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
ylim=c(0, .15),
xlab="Years since enrollment", ylab="predicted P(state), age=72")
legend("topleft", c("PAD, CVD death", "Control, CVD death",
"PAD, nonCVD death", "Control, nonCVD death"),
col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
Predicted absolute risk curves from the multistate Cox model: dying of CVD (solid) and from other causes (dashed) by PAD status, for a male enrolled at age 58 (left panel) or 72 (right). The values of baseline HDL and LDL are 1.3 and 3, respectively.