1 An introduction to survival analysis in R
2 Survival Analysis in R: comparing survival curves
3 Competing risks in survival analysis (an introduction)
4 An example of time-dependent covariate : the extended KM estimator
5 Cox regression in R
6 Evaluating performance for the Cox model
- 6.1 Discrimination for the survival data : time-dependent ROC curve
- 6.2 Calibration for the Cox model
7 Examples on more advanced survival topics

1 An introduction to survival analysis in R

This practical provides an introduction to survival analysis, and to conducting a survival analysis in R.

We will cover some basics of survival analysis; the following series of tutorial papers can be helpful for additional reading:

Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part I: Basic concepts and first analyses. 232-238. ISSN 0007-0920.
M J Bradburn, T G Clark, S B Love, & D G Altman. (2003). Survival Analysis Part II: Multivariate data analysis - an introduction to concepts and methods. British Journal of Cancer, 89(3), 431-436.
Bradburn, M., Clark, T., Love, S., & Altman, D. (2003). Survival analysis Part III: Multivariate data analysis – choosing a model and assessing its adequacy and fit. 89(4), 605-11.
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part IV: Further concepts and methods in survival analysis. 781-786. ISSN 0007-0920.

1.0.1 Packages

Some packages we’ll be using include:

lubridate
survival
survminer

library(survival)
library(survminer)
library(lubridate)

Moreover, it will be necessary install a library from a github:

# devtools::install_github("zabore/ezfun")
library(ezfun)

1.0.2 What is survival data?

Time-to-event data that consist of a distinct start time and end time.

Examples:

Time from surgery to death
Time from start of treatment to cancer progression
Time from response to cancer recurrence
Time from HIV infection to development of AIDS
Time to heart attack
Time to onset of substance abuse
Time to initiation of sexual activity
Time to machine malfunction …..

1.0.3 Aliases for survival analysis

Because survival analysis is common in many other fields, it also goes by other names

Reliability analysis
Duration analysis
Event history analysis
Time-to-event analysis

1.0.4 The lung dataset

The lung dataset is available from the survival package in R. The data contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. Some variables we will use to demonstrate methods include:

time: Survival time in days
status: censoring status 1=censored, 2=dead
sex: Male=1 Female=2

1.0.5 What is censoring?

1.0.6 Types of censoring

A subject may be censored due to:

Loss to follow-up
Withdrawal from study
No event by end of fixed study period (administrative censoring)

Specifically these are examples of right censoring.

Left censoring and interval censoring are also possible, and methods exist to analyze this type of data, but in this course we will be limited to right censoring.

1.0.7 Censored survival data

In this example, how would we compute the proportion of subjects who are event-free at 10 years?

Subjects 6 and 7 were event-free at 10 years. Subjects 2,9 and 10 had the event before 10 years. Subjects 1, 3, 4, 5 and 8 were censored before 10 years, so we don’t know whether they had the event or not by 10 years - how do we incorporate these subjects into our estimate?

1.0.8 Distribution of follow-up time

Censored subjects still provide information so must be appropriately included in the analysis
Distribution of follow-up times is skewed, and may differ between censored patients and those with events
Follow-up times are always positive

1.0.9 Components of survival data

For subject \(i\):

Event time \(T_i\)
Censoring time \(C_i\)
Event indicator \(\delta_i\):
- 1 if event observed (i.e. \(T_i \leq C_i\))
- 0 if censored (i.e. \(T_i > C_i\))
Observed time \(Y_i = \min(T_i, C_i)\)

The observed times and an event indicator are provided in the lung data

time: Survival time in days (\(Y_i\))
status: censoring status 1=censored, 2=dead (\(\delta_i\))

inst	time	status	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss
3	306	2	74	1	1	90	100	1175	NA
3	455	2	68	1	0	90	90	1225	15
3	1010	1	56	1	0	90	90	NA	15
5	210	2	57	1	1	90	60	1150	11
1	883	2	60	1	0	100	90	NA	0
12	1022	1	74	1	1	50	80	513	0

1.0.10 Dealing with dates in R

Data will often come with start and end dates rather than pre-calculated survival times. The first step is to make sure these are formatted as dates in R.

Let’s create a small example dataset with variables sx_date for surgery date and last_fup_date for the last follow-up date.

date_ex <- 
  tibble(
    sx_date = c("2007-06-22", "2004-02-13", "2010-10-27"), 
    last_fup_date = c("2017-04-15", "2018-07-04", "2016-10-31")
    )
date_ex

## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <chr>      <chr>        
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

We see these are both character variables, which will often be the case, but we need them to be formatted as dates.

1.0.11 Formatting dates - base R

date_ex %>% 
  mutate(
    sx_date = as.Date(sx_date, format = "%Y-%m-%d"), 
    last_fup_date = as.Date(last_fup_date, format = "%Y-%m-%d") 
    )

## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <date>     <date>       
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

Note that in base R the format must include the separator as well as the symbol. e.g. if your date is in format m/d/Y then you would need format = "%m/%d/%Y"
See a full list of date format symbols at https://www.statmethods.net/input/dates.html

1.0.12 Formatting dates - lubridate package

We can also use the lubridate package to format dates. In this case, use the ymd function

date_ex %>% 
  mutate(
    sx_date = ymd(sx_date), 
    last_fup_date = ymd(last_fup_date)
    )

## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <date>     <date>       
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

Note that unlike the base R option, the separators do not need to be specified
The help page for ?dmy will show all format options.

1.0.13 Calculating survival times - base R

Now that the dates formatted, we need to calculate the difference between start and end time in some units, usually months or years. In base R, use difftime to calculate the number of days between our two dates and convert it to a numeric value using as.numeric. Then convert to years by dividing by 365.25, the average number of days in a year.

date_ex %>% 
  mutate(
    os_yrs = 
      as.numeric(
        difftime(last_fup_date, 
                 sx_date, 
                 units = "days")) / 365.25
    )

## # A tibble: 3 × 3
##   sx_date    last_fup_date os_yrs
##   <date>     <date>         <dbl>
## 1 2007-06-22 2017-04-15      9.82
## 2 2004-02-13 2018-07-04     14.4 
## 3 2010-10-27 2016-10-31      6.01

1.0.14 Calculating survival times - lubridate

Using the lubridate package, the operator %--% designates a time interval, which is then converted to the number of elapsed seconds using as.duration and finally converted to years by dividing by dyears(1), which gives the number of seconds in a year.

date_ex %>% 
  mutate(
    os_yrs = 
      as.duration(sx_date %--% last_fup_date) / dyears(1)
    )

## # A tibble: 3 × 3
##   sx_date    last_fup_date os_yrs
##   <date>     <date>         <dbl>
## 1 2007-06-22 2017-04-15      9.82
## 2 2004-02-13 2018-07-04     14.4 
## 3 2010-10-27 2016-10-31      6.01

Note: we need to load the lubridate package using a call to library in order to be able to access the special operators (similar to situation with pipes)

1.0.15 Event indicator

For the components of survival data I mentioned the event indicator:

Event indicator \(\delta_i\):

1 if event observed (i.e. \(T_i \leq C_i\))
0 if censored (i.e. \(T_i > C_i\))

However, in R the Surv function will also accept TRUE/FALSE (TRUE = event) or 1/2 (2 = event).

In the lung data, we have:

status: censoring status 1=censored, 2=dead

1.0.16 Survival function

The probability that a subject will survive beyond any given specified time

\[S(t) = Pr(T>t) = 1 - F(t)\]

\(S(t)\): survival function \(F(t) = Pr(T \leq t)\): cumulative distribution function

In theory the survival function is smooth; in practice we observe events on a discrete time scale.

1.0.17 Survival probability

Survival probability at a certain time, \(S(t)\), is a conditional probability of surviving beyond that time, given that an individual has survived just prior to that time.
Can be estimated as the number of patients who are alive without loss to follow-up at that time, divided by the number of patients who were alive just prior to that time
The Kaplan-Meier estimate of survival probability is the product of these conditional probabilities up until that time
At time 0, the survival probability is 1, i.e. \(S(t_0) = 1\)

1.0.18 Creating survival objects

The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.

The Surv function from the survival package creates a survival object for use as the response in a model formula. There will be one entry for each subject that is the survival time, which is followed by a + if the subject was censored. Let’s look at the first 10 observations:

Surv(lung$time, lung$status)[1:10]

##  [1]  306   455  1010+  210   883  1022+  310   361   218   166

1.0.19 Estimating survival curves with the Kaplan-Meier method

The survfit function creates survival curves based on a formula. Let’s generate the overall survival curve for the entire cohort, assign it to object f1, and look at the names of that object:

f1 <- survfit(Surv(time, status) ~ 1, data = lung)
names(f1)

##  [1] "n"         "time"      "n.risk"    "n.event"   "n.censor"  "surv"     
##  [7] "std.err"   "cumhaz"    "std.chaz"  "type"      "logse"     "conf.int" 
## [13] "conf.type" "lower"     "upper"     "t0"        "call"

Some key components of this survfit object that will be used to create survival curves include:

time, which contains the start and endpoints of each time interval
surv, which contains the survival probability corresponding to each time

1.0.20 Kaplan-Meier plot - base R

Now we plot the survfit object in base R to get the Kaplan-Meier plot.

plot(survfit(Surv(time, status) ~ 1, data = lung), 
     xlab = "Days", 
     ylab = "Overall survival probability")

The default plot in base R shows the step function (solid line) with associated confidence intervals (dotted lines)
Horizontal lines represent survival duration for the interval
An interval is terminated by an event
The height of vertical lines show the change in cumulative probability
Censored observations, indicated by tick marks, reduce the cumulative survival between intervals. (Note the tick marks for censored patients are not shown by default, but could be added using the option mark.time = TRUE)

1.0.21 Kaplan-Meier plot - ggsurvplot

Alternatively, the ggsurvplot function from the survminer package is built on ggplot2, and can be used to create Kaplan-Meier plots. Checkout the cheatsheet for the survminer package.

ggsurvplot(
    fit = survfit(Surv(time, status) ~ 1, data = lung), 
    xlab = "Days", 
    ylab = "Overall survival probability")

The default plot using ggsurvplot shows the step function (solid line) with associated confidence bands (shaded area).
The tick marks for censored patients are shown by default, somewhat obscuring the line itself in this example, and could be supressed using the option censor = FALSE

1.0.22 Estimating \(x\)-year survival

One quantity often of interest in a survival analysis is the probability of surviving beyond a certain number (\(x\)) of years.

For example, to estimate the probability of surviving to \(1\) year, use summary with the times argument (Note the time variable in the lung data is actually in days, so we need to use times = 365.25)

summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)

## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   365     65     121    0.409  0.0358        0.345        0.486

We find that the \(1\)-year probability of survival in this study is 41%.

The associated lower and upper bounds of the 95% confidence interval are also displayed.

1.0.23 \(x\)-year survival and the survival curve

The \(1\)-year survival probability is the point on the y-axis that corresponds to \(1\) year on the x-axis for the survival curve.

plot_main <- 
  ggsurvplot(
    data = lung, 
    fit = f1,
    xlab = "Months",
    legend = "none",
    xscale = 30.4,
    break.x.by = 182.4, 
    risk.table = TRUE,
    risk.table.y.text = FALSE)
plot1 <- plot_main
plot1$plot <- plot1$plot + 
  geom_segment(x = 365.25, xend = 365.25, y = -0.05, yend = 0.4092416, 
               size = 1.5) +
  geom_segment(x = 365.25, xend = -40, y = 0.4092416, yend = 0.4092416,
               size = 1.5, 
               arrow = arrow(length = unit(0.2, "inches"))) 
plot1

1.0.24 \(x\)-year survival is often estimated incorrectly

What happens if you use a “naive” estimate?

121 of the 228 patients died by \(1\) year so:

\[\Big(1 - \frac{121}{228}\Big) \times 100 = 47\%\]

You get an incorrect estimate of the \(1\)-year probability of survival when you ignore the fact that 42 patients were censored before \(1\) year.
Recall the correct estimate of the \(1\)-year probability of survival was 41%.

1.0.25 Impact on \(x\)-year survival in absence of censoring

Imagine two studies, each with 228 subjects. There are 165 deaths in each study. No censoring in one (orange line), 63 patients censored in the other (blue line)
In absence of censoring the overall survival probability is higher; instead when censoring is present, censored subjects only contribute information for part of the follow-up time, and then fall out of the risk set, thus pulling down the cumulative probability of survival.

fakedata2 <- lung %>% 
  mutate(time = ifelse(status == 2, time, 1022), 
         group = "No censoring") %>% 
  full_join(mutate(lung, group = "With censoring"))
fit3 <- survfit(Surv(time, status) ~ group, data = fakedata2)
ggsurvplot(
  data = fakedata2, 
  fit = fit3,
  xlab = "Months",
  legend = "bottom",
  legend.title = "",
  legend.labs = c("No censoring", "With censoring"),
  xscale = 30.4,
  break.x.by = 182.4, 
  risk.table = TRUE,
  risk.table.y.text = FALSE)

1.0.26 Estimating median survival time

Another quantity often of interest in a survival analysis is the average survival time, which we quantify using the median. Survival times are not expected to be normally distributed so the mean is not an appropriate summary.

We can obtain this directly from our survfit object

survfit(Surv(time, status) ~ 1, data = lung)

## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 228    165    310     285     363

We see the median survival time is 310 days The lower and upper bounds of the 95% confidence interval are also displayed.

1.0.27 Median survival time and the survival curve

Median survival is the time corresponding to a survival probability of \(0.5\):

plot2 <- plot_main
plot2$plot <- plot2$plot + 
  geom_segment(x = -45, xend = 310, y = 0.5, yend = 0.5,  size = 1.5) +
  geom_segment(x = 310, xend = 310, y = 0.5, yend = -0.03, size = 1.5, 
               arrow = arrow(length = unit(0.2, "inches")))
plot2

1.0.28 Median survival is often estimated incorrectly

What happens if you use a “naive” estimate?

Summarize the median survival time among the 165 patients who died

lung %>% 
  filter(status == 2) %>% 
  summarize(median_surv = median(time))

##   median_surv
## 1         226

You get an incorrect estimate of median survival time of 226 days when you ignore the fact that censored patients also contribute follow-up time.
Recall the correct estimate of median survival time is 310 days.

1.0.29 Impact on survival of ignoring (removing) censoring

Ignoring censoring when instead is present, i.e. removing pts that are censored from the dataset, creates an artificially lowered survival curve because the follow-up time that censored patients contribute is excluded (orange line)
The true survival curve for the lung data is shown in blue for comparison

fakedata <- lung %>% 
  filter(status == 2) %>% 
  mutate(group = "Ignoring censoring") %>% 
  full_join(mutate(lung, group = "With censoring"))
fit2 <- survfit(Surv(time, status) ~ group, data = fakedata)
ggsurvplot(
  data = fakedata, 
  fit = fit2,
  xlab = "Months",
  legend = "bottom",
  legend.title = "",
  legend.labs = c("Ignoring censoring", "With censoring"),
  xscale = 30.4,
  break.x.by = 182.4, 
  risk.table = TRUE, 
  risk.table.y.text = FALSE)

1.0.30 Comparison between the KM estimator and the FH estimator

We will use the “Duration of nursing home stay” data.

The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. Treated nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient’s health and sending them home.

Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.

Variables include:

los - Length of stay of a resident (in days)
age - Age of a resident
rx - Nursing home assignment (1:bonuses, 0:no bonuses)
gender - Gender (1:male, 0:female)
married - (1:married, 0:not married)
health - Health status (2:second best, 5:worst)
censor - Censoring indicator (1:censored, 0:discharged)

Say we want to obtain the Fleming-Harrington estimate of the survival function for married females, in the healthiest initial subgroup, who are randomized to the untreated group of the nursing home study.

nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat",
head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))

nurs2 = subset(nurshome, gender==0 & rx==0 & health==2 & married==1)

fit.fh = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="fleming-harrington", conf.type="log-log")
summary(fit.fh)

## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "fleming-harrington", 
##     conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    14     12       1   0.9200  0.0767      0.55345        0.988
##    24     11       1   0.8401  0.1036      0.49777        0.957
##    25     10       1   0.7601  0.1207      0.42614        0.916
##    38      9       1   0.6802  0.1318      0.35609        0.866
##    64      8       1   0.6003  0.1384      0.29015        0.810
##    89      7       1   0.5204  0.1412      0.22897        0.749
##   113      6       1   0.4405  0.1402      0.17290        0.682
##   123      5       1   0.3606  0.1356      0.12238        0.609
##   149      4       1   0.2809  0.1268      0.07814        0.531
##   168      3       1   0.2012  0.1129      0.04142        0.446
##   185      2       1   0.1221  0.0917      0.01445        0.352
##   234      1       1   0.0449  0.0562      0.00107        0.245

Compare with KM:

fit.km = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="kaplan-meier", conf.type="log-log")
summary(fit.km)

## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "kaplan-meier", 
##     conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    14     12       1   0.9167  0.0798      0.53898        0.988
##    24     11       1   0.8333  0.1076      0.48171        0.956
##    25     10       1   0.7500  0.1250      0.40842        0.912
##    38      9       1   0.6667  0.1361      0.33702        0.860
##    64      8       1   0.5833  0.1423      0.27014        0.801
##    89      7       1   0.5000  0.1443      0.20848        0.736
##   113      6       1   0.4167  0.1423      0.15247        0.665
##   123      5       1   0.3333  0.1361      0.10270        0.588
##   149      4       1   0.2500  0.1250      0.06014        0.505
##   168      3       1   0.1667  0.1076      0.02651        0.413
##   185      2       1   0.0833  0.0798      0.00505        0.311
##   234      1       1   0.0000     NaN           NA           NA

Compare plots:

plot(fit.fh$time, fit.fh$surv,type="n", xlab="Follow up", ylab="Survival curve")
points(fit.fh$time, fit.fh$surv, col="blue", pch=17)
points(fit.km$time,fit.km$surv, col="red", pch=14)
lines(fit.fh$time,fit.fh$surv, col="blue", lwd=2)
lines(fit.km$time, fit.km$surv,col="red", lwd=2)
legend("topright", legend=c("FH", "KM"), col=c("blue", "red"), pch=c(17,14), lty=c(1,1), lwd=c(2,2), bty="n")

Generally, the Fleming Harrington estimator is slightly higher than the KM at every time point, but with larger datasets the two will typically be much closer.

2 Survival Analysis in R: comparing survival curves

Load libraries:

library(survival)
library(MASS)
library(rms)
library(ISwR)
library(survMisc)
library(survminer)
library(KMsurv)

2.1 Recap of the basic steps

Loading and viewing dataset melanom:

dati <- melanom
head(dati)

##    no status days ulc thick sex
## 1 789      3   10   1   676   2
## 2  13      3   30   2    65   2
## 3  97      2   35   2   134   2
## 4  16      3   99   2   290   1
## 5  21      1  185   1  1208   2
## 6 469      1  204   1   484   2

attach(dati)

Create the survival object:

msurv <- with(dati, Surv(days, status == 1))
msurv

##   [1]   10+   30+   35+   99+  185   204   210   232   232+  279   295   355+
##  [13]  386   426   469   493+  529   621   629   659   667   718   752   779 
##  [25]  793   817   826+  833   858   869   872   967   977   982  1041  1055 
##  [37] 1062  1075  1156  1228  1252  1271  1312  1427+ 1435  1499+ 1506  1508+
##  [49] 1510+ 1512+ 1516  1525+ 1542+ 1548  1557+ 1560  1563+ 1584  1605+ 1621 
##  [61] 1627+ 1634+ 1641+ 1641+ 1648+ 1652+ 1654+ 1654+ 1667  1678+ 1685+ 1690 
##  [73] 1710+ 1710+ 1726  1745+ 1762+ 1779+ 1787+ 1787+ 1793+ 1804+ 1812+ 1836+
##  [85] 1839+ 1839+ 1854+ 1856+ 1860+ 1864+ 1899+ 1914+ 1919+ 1920+ 1927+ 1933 
##  [97] 1942+ 1955+ 1956+ 1958+ 1963+ 1970+ 2005+ 2007+ 2011+ 2024+ 2028+ 2038+
## [109] 2056+ 2059+ 2061  2062  2075+ 2085+ 2102+ 2103  2104+ 2108  2112+ 2150+
## [121] 2156+ 2165+ 2209+ 2227+ 2227+ 2256  2264+ 2339+ 2361+ 2387+ 2388  2403+
## [133] 2426+ 2426+ 2431+ 2460+ 2467  2492+ 2493+ 2521+ 2542+ 2559+ 2565  2570+
## [145] 2660+ 2666+ 2676+ 2738+ 2782  2787+ 2984+ 3032+ 3040+ 3042  3067+ 3079+
## [157] 3101+ 3144+ 3152+ 3154+ 3180+ 3182+ 3185+ 3199+ 3228+ 3229+ 3278+ 3297+
## [169] 3328+ 3330+ 3338  3383+ 3384+ 3385+ 3388+ 3402+ 3441+ 3458+ 3459+ 3459+
## [181] 3476+ 3523+ 3667+ 3695+ 3695+ 3776+ 3776+ 3830+ 3856+ 3872+ 3909+ 3968+
## [193] 4001+ 4103+ 4119+ 4124+ 4207+ 4310+ 4390+ 4479+ 4492+ 4668+ 4688+ 4926+
## [205] 5565+

Histogram of the follow up (days) and descriptive statistics:

hist(days)

summary(days)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10    1525    2005    2153    3042    5565

Estimate the KM curve:

mfit <- survfit(Surv(days, status==1)~1, data = dati)
mfit

## Call: survfit(formula = Surv(days, status == 1) ~ 1, data = dati)
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 205     57     NA      NA      NA

Output of the KM estimate at different time points:

summary(mfit, times=c(1000,3000,4000))

## Call: survfit(formula = Surv(days, status == 1) ~ 1, data = dati)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##  1000    171      26    0.869  0.0240        0.823        0.917
##  3000     54      29    0.677  0.0385        0.605        0.757
##  4000     13       2    0.645  0.0431        0.566        0.735

Graphical representation of the curve:

plot(mfit)
abline(h=0.50, col="red",lty=2)

In this case the median survival time is not even interesting because the estimate is infinite.

The survival curve does not cross the 50% mark before all patients are censored.

If you want to see also censored times in the plot:

plot(mfit,mark.time=T)
abline(h=0.50, col="red",lty=2)

if you want to see also the number at risk at different time points:

ggsurvplot(mfit, data=dati, risk.table = TRUE)

Remember that estimate of the curve is reliable until the risk set has a sufficient sample size.
As a rule of thumb: the curve can become unreliable when the number of patients at risk is less than 15. The width of confidence intervals will help to decide.

2.2 Estimate survival curves by gender:

surv.bysex <- survfit(Surv(days,status==1)~sex)

Plot the curves:

plot(surv.bysex, conf.int=T,col=c("red","black"))

Other types of graphical representations:

ggsurvplot(surv.bysex, data=dati, risk.table = TRUE)

ggsurvplot(surv.bysex, data=dati, risk.table ="abs_pct")

2.3 Comparing survival curves : the log-rank test

Log-rank test to see if there is a significant difference between males and females: the survdiff function implements different tests specified by a parameter called rho, allowing a non-proportional hazards alternative to the null hypothesis, but the default value of rho gives the log-rank test.

survdiff(Surv(days,status==1)~sex)

## Call:
## survdiff(formula = Surv(days, status == 1) ~ sex)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 126       28     37.1      2.25      6.47
## sex=2  79       29     19.9      4.21      6.47
## 
##  Chisq= 6.5  on 1 degrees of freedom, p= 0.01

We conclude that there is a significant difference in survival between males and females.

2.4 Stratified log-rank test using presence of ulcer as strata:

surv.bysex.ulc <- survdiff(Surv(days,status==1)~sex+strata(ulc))
surv.bysex.ulc

## Call:
## survdiff(formula = Surv(days, status == 1) ~ sex + strata(ulc))
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 126       28     34.7      1.28      3.31
## sex=2  79       29     22.3      1.99      3.31
## 
##  Chisq= 3.3  on 1 degrees of freedom, p= 0.07

To make separate graphs:

index.ulc1 <- ulc==1
index.ulc2 <- ulc==2

surv.bysex1 <- survfit(Surv(days,status==1)~sex, data=dati[index.ulc1,])
surv.bysex2 <- survfit(Surv(days,status==1)~sex, data=dati[index.ulc2,])


par(pty=("s"), mfrow=c(1,2))
plot(surv.bysex1, conf.int=T,col=c("red","black"))
title("Presence of ulcer")
plot(surv.bysex2, conf.int=T,col=c("red","black"))
title("Absence of ulcer")

Stratification makes the effect of sex less significant. A possible explanation might be that males seek treatment when the disease is in a more advanced state than women do, so that the gender difference is reduced when adjusted for a measure of disease progression.

2.5 Extension for more than two groups

Example of log rank with more than 2 groups:

data(myeloma)
fit <- survfit(Surv(time, event) ~ molecular_group, data = myeloma)
ggsurvplot(fit, legend.labs = levels(myeloma$molecular_group),
           pval = TRUE)

This plot shows the global p value across groups. Let’s see the differences between pairs (adjusting for multiple comparisons):

res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
                         data = myeloma)
res

## 
##  Pairwise comparisons using Log-Rank test 
## 
## data:  myeloma and molecular_group 
## 
##                  Cyclin D-1 Cyclin D-2 Hyperdiploid Low bone disease MAF  
## Cyclin D-2       0.723      -          -            -                -    
## Hyperdiploid     0.943      0.723      -            -                -    
## Low bone disease 0.723      0.988      0.644        -                -    
## MAF              0.644      0.447      0.523        0.485            -    
## MMSET            0.328      0.103      0.103        0.103            0.723
## Proliferation    0.103      0.038      0.038        0.062            0.485
##                  MMSET
## Cyclin D-2       -    
## Hyperdiploid     -    
## Low bone disease -    
## MAF              -    
## MMSET            -    
## Proliferation    0.527
## 
## P value adjustment method: BH

Proliferation vs Cyclin D-2 and vs Hyperdiploid show significant differences in survival curves.No others significant differences are detected between others groups.

3 Competing risks in survival analysis (an introduction)

Pneumonia is a very common nosocomial infection in intensive care units (ICUs). Many studies have investigated risk factors for the development of infection and its consequences. We now explore a dataset from a prospective cohort study, with 747 patients admitted in hospital, and 97 had pneumonia at admission.
They were followed until they die in the ICU or they were discharged alive. 657 were discharged alive, 76 died in ICU and 14 were censored (i.e. we do not know if they die in hospital or they were discharged alive: imagine for example that these patients have been transferred to another hospital). The variable time in the dataset is the length of hospital stay in days. The variable status is coded as 0 for censoring, 1 for discharged alive, 2 for death. The variable pneu indicates if the patient had pneumonia at admission.

Load the required libraries:

library(mvna)
library(survival)

Load the data:

data(sir.adm)
head(sir.adm)

##     id pneu status time      age sex
## 1   41    0      1    4 75.34153   F
## 2  395    0      1   24 19.17380   M
## 3  710    1      1   37 61.56568   M
## 4 3138    0      1    8 57.88038   F
## 5 3154    0      1    3 39.00639   M
## 6 3178    0      1   24 70.27762   M

table(sir.adm$pneu)

## 
##   0   1 
## 650  97

table(sir.adm$status)

## 
##   0   1   2 
##  14 657  76

If the status is 1 or 2 we have an uncensored observation in this example; if it is 0 is censored.

The first scientific question is : does pneumonia increase the risk of in-hospital mortality?

Let’see the raw data:

table(sir.adm$pneu,sir.adm$status)

##    
##       0   1   2
##   0   6 589  55
##   1   8  68  21

my.table <- table(sir.adm$pneu,sir.adm$status)
round(prop.table(my.table,1),3)

##    
##         0     1     2
##   0 0.009 0.906 0.085
##   1 0.082 0.701 0.216

It seems that pneumonia patients have more chances to die in hospital.

3.1 Exploring length of stay stratified by pneumonia (as if no competing risks exist)

The second scientific question is : does pneumonia increase the risk of staying in the hospital (i.e. increase the length of hospital stay) ?

Now, first of all let’s explore the survival functions (in terms of length of stay) for a composite end-point, i.e. putting together in-hospital death or to be discharged alive (this composite end-point has no really a clinical meaning, it is just analysed here for didactic purposes):

plot(survfit(Surv(time, status != 0) ~ pneu, data = sir.adm), col = 1:2)
legend(100, 1, legend = c("no pneu", "pneu"), col = 1:2, lty = 1)

Pneumonia group shows longer hospital stays.

We can also obtain the corresponding cumulative hazard functions by using the Nelson-Aalen estimator:

plot(survfit(Surv(time, status != 0) ~ pneu, data = sir.adm), fun = "cumhaz", col = 1:2)
legend(0, 6.5, legend = c("no pneu", "pneu"), col = 1:2, lty = 1)

Let’s now quantify the effect of pneumonia on length of stay by means of a Cox model on the composite end-point:

cox <- coxph(Surv(time, status != 0) ~ pneu, data = sir.adm)
summary(cox)

## Call:
## coxph(formula = Surv(time, status != 0) ~ pneu, data = sir.adm)
## 
##   n= 747, number of events= 733 
## 
##         coef exp(coef) se(coef)      z Pr(>|z|)    
## pneu -0.9481    0.3875   0.1152 -8.227   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## pneu    0.3875      2.581    0.3092    0.4857
## 
## Concordance= 0.575  (se = 0.008 )
## Likelihood ratio test= 83.89  on 1 df,   p=<2e-16
## Wald test            = 67.69  on 1 df,   p=<2e-16
## Score (logrank) test = 72.39  on 1 df,   p=<2e-16

Pneumonia multiplies the hazard by 0.39, i.e. it is a significant factor in increasing the stay in the hospital (by decreasing the instantaneous risk to leave the hospital).

Now, we split the dataset in two subgroups according to the presence of pneumonia and we re-obtain an estimate of the cumulative hazard, this time based on the Cox model estimation (this is called the Breslow estimator of the cumulative hazard: remind that the Cox model is composed by two parts, the effect of the covariates and an estimate of the baseline hazard).

To do so, we have to extract the estimated cumulative hazard from the Cox model on the data of two patient, one with pneumonia and the other without (that are the values of cumulative hazard estimated in the Cox model for the two corresponding groups):

survfit(cox, newdata = sir.adm[1,])  -> cox.nop     
survfit(cox, newdata = sir.adm[3,])  -> cox.p

We can now compare the non-parametric Nelson-Aalen estimator previously obtained with this estimator obtained from the Cox model:

plot(survfit(Surv(time, status != 0) ~ pneu, data = sir.adm), fun = "cumhaz", col = 1:2)
legend(0, 6.5, legend = c("no pneu", "pneu"), col = 1:2, lty = 1) 
lines(cox.nop$time, cox.nop$cumhaz, type = "s", lty = 2)   
lines(cox.p$time, cox.p$cumhaz, type = "s", lty = 2, col = 2)

They are quite close, and this is also a sort of goodness of fit plot for the cox model.Remind that here there are not competing risk, we are using the composite end-point.

3.2 Competing risks : the cumulative incidence function

For the 76 patients died in-hospital we will never observe their time to be discharged alive.

Conversely, for the 657 subjects discharged alive, we will not observe in this study their time to death.

From the raw data we saw that for the pneumonia group the chances to be discharged alive are smaller.

We also have seen that patients with pneumonia has more chance to stay longer in the hospital.

But, now the question is : patients with pneumonia have a different cause-specific hazard of death ?

Let’s now first of all estimate the cumulative incidence functions for these two distinct type of events in the two groups, taking into account the competing risks (Aalen-Johansen estimator):

library(cmprsk)

my.sir.cif <- cuminc(sir.adm$time, sir.adm$status, group = sir.adm$pneu, cencode =0)
my.sir.cif

## Tests:
##       stat           pv df
## 1 42.07480 8.784784e-11  1
## 2 16.16525 5.804939e-05  1
## Estimates and Variances:
## $est
##             50        100        150
## 0 1 0.88998816 0.91100110 0.91261748
## 1 1 0.64633198 0.73293123         NA
## 0 2 0.07930062 0.08576614 0.08576614
## 1 2 0.18393349 0.22550113         NA
## 
## $var
##               50          100          150
## 0 1 0.0001517978 0.0001266104 0.0001250158
## 1 1 0.0025299050 0.0022581560           NA
## 0 2 0.0001125301 0.0001222882 0.0001222882
## 1 2 0.0016495598 0.0020723886           NA

We can also visualize the estimated CIFs:

plot(my.sir.cif, main = "CIFs", xlab = "Days", curvlab=c("No pneumo alive", "Pneumo alive", "No pneumo Death", "Pneumo Death"), ylim=c(0,1.4))

And we can also calculate a test statistic to compare the CIFs:

my.sir.cif$Tests

##       stat           pv df
## 1 42.07480 8.784784e-11  1
## 2 16.16525 5.804939e-05  1

From the test results, a significant difference is observed in both CIFs with respect to the presence/absence of pneumonia for the two types of events.

Note that these functions are quite different from the corresponding 1-KM curves: 1-KM over-estimate the incidence of the event of interest.

sir.adm$status_1_KM <- ifelse(sir.adm$status==1, 1, 0)
sir.adm$status_2_KM <- ifelse(sir.adm$status==2, 1, 0)
table(sir.adm$status_1_KM)

## 
##   0   1 
##  90 657

table(sir.adm$status_2_KM)

## 
##   0   1 
## 671  76

KM_1 <- survfit(Surv(time, status_1_KM) ~ pneu, data = sir.adm)
KM_2 <- survfit(Surv(time, status_2_KM) ~ pneu, data = sir.adm)



fup01 <- my.sir.cif$"0 1"$time
est01 <- my.sir.cif$"0 1"$est


fup11 <- my.sir.cif$"1 1"$time
est11 <- my.sir.cif$"1 1"$est


fup02 <- my.sir.cif$"0 2"$time
est02 <- my.sir.cif$"0 2"$est


fup12 <- my.sir.cif$"1 2"$time
est12 <- my.sir.cif$"1 2"$est


plot(KM_1, fun = "event", col = 1:2)
lines(fup01,est01, col="black", lty=2, lwd=2)
lines(fup11,est11, col="red", lty=2, lwd=2)

plot(KM_2, fun = "event", col = 1:2)
lines(fup02,est02, col="black", lty=2, lwd=2)
lines(fup12,est12, col="red", lty=2, lwd=2)

3.3 The cause-specific hazard rate regression approach when there are competing risks

Let’s now estimate two cause-specific Cox models in the general population:

summary(coxph(Surv(time,status==1)~pneu,data=sir.adm))

## Call:
## coxph(formula = Surv(time, status == 1) ~ pneu, data = sir.adm)
## 
##   n= 747, number of events= 657 
## 
##         coef exp(coef) se(coef)      z Pr(>|z|)    
## pneu -1.0901    0.3362   0.1299 -8.391   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## pneu    0.3362      2.974    0.2606    0.4337
## 
## Concordance= 0.577  (se = 0.008 )
## Likelihood ratio test= 91.7  on 1 df,   p=<2e-16
## Wald test            = 70.4  on 1 df,   p=<2e-16
## Score (logrank) test = 77.09  on 1 df,   p=<2e-16

Pneumonia multiply the hazard to be discharged alive by a factor of 0.34. It is significant.

summary(coxph(Surv(time,status==2)~pneu,data=sir.adm))

## Call:
## coxph(formula = Surv(time, status == 2) ~ pneu, data = sir.adm)
## 
##   n= 747, number of events= 76 
## 
##         coef exp(coef) se(coef)      z Pr(>|z|)
## pneu -0.1622    0.8503   0.2678 -0.606    0.545
## 
##      exp(coef) exp(-coef) lower .95 upper .95
## pneu    0.8503      1.176     0.503     1.437
## 
## Concordance= 0.531  (se = 0.023 )
## Likelihood ratio test= 0.37  on 1 df,   p=0.5
## Wald test            = 0.37  on 1 df,   p=0.5
## Score (logrank) test = 0.37  on 1 df,   p=0.5

Pneumonia multiply the hazard to die in the hospital by a factor of 0.85, but with a 95% CI from 0.50 to 1.43. It is not significant.Therefore having pneumonia does not significantly increase or decrease the instantaneous risk of death.

Note that if we want now to obtain the estimated cumulative incidence functions, related to the Cox models, we should switch to a multi-state approach:

sir.adm$eventMM <- factor(sir.adm$status, 0:2, labels=c("censor", "discharged alive", "death"))
table(sir.adm$eventMM)

## 
##           censor discharged alive            death 
##               14              657               76

fitMM <- coxph(Surv(time, eventMM) ~ pneu, sir.adm, id=id)
print(fitMM, digits=2)

## Call:
## coxph(formula = Surv(time, eventMM) ~ pneu, data = sir.adm, id = id)
## 
##       
## 1:2     coef exp(coef) se(coef) robust se   z      p
##   pneu -1.06      0.35     0.13      0.11 -10 <2e-16
## 
##       
## 1:3     coef exp(coef) se(coef) robust se    z   p
##   pneu -0.16      0.85     0.27      0.25 -0.6 0.5
## 
##  States:  1= (s0), 2= discharged alive, 3= death 
## 
## Likelihood ratio test=87  on 2 df, p=<2e-16
## n= 747, unique id= 747, number of events= 733

dummy <- data.frame(pneu=c(0,1))
csurv <- survfit(fitMM, newdata=dummy)

plot(csurv[,'discharged alive'], ylab="CIF (from Cox)", col=1:2, lty=c(1,1), lwd=2)
lines(csurv[,'death'], col=1:2, lty=c(2,2), lwd=2)

3.4 Supplementary material: the subdistribution hazard regression approach when there are competing risks

If instead we use the subdistribution hazards regression approach (Fine and Gray, just mentioned in the slides):

crr(ftime=sir.adm$time,fstatus=sir.adm$status,cov1=sir.adm$pneu,failcode=1,cencode=0) -> res.phsd1
res.phsd1

## convergence:  TRUE 
## coefficients:
## sir.adm$pneu1 
##       -0.9069 
## standard errors:
## [1] 0.1033
## two-sided p-values:
## sir.adm$pneu1 
##             0

exp(res.phsd1$coef)

## sir.adm$pneu1 
##     0.4037637

Pneumonia multiply the subdistribution hazard to be discharged alive by a factor of 0.40. It is significant.

crr(ftime=sir.adm$time,fstatus=sir.adm$status,cov1=sir.adm$pneu,failcode=2,cencode=0) -> res.phsd2
res.phsd2

## convergence:  TRUE 
## coefficients:
## sir.adm$pneu1 
##        0.9749 
## standard errors:
## [1] 0.2496
## two-sided p-values:
## sir.adm$pneu1 
##       9.4e-05

exp(res.phsd2$coef)

## sir.adm$pneu1 
##      2.650951

Pneumonia multiply the subdistribution hazard to die in the hospital by a factor of 2.65. It is significant.

What does this means ? Why the two regression models give us a different picture in terms of hazards ?

This difference indicate that since pneumonia patients stay longer in the hospital, if we take into account the cumulative hazard scale, their risk of death is increased with respect to non-pneumonia patients. But, their instantaneous cause-specific risk of death at each time t, is not different from non-pneumonia patients. We observe them for longer time in hospital and this carries out an increased proportions of deaths in-hospital.

4 An example of time-dependent covariate : the extended KM estimator

The following dataset come from our paper: https://pubmed.ncbi.nlm.nih.gov/30131230/ discussed during the lesson.

Load the required library:

library(survival)

Load the data:

load(here("datasets","dati_iniziali.Rdata"))

We are interested here in evaluating the impact of HF progression on mortality. Remember that HF progression is a time-dependent variable measured along the follow up (variable: quando_progr).

Try to adopt the standard KM approach:

sfitkm <- survfit(Surv(fup_decesso_cens, morto)~progressione,data=dati_iniziali)

plot(sfitkm,lty=c(2,1),mark.time=FALSE, col="black", fun="event", ylab="Cumulative Incidence of Death", 
     xlab="Follow up (months)", xlim=c(0,48), xaxp=c(0,48,4),ylim=c(0, 0.6))
legend(0,0.55, legend=c("HF progression", "Others"), lty=c(1,2), bty="n")
title("Standard 1-KM")

It seems that patients that experience HF progression are at a lower risk of mortality… Immortal time bias !!!

Let’s now put the dataset in the counting format (we do not covered this in the slides, see for details : https://rpubs.com/SteadyStoffel/surv_counting_process1)

attach(dati_iniziali)
data_progr_0 <-dati_iniziali[which(progressione==0),]
rec          <-rep(0,nrow(data_progr_0))
inizio       <-rep(0,nrow(data_progr_0))
fine         <-data_progr_0$fup_decesso_cens
deaths       <-data_progr_0$morto
data_progr_1   <-dati_iniziali[which(progressione==1),]
rec            <-c(rec,rep(0,nrow(data_progr_1)))
inizio         <-c(inizio,rep(0,nrow(data_progr_1)))
fine           <-c(fine,data_progr_1$quando_progr)
deaths         <-c(deaths,rep(0,nrow(data_progr_1)))
rec         <-c(rec,rep(1,nrow(data_progr_1)))
inizio      <-c(inizio,data_progr_1$quando_progr)
fine        <-c(fine,data_progr_1$fup_decesso_cens)
deaths      <-c(deaths,data_progr_1$morto)

id         <-c(data_progr_0$KEY_ANAGRAFE,data_progr_1$KEY_ANAGRAFE,data_progr_1$KEY_ANAGRAFE)
lvef35fin  <-c(data_progr_0$LVEF_35,data_progr_1$LVEF_35,data_progr_1$LVEF_35)
dataset_analisi   <-data.frame(id,lvef35fin, rec,inizio,fine,deaths)
detach(dati_iniziali)
table(rec,deaths)

##    deaths
## rec    0    1
##   0 1755  773
##   1  305  208

table(dataset_analisi$rec)

## 
##    0    1 
## 2528  513

In this format, we can apply the extended KM estimator:

sfit <- survfit(Surv(inizio, fine, deaths)~rec,data=dataset_analisi)
sfit

## Call: survfit(formula = Surv(inizio, fine, deaths) ~ rec, data = dataset_analisi)
## 
##       records n.max n.start events median 0.95LCL 0.95UCL
## rec=0    2528  2528    2528    773     81      78      NA
## rec=1     513   224      37    208     45      37      54

plot(sfit,lty=c(2,1),mark.time=FALSE, col="black", fun="event", ylab="Cumulative Incidence of Death", 
     xlab="Follow up (months)", xlim=c(0,48), xaxp=c(0,48,4),ylim=c(0, 0.6))
legend(0,0.55, legend=c("HF progression", "Others"), lty=c(1,2), bty="n")
title("Extended 1-KM")

Here we can not apply the log-rank test to compare the curves, but we still can estimate the hazard ratio of the time-dependent HF progression variable:

coxprogr <- coxph(Surv(inizio, fine, deaths)~rec,data=dataset_analisi)
summary(coxprogr)

## Call:
## coxph(formula = Surv(inizio, fine, deaths) ~ rec, data = dataset_analisi)
## 
##   n= 3041, number of events= 981 
## 
##        coef exp(coef) se(coef)     z Pr(>|z|)    
## rec 0.66255   1.93973  0.08176 8.104 5.32e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## rec      1.94     0.5155     1.653     2.277
## 
## Concordance= 0.537  (se = 0.006 )
## Likelihood ratio test= 58.74  on 1 df,   p=2e-14
## Wald test            = 65.67  on 1 df,   p=5e-16
## Score (logrank) test = 67.87  on 1 df,   p=<2e-16

5 Cox regression in R

Load the required libraries:

library(knitr)
library(tidyverse)
library(lubridate)
library(survival)
library(survminer)
library(gtsummary)
library(ISwR)
library(KMsurv)
library(pmsampsize)

5.1 The Cox regression model

We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables on the hazard function.

The Cox regression model is a semi-parametric model that can be used to fit univariable and multivariable regression models that have survival outcomes.

\[h(t|X_i) = h_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})\]

\(h(t)\): hazard, or the instantaneous rate at which events occur \(h_0(t)\): underlying baseline hazard

Some key assumptions of the model:

non-informative censoring
proportional hazards

Note: parametric regression models for survival outcomes are also available, but they won’t be addressed in this course.

We can fit regression models for survival data using the coxph function, which takes a Surv object on the left hand side and has standard syntax for regression formulas in R on the right hand side.

5.2 The lung dataset

The lung dataset contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. Some variables we will use to demonstrate methods include:

time: Survival time in days
status: censoring status 1=censored, 2=dead
sex: Male=1 Female=2

data("cancer", package="survival")
str(lung)

## 'data.frame':    228 obs. of  10 variables:
##  $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : num  306 455 1010 210 883 ...
##  $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : num  1175 1225 NA 1150 NA ...
##  $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

head(lung)

##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0

coxph(Surv(time, status) ~ sex, data = lung)

## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
## 
##        coef exp(coef) se(coef)      z       p
## sex -0.5310    0.5880   0.1672 -3.176 0.00149
## 
## Likelihood ratio test=10.63  on 1 df, p=0.001111
## n= 228, number of events= 165

5.3 Formatting Cox regression results

We can see a tidy version of the output using the tidy function from the broom package:

broom::tidy(
  coxph(Surv(time, status) ~ sex, data = lung), 
  exp = TRUE
  ) %>% 
  kable()

term	estimate	std.error	statistic	p.value
sex	0.5880028	0.1671786	-3.176385	0.0014912

Or use tbl_regression from the gtsummary package

coxph(Surv(time, status) ~ sex, data = lung) %>% 
  gtsummary::tbl_regression(exp = TRUE)

Characteristic	HR	95% CI	p-value
sex	0.59	0.42, 0.82	0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

5.4 Hazard ratios

The quantity of interest from a Cox regression model is a hazard ratio (HR). The HR represents the ratio of hazards between two groups at any given point in time.
The HR is interpreted as the instantaneous rate of occurrence of the event of interest in those who are still at risk for the event. It is not a risk, though it is commonly interpreted as such.
If you have a regression parameter \(\beta\) (from column estimate in our coxph) then HR = \(\exp(\beta)\).
A HR < 1 indicates reduced hazard of death whereas a HR > 1 indicates an increased hazard of death.
So our HR = 0.59 implies that around 0.6 times as many females are dying as males, at any given time.

Here is a plot of the cumulative hazard functions by sex:

5.5 Assessing proportional hazards (I)

One assumption of the Cox proportional hazards regression model is that the hazards are proportional at each point in time throughout follow-up. How can we check to see if our data meet this assumption?

Use the cox.zph function from the survival package. It results in two main things:

A hypothesis test of whether the effect of each covariate differs according to time, and a global test of all covariates at once.
- This is done by testing for an interaction effect between the covariate and log(time)
- A significant p-value indicates that the proportional hazards assumption is violated
Plots of the Schoenfeld residuals
- Deviation from a zero-slope line is evidence that the proportional hazards assumption is violated

mv_fit <- coxph(Surv(time, status) ~ sex + age, data = lung)
cz <- cox.zph(mv_fit)
print(cz)

##        chisq df    p
## sex    2.608  1 0.11
## age    0.209  1 0.65
## GLOBAL 2.771  2 0.25

plot(cz)

In this example, the PH assumption could be considered valid for both covariates.

5.6 Another example of Cox PH Regression

Now, on the melanoma dataset, consider a Cox model with the single regressor sex:

Loading and viewing dataset melanom:

data("melanom", package="ISwR")
str(melanom)

## 'data.frame':    205 obs. of  6 variables:
##  $ no    : int  789 13 97 16 21 469 685 7 932 944 ...
##  $ status: int  3 3 2 3 1 1 1 1 3 1 ...
##  $ days  : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ ulc   : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ thick : int  676 65 134 290 1208 484 516 1288 322 741 ...
##  $ sex   : int  2 2 2 1 2 2 2 2 1 1 ...

head(melanom)

##    no status days ulc thick sex
## 1 789      3   10   1   676   2
## 2  13      3   30   2    65   2
## 3  97      2   35   2   134   2
## 4  16      3   99   2   290   1
## 5  21      1  185   1  1208   2
## 6 469      1  204   1   484   2

attach(melanom)

## The following objects are masked from dati:
## 
##     days, no, sex, status, thick, ulc

Create the survival object:

mod.sex <- coxph(Surv(days,status==1)~sex)
summary(mod.sex)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex)
## 
##   n= 205, number of events= 57 
## 
##       coef exp(coef) se(coef)     z Pr(>|z|)  
## sex 0.6622    1.9390   0.2651 2.498   0.0125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## sex     1.939     0.5157     1.153      3.26
## 
## Concordance= 0.59  (se = 0.034 )
## Likelihood ratio test= 6.15  on 1 df,   p=0.01
## Wald test            = 6.24  on 1 df,   p=0.01
## Score (logrank) test = 6.47  on 1 df,   p=0.01

Males (=2) have an hazard of death for melanoma that is nearly twice than women (=1).

A more elaborate example, involving also a continuous covariate:

mod.cov <- coxph(Surv(days,status==1)~sex+log(thick))
summary(mod.cov)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)    
## sex        0.4580    1.5809   0.2687 1.705   0.0883 .  
## log(thick) 0.7809    2.1834   0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex            1.581     0.6326    0.9337     2.677
## log(thick)     2.183     0.4580    1.6040     2.972
## 
## Concordance= 0.749  (se = 0.033 )
## Likelihood ratio test= 33.45  on 2 df,   p=5e-08
## Wald test            = 31  on 2 df,   p=2e-07
## Score (logrank) test = 32.52  on 2 df,   p=9e-08

Remember that ‘thick’ is the tumor thickness; we use logarithm since the distribution is asymmetric. Each 1 point change in log(thick) is associated with a 2.2-fold increase in a patient’s risk.

5.7 Verifying the PH Assumption (II)

Question: adjusting for log(thick), does the effect of gender follow a proportional hazards model? If the PH assumption holds, the log cumulative hazards for the two groups, adjusting for log(thick), should be roughly parallel:

fit1 <- coxph( Surv(days,status==1) ~ log(thick)+ strata(sex),data=melanom)
plot(survfit(fit1), lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"), fun="cloglog",xlab="Days",ylab="Log-Cumulative Hazard Function", xlim=c(100, 5000) )
legend(100, -1, lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"),legend=c("Females", "Males"),bty="n")

Conclusion: not strong evidence of non-proportional hazards. This is a good look at gross departures, but it is far from a formal test.

As before, let’s do a formal test to check PH assumption:

check.ph <- cox.zph(mod.cov, transform="km", global=TRUE)
check.ph

##            chisq df     p
## sex         1.33  1 0.248
## log(thick)  6.23  1 0.013
## GLOBAL      6.70  2 0.035

Visualizing the residual plot: if the PH assumption for log(thick) holds, then the plot of b(t) vs time should be on a horizontal line.In this case we note that for log(thick) the PH assumption is not valid, so we should apply some more advanced options (as for example introducing an interaction between the covariate and a function of time). We do not introduce here this option.

plot(check.ph[2], xlab="Days")
abline(h=0, lty=3,lwd=2)
abline(h=coef(mod.cov)[2], lty=3,col="red",lwd=2)

We can observe that instead the effect of log(thick) is gradually decreasing with time; red dashed line indicates Cox model’s estimate for the overall log thick effect.

A more elaborate example: binary factor + continuous covariate + stratification variable

mod.cov.strat <- coxph(Surv(days,status==1)~sex+log(thick)+strata(ulc))
summary(mod.cov.strat)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick) + 
##     strata(ulc))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)   
## sex        0.3600    1.4333   0.2702 1.332   0.1828   
## log(thick) 0.5599    1.7505   0.1784 3.139   0.0017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex            1.433     0.6977     0.844     2.434
## log(thick)     1.750     0.5713     1.234     2.483
## 
## Concordance= 0.673  (se = 0.039 )
## Likelihood ratio test= 13.3  on 2 df,   p=0.001
## Wald test            = 12.88  on 2 df,   p=0.002
## Score (logrank) test = 12.98  on 2 df,   p=0.002

Plotting estimated survival curves in the strata:

plot(survfit(mod.cov.strat), col=c("red","black"),ylim=c(0.5,1), xlab="Days", ylab="Survival")
legend(3000,1, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))

The default for survfit is to generate curves for a pseudoindividual for which the covariates are at their mean values. In the present case, that would correspond to a tumor thickness of 1.86 mm and a gender of 1.39 (!). We have been sloppy in not defining sex as a factor variable, but that would not actually give a different result : coxph subtracts the means of the regressors before fitting, so a 1/2 coding is the same as 0/1, which is what a factor with treatment contrasts usually gives:

sex.f <- as.factor(sex)
sex.f

##   [1] 2 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 2 2 1 2 1 2
##  [38] 2 1 2 1 1 1 2 2 2 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2
##  [75] 1 1 1 2 2 1 1 1 2 1 1 2 1 2 2 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2
## [112] 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1
## [149] 2 2 2 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1
## [186] 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1
## Levels: 1 2

mod.cov.strat.f <- coxph(Surv(days,status==1)~sex.f+log(thick)+strata(ulc))
summary(mod.cov.strat.f)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex.f + log(thick) + 
##     strata(ulc))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)   
## sex.f2     0.3600    1.4333   0.2702 1.332   0.1828   
## log(thick) 0.5599    1.7505   0.1784 3.139   0.0017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex.f2         1.433     0.6977     0.844     2.434
## log(thick)     1.750     0.5713     1.234     2.483
## 
## Concordance= 0.673  (se = 0.039 )
## Likelihood ratio test= 13.3  on 2 df,   p=0.001
## Wald test            = 12.88  on 2 df,   p=0.002
## Score (logrank) test = 12.98  on 2 df,   p=0.002

In order to estimate survival curves for subjects with certain values of the covariates, we could use the option newdata in survfit:

par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Male, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Female, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))

par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F) 
title("Male, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F) 
title("Female, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))

5.8 Sample size for Survival outcomes

Use pmsampsize to calculate the minimum sample size required for developing a multivariable Cox prediction model with a survival outcome using 30 candidate predictors. We know an existing prediction model in the same field has an R-squared adjusted of 0.051. Further, in the previous study the mean follow-up was 2.07 years, and overall event rate was 0.065. We select a timepoint of interest for prediction using the newly developed model of 2 years.

# this code is not evaluated - pmsampsize does not work on mac
pmsampsize(type = "s", rsquared = 0.051, parameters = 30, rate = 0.065,timepoint = 2, meanfup = 2.07)

We need 5143 subjects in our study, according to the assumptions declared above.

6 Evaluating performance for the Cox model

Load the required libraries:

# load packages
library(knitr)
library(tidyverse)
library(lubridate)
library(survival)
library(survminer)
library(gtsummary)
library(ISwR)
library(KMsurv)
library(timeROC)
library(rms)

Load the dataset melanom:

data("melanom", package="ISwR")
str(melanom)

## 'data.frame':    205 obs. of  6 variables:
##  $ no    : int  789 13 97 16 21 469 685 7 932 944 ...
##  $ status: int  3 3 2 3 1 1 1 1 3 1 ...
##  $ days  : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ ulc   : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ thick : int  676 65 134 290 1208 484 516 1288 322 741 ...
##  $ sex   : int  2 2 2 1 2 2 2 2 1 1 ...

head(melanom)

##    no status days ulc thick sex
## 1 789      3   10   1   676   2
## 2  13      3   30   2    65   2
## 3  97      2   35   2   134   2
## 4  16      3   99   2   290   1
## 5  21      1  185   1  1208   2
## 6 469      1  204   1   484   2

6.1 Discrimination for the survival data : time-dependent ROC curve

First of all we evaluate the discrimination accuracy of the tumor thickness as a prognostic biomarker for death from malignant melanoma:

ROC.thick<-timeROC(T=melanom$days,delta=melanom$status,marker=melanom$thick,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.thick

## Time-dependent-Roc curve estimated using IPCW  (n=205, with competing risks). 
##        Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800    45       124           36        0     75.39 4.10     76.51 3.85
## t=2000    46       103           56        0     75.60 4.14     76.11 3.82
## t=2200    50        83           72        0     72.55 4.40     74.33 3.81
## 
## Method used for estimating IPCW:marginal 
## 
## Total computation time : 0.48  secs.

Note that since we are here in a competing risk setting (two causes of death, but we are interested in cause=1), we obtain two AUCs estimated each one for a risk. We are now interested in evaluating AUC_1. We can also evaluate the confidence intervals for the estimated AUCs:

confint(ROC.thick)

## $CI_AUC_1
##         2.5% 97.5%
## t=1800 67.36 83.41
## t=2000 67.49 83.72
## t=2200 63.92 81.19
## 
## $CB_AUC_1
##         2.5% 97.5%
## t=1800 66.66 84.11
## t=2000 66.78 84.42
## t=2200 63.18 81.93
## 
## $C.alpha.1
##      95% 
## 2.129302 
## 
## $CI_AUC_2
##         2.5% 97.5%
## t=1800 68.96 84.05
## t=2000 68.63 83.60
## t=2200 66.86 81.79
## 
## $CB_AUC_2
##         2.5% 97.5%
## t=1800 68.55 84.46
## t=2000 68.23 84.00
## t=2200 66.46 82.20
## 
## $C.alpha.2
##      95% 
## 2.065624

Then, we estimate a Cox model by adding sex to the tumor thickness:

mod.cov <- coxph(Surv(days,status==1)~sex+log(thick), data=melanom)
summary(mod.cov)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick), data = melanom)
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)    
## sex        0.4580    1.5809   0.2687 1.705   0.0883 .  
## log(thick) 0.7809    2.1834   0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex            1.581     0.6326    0.9337     2.677
## log(thick)     2.183     0.4580    1.6040     2.972
## 
## Concordance= 0.749  (se = 0.033 )
## Likelihood ratio test= 33.45  on 2 df,   p=5e-08
## Wald test            = 31  on 2 df,   p=2e-07
## Score (logrank) test = 32.52  on 2 df,   p=9e-08

We now extract the linear predictor from the Cox model (X*beta): note that the linear predictor of this cause-specific Cox model is evaluated at baseline and it is independent of time.

lin.pred <- predict(mod.cov)

And we now compute the time-dependent discrimination from this model:

ROC.cox <- timeROC(T=melanom$days,delta=melanom$status,marker=lin.pred,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.cox

## Time-dependent-Roc curve estimated using IPCW  (n=205, with competing risks). 
##        Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800    45       124           36        0     75.76 4.09     76.94 3.85
## t=2000    46       103           56        0     76.17 4.15     76.39 3.83
## t=2200    50        83           72        0     74.33 4.34     75.13 3.75
## 
## Method used for estimating IPCW:marginal 
## 
## Total computation time : 0.17  secs.

along with the estimated confidence intervals:

confint(ROC.cox)

## $CI_AUC_1
##         2.5% 97.5%
## t=1800 67.75 83.77
## t=2000 68.04 84.30
## t=2200 65.82 82.83
## 
## $CB_AUC_1
##         2.5% 97.5%
## t=1800 66.99 84.53
## t=2000 67.26 85.08
## t=2200 65.02 83.64
## 
## $C.alpha.1
##      95% 
## 2.146461 
## 
## $CI_AUC_2
##         2.5% 97.5%
## t=1800 69.40 84.49
## t=2000 68.87 83.90
## t=2200 67.77 82.48
## 
## $CB_AUC_2
##         2.5% 97.5%
## t=1800 68.98 84.91
## t=2000 68.46 84.31
## t=2200 67.37 82.88
## 
## $C.alpha.2
##     95% 
## 2.06678

Now we can compare the AUCs: is there a significant improvement in discrimination by adding sex?

pp1 <- compare(ROC.thick,ROC.cox)
round(pp1$p_values_AUC_1,digits=4)

## t=1800 t=2000 t=2200 
## 0.7713 0.6594 0.1936

No significant increment in discrimination is observed from this test.

Now we add also ulcer in the Cox model but this time as a covariate (not as a strata variable):

mod.cov2 <- coxph(Surv(days,status==1)~sex+log(thick)+ulc, data=melanom)
summary(mod.cov2)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick) + 
##     ulc, data = melanom)
## 
##   n= 205, number of events= 57 
## 
##               coef exp(coef) se(coef)      z Pr(>|z|)   
## sex         0.3813    1.4641   0.2706  1.409  0.15879   
## log(thick)  0.5756    1.7782   0.1794  3.209  0.00133 **
## ulc        -0.9389    0.3911   0.3243 -2.895  0.00379 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex           1.4641     0.6830    0.8615    2.4883
## log(thick)    1.7782     0.5624    1.2511    2.5273
## ulc           0.3911     2.5571    0.2071    0.7384
## 
## Concordance= 0.768  (se = 0.032 )
## Likelihood ratio test= 42.66  on 3 df,   p=3e-09
## Wald test            = 36.99  on 3 df,   p=5e-08
## Score (logrank) test = 42.81  on 3 df,   p=3e-09

We now extract also from this model the linear predictor (X*beta):

lin.pred2 <- predict(mod.cov2)

And now we compare the performance of this model with the tumor thickness:

ROC.cox2 <- timeROC(T=melanom$days,delta=melanom$status,marker=lin.pred2,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.cox2

## Time-dependent-Roc curve estimated using IPCW  (n=205, with competing risks). 
##        Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800    45       124           36        0     78.74 3.95     79.36 3.71
## t=2000    46       103           56        0     78.59 4.06     78.58 3.73
## t=2200    50        83           72        0     75.88 4.27     76.66 3.69
## 
## Method used for estimating IPCW:marginal 
## 
## Total computation time : 0.22  secs.

confint(ROC.cox2)

## $CI_AUC_1
##         2.5% 97.5%
## t=1800 71.00 86.47
## t=2000 70.64 86.54
## t=2200 67.51 84.25
## 
## $CB_AUC_1
##         2.5% 97.5%
## t=1800 69.98 87.49
## t=2000 69.59 87.59
## t=2200 66.40 85.35
## 
## $C.alpha.1
##      95% 
## 2.218481 
## 
## $CI_AUC_2
##         2.5% 97.5%
## t=1800 72.10 86.62
## t=2000 71.27 85.88
## t=2200 69.43 83.89
## 
## $CB_AUC_2
##         2.5% 97.5%
## t=1800 71.55 87.17
## t=2000 70.72 86.43
## t=2200 68.88 84.43
## 
## $C.alpha.2
##      95% 
## 2.108273

pp1 <- compare(ROC.thick,ROC.cox2)
round(pp1$p_values_AUC_1,digits=4)

## t=1800 t=2000 t=2200 
## 0.1473 0.2036 0.1781

Again, no significant difference in discrimination is observed.

Finally, we can plot the estimated time-dependent AUCs:

plotAUCcurve(ROC.thick, FP = 2, add = FALSE,conf.int = FALSE,conf.band = FALSE, col = "black")
plotAUCcurve(ROC.cox, FP = 2, add =TRUE, conf.int = FALSE,conf.band = FALSE, col = "blue")
plotAUCcurve(ROC.cox2, FP = 2, add =TRUE, conf.int = FALSE,conf.band = FALSE, col = "red")
title(main="Time-dependent AUCs")
legend("topleft", legend=c("Tumor thickness","Thick + sex", "Thick + sex+ulc"), lty=c(1,1,1),lwd=c(2,2,2), 
       col=c("black","blue", "red"), bty="n")

We can also fix a specific time-horizon :

plot(ROC.thick,time=2200,add = FALSE, col = "black",title=FALSE, lwd=2)
plot(ROC.cox,time=2200,add = TRUE, col = "blue",title=FALSE, lwd=2)
plot(ROC.cox2,time=2200,add = TRUE, col = "red",title=FALSE, lwd=2)
legend("bottomright", legend=c("Tumor thickness","Thick + sex", "Thick + sex + ulc"), lty=c(1,1,1),lwd=c(2,2,2), 
       col=c("black","blue", "red"), bty="n")
title(main="Time-dependent ROC curves at 2200 days")

Note that discrimination in this context is focused exclusively on the performance of the linear predictor extracted from the Cox model. We are not considering here the survival estimates.

6.2 Calibration for the Cox model

Just for didactic reasons, we are in this example IGNORING the competing risk present in these data, and act as we have an unique event. Therefore the method is illustrated is valid only in the standard survival setting, not in the competing risk setting (more advanced approaches are needed in that case, as for example estimating cumulative incidence curves using the multi-state approach). We now evaluate calibration for the Cox model: first of all, we determine summaries of variables for effect and plotting ranges, values to adjust to, and overall ranges for the dataset Melanom, since this is required in order to use the R functions developed in the rms library.

options(datadist='dd')
dd <-datadist(melanom)
dd

##                       no status      days ulc     thick sex
## Low:effect      222.0000      1 1525.0000   1   97.0000   1
## Adjust to       469.0000      2 2005.0000   1  194.0000   1
## High:effect     731.0000      3 3042.0000   2  356.0000   2
## Low:prediction   11.0000      1  294.2195   1   32.0000   1
## High:prediction 943.0488      3 4119.2439   2  838.7805   2
## Low               2.0000      1   10.0000   1   10.0000   1
## High            992.0000      3 5565.0000   2 1742.0000   2
## 
## Values:
## 
## status : 1 2 3 
## ulc : 1 2 
## sex : 1 2

Then, we re-estimate the Cox model by adding some options useful to fix the time horizon to evaluate calibration, in this case fixed at 2000 days:

mod.cov2 <- cph(Surv(days,status==1)~sex+log(thick)+ulc, data=melanom,x=TRUE , y=TRUE,surv=TRUE,time.inc=2000)

Finally, we use the function calibrate by comparing predicted probabilities from the Cox model with the observed ones estimated by the Kaplan-Meier:

calkm <- calibrate(mod.cov2, u=2000, m=40, cmethod= 'KM', B=200)

## Using Cox survival estimates at 2000 Days

calkm

## calibrate.cph(fit = mod.cov2, cmethod = "KM", u = 2000, m = 40, 
##     B = 200)
## 
## 
## n=205  B=200  u=2000 Day
## 
##       index.orig    training         test mean.optimism mean.corrected   n
## [1,]  0.01505666 0.015399968  0.028326832  -0.012926864     0.02798352 200
## [2,]  0.01466150 0.006498119 -0.001349244   0.007847363     0.00681414 200
## [3,]  0.01195721 0.006277704  0.004442661   0.001835043     0.01012216 200
## [4,] -0.01622951 0.003399131 -0.013930637   0.017329768    -0.03355928 200
## [5,]  0.01160306 0.004361092  0.002688064   0.001673028     0.00993003 200
##      mean.predicted        KM KM.corrected    std.err
## [1,]      0.4394888 0.4545455    0.4674723 0.17797217
## [2,]      0.6603385 0.6750000    0.6671526 0.10971343
## [3,]      0.8233362 0.8352934    0.8334584 0.07384917
## [4,]      0.8978469 0.8816174    0.8642876 0.06477229
## [5,]      0.9396165 0.9512195    0.9495465 0.03536639

plot(calkm, cex.lab=0.7)

Another more general approach to estimating calibration curves for survival models is to use the flexible adaptive hazard regression approach (not covered here). For survival models, “predicted” means predicted survival probability at a single time point, and “observed” refers to the corresponding Kaplan-Meier survival estimate. The number of intervals of predicted survival probabilities in which subjects are ranked depend on the parameter m that correspond to the average number of subjects in each group evaluated at u-time.
As an internal validation tool, the bootstrap is used to de-bias the estimates to correct for overfitting, allowing estimation of the likely future calibration performance of the fitted model.

7 Examples on more advanced survival topics

These examples are taken from the paper: “Analysis of time-to-event for observational studies: Guidance to the use of intensity models”, available at : https://pubmed.ncbi.nlm.nih.gov/33043497/.

load(here("datasets", "pad.rda"))
library(splines)
library(timereg) 
library(knitr)
library(survival)
#  Advanced ovarian data from dynpred package
data(ova, package="dynpred")

7.1 Introduction

7.1.1 Peripheral Arterial Disease :Background

Peripheral arterial disease (PAD) is a common circulatory problem in which narrowed arteries reduce blood flow to peripheral limbs, often the legs. It is also likely to be a sign of a more widespread atherosclerosis, and subjects manifesting the disease carry an increased risk for all atherothrombotic events including myocardial infarction, stroke, and cardiac death. The pad data set contains the results of a Slovene study. Briefly, the study was conducted by 74 primary care physicians-researchers, each of whom was asked to recruit 10 subjects with PAD along with 10 age and sex matched controls without PAD; actual recruitment ranged from 1 to 31 pairs. All participating subjects consented to a yearly examination and interview for 5 years. Important endpoints are death, either due to cardiovascular disease (CVD death) or other causes, non-fatal CVD endpoints of infarction and stroke, and patient interventions attributed to the disease such as revascularization procedures.

The final study includes 742 PAD and 713 controls, with baseline data for each subject, measurements at each visit, and endpoints. This analysis will focus on cardiac endpoints and on overall mortality.

7.1.2 Endpoint and follow-up

The starting R data has 3 dataframes:

pad1 has one observation per subject containing baseline information,
pad2 has one observation per subject visit containing measurements at that visit, and
pad3 has an observation for each observed endpoint.

The dataset pad2 contains all the repeated examinations over the 5 years of the study. Note that in this dataset the ID of the subject is repeated for multiple lines (data are in the long format). The same happens for pad3 that reports all the events of interest. In the IDA phase (Initial Data Analysis) we look at the overall follow-up for subjects, and decide on a time scale and an endpoint. Per the study design subjects should have a visit at years 1–5. We see in Figure that visit 1 dates are stretched out over a full year and by visit 5 over almost 2 years, a pattern that is unavoidable in observational studies of human subjects. Based on the counts shown in the figure there is no evidence of differential drop out between the PAD and control subjects.

vcount <- with(pad2, table(pad, visit))  # visit count
vtime <-  pad2$day/365.25
plot(vtime ~ jitter(visit + .3*(pad=="PAD") - .15), pad2, subset=(visit>0),
     col=c(1,2)[as.numeric(pad)], ylim=c(0, max(vtime)), 
     xlab= "Visit number", ylab="Years from initial visit")
#abline(h=1:6, lty=2, col=4)
segments(rep(0,6), 1:6, 1:6+.5, 1:6, lty=2, col='gray')
legend(4,2, c("PAD", "Control"), col=c(2,1), bty='n', pch=19)
text(1:5 - .15, 0:4, vcount[1,-1])
text(1:5 + .15, 0:4, vcount[2,-1])

Time to each visit for the PAD study.

pfig2 <- survfit(Surv(futime/365.25, status=='censor') ~ pad, pad1)
plot(pfig2, fun='event', col=1:2,
     xlab="Years post enrollment", ylab="Fraction censored")
legend(2, .6, c("Control", "PAD", "Ideal"), col=c(1,2,1), lty=c(1,1,2),
       bty='n')
lines(c(0, 4.8, 6.8), c(0,0,1), lty=3)

Observed and ideal follow-up in the PAD study

The censoring pattern for subjects in the two arms, showed in the above figure, are computed using a reverse Kaplan-Meier (i.e. the status is to be censored, and the censoring is having the event).
There is no evidence of differential follow-up.

Given the visit slippage of the first figure, an ideal censoring pattern would be a final followup that is approximately uniformly distributed between 4.8 and 6.8 years, this is shown as a dotted line on the figure.

Examining the counts of subjects for each visit the total number of visits for the subjects is not what was planned — about 25% do not have a fifth visit — but patients have adhered to the 5 year time commitment quite closely. As stated in the primary paper, a decision was made to truncate all patient data at 5 years for the purpose of analysis.

7.1.3 The time scale of the survival analysis

The initial survival curves can be created using a copy of the baseline data (pad1), with follow-up truncated at 5 years.

pdata <- subset(pad1, select=c(id, pad, sex, age, futime, status))
pdata$death <- ifelse(pdata$status == 'censor', 0, 1) # any cause of death

# truncate follow=up at 5 years
over5 <- pdata$futime > 5*(365.25)
pdata$futime <- pmin(1826, pdata$futime)  # 1826 days = 5 years
pdata$death[over5] <- 0
pdata$status[over5] <- "censor"
table(pdata$status)

     censor    CV death nonCV death 
       1296          68          91


# Convert to follow-up in years, and add an age scale
pdata$years <- pdata$futime/365.25
pdata$age1 <- as.numeric(pad1$d.first - pad1$d.birth)/365.25
pdata$age2 <- pdata$age1 + pdata$years

oldpar <- par(mfrow=c(1,2))
psurv1 <- survfit(Surv(years, death) ~ pad + sex, data=pdata)
psurv2 <- survfit(Surv(age1, age2, death) ~ pad+ sex, pdata,
               start.time=55)
plot(psurv1, col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
     xlab="Years since enrollment", ylab="Death")
legend(.1, .18, c("PAD, male","PAD, female", "Control, male","Control, female"),
       col=c(2,2,1,1), lty=c(2,1,2,1), lwd=2, bty='n', cex=.8)
plot(psurv2,   col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
      xlab='Age', ylab='Death, given alive at age 55')

Kaplan-Meier curves of overall survival.

par(oldpar)

Our first analysis will use time since enrollment as the primary time scale. For the PAD patients the time since diagnosis is a natural scale since it represents both the progression of disease and of treatments for the disease. Survival curves for the control subjects serve as a comparison outcome of similarly aged subjects without the disease, but do not have a natural stand-alone interpretation.

Death rates are higher for males than for females, which is no surprise given a mean age at entry of 65 years, and is higher for PAD subjects than for the age matched controls. The right hand panel shows the curves on age scale. Note that when using the age scale, we include two time points in the Survival function: age at entry in the study and age at the event or censoring. These data are both right-censored and left-truncated (we have not discussed this point in the lessons, but it is of interest if you want to deepen the survival framework).In two words, left-truncation is when the subjects to enter under your observation should be survived until a certain time-point (i.e. age at entry).

Survival curves can equivalently display either the fraction who have died (curves start at 0: 1-KM) or the fraction who have not died (curves start at 1); the choice between the two is often based on tradition or habit.

7.2 Covariates and Cox model

The analysis will focus on the impact on survival of the covariates PAD, age, sex, and serum values for high density and low density lipoprotein (HDL, LDL). Figure shows density estimates for each of the three continuous covariates at baseline, by PAD group. By study design there should be no difference in the age distribution. HDL and LDL are slightly lower for the PAD subjects.


oldpar <- par(mfrow=c(1,3), mar=c(5.1, 4.1, 5.1, 1))
dfun <- function(data, x, ...) {
    d0 <- density((data[[x]])[data$pad=="Control"])
    d1 <- density((data[[x]])[data$pad=="PAD"])
    matplot(cbind(d0$x, d1$x), cbind(d0$y, d1$y), type='l', lty=1, col=1:2,
            ...)
}
lunit <- 38.67  # convert from mmol/L to mg/dl
xx <- c(0, 50, 100, 150, 200, 250, 300)

dfun(pdata, "age", xlab="Age", ylab="Density", lwd=2)
legend(51, .02, c("PAD", "Control"), lty=1, lwd=2, col=2:1, bty='n')

dfun(na.omit(subset(pad2, visit==0, c(hdl, pad))),
             "hdl", xlab="HDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)

dfun(na.omit(subset(pad2, visit==0, c(ldl, pad))),
     "ldl", xlab="LDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)

Distributions of three continuous covariates, by group. Blood value scales are in mmol/L (lower) and mg/dl (above).

par(oldpar)

7.3 A more advanced topic : time-dependent covariates (multiple measurements) and competing risks !

The Cox model fits will use age, sex, baseline values of high density and low density lipoproteins (hdl0, ldl0), time-dependent lipid values (hdl, ldl), and a time-dependent value of “years since enrollment” (etime). Time since enrollment only plays a role in age scale models: in a time-since-entry model every one in a given risk set will have exactly the same value of the covariate, and hence the estimated coefficient for the variable would be 0.

For time-dependent analysis we use a counting process form. Mathematically each subject is represented by 3 processes: \(Y_i(t)=1\) if the subject is currently at risk, \(X_i(t)\) is the time dependent covariate process, and \(N(t)\) the cumulative number of events. In the R code this translates into multiple observations for each subject of the form start-stop:

 id, time1, time2, event, x1, x2, ...

where (time1, time2) describe an interval over which the subject is at risk, event corresponds to \(dN(t)\) and will be 1 of the subject had an event at time2 and 0 otherwise, x1, x2, ... are the value of the covariates over the interval, and id is a subject identifier. Subject 9, for instance, is a male with blood values measured on days 0, 583, 945, and 1309, who died on day 1349. Their data has the form

id tstart tstop death  sex  ldl  hdl
 9      0   583     0 male 2.66 1.08
 9    583   945     0 male 3.20 1.53
 9    945  1309     0 male 2.81 1.32
 9   1309  1349     1 male 2.51 1.30

The lipid values and time intervals encode the most recent measurement of the value. Creation and checking of the data often represents a major share of the work in an analysis. There is often concern that because the data set has multiple rows for the same subject, a valid variance estimate for the model will need to account for correlation.
This is not the case, however.
Internally, there is only one copy of each subject; the multiple rows are merely a way to communicate to the program which covariate value is applicable at any given time point.

When there are competing risks, such as CVD and non-CVD death, the 0/1 event event variable is replaced by a factor with levels of “none”, “CVD death”, “non-CVD death”, which shows both when an event occurred and it’s type.

Below we create separate time-dependent data sets for analysis on the time-from-enrollment scale (pdata2) and on the age scale (pdata3).

The two data sets have separate event indicators for death, the competing risk of CVD/non-CVD death, and for any CVD event/non-CVD death. Creating (time1, time2) data sets on age-in-years scale can lead to anomalies due to round off error; a detailed discussion and example is found in the “Round off error and tied times” vignette of the survival package. For this reason the time-dependent data sets all use time in days. Use birth date and entry date to create agedays = age at enrollment, in days

This data set will have two time scales (age and time since enrollment), time dependent covariates (hdl, ldl, etime), and multiple endpoints (all cause death, CVD vs non-CVD death, major CVD event).

A. etime = time since enrollment as a time dependent variable 1. This is a continuous time-dependent covariate – technically it changes minute by minute. 2. We will be using it in age scale fits. If we create it on age scale then the cuts for survSplit can be limited to “the ages at which deaths occur”

B. The tmerge routine can keep track of multiple endpoints on a single time scale, the survSplit routine a single endpoint and single time scale.

C. The tmerge operations needs to be sequential: no step that would change the number of rows can be inserted in the midst.

Taken all together, it is simplest to create the data sets, pdata2 on the time since enrollment scale, pdata3 on the age scale. Many of the steps will need to be repeated, e.g., adding time dependent hdl and ldl.

pdata2 <- tmerge(pdata[, c("id", "pad", "age", "sex")], pdata, 
                 id=id, status = event(futime, status))
pdata2 <- tmerge(pdata2, pad2, id=id, ldl= tdc(day, ldl),
                 hdl = tdc(day, hdl))

# add baseline hdl and ldl as covariates
indx <- match(pdata2$id, pdata2$id)
pdata2$ldl0 <- pdata2$ldl[indx]
pdata2$hdl0 <- pdata2$hdl[indx]
pdata2$age10 <- pdata2$age/10   # age in decades

# create "any cvd event" as a competing risk, it will be used later
temp <- subset(pad3, event %in% c("cvd death", "infarction", "stroke"))
temp <- temp[!duplicated(temp$id),]   # first of these
temp <- subset(temp, day< 5*365.25)   # must be <= 5 years
pdata2 <- tmerge(pdata2, temp, id=id, cvstat=event(day),
                 priorcv = tdc(day))

pdata2$death <- ifelse(pdata2$status=="censor", 0, 1)  # all cause death
tstat <- ifelse(pdata2$cvstat>0, pdata2$cvstat, 2*pdata2$death)
pdata2$cvstat <- factor(tstat, 0:2,
                         c("censor", "CVD event", "death"))
pdata2$death <- ifelse(pdata2$status=="censor", 0, 1)  # all cause death

# Reprise all the steps, on age scale.  We need to add "age" to the
#  merged-in data sets before adding them.
ageday <- as.numeric(pad1$d.first - pad1$d.birth) # age in days at enrollmemt
pdata3 <- tmerge(pdata[, c("id", "pad", "sex")], pdata, 
                 id=id, status = event(futime + ageday, status),
                 tstart=ageday, tstop=ageday+futime,
                 options=list(tstart="age1", tstop="age2"))
temp2 <- pad2
temp2$age <- temp2$day + ageday[match(temp2$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp2, id=id, ldl= tdc(age, ldl),
                 hdl = tdc(age, hdl))

# add baseline hdl and ldl as covariates
indx <- match(pdata3$id, pdata3$id) #points to the first row for each subject
pdata3$ldl0 <- pdata3$ldl[indx]
pdata3$hdl0 <- pdata3$hdl[indx]

# Add time since enrollment as a time-dependent covariate.
#  Use a temporary data set + survSplit to create a 'variable' that looks
#  like yet another lab test, then merge it in on the age scale.
#  Time since enrollment will be in weeks, which will create 52*5= 260 obs
#  per subject.  (Time since enrollment in days would be overkill.)
pdata$ageday <- ageday
temp3 <- survSplit(Surv(futime, status) ~ ageday + id, data= pdata,
      cut= seq(7, max(pdata$futime), by=7))
pdata3 <- tmerge(pdata3, temp3, id, etime = tdc(ageday + tstart, tstart))
pdata3$etime <- pdata3$etime/365.25   # recode to time in years

# add 'any cv' as an event
temp$age <- temp$day + ageday[match(temp$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp, id=id, cvstat=event(age),
                 priorcv = tdc(age))

# recode event variables, after all survSplit or tmerge steps are done
pdata3$death <- ifelse(pdata3$status=="censor", 0, 1)  # all cause death
tstat <- ifelse(pdata3$cvstat>0, pdata3$cvstat, 2*pdata3$death)
pdata3$cvstat <- factor(tstat, 0:2, c("censor", "CVD event", "nonCVD-death"))

# Creation of multi-row data sets has pitfalls.  It is always wise to
#  sanity check the results.  
check1 <- survcheck(Surv(tstart, tstop, death) ~ 1, id=id, data= pdata2)
check1  
Call:
survcheck(formula = Surv(tstart, tstop, death) ~ 1, data = pdata2, 
    id = id)

Unique identifiers       Observations        Transitions 
              1455               6731                159 

Transitions table:
      to
from      1 (censored)
  (s0)  159       1296
  1       0          0

Number of subjects with 0, 1, ... transitions to each state:
       count
state      0   1
  1     1296 159
  (any) 1296 159

check2 <- survcheck(Surv(age1, age2, death) ~1, id=id, data= pdata3)
check2
Call:
survcheck(formula = Surv(age1, age2, death) ~ 1, data = pdata3, 
    id = id)

Unique identifiers       Observations        Transitions 
              1455             362876                159 

Transitions table:
      to
from      1 (censored)
  (s0)  159       1296
  1       0          0

Number of subjects with 0, 1, ... transitions to each state:
       count
state      0   1
  1     1296 159
  (any) 1296 159

7.3.1 Overall mortality

The first pair of fits shows that the estimated overall effects of PAD and sex hardly differ between analyses done on the enrollment or age scales. A second pair fits looks at the coefficient effects within the PAD and control groups. (Inclusion of the covariate by strata interactions gives results that are equivalent to separate fits for the the PAD and control groups; bundling them as a single model allows a formal test of whether the effects differ between the two strata.) None of sex, enrollment age, or time since enrollment show meaningful differences between the PAD and control groups. That is, on the log-hazard scale, PAD appears to act as an additive risk.

# Fit the first set of models: pad, age, sex on the time since enrollment
#  scale (cfit1a) and on the age scale (cfit1b)
cfit1a <- coxph(Surv(tstart, tstop, death) ~ pad + sex + age10, pdata2)
cfit1b <- coxph(Surv(age1, age2, death) ~ pad + sex + etime, pdata3)
# Add interactions
cfit2a <- coxph(Surv(tstart, tstop, death) ~ (sex + age10)*strata(pad), pdata2)
cfit2b <- coxph(Surv(age1, age2, death) ~ (sex + etime)*strata(pad),
                pdata3)

cat("Overall model on time since enrollment scale\n")
Overall model on time since enrollment scale

print(cfit1a)
Call:
coxph(formula = Surv(tstart, tstop, death) ~ pad + sex + age10, 
    data = pdata2)

          coef exp(coef) se(coef)     z        p
padPAD  0.8755    2.4000   0.1739 5.036 4.76e-07
sexmale 0.6924    1.9985   0.1822 3.800 0.000145
age10   0.6583    1.9315   0.1053 6.250 4.11e-10

Likelihood ratio test=82.63  on 3 df, p=< 2.2e-16
n= 6731, number of events= 159

cat("\nOverall model on age scale\n")

Overall model on age scale

print(cfit1b)
Call:
coxph(formula = Surv(age1, age2, death) ~ pad + sex + etime, 
    data = pdata3)

           coef exp(coef) se(coef)     z        p
padPAD  0.87354   2.39538  0.17400 5.020 5.16e-07
sexmale 0.70540   2.02466  0.18270 3.861 0.000113
etime   0.16593   1.18049  0.05917 2.804 0.005042

Likelihood ratio test=51.15  on 3 df, p=4.538e-11
n= 362876, number of events= 159


cat("\nInteractions between covariates and treatment, time since enrollment\n") 

Interactions between covariates and treatment, time since enrollment

print(cfit2a)
Call:
coxph(formula = Surv(tstart, tstop, death) ~ (sex + age10) * 
    strata(pad), data = pdata2)

                           coef exp(coef) se(coef)      z        p
sexmale                 0.67576   1.96552  0.33494  2.018 0.043638
age10                   0.68341   1.98062  0.19354  3.531 0.000414
sexmale:strata(pad)PAD  0.02208   1.02233  0.39925  0.055 0.955897
age10:strata(pad)PAD   -0.03498   0.96563  0.23081 -0.152 0.879552

Likelihood ratio test=55.21  on 4 df, p=2.938e-11
n= 6731, number of events= 159

cat("\nInteractions between covariates and treatment, age scale\n") 

Interactions between covariates and treatment, age scale

print(cfit2b)
Call:
coxph(formula = Surv(age1, age2, death) ~ (sex + etime) * strata(pad), 
    data = pdata3)

                            coef exp(coef)  se(coef)     z      p
sexmale                0.6980525 2.0098347 0.3357736 2.079 0.0376
etime                  0.1176904 1.1248958 0.1097373 1.072 0.2835
sexmale:strata(pad)PAD 0.0001539 1.0001539 0.4003644 0.000 0.9997
etime:strata(pad)PAD   0.0665382 1.0688018 0.1304441 0.510 0.6100

Likelihood ratio test=24.44  on 4 df, p=6.516e-05
n= 362876, number of events= 159

Therefore, we can see that in the multivariable Cox model on the follow up scale, being PAD and increasing age decrease the estimated survival, male sex is at higher risk. On the age scale, these results are confirmed. There is no a significant interaction between PAD and gender or age.

7.3.2 Cardiovascular death

Since PAD is expected to exert its primary effect via cardiovascular disease (CVD) endpoints, a more interesting analysis is to focus on CVD rather than overall mortality. Since in this case non-CVD death will be a competing risk this needs to be accounted for in our estimate of the probability in state; in the R survival package this is accomplished by using a multi-level categorical variable as the “status”, rather than 0/1. The survfit function then computes Aalen-Johansen (AJ) estimates of a multi state model.


oldpar <- par(mfrow=c(1,2))
psurv4 <- survfit(Surv(tstart, tstop, status) ~ pad, id = id, pdata2)
plot(psurv4, col=1:2, lty=c(1,1,2,2), ylim=c(0,.15), xscale=365.25, 
     xlab="Years since enrollment", ylab="Absolute risk")
legend("topleft", c("PAD CVD death", "Control CVD death", 
                 "PAD nCVD death","Control nCVD"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n')

psurv5 <- survfit(Surv(tstart, tstop, cvstat) ~ pad, pdata2, id=id,
                  subset=(priorcv==0))
plot(psurv5, col=1:2, lty=c(1,1,2,2), xscale=365.25, ylim=c(0, .15),
         xlab="Years since enrollment", ylab="Absolute risk")         

legend("topleft", c("Major CVD, PAD", "Major CVD, Control", 
                  "death w/o CVD, PAD","death w/o CVD, Control"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.85)

Aalen-Johansen curves for CVD vs. non-CVD death outcomes (left) and for any CVD event vs. death without CVD (right).

par(oldpar)

Figure shows the Aalen-Johansen estimates of probability in state for CVD and non-CVD death in the left panel (which could also be labeled as “absolute risk”, “competing risk”, or “cumulative incidence”) and the AJ curves for first major CVD event (infarction, stroke, CVD death) versus non-CVD death in the right panel. For controls, CVD deaths are about 1/2 of the non-CVD, approximately 2% versus 4% at 5 years; both outcomes are near 8% for the PAD subjects. Nearly 14% of the PAD subjects experience a major CVD event versus 6% for controls.

The absolute risk curves require that all transitions in the model be considered jointly: incorrect curves will result from a simple Kaplan-Meier of time to CVD, censoring at non-CVD death, or a KM of time to non-CVD death, censoring at CVD.

Interestingly, this is not true for the cumulative hazard estimate: the naive estimate for one outcome, censoring the others, will be identical to the joint estimate.

The Cox model is a model for the hazards and has the same identity: the first coxph call below, which models the entire process, will have the same results for the CVD death portion as the second coxph call, which ignores non-CVD deaths by censoring them; this is verified by the all.equal test.

However, if one desires to predict absolute risk curves from the fitted Cox model then the first form (using the “complete” multi-state approach) is essential.

pcox1 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10 + hdl0 + ldl0, 
               pdata2, id=id)
cat("transitions table\n")
transitions table

pcox1$transitions  # a check of the data
             to
from          CV death nonCV death (censored)
  (s0)              67          91       1275
  CV death           0           0          0
  nonCV death        0           0          0

pcox1
Call:
coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 + 
    hdl0 + ldl0, data = pdata2, id = id)

         
1:2           coef exp(coef) se(coef) robust se      z        p
  padPAD   1.05440   2.87025  0.28291   0.28029  3.762 0.000169
  sexmale  0.51298   1.67026  0.27813   0.27881  1.840 0.065786
  age10    0.65698   1.92895  0.16378   0.17894  3.672 0.000241
  hdl0    -0.30026   0.74063  0.32993   0.44220 -0.679 0.497130
  ldl0    -0.08128   0.92194  0.12687   0.12754 -0.637 0.523966

         
1:3           coef exp(coef) se(coef) robust se      z        p
  padPAD   0.71390   2.04194  0.22494   0.22789  3.133  0.00173
  sexmale  0.75348   2.12438  0.25070   0.25505  2.954  0.00313
  age10    0.65666   1.92835  0.14006   0.14379  4.567 4.95e-06
  hdl0    -0.20347   0.81589  0.28098   0.32688 -0.622  0.53363
  ldl0     0.02213   1.02237  0.10630   0.10857  0.204  0.83850

 States:  1= (s0), 2= CV death, 3= nonCV death 

Likelihood ratio test=83.96  on 10 df, p=8.386e-14
n= 6626, unique id= 1433, number of events= 158 
   (105 observations deleted due to missingness)


# cause-specific Cox model censoring non CV death 
pcox1b <- coxph(Surv(tstart, tstop, status=="CV death") ~ pad +sex + age10
              + hdl0 + ldl0, data=pdata2)  #censor non-CVD death
pcox1b
Call:
coxph(formula = Surv(tstart, tstop, status == "CV death") ~ pad + 
    sex + age10 + hdl0 + ldl0, data = pdata2)

            coef exp(coef) se(coef)      z        p
padPAD   1.05445   2.87039  0.28292  3.727 0.000194
sexmale  0.51298   1.67027  0.27813  1.844 0.065122
age10    0.65703   1.92905  0.16379  4.011 6.03e-05
hdl0    -0.30029   0.74060  0.32993 -0.910 0.362744
ldl0    -0.08124   0.92197  0.12687 -0.640 0.521959

Likelihood ratio test=39.51  on 5 df, p=1.874e-07
n= 6626, number of events= 67 
   (105 observations deleted due to missingness)


pcox2 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10+
              + hdl + ldl, data=pdata2, id=id)  # time dependent hdl and ldl
pcox2
Call:
coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 + 
    +hdl + ldl, data = pdata2, id = id)

         
1:2          coef exp(coef) se(coef) robust se      z        p
  padPAD   0.8751    2.3991   0.2857    0.2745  3.188 0.001433
  sexmale  0.3094    1.3626   0.2775    0.2707  1.143 0.252954
  age10    0.6963    2.0064   0.1652    0.1813  3.840 0.000123
  hdl     -1.5482    0.2126   0.4109    0.4657 -3.325 0.000885
  ldl     -0.2797    0.7560   0.1479    0.1496 -1.869 0.061609

         
1:3          coef exp(coef) se(coef) robust se      z        p
  padPAD   0.7545    2.1265   0.2277    0.2256  3.345 0.000824
  sexmale  0.7847    2.1918   0.2518    0.2461  3.188 0.001432
  age10    0.6761    1.9662   0.1401    0.1409  4.797 1.61e-06
  hdl     -0.1489    0.8617   0.2910    0.3901 -0.382 0.702680
  ldl      0.1338    1.1432   0.1125    0.1070  1.250 0.211251

 States:  1= (s0), 2= CV death, 3= nonCV death 

Likelihood ratio test=105.6  on 10 df, p=< 2.2e-16
n= 6702, unique id= 1453, number of events= 158 
   (29 observations deleted due to missingness)

There were 67 CVD and 91 non-CVD deaths in 5 years of follow-up, which are the transitions from the initial state (s0) to the CVD death and non-CVD death states, respectively, in the transitions table. No one transitioned from death to another state, which would indicate an error in the data set. Table displays the coefficients for the fits with baseline and time-dependent lipids, for both of the competing outcomes. The joint Cox model shows that after adjustment for age, sex, and baseline lipids, PAD is associated with nearly three fold increase in the rate of CVD deaths and double fold increase in non-CVD death. Age and sex are important predictors as well, as expected, with male sex having a somewhat larger effect on the non-CVD endpoint. The effect of baseline lipid values was not significant, however, time-dependent HDL shows an effect on the CVD death endpoint, somewhat reducing the estimated effect of PAD on CVD death.

7.3.3 Predicted probability-in-state (survival) curves

Predicted \(P(state)\) curves can be created from the multi-state fits, as well. As with any Cox model prediction, the first task is to choose a set of covariate values for which prediction is desired. We create the predictions for four hypothetical male subjects with ages of 58 and 72 (quartiles of age), and PAD vs. control. For HDL and LDL use values of 1.3 and 3, which are near the medians. (These curves are for the model of CVD death vs other death, time-fixed hdl and ldl.)

dummy <- expand.grid(pad=c("PAD", "Control"), age10=c(5.8, 7.2), sex="male",
                     hdl0 =1.3, ldl0 =3)
dummy
      pad age10  sex hdl0 ldl0
1     PAD   5.8 male  1.3    3
2 Control   5.8 male  1.3    3
3     PAD   7.2 male  1.3    3
4 Control   7.2 male  1.3    3

coxpred1 <- survfit(pcox1, newdata= dummy)
dim(coxpred1)
  data states 
     4      3

coxpred1$states
[1] "(s0)"        "CV death"    "nonCV death"

The resulting set of absolute risk curves can be viewed as an array of estimates by stratum, target covariates, and state. The pcox1 fit had no strata so the first dimension is 1. The resulting curves are shown in Figure. Because a multi-state model can generate a very large number of curves it is often helpful to break them into multiple panels.

par(mfrow=c(1, 2))
plot(coxpred1[1, 1:2 ,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
     ylim=c(0, .15),
     xlab="Years since enrollment", ylab="predicted P(state), age = 58")
legend("topleft", c("PAD, CVD death", "Control, CVD death",
                   "PAD, nonCVD death", "Control, nonCVD death"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
plot(coxpred1[1, 3:4,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
     ylim=c(0, .15),
     xlab="Years since enrollment", ylab="predicted P(state), age=72")

legend("topleft", c("PAD, CVD death", "Control, CVD death",
                   "PAD, nonCVD death", "Control, nonCVD death"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)

Predicted absolute risk curves from the multistate Cox model: dying of CVD (solid) and from other causes (dashed) by PAD status, for a male enrolled at age 58 (left panel) or 72 (right). The values of baseline HDL and LDL are 1.3 and 3, respectively.

R applications & Exercises - Block 4

STATISTICAL LEARNING IN EPIDEMIOLOGY 2025

Prof. Giulia Barbati & Paolo Dalena

27/05/2025

1 An introduction to survival analysis in R

1.0.1 Packages

1.0.2 What is survival data?

1.0.3 Aliases for survival analysis

1.0.4 The lung dataset

1.0.5 What is censoring?

1.0.6 Types of censoring

1.0.7 Censored survival data

1.0.8 Distribution of follow-up time

1.0.9 Components of survival data

1.0.10 Dealing with dates in R

1.0.11 Formatting dates - base R

1.0.12 Formatting dates - lubridate package

1.0.13 Calculating survival times - base R

1.0.14 Calculating survival times - lubridate

1.0.15 Event indicator

1.0.16 Survival function

1.0.17 Survival probability

1.0.18 Creating survival objects

1.0.19 Estimating survival curves with the Kaplan-Meier method

1.0.20 Kaplan-Meier plot - base R

1.0.21 Kaplan-Meier plot - ggsurvplot

1.0.22 Estimating \(x\)-year survival

1.0.23 \(x\)-year survival and the survival curve

1.0.24 \(x\)-year survival is often estimated incorrectly

1.0.25 Impact on \(x\)-year survival in absence of censoring

1.0.26 Estimating median survival time

1.0.27 Median survival time and the survival curve

1.0.28 Median survival is often estimated incorrectly

1.0.29 Impact on survival of ignoring (removing) censoring

1.0.30 Comparison between the KM estimator and the FH estimator

2 Survival Analysis in R: comparing survival curves

2.1 Recap of the basic steps

2.2 Estimate survival curves by gender:

2.3 Comparing survival curves : the log-rank test

2.4 Stratified log-rank test using presence of ulcer as strata:

2.5 Extension for more than two groups

3 Competing risks in survival analysis (an introduction)

3.1 Exploring length of stay stratified by pneumonia (as if no competing risks exist)

3.2 Competing risks : the cumulative incidence function

3.3 The cause-specific hazard rate regression approach when there are competing risks

3.4 Supplementary material: the subdistribution hazard regression approach when there are competing risks

4 An example of time-dependent covariate : the extended KM estimator

5 Cox regression in R

5.1 The Cox regression model

5.2 The lung dataset

5.3 Formatting Cox regression results

5.4 Hazard ratios

5.5 Assessing proportional hazards (I)

5.6 Another example of Cox PH Regression

5.7 Verifying the PH Assumption (II)

5.8 Sample size for Survival outcomes

6 Evaluating performance for the Cox model

6.1 Discrimination for the survival data : time-dependent ROC curve

6.2 Calibration for the Cox model

7 Examples on more advanced survival topics

7.1 Introduction

7.1.1 Peripheral Arterial Disease :Background

7.1.2 Endpoint and follow-up

7.1.3 The time scale of the survival analysis

7.2 Covariates and Cox model

7.3 A more advanced topic : time-dependent covariates (multiple measurements) and competing risks !

7.3.1 Overall mortality

7.3.2 Cardiovascular death

7.3.3 Predicted probability-in-state (survival) curves