1 An introduction to survival analysis in R

This practical provides an introduction to survival analysis, and to conducting a survival analysis in R.

We will cover some basics of survival analysis; the following series of tutorial papers can be helpful for additional reading:

Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part I: Basic concepts and first analyses. 232-238. ISSN 0007-0920.
M J Bradburn, T G Clark, S B Love, & D G Altman. (2003). Survival Analysis Part II: Multivariate data analysis - an introduction to concepts and methods. British Journal of Cancer, 89(3), 431-436.
Bradburn, M., Clark, T., Love, S., & Altman, D. (2003). Survival analysis Part III: Multivariate data analysis – choosing a model and assessing its adequacy and fit. 89(4), 605-11.
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part IV: Further concepts and methods in survival analysis. 781-786. ISSN 0007-0920.

1.1 Packages

Some packages we’ll be using include:

lubridate
survival
survminer

library(survival)
library(survminer)
library(lubridate)

Moreover, it will be necessary install a library from a github:

# devtools::install_github("zabore/ezfun")
library(ezfun)

1.2 What is survival data?

Time-to-event data that consist of a distinct start time and end time.

Examples:

Time from surgery to death
Time from start of treatment to cancer progression
Time from response to cancer recurrence
Time from HIV infection to development of AIDS
Time to heart attack
Time to onset of substance abuse
Time to initiation of sexual activity
Time to machine malfunction …..

1.3 Aliases for survival analysis

Because survival analysis is common in many other fields, it also goes by other names

Reliability analysis
Duration analysis
Event history analysis
Time-to-event analysis

1.4 The lung dataset

The lung dataset is available from the survival package in R. The data contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. Some variables we will use to demonstrate methods include:

time: Survival time in days
status: censoring status 1=censored, 2=dead
sex: Male=1 Female=2

1.5 What is censoring?

1.6 Types of censoring

A subject may be censored due to:

Loss to follow-up
Withdrawal from study
No event by end of fixed study period (administrative censoring)

Specifically these are examples of right censoring.

Left censoring and interval censoring are also possible, and methods exist to analyze this type of data, but in this course we will be limited to right censoring.

1.7 Censored survival data

In this example, how would we compute the proportion of subjects who are event-free at 10 years?

Subjects 6 and 7 were event-free at 10 years. Subjects 2,9 and 10 had the event before 10 years. Subjects 1, 3, 4, 5 and 8 were censored before 10 years, so we don’t know whether they had the event or not by 10 years - how do we incorporate these subjects into our estimate?

1.8 Distribution of follow-up time

Censored subjects still provide information so must be appropriately included in the analysis
Distribution of follow-up times is skewed, and may differ between censored patients and those with events
Follow-up times are always positive

ggplot(lung, aes(x = time, fill = factor(status))) +
  geom_histogram(bins = 25, alpha = 0.6, position = "identity") +
  scale_fill_manual(values = c("blue", "red"), labels = c("Censored", "Dead")) +
  ezfun::theme_ezbasic() +
  labs(x = "Days",
       y = "Count")

1.9 Components of survival data

For subject \(i\):

Event time \(T_i\)
Censoring time \(C_i\)
Event indicator \(\delta_i\):
- 1 if event observed (i.e. \(T_i \leq C_i\))
- 0 if censored (i.e. \(T_i > C_i\))
Observed time \(Y_i = \min(T_i, C_i)\)

The observed times and an event indicator are provided in the lung data

time: Survival time in days (\(Y_i\))
status: censoring status 1=censored, 2=dead (\(\delta_i\))

kable(head(lung))

inst	time	status	age	sex	ph.ecog	ph.karno	pat.karno	meal.cal	wt.loss
3	306	2	74	1	1	90	100	1175	NA
3	455	2	68	1	0	90	90	1225	15
3	1010	1	56	1	0	90	90	NA	15
5	210	2	57	1	1	90	60	1150	11
1	883	2	60	1	0	100	90	NA	0
12	1022	1	74	1	1	50	80	513	0

1.10 Dealing with dates in R

Data will often come with start and end dates rather than pre-calculated survival times. The first step is to make sure these are formatted as dates in R.

Let’s create a small example dataset with variables sx_date for surgery date and last_fup_date for the last follow-up date.

date_ex <- 
  tibble(
    sx_date = c("2007-06-22", "2004-02-13", "2010-10-27"), 
    last_fup_date = c("2017-04-15", "2018-07-04", "2016-10-31")
    )
date_ex

## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <chr>      <chr>        
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

We see these are both character variables, which will often be the case, but we need them to be formatted as dates.

1.11 Formatting dates - base R

date_ex %>% 
  mutate(
    sx_date = as.Date(sx_date, format = "%Y-%m-%d"), 
    last_fup_date = as.Date(last_fup_date, format = "%Y-%m-%d") 
    )

## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <date>     <date>       
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

Note that in base R the format must include the separator as well as the symbol. e.g. if your date is in format m/d/Y then you would need format = "%m/%d/%Y"
See a full list of date format symbols at https://www.statmethods.net/input/dates.html

1.12 Formatting dates - lubridate package

We can also use the lubridate package to format dates. In this case, use the ymd function

date_ex %>% 
  mutate(
    sx_date = ymd(sx_date), 
    last_fup_date = ymd(last_fup_date)
    )

## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <date>     <date>       
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

Note that unlike the base R option, the separators do not need to be specified
The help page for ?dmy will show all format options.

1.13 Calculating survival times - base R

Now that the dates formatted, we need to calculate the difference between start and end time in some units, usually months or years. In base R, use difftime to calculate the number of days between our two dates and convert it to a numeric value using as.numeric. Then convert to years by dividing by 365.25, the average number of days in a year.

date_ex %>% 
  mutate(
    os_yrs = 
      as.numeric(
        difftime(last_fup_date, 
                 sx_date, 
                 units = "days")) / 365.25
    )

## # A tibble: 3 × 3
##   sx_date    last_fup_date os_yrs
##   <date>     <date>         <dbl>
## 1 2007-06-22 2017-04-15      9.82
## 2 2004-02-13 2018-07-04     14.4 
## 3 2010-10-27 2016-10-31      6.01

1.14 Calculating survival times - lubridate

Using the lubridate package, the operator %--% designates a time interval, which is then converted to the number of elapsed seconds using as.duration and finally converted to years by dividing by dyears(1), which gives the number of seconds in a year.

date_ex %>% 
  mutate(
    os_yrs = 
      as.duration(sx_date %--% last_fup_date) / dyears(1)
    )

## # A tibble: 3 × 3
##   sx_date    last_fup_date os_yrs
##   <date>     <date>         <dbl>
## 1 2007-06-22 2017-04-15      9.82
## 2 2004-02-13 2018-07-04     14.4 
## 3 2010-10-27 2016-10-31      6.01

Note: we need to load the lubridate package using a call to library in order to be able to access the special operators (similar to situation with pipes)

1.15 Event indicator

For the components of survival data I mentioned the event indicator:

Event indicator \(\delta_i\):

1 if event observed (i.e. \(T_i \leq C_i\))
0 if censored (i.e. \(T_i > C_i\))

However, in R the Surv function will also accept TRUE/FALSE (TRUE = event) or 1/2 (2 = event).

In the lung data, we have:

status: censoring status 1=censored, 2=dead

1.16 Survival function

The probability that a subject will survive beyond any given specified time

\[S(t) = Pr(T>t) = 1 - F(t)\]

\(S(t)\): survival function \(F(t) = Pr(T \leq t)\): cumulative distribution function

In theory the survival function is smooth; in practice we observe events on a discrete time scale.

1.17 Survival probability

Survival probability at a certain time, \(S(t)\), is a conditional probability of surviving beyond that time, given that an individual has survived just prior to that time.
Can be estimated as the number of patients who are alive without loss to follow-up at that time, divided by the number of patients who were alive just prior to that time
The Kaplan-Meier estimate of survival probability is the product of these conditional probabilities up until that time
At time 0, the survival probability is 1, i.e. \(S(t_0) = 1\)

1.18 Creating survival objects

The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.

The Surv function from the survival package creates a survival object for use as the response in a model formula. There will be one entry for each subject that is the survival time, which is followed by a + if the subject was censored. Let’s look at the first 10 observations:

Surv(lung$time, lung$status)[1:10]

##  [1]  306   455  1010+  210   883  1022+  310   361   218   166

1.19 Estimating survival curves with the Kaplan-Meier method

The survfit function creates survival curves based on a formula. Let’s generate the overall survival curve for the entire cohort, assign it to object f1, and look at the names of that object:

f1 <- survfit(Surv(time, status) ~ 1, data = lung)
names(f1)

##  [1] "n"         "time"      "n.risk"    "n.event"   "n.censor"  "surv"     
##  [7] "std.err"   "cumhaz"    "std.chaz"  "type"      "logse"     "conf.int" 
## [13] "conf.type" "lower"     "upper"     "t0"        "call"

Some key components of this survfit object that will be used to create survival curves include:

time, which contains the start and endpoints of each time interval
surv, which contains the survival probability corresponding to each time

1.20 Kaplan-Meier plot - base R

Now we plot the survfit object in base R to get the Kaplan-Meier plot.

plot(survfit(Surv(time, status) ~ 1, data = lung), 
     xlab = "Days", 
     ylab = "Overall survival probability", mark.time = T)

The default plot in base R shows the step function (solid line) with associated confidence intervals (dotted lines)
Horizontal lines represent survival duration for the interval
An interval is terminated by an event
The height of vertical lines show the change in cumulative probability
Censored observations, indicated by tick marks, reduce the cumulative survival between intervals. (Note the tick marks for censored patients are not shown by default, but could be added using the option mark.time = TRUE)

1.21 Kaplan-Meier plot - ggsurvplot

Alternatively, the ggsurvplot function from the survminer package is built on ggplot2, and can be used to create Kaplan-Meier plots. Checkout the cheatsheet for the survminer package.

ggsurvplot(
    fit = survfit(Surv(time, status) ~ 1, data = lung), 
    xlab = "Days", 
    ylab = "Overall survival probability")

The default plot using ggsurvplot shows the step function (solid line) with associated confidence bands (shaded area).
The tick marks for censored patients are shown by default, somewhat obscuring the line itself in this example, and could be supressed using the option censor = FALSE

1.22 Estimating \(x\)-year survival

One quantity often of interest in a survival analysis is the probability of surviving beyond a certain number (\(x\)) of years.

For example, to estimate the probability of surviving to \(1\) year, use summary with the times argument (Note the time variable in the lung data is actually in days, so we need to use times = 365.25)

summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)

## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   365     65     121    0.409  0.0358        0.345        0.486

We find that the \(1\)-year probability of survival in this study is 41%.

The associated lower and upper bounds of the 95% confidence interval are also displayed.

1.23 \(x\)-year survival and the survival curve

The \(1\)-year survival probability is the point on the y-axis that corresponds to \(1\) year on the x-axis for the survival curve.

plot_main <- 
  ggsurvplot(
    data = lung, 
    fit = f1,
    xlab = "Months",
    legend = "none",
    xscale = 30.4,
    break.x.by = 182.4, 
    risk.table = TRUE,
    risk.table.y.text = FALSE)
plot1 <- plot_main
plot1$plot <- plot1$plot + 
  geom_segment(x = 365.25, xend = 365.25, y = -0.05, yend = 0.4092416, 
               size = 1.5) +
  geom_segment(x = 365.25, xend = -40, y = 0.4092416, yend = 0.4092416,
               size = 1.5, 
               arrow = arrow(length = unit(0.2, "inches"))) 
plot1

1.24 \(x\)-year survival is often estimated incorrectly

What happens if you use a “naive” estimate?

121 of the 228 patients died by \(1\) year so:

\[\Big(1 - \frac{121}{228}\Big) \times 100 = 47\%\]

You get an incorrect estimate of the \(1\)-year probability of survival when you ignore the fact that 42 patients were censored before \(1\) year.
Recall the correct estimate of the \(1\)-year probability of survival was 41%.

1.25 Impact on \(x\)-year survival in absence of censoring

Imagine two studies, each with 228 subjects. There are 165 deaths in each study. No censoring in one (orange line), 63 patients censored in the other (blue line)
In absence of censoring the overall survival probability is higher; instead when censoring is present, censored subjects only contribute information for part of the follow-up time, and then fall out of the risk set, thus pulling down the cumulative probability of survival.

fakedata2 <- lung %>% 
  mutate(time = ifelse(status == 2, time, 1022), 
         group = "No censoring") %>% 
  full_join(mutate(lung, group = "With censoring"))
fit3 <- survfit(Surv(time, status) ~ group, data = fakedata2)
ggsurvplot(
  data = fakedata2, 
  fit = fit3,
  xlab = "Months",
  legend = "bottom",
  legend.title = "",
  legend.labs = c("No censoring", "With censoring"),
  xscale = 30.4,
  break.x.by = 182.4, 
  risk.table = TRUE,
  risk.table.y.text = FALSE)

1.26 Estimating median survival time

Another quantity often of interest in a survival analysis is the average survival time, which we quantify using the median. Survival times are not expected to be normally distributed so the mean is not an appropriate summary.

We can obtain this directly from our survfit object

survfit(Surv(time, status) ~ 1, data = lung)

## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 228    165    310     285     363

We see the median survival time is 310 days The lower and upper bounds of the 95% confidence interval are also displayed.

1.27 Median survival time and the survival curve

Median survival is the time corresponding to a survival probability of \(0.5\):

plot2 <- plot_main
plot2$plot <- plot2$plot + 
  geom_segment(x = -45, xend = 310, y = 0.5, yend = 0.5,  size = 1.5) +
  geom_segment(x = 310, xend = 310, y = 0.5, yend = -0.03, size = 1.5, 
               arrow = arrow(length = unit(0.2, "inches")))
plot2

1.28 Median survival is often estimated incorrectly

What happens if you use a “naive” estimate?

Summarize the median survival time among the 165 patients who died

lung %>% 
  filter(status == 2) %>% 
  summarize(median_surv = median(time))

##   median_surv
## 1         226

You get an incorrect estimate of median survival time of 226 days when you ignore the fact that censored patients also contribute follow-up time.
Recall the correct estimate of median survival time is 310 days.

1.29 Impact on survival of ignoring (removing) censoring

Ignoring censoring when instead is present, i.e. removing pts that are censored from the dataset, creates an artificially lowered survival curve because the follow-up time that censored patients contribute is excluded (orange line)
The true survival curve for the lung data is shown in blue for comparison

fakedata <- lung %>% 
  filter(status == 2) %>% 
  mutate(group = "Ignoring censoring") %>% 
  full_join(mutate(lung, group = "With censoring"))
fit2 <- survfit(Surv(time, status) ~ group, data = fakedata)
ggsurvplot(
  data = fakedata, 
  fit = fit2,
  xlab = "Months",
  legend = "bottom",
  legend.title = "",
  legend.labs = c("Ignoring censoring", "With censoring"),
  xscale = 30.4,
  break.x.by = 182.4, 
  risk.table = TRUE, 
  risk.table.y.text = FALSE)

1.30 Comparison between the KM estimator and the FH estimator

We will use the “Duration of nursing home stay” data.

The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. Treated nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient’s health and sending them home.

Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.

Variables include:

los - Length of stay of a resident (in days)
age - Age of a resident
rx - Nursing home assignment (1:bonuses, 0:no bonuses)
gender - Gender (1:male, 0:female)
married - (1:married, 0:not married)
health - Health status (2:second best, 5:worst)
censor - Censoring indicator (1:censored, 0:discharged)

Say we want to obtain the Fleming-Harrington estimate of the survival function for married females, in the healthiest initial subgroup, who are randomized to the untreated group of the nursing home study.

nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat",
head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))

nurs2 = subset(nurshome, gender==0 & rx==0 & health==2 & married==1)

fit.fh = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="fleming-harrington", conf.type="log-log")
summary(fit.fh)

## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "fleming-harrington", 
##     conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    14     12       1   0.9200  0.0767      0.55345        0.988
##    24     11       1   0.8401  0.1036      0.49777        0.957
##    25     10       1   0.7601  0.1207      0.42614        0.916
##    38      9       1   0.6802  0.1318      0.35609        0.866
##    64      8       1   0.6003  0.1384      0.29015        0.810
##    89      7       1   0.5204  0.1412      0.22897        0.749
##   113      6       1   0.4405  0.1402      0.17290        0.682
##   123      5       1   0.3606  0.1356      0.12238        0.609
##   149      4       1   0.2809  0.1268      0.07814        0.531
##   168      3       1   0.2012  0.1129      0.04142        0.446
##   185      2       1   0.1221  0.0917      0.01445        0.352
##   234      1       1   0.0449  0.0562      0.00107        0.245

Compare with KM:

fit.km = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="kaplan-meier", conf.type="log-log")
summary(fit.km)

## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "kaplan-meier", 
##     conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    14     12       1   0.9167  0.0798      0.53898        0.988
##    24     11       1   0.8333  0.1076      0.48171        0.956
##    25     10       1   0.7500  0.1250      0.40842        0.912
##    38      9       1   0.6667  0.1361      0.33702        0.860
##    64      8       1   0.5833  0.1423      0.27014        0.801
##    89      7       1   0.5000  0.1443      0.20848        0.736
##   113      6       1   0.4167  0.1423      0.15247        0.665
##   123      5       1   0.3333  0.1361      0.10270        0.588
##   149      4       1   0.2500  0.1250      0.06014        0.505
##   168      3       1   0.1667  0.1076      0.02651        0.413
##   185      2       1   0.0833  0.0798      0.00505        0.311
##   234      1       1   0.0000     NaN           NA           NA

Compare plots:

plot(fit.fh$time, fit.fh$surv,type="n", xlab="Follow up", ylab="Survival curve")
points(fit.fh$time, fit.fh$surv, col="blue", pch=17)
points(fit.km$time,fit.km$surv, col="red", pch=14)
lines(fit.fh$time,fit.fh$surv, col="blue", lwd=2)
lines(fit.km$time, fit.km$surv,col="red", lwd=2)
legend("topright", legend=c("FH", "KM"), col=c("blue", "red"), pch=c(17,14), lty=c(1,1), lwd=c(2,2), bty="n")

Generally, the Fleming Harrington estimator is slightly higher than the KM at every time point, but with larger datasets the two will typically be much closer.

2 Survival Analysis in R: comparing survival curves

Load libraries:

library(survival)
library(MASS)
library(rms)
library(ISwR)
library(survMisc)
library(survminer)
library(KMsurv)

2.1 Recap of the basic steps

Loading and viewing dataset melanom:

dati <- melanom
head(dati)

##    no status days ulc thick sex
## 1 789      3   10   1   676   2
## 2  13      3   30   2    65   2
## 3  97      2   35   2   134   2
## 4  16      3   99   2   290   1
## 5  21      1  185   1  1208   2
## 6 469      1  204   1   484   2

attach(dati)

Create the survival object:

msurv <- with(dati, Surv(days, status == 1))
msurv

##   [1]   10+   30+   35+   99+  185   204   210   232   232+  279   295   355+
##  [13]  386   426   469   493+  529   621   629   659   667   718   752   779 
##  [25]  793   817   826+  833   858   869   872   967   977   982  1041  1055 
##  [37] 1062  1075  1156  1228  1252  1271  1312  1427+ 1435  1499+ 1506  1508+
##  [49] 1510+ 1512+ 1516  1525+ 1542+ 1548  1557+ 1560  1563+ 1584  1605+ 1621 
##  [61] 1627+ 1634+ 1641+ 1641+ 1648+ 1652+ 1654+ 1654+ 1667  1678+ 1685+ 1690 
##  [73] 1710+ 1710+ 1726  1745+ 1762+ 1779+ 1787+ 1787+ 1793+ 1804+ 1812+ 1836+
##  [85] 1839+ 1839+ 1854+ 1856+ 1860+ 1864+ 1899+ 1914+ 1919+ 1920+ 1927+ 1933 
##  [97] 1942+ 1955+ 1956+ 1958+ 1963+ 1970+ 2005+ 2007+ 2011+ 2024+ 2028+ 2038+
## [109] 2056+ 2059+ 2061  2062  2075+ 2085+ 2102+ 2103  2104+ 2108  2112+ 2150+
## [121] 2156+ 2165+ 2209+ 2227+ 2227+ 2256  2264+ 2339+ 2361+ 2387+ 2388  2403+
## [133] 2426+ 2426+ 2431+ 2460+ 2467  2492+ 2493+ 2521+ 2542+ 2559+ 2565  2570+
## [145] 2660+ 2666+ 2676+ 2738+ 2782  2787+ 2984+ 3032+ 3040+ 3042  3067+ 3079+
## [157] 3101+ 3144+ 3152+ 3154+ 3180+ 3182+ 3185+ 3199+ 3228+ 3229+ 3278+ 3297+
## [169] 3328+ 3330+ 3338  3383+ 3384+ 3385+ 3388+ 3402+ 3441+ 3458+ 3459+ 3459+
## [181] 3476+ 3523+ 3667+ 3695+ 3695+ 3776+ 3776+ 3830+ 3856+ 3872+ 3909+ 3968+
## [193] 4001+ 4103+ 4119+ 4124+ 4207+ 4310+ 4390+ 4479+ 4492+ 4668+ 4688+ 4926+
## [205] 5565+

Histogram of the follow up (days) and descriptive statistics:

hist(days)

summary(days)

##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10    1525    2005    2153    3042    5565

Estimate the KM curve:

mfit <- survfit(Surv(days, status==1)~1, data = dati)
mfit

## Call: survfit(formula = Surv(days, status == 1) ~ 1, data = dati)
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 205     57     NA      NA      NA

Output of the KM estimate at different time points:

summary(mfit, times=c(1000,3000,4000))

## Call: survfit(formula = Surv(days, status == 1) ~ 1, data = dati)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##  1000    171      26    0.869  0.0240        0.823        0.917
##  3000     54      29    0.677  0.0385        0.605        0.757
##  4000     13       2    0.645  0.0431        0.566        0.735

Graphical representation of the curve:

plot(mfit)
abline(h=0.50, col="red",lty=2)

In this case the median survival time is not even interesting because the estimate is infinite.

The survival curve does not cross the 50% mark before all patients are censored.

If you want to see also censored times in the plot:

plot(mfit,mark.time=T)
abline(h=0.50, col="red",lty=2)

if you want to see also the number at risk at different time points:

ggsurvplot(mfit, data=dati, risk.table = TRUE)

Remember that estimate of the curve is reliable until the risk set has a sufficient sample size.
As a rule of thumb: the curve can become unreliable when the number of patients at risk is less than 15. The width of confidence intervals will help to decide.

2.2 Estimate survival curves by gender:

surv.bysex <- survfit(Surv(days,status==1)~sex)

Plot the curves:

plot(surv.bysex, conf.int=T,col=c("red","black"))

Other types of graphical representations:

ggsurvplot(surv.bysex, data=dati, risk.table = TRUE)

ggsurvplot(surv.bysex, data=dati, risk.table ="abs_pct")

2.3 Comparing survival curves : the log-rank test

Log-rank test to see if there is a significant difference between males and females: the survdiff function implements different tests specified by a parameter called rho, allowing a non-proportional hazards alternative to the null hypothesis, but the default value of rho gives the log-rank test.

survdiff(Surv(days,status==1)~sex)

## Call:
## survdiff(formula = Surv(days, status == 1) ~ sex)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 126       28     37.1      2.25      6.47
## sex=2  79       29     19.9      4.21      6.47
## 
##  Chisq= 6.5  on 1 degrees of freedom, p= 0.01

We conclude that there is a significant difference in survival between males and females.

2.4 Stratified log-rank test using presence of ulcer as strata:

surv.bysex.ulc <- survdiff(Surv(days,status==1)~sex+strata(ulc))
surv.bysex.ulc

## Call:
## survdiff(formula = Surv(days, status == 1) ~ sex + strata(ulc))
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 126       28     34.7      1.28      3.31
## sex=2  79       29     22.3      1.99      3.31
## 
##  Chisq= 3.3  on 1 degrees of freedom, p= 0.07

To make separate graphs:

index.ulc1 <- ulc==1
index.ulc2 <- ulc==2

surv.bysex1 <- survfit(Surv(days,status==1)~sex, data=dati[index.ulc1,])
surv.bysex2 <- survfit(Surv(days,status==1)~sex, data=dati[index.ulc2,])


par(pty=("s"), mfrow=c(1,2))
plot(surv.bysex1, conf.int=T,col=c("red","black"))
title("Presence of ulcer")
plot(surv.bysex2, conf.int=T,col=c("red","black"))
title("Absence of ulcer")

Stratification makes the effect of sex less significant. A possible explanation might be that males seek treatment when the disease is in a more advanced state than women do, so that the gender difference is reduced when adjusted for a measure of disease progression.

2.5 Extension for more than two groups

Example of log rank with more than 2 groups:

data(myeloma)
fit <- survfit(Surv(time, event) ~ molecular_group, data = myeloma)
ggsurvplot(fit, legend.labs = levels(myeloma$molecular_group),
           pval = TRUE)

This plot shows the global p value across groups. Let’s see the differences between pairs (adjusting for multiple comparisons):

res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
                         data = myeloma)
res

## 
##  Pairwise comparisons using Log-Rank test 
## 
## data:  myeloma and molecular_group 
## 
##                  Cyclin D-1 Cyclin D-2 Hyperdiploid Low bone disease MAF  
## Cyclin D-2       0.723      -          -            -                -    
## Hyperdiploid     0.943      0.723      -            -                -    
## Low bone disease 0.723      0.988      0.644        -                -    
## MAF              0.644      0.447      0.523        0.485            -    
## MMSET            0.328      0.103      0.103        0.103            0.723
## Proliferation    0.103      0.038      0.038        0.062            0.485
##                  MMSET
## Cyclin D-2       -    
## Hyperdiploid     -    
## Low bone disease -    
## MAF              -    
## MMSET            -    
## Proliferation    0.527
## 
## P value adjustment method: BH

Proliferation vs Cyclin D-2 and vs Hyperdiploid show significant differences in survival curves.No others significant differences are detected between others groups.

3 Cox regression in R

Load the required libraries:

library(knitr)
library(tidyverse)
library(lubridate)
library(survival)
library(survminer)
library(gtsummary)
library(ISwR)
library(KMsurv)
library(pmsampsize)

3.1 The Cox regression model

We may want to quantify an effect size for a single variable, or include more than one variable into a regression model to account for the effects of multiple variables on the hazard function.

The Cox regression model is a semi-parametric model that can be used to fit univariable and multivariable regression models that have survival outcomes.

\[h(t|X_i) = h_0(t) \exp(\beta_1 X_{i1} + \cdots + \beta_p X_{ip})\]

\(h(t)\): hazard, or the instantaneous rate at which events occur \(h_0(t)\): underlying baseline hazard

Some key assumptions of the model:

non-informative censoring
proportional hazards

Note: parametric regression models for survival outcomes are also available, but they won’t be addressed in this course.

We can fit regression models for survival data using the coxph function, which takes a Surv object on the left hand side and has standard syntax for regression formulas in R on the right hand side.

3.2 The lung dataset

The lung dataset contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. Some variables we will use to demonstrate methods include:

time: Survival time in days
status: censoring status 1=censored, 2=dead
sex: Male=1 Female=2

data("cancer", package="survival")
str(lung)

## 'data.frame':    228 obs. of  10 variables:
##  $ inst     : num  3 3 3 5 1 12 7 11 1 7 ...
##  $ time     : num  306 455 1010 210 883 ...
##  $ status   : num  2 2 1 2 2 1 2 2 2 2 ...
##  $ age      : num  74 68 56 57 60 74 68 71 53 61 ...
##  $ sex      : num  1 1 1 1 1 1 2 2 1 1 ...
##  $ ph.ecog  : num  1 0 0 1 0 1 2 2 1 2 ...
##  $ ph.karno : num  90 90 90 90 100 50 70 60 70 70 ...
##  $ pat.karno: num  100 90 90 60 90 80 60 80 80 70 ...
##  $ meal.cal : num  1175 1225 NA 1150 NA ...
##  $ wt.loss  : num  NA 15 15 11 0 0 10 1 16 34 ...

head(lung)

##   inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
## 1    3  306      2  74   1       1       90       100     1175      NA
## 2    3  455      2  68   1       0       90        90     1225      15
## 3    3 1010      1  56   1       0       90        90       NA      15
## 4    5  210      2  57   1       1       90        60     1150      11
## 5    1  883      2  60   1       0      100        90       NA       0
## 6   12 1022      1  74   1       1       50        80      513       0

coxph(Surv(time, status) ~ sex, data = lung)

## Call:
## coxph(formula = Surv(time, status) ~ sex, data = lung)
## 
##        coef exp(coef) se(coef)      z       p
## sex -0.5310    0.5880   0.1672 -3.176 0.00149
## 
## Likelihood ratio test=10.63  on 1 df, p=0.001111
## n= 228, number of events= 165

3.3 Formatting Cox regression results

We can see a tidy version of the output using the tidy function from the broom package:

broom::tidy(
  coxph(Surv(time, status) ~ sex, data = lung), 
  exp = TRUE
  ) %>% 
  kable()

term	estimate	std.error	statistic	p.value
sex	0.5880028	0.1671786	-3.176385	0.0014912

Or use tbl_regression from the gtsummary package

coxph(Surv(time, status) ~ sex, data = lung) %>% 
  gtsummary::tbl_regression(exp = TRUE)

Characteristic	HR	95% CI	p-value
sex	0.59	0.42, 0.82	0.001
Abbreviations: CI = Confidence Interval, HR = Hazard Ratio

3.4 Hazard ratios

The quantity of interest from a Cox regression model is a hazard ratio (HR). The HR represents the ratio of hazards between two groups at any given point in time.
The HR is interpreted as the instantaneous rate of occurrence of the event of interest in those who are still at risk for the event. It is not a risk, though it is commonly interpreted as such.
If you have a regression parameter \(\beta\) (from column estimate in our coxph) then HR = \(\exp(\beta)\).
A HR < 1 indicates reduced hazard of death whereas a HR > 1 indicates an increased hazard of death.
So our HR = 0.59 implies that around 0.6 times as many females are dying as males, at any given time.

Here is a plot of the cumulative hazard functions by sex:

3.5 Assessing proportional hazards (I)

One assumption of the Cox proportional hazards regression model is that the hazards are proportional at each point in time throughout follow-up. How can we check to see if our data meet this assumption?

Use the cox.zph function from the survival package. It results in two main things:

A hypothesis test of whether the effect of each covariate differs according to time, and a global test of all covariates at once.
- This is done by testing for an interaction effect between the covariate and log(time)
- A significant p-value indicates that the proportional hazards assumption is violated
Plots of the Schoenfeld residuals
- Deviation from a zero-slope line is evidence that the proportional hazards assumption is violated

mv_fit <- coxph(Surv(time, status) ~ sex + age, data = lung)
cz <- cox.zph(mv_fit)
print(cz)

##        chisq df    p
## sex    2.608  1 0.11
## age    0.209  1 0.65
## GLOBAL 2.771  2 0.25

plot(cz)

In this example, the PH assumption could be considered valid for both covariates.

3.6 Another example of Cox PH Regression

Now, on the melanoma dataset, consider a Cox model with the single regressor sex:

Loading and viewing dataset melanom:

data("melanom", package="ISwR")
str(melanom)

## 'data.frame':    205 obs. of  6 variables:
##  $ no    : int  789 13 97 16 21 469 685 7 932 944 ...
##  $ status: int  3 3 2 3 1 1 1 1 3 1 ...
##  $ days  : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ ulc   : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ thick : int  676 65 134 290 1208 484 516 1288 322 741 ...
##  $ sex   : int  2 2 2 1 2 2 2 2 1 1 ...

head(melanom)

##    no status days ulc thick sex
## 1 789      3   10   1   676   2
## 2  13      3   30   2    65   2
## 3  97      2   35   2   134   2
## 4  16      3   99   2   290   1
## 5  21      1  185   1  1208   2
## 6 469      1  204   1   484   2

attach(melanom)

## The following objects are masked from dati:
## 
##     days, no, sex, status, thick, ulc

Create the survival object:

mod.sex <- coxph(Surv(days,status==1)~sex)
summary(mod.sex)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex)
## 
##   n= 205, number of events= 57 
## 
##       coef exp(coef) se(coef)     z Pr(>|z|)  
## sex 0.6622    1.9390   0.2651 2.498   0.0125 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##     exp(coef) exp(-coef) lower .95 upper .95
## sex     1.939     0.5157     1.153      3.26
## 
## Concordance= 0.59  (se = 0.034 )
## Likelihood ratio test= 6.15  on 1 df,   p=0.01
## Wald test            = 6.24  on 1 df,   p=0.01
## Score (logrank) test = 6.47  on 1 df,   p=0.01

Males (=2) have an hazard of death for melanoma that is nearly twice than women (=1).

A more elaborate example, involving also a continuous covariate:

mod.cov <- coxph(Surv(days,status==1)~sex+log(thick))
summary(mod.cov)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)    
## sex        0.4580    1.5809   0.2687 1.705   0.0883 .  
## log(thick) 0.7809    2.1834   0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex            1.581     0.6326    0.9337     2.677
## log(thick)     2.183     0.4580    1.6040     2.972
## 
## Concordance= 0.749  (se = 0.033 )
## Likelihood ratio test= 33.45  on 2 df,   p=5e-08
## Wald test            = 31  on 2 df,   p=2e-07
## Score (logrank) test = 32.52  on 2 df,   p=9e-08

Remember that ‘thick’ is the tumor thickness; we use logarithm since the distribution is asymmetric. Each 1 point change in log(thick) is associated with a 2.2-fold increase in a patient’s risk.

3.7 Verifying the PH Assumption (II)

Question: adjusting for log(thick), does the effect of gender follow a proportional hazards model? If the PH assumption holds, the log cumulative hazards for the two groups, adjusting for log(thick), should be roughly parallel:

fit1 <- coxph( Surv(days,status==1) ~ log(thick)+ strata(sex),data=melanom)
plot(survfit(fit1), lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"), fun="cloglog",xlab="Days",ylab="Log-Cumulative Hazard Function", xlim=c(100, 5000) )
legend(100, -1, lty=c(1,1), lwd=c(2,2), col=c("pink", "blue"),legend=c("Females", "Males"),bty="n")

Conclusion: not strong evidence of non-proportional hazards. This is a good look at gross departures, but it is far from a formal test.

As before, let’s do a formal test to check PH assumption:

check.ph <- cox.zph(mod.cov, transform="km", global=TRUE)
check.ph

##            chisq df     p
## sex         1.33  1 0.248
## log(thick)  6.23  1 0.013
## GLOBAL      6.70  2 0.035

Visualizing the residual plot: if the PH assumption for log(thick) holds, then the plot of b(t) vs time should be on a horizontal line.In this case we note that for log(thick) the PH assumption is not valid, so we should apply some more advanced options (as for example introducing an interaction between the covariate and a function of time). We do not introduce here this option.

plot(check.ph[2], xlab="Days")
abline(h=0, lty=3,lwd=2)
abline(h=coef(mod.cov)[2], lty=3,col="red",lwd=2)

We can observe that instead the effect of log(thick) is gradually decreasing with time; red dashed line indicates Cox model’s estimate for the overall log thick effect.

A more elaborate example: binary factor + continuous covariate + stratification variable

mod.cov.strat <- coxph(Surv(days,status==1)~sex+log(thick)+strata(ulc))
summary(mod.cov.strat)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick) + 
##     strata(ulc))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)   
## sex        0.3600    1.4333   0.2702 1.332   0.1828   
## log(thick) 0.5599    1.7505   0.1784 3.139   0.0017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex            1.433     0.6977     0.844     2.434
## log(thick)     1.750     0.5713     1.234     2.483
## 
## Concordance= 0.673  (se = 0.039 )
## Likelihood ratio test= 13.3  on 2 df,   p=0.001
## Wald test            = 12.88  on 2 df,   p=0.002
## Score (logrank) test = 12.98  on 2 df,   p=0.002

Plotting estimated survival curves in the strata:

plot(survfit(mod.cov.strat), col=c("red","black"),ylim=c(0.5,1), xlab="Days", ylab="Survival")
legend(3000,1, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))

The default for survfit is to generate curves for a pseudoindividual for which the covariates are at their mean values. In the present case, that would correspond to a tumor thickness of 1.86 mm and a gender of 1.39 (!). We have been sloppy in not defining sex as a factor variable, but that would not actually give a different result : coxph subtracts the means of the regressors before fitting, so a 1/2 coding is the same as 0/1, which is what a factor with treatment contrasts usually gives:

sex.f <- as.factor(sex)
sex.f

##   [1] 2 2 2 1 2 2 2 2 1 1 1 1 1 2 1 2 2 2 2 2 1 2 2 2 2 1 1 1 1 1 1 2 2 1 2 1 2
##  [38] 2 1 2 1 1 1 2 2 2 2 2 1 1 2 1 1 1 1 1 1 2 1 1 1 1 2 1 1 1 2 1 1 1 1 2 1 2
##  [75] 1 1 1 2 2 1 1 1 2 1 1 2 1 2 2 1 1 1 2 2 2 1 1 1 1 1 1 2 1 2 1 1 2 1 1 2 2
## [112] 1 2 1 2 2 1 1 1 1 1 2 1 1 2 1 1 1 2 1 2 1 1 1 1 1 1 1 2 1 2 1 2 1 1 1 1 1
## [149] 2 2 2 1 1 1 1 2 2 2 1 2 1 2 1 1 1 1 2 1 2 2 1 1 1 1 2 2 1 1 1 2 1 1 1 1 1
## [186] 2 1 2 1 1 2 1 1 1 2 1 2 2 1 1 2 1 1 1 1
## Levels: 1 2

mod.cov.strat.f <- coxph(Surv(days,status==1)~sex.f+log(thick)+strata(ulc))
summary(mod.cov.strat.f)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex.f + log(thick) + 
##     strata(ulc))
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)   
## sex.f2     0.3600    1.4333   0.2702 1.332   0.1828   
## log(thick) 0.5599    1.7505   0.1784 3.139   0.0017 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex.f2         1.433     0.6977     0.844     2.434
## log(thick)     1.750     0.5713     1.234     2.483
## 
## Concordance= 0.673  (se = 0.039 )
## Likelihood ratio test= 13.3  on 2 df,   p=0.001
## Wald test            = 12.88  on 2 df,   p=0.002
## Score (logrank) test = 12.98  on 2 df,   p=0.002

In order to estimate survival curves for subjects with certain values of the covariates, we could use the option newdata in survfit:

par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Male, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=194)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F)
title("Female, tumor thickness=194")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))

par(mfrow=c(1,2))
plot(survfit(mod.cov.strat, newdata=data.frame(sex=2,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F) 
title("Male, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1)) 
plot(survfit(mod.cov.strat, newdata=data.frame(sex=1,thick=709)), xlab = "Days", ylab="Survival",col=c("red","black"),mark.time=F) 
title("Female, tumor thickness=709")
legend(1000, .2, c("Ulcerated Tumor", "Not Ulcerated Tumor"), col=c("red","black"),bty="n", lty=c(1,1))

3.8 Sample size for Survival outcomes

Use pmsampsize to calculate the minimum sample size required for developing a multivariable Cox prediction model with a survival outcome using 30 candidate predictors. We know an existing prediction model in the same field has an R-squared adjusted of 0.051. Further, in the previous study the mean follow-up was 2.07 years, and overall event rate was 0.065. We select a timepoint of interest for prediction using the newly developed model of 2 years.

# this code is not evaluated - pmsampsize does not work on mac
pmsampsize(type = "s", rsquared = 0.051, parameters = 30, rate = 0.065,timepoint = 2, meanfup = 2.07)

We need 5143 subjects in our study, according to the assumptions declared above.

4 Evaluating performance for the Cox model

Load the required libraries:

# load packages
library(knitr)
library(tidyverse)
library(lubridate)
library(survival)
library(survminer)
library(gtsummary)
library(ISwR)
library(KMsurv)
library(timeROC)
library(rms)

Load the dataset melanom:

data("melanom", package="ISwR")
str(melanom)

## 'data.frame':    205 obs. of  6 variables:
##  $ no    : int  789 13 97 16 21 469 685 7 932 944 ...
##  $ status: int  3 3 2 3 1 1 1 1 3 1 ...
##  $ days  : int  10 30 35 99 185 204 210 232 232 279 ...
##  $ ulc   : int  1 2 2 2 1 1 1 1 1 1 ...
##  $ thick : int  676 65 134 290 1208 484 516 1288 322 741 ...
##  $ sex   : int  2 2 2 1 2 2 2 2 1 1 ...

head(melanom)

##    no status days ulc thick sex
## 1 789      3   10   1   676   2
## 2  13      3   30   2    65   2
## 3  97      2   35   2   134   2
## 4  16      3   99   2   290   1
## 5  21      1  185   1  1208   2
## 6 469      1  204   1   484   2

4.1 Discrimination for the survival data : time-dependent ROC curve

First of all we evaluate the discrimination accuracy of the tumor thickness as a prognostic biomarker for death from malignant melanoma:

ROC.thick<-timeROC(T=melanom$days,delta=melanom$status,marker=melanom$thick,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.thick

## Time-dependent-Roc curve estimated using IPCW  (n=205, with competing risks). 
##        Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800    45       124           36        0     75.39 4.10     76.51 3.85
## t=2000    46       103           56        0     75.60 4.14     76.11 3.82
## t=2200    50        83           72        0     72.55 4.40     74.33 3.81
## 
## Method used for estimating IPCW:marginal 
## 
## Total computation time : 0.44  secs.

Note that since we are here in a competing risk setting (two causes of death, but we are interested in cause=1), we obtain two AUCs estimated each one for a risk. We are now interested in evaluating AUC_1. We can also evaluate the confidence intervals for the estimated AUCs:

confint(ROC.thick)

## $CI_AUC_1
##         2.5% 97.5%
## t=1800 67.36 83.41
## t=2000 67.49 83.72
## t=2200 63.92 81.19
## 
## $CB_AUC_1
##         2.5% 97.5%
## t=1800 66.66 84.11
## t=2000 66.78 84.42
## t=2200 63.18 81.93
## 
## $C.alpha.1
##      95% 
## 2.129302 
## 
## $CI_AUC_2
##         2.5% 97.5%
## t=1800 68.96 84.05
## t=2000 68.63 83.60
## t=2200 66.86 81.79
## 
## $CB_AUC_2
##         2.5% 97.5%
## t=1800 68.55 84.46
## t=2000 68.23 84.00
## t=2200 66.46 82.20
## 
## $C.alpha.2
##      95% 
## 2.065624

Then, we estimate a Cox model by adding sex to the tumor thickness:

mod.cov <- coxph(Surv(days,status==1)~sex+log(thick), data=melanom)
summary(mod.cov)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick), data = melanom)
## 
##   n= 205, number of events= 57 
## 
##              coef exp(coef) se(coef)     z Pr(>|z|)    
## sex        0.4580    1.5809   0.2687 1.705   0.0883 .  
## log(thick) 0.7809    2.1834   0.1573 4.963 6.94e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex            1.581     0.6326    0.9337     2.677
## log(thick)     2.183     0.4580    1.6040     2.972
## 
## Concordance= 0.749  (se = 0.033 )
## Likelihood ratio test= 33.45  on 2 df,   p=5e-08
## Wald test            = 31  on 2 df,   p=2e-07
## Score (logrank) test = 32.52  on 2 df,   p=9e-08

We now extract the linear predictor from the Cox model (X*beta): note that the linear predictor of this cause-specific Cox model is evaluated at baseline and it is independent of time.

lin.pred <- predict(mod.cov)

And we now compute the time-dependent discrimination from this model:

ROC.cox <- timeROC(T=melanom$days,delta=melanom$status,marker=lin.pred,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.cox

## Time-dependent-Roc curve estimated using IPCW  (n=205, with competing risks). 
##        Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800    45       124           36        0     75.76 4.09     76.94 3.85
## t=2000    46       103           56        0     76.17 4.15     76.39 3.83
## t=2200    50        83           72        0     74.33 4.34     75.13 3.75
## 
## Method used for estimating IPCW:marginal 
## 
## Total computation time : 0.2  secs.

along with the estimated confidence intervals:

confint(ROC.cox)

## $CI_AUC_1
##         2.5% 97.5%
## t=1800 67.75 83.77
## t=2000 68.04 84.30
## t=2200 65.82 82.83
## 
## $CB_AUC_1
##         2.5% 97.5%
## t=1800 66.99 84.53
## t=2000 67.26 85.08
## t=2200 65.02 83.64
## 
## $C.alpha.1
##      95% 
## 2.146461 
## 
## $CI_AUC_2
##         2.5% 97.5%
## t=1800 69.40 84.49
## t=2000 68.87 83.90
## t=2200 67.77 82.48
## 
## $CB_AUC_2
##         2.5% 97.5%
## t=1800 68.98 84.91
## t=2000 68.46 84.31
## t=2200 67.37 82.88
## 
## $C.alpha.2
##     95% 
## 2.06678

Now we can compare the AUCs: is there a significant improvement in discrimination by adding sex?

pp1 <- compare(ROC.thick,ROC.cox)
round(pp1$p_values_AUC_1,digits=4)

## t=1800 t=2000 t=2200 
## 0.7713 0.6594 0.1936

No significant increment in discrimination is observed from this test.

Now we add also ulcer in the Cox model but this time as a covariate (not as a strata variable):

mod.cov2 <- coxph(Surv(days,status==1)~sex+log(thick)+ulc, data=melanom)
summary(mod.cov2)

## Call:
## coxph(formula = Surv(days, status == 1) ~ sex + log(thick) + 
##     ulc, data = melanom)
## 
##   n= 205, number of events= 57 
## 
##               coef exp(coef) se(coef)      z Pr(>|z|)   
## sex         0.3813    1.4641   0.2706  1.409  0.15879   
## log(thick)  0.5756    1.7782   0.1794  3.209  0.00133 **
## ulc        -0.9389    0.3911   0.3243 -2.895  0.00379 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
##            exp(coef) exp(-coef) lower .95 upper .95
## sex           1.4641     0.6830    0.8615    2.4883
## log(thick)    1.7782     0.5624    1.2511    2.5273
## ulc           0.3911     2.5571    0.2071    0.7384
## 
## Concordance= 0.768  (se = 0.032 )
## Likelihood ratio test= 42.66  on 3 df,   p=3e-09
## Wald test            = 36.99  on 3 df,   p=5e-08
## Score (logrank) test = 42.81  on 3 df,   p=3e-09

We now extract also from this model the linear predictor (X*beta):

lin.pred2 <- predict(mod.cov2)

And now we compare the performance of this model with the tumor thickness:

ROC.cox2 <- timeROC(T=melanom$days,delta=melanom$status,marker=lin.pred2,cause=1, times=c(1800,2000,2200),iid=TRUE)
ROC.cox2

## Time-dependent-Roc curve estimated using IPCW  (n=205, with competing risks). 
##        Cases Survivors Other events Censored AUC_1 (%) se_1 AUC_2 (%) se_2
## t=1800    45       124           36        0     78.74 3.95     79.36 3.71
## t=2000    46       103           56        0     78.59 4.06     78.58 3.73
## t=2200    50        83           72        0     75.88 4.27     76.66 3.69
## 
## Method used for estimating IPCW:marginal 
## 
## Total computation time : 0.24  secs.

confint(ROC.cox2)

## $CI_AUC_1
##         2.5% 97.5%
## t=1800 71.00 86.47
## t=2000 70.64 86.54
## t=2200 67.51 84.25
## 
## $CB_AUC_1
##         2.5% 97.5%
## t=1800 69.98 87.49
## t=2000 69.59 87.59
## t=2200 66.40 85.35
## 
## $C.alpha.1
##      95% 
## 2.218481 
## 
## $CI_AUC_2
##         2.5% 97.5%
## t=1800 72.10 86.62
## t=2000 71.27 85.88
## t=2200 69.43 83.89
## 
## $CB_AUC_2
##         2.5% 97.5%
## t=1800 71.55 87.17
## t=2000 70.72 86.43
## t=2200 68.88 84.43
## 
## $C.alpha.2
##      95% 
## 2.108273

pp1 <- compare(ROC.thick,ROC.cox2)
round(pp1$p_values_AUC_1,digits=4)

## t=1800 t=2000 t=2200 
## 0.1473 0.2036 0.1781

Again, no significant difference in discrimination is observed.

Finally, we can plot the estimated time-dependent AUCs:

plotAUCcurve(ROC.thick, FP = 2, add = FALSE,conf.int = FALSE,conf.band = FALSE, col = "black")
plotAUCcurve(ROC.cox, FP = 2, add =TRUE, conf.int = FALSE,conf.band = FALSE, col = "blue")
plotAUCcurve(ROC.cox2, FP = 2, add =TRUE, conf.int = FALSE,conf.band = FALSE, col = "red")
title(main="Time-dependent AUCs")
legend("topleft", legend=c("Tumor thickness","Thick + sex", "Thick + sex+ulc"), lty=c(1,1,1),lwd=c(2,2,2), 
       col=c("black","blue", "red"), bty="n")

We can also fix a specific time-horizon :

plot(ROC.thick,time=2200,add = FALSE, col = "black",title=FALSE, lwd=2)
plot(ROC.cox,time=2200,add = TRUE, col = "blue",title=FALSE, lwd=2)
plot(ROC.cox2,time=2200,add = TRUE, col = "red",title=FALSE, lwd=2)
legend("bottomright", legend=c("Tumor thickness","Thick + sex", "Thick + sex + ulc"), lty=c(1,1,1),lwd=c(2,2,2), 
       col=c("black","blue", "red"), bty="n")
title(main="Time-dependent ROC curves at 2200 days")

Note that discrimination in this context is focused exclusively on the performance of the linear predictor extracted from the Cox model. We are not considering here the survival estimates.

4.2 Calibration for the Cox model

Just for didactic reasons, we are in this example IGNORING the competing risk present in these data, and act as we have an unique event. Therefore the method is illustrated is valid only in the standard survival setting, not in the competing risk setting (more advanced approaches are needed in that case, as for example estimating cumulative incidence curves using the multi-state approach). We now evaluate calibration for the Cox model: first of all, we determine summaries of variables for effect and plotting ranges, values to adjust to, and overall ranges for the dataset Melanom, since this is required in order to use the R functions developed in the rms library.

options(datadist='dd')
dd <-datadist(melanom)
dd

##                       no status      days ulc     thick sex
## Low:effect      222.0000      1 1525.0000   1   97.0000   1
## Adjust to       469.0000      2 2005.0000   1  194.0000   1
## High:effect     731.0000      3 3042.0000   2  356.0000   2
## Low:prediction   11.0000      1  294.2195   1   32.0000   1
## High:prediction 943.0488      3 4119.2439   2  838.7805   2
## Low               2.0000      1   10.0000   1   10.0000   1
## High            992.0000      3 5565.0000   2 1742.0000   2
## 
## Values:
## 
## status : 1 2 3 
## ulc : 1 2 
## sex : 1 2

Then, we re-estimate the Cox model by adding some options useful to fix the time horizon to evaluate calibration, in this case fixed at 2000 days:

mod.cov2 <- cph(Surv(days,status==1)~sex+log(thick)+ulc, data=melanom,x=TRUE , y=TRUE,surv=TRUE,time.inc=2000)

Finally, we use the function calibrate by comparing predicted probabilities from the Cox model with the observed ones estimated by the Kaplan-Meier:

calkm <- calibrate(mod.cov2, u=2000, m=40, cmethod= 'KM', B=200)

## Using Cox survival estimates at 2000 Days

calkm

## calibrate.cph(fit = mod.cov2, cmethod = "KM", u = 2000, m = 40, 
##     B = 200)
## 
## 
## n=205  B=200  u=2000 Day
## 
##       index.orig    training         test mean.optimism mean.corrected   n
## [1,]  0.01505666 0.015399968  0.028326832  -0.012926864     0.02798352 200
## [2,]  0.01466150 0.006498119 -0.001349244   0.007847363     0.00681414 200
## [3,]  0.01195721 0.006277704  0.004442661   0.001835043     0.01012216 200
## [4,] -0.01622951 0.003399131 -0.013930637   0.017329768    -0.03355928 200
## [5,]  0.01160306 0.004361092  0.002688064   0.001673028     0.00993003 200
##      mean.predicted        KM KM.corrected    std.err
## [1,]      0.4394888 0.4545455    0.4674723 0.17797217
## [2,]      0.6603385 0.6750000    0.6671526 0.10971343
## [3,]      0.8233362 0.8352934    0.8334584 0.07384917
## [4,]      0.8978469 0.8816174    0.8642876 0.06477229
## [5,]      0.9396165 0.9512195    0.9495465 0.03536639

plot(calkm, cex.lab=0.7)

Another more general approach to estimating calibration curves for survival models is to use the flexible adaptive hazard regression approach (not covered here). For survival models, “predicted” means predicted survival probability at a single time point, and “observed” refers to the corresponding Kaplan-Meier survival estimate. The number of intervals of predicted survival probabilities in which subjects are ranked depend on the parameter m that correspond to the average number of subjects in each group evaluated at u-time.
As an internal validation tool, the bootstrap is used to de-bias the estimates to correct for overfitting, allowing estimation of the likely future calibration performance of the fitted model.

5 Examples on more advanced survival topics

These examples are taken from the paper: “Analysis of time-to-event for observational studies: Guidance to the use of intensity models”, available at : https://pubmed.ncbi.nlm.nih.gov/33043497/.

load(here("datasets", "pad.rda"))
library(splines)
library(timereg) 
library(knitr)
library(survival)
#  Advanced ovarian data from dynpred package
data(ova, package="dynpred")

5.1 Introduction

5.1.1 Peripheral Arterial Disease :Background

Peripheral arterial disease (PAD) is a common circulatory problem in which narrowed arteries reduce blood flow to peripheral limbs, often the legs. It is also likely to be a sign of a more widespread atherosclerosis, and subjects manifesting the disease carry an increased risk for all atherothrombotic events including myocardial infarction, stroke, and cardiac death. The pad data set contains the results of a Slovene study. Briefly, the study was conducted by 74 primary care physicians-researchers, each of whom was asked to recruit 10 subjects with PAD along with 10 age and sex matched controls without PAD; actual recruitment ranged from 1 to 31 pairs. All participating subjects consented to a yearly examination and interview for 5 years. Important endpoints are death, either due to cardiovascular disease (CVD death) or other causes, non-fatal CVD endpoints of infarction and stroke, and patient interventions attributed to the disease such as revascularization procedures.

The final study includes 742 PAD and 713 controls, with baseline data for each subject, measurements at each visit, and endpoints. This analysis will focus on cardiac endpoints and on overall mortality.

5.1.2 Endpoint and follow-up

The starting R data has 3 dataframes:

pad1 has one observation per subject containing baseline information,
pad2 has one observation per subject visit containing measurements at that visit, and
pad3 has an observation for each observed endpoint.

The dataset pad2 contains all the repeated examinations over the 5 years of the study. Note that in this dataset the ID of the subject is repeated for multiple lines (data are in the long format). The same happens for pad3 that reports all the events of interest. In the IDA phase (Initial Data Analysis) we look at the overall follow-up for subjects, and decide on a time scale and an endpoint. Per the study design subjects should have a visit at years 1–5. We see in Figure that visit 1 dates are stretched out over a full year and by visit 5 over almost 2 years, a pattern that is unavoidable in observational studies of human subjects. Based on the counts shown in the figure there is no evidence of differential drop out between the PAD and control subjects.

vcount <- with(pad2, table(pad, visit))  # visit count
vtime <-  pad2$day/365.25
plot(vtime ~ jitter(visit + .3*(pad=="PAD") - .15), pad2, subset=(visit>0),
     col=c(1,2)[as.numeric(pad)], ylim=c(0, max(vtime)), 
     xlab= "Visit number", ylab="Years from initial visit")
#abline(h=1:6, lty=2, col=4)
segments(rep(0,6), 1:6, 1:6+.5, 1:6, lty=2, col='gray')
legend(4,2, c("PAD", "Control"), col=c(2,1), bty='n', pch=19)
text(1:5 - .15, 0:4, vcount[1,-1])
text(1:5 + .15, 0:4, vcount[2,-1])

Time to each visit for the PAD study.

pfig2 <- survfit(Surv(futime/365.25, status=='censor') ~ pad, pad1)
plot(pfig2, fun='event', col=1:2,
     xlab="Years post enrollment", ylab="Fraction censored")
legend(2, .6, c("Control", "PAD", "Ideal"), col=c(1,2,1), lty=c(1,1,2),
       bty='n')
lines(c(0, 4.8, 6.8), c(0,0,1), lty=3)

Observed and ideal follow-up in the PAD study

The censoring pattern for subjects in the two arms, showed in the above figure, are computed using a reverse Kaplan-Meier (i.e. the status is to be censored, and the censoring is having the event).
There is no evidence of differential follow-up.

Given the visit slippage of the first figure, an ideal censoring pattern would be a final followup that is approximately uniformly distributed between 4.8 and 6.8 years, this is shown as a dotted line on the figure.

Examining the counts of subjects for each visit the total number of visits for the subjects is not what was planned — about 25% do not have a fifth visit — but patients have adhered to the 5 year time commitment quite closely. As stated in the primary paper, a decision was made to truncate all patient data at 5 years for the purpose of analysis.

5.1.3 The time scale of the survival analysis

The initial survival curves can be created using a copy of the baseline data (pad1), with follow-up truncated at 5 years.

pdata <- subset(pad1, select=c(id, pad, sex, age, futime, status))
pdata$death <- ifelse(pdata$status == 'censor', 0, 1) # any cause of death

# truncate follow=up at 5 years
over5 <- pdata$futime > 5*(365.25)
pdata$futime <- pmin(1826, pdata$futime)  # 1826 days = 5 years
pdata$death[over5] <- 0
pdata$status[over5] <- "censor"
table(pdata$status)

## 
##      censor    CV death nonCV death 
##        1296          68          91

# Convert to follow-up in years, and add an age scale
pdata$years <- pdata$futime/365.25
pdata$age1 <- as.numeric(pad1$d.first - pad1$d.birth)/365.25
pdata$age2 <- pdata$age1 + pdata$years

oldpar <- par(mfrow=c(1,2))
psurv1 <- survfit(Surv(years, death) ~ pad + sex, data=pdata)
psurv2 <- survfit(Surv(age1, age2, death) ~ pad+ sex, pdata,
               start.time=55)
plot(psurv1, col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
     xlab="Years since enrollment", ylab="Death")
legend(.1, .18, c("PAD, male","PAD, female", "Control, male","Control, female"),
       col=c(2,2,1,1), lty=c(2,1,2,1), lwd=2, bty='n', cex=.8)
plot(psurv2,   col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
      xlab='Age', ylab='Death, given alive at age 55')

Kaplan-Meier curves of overall survival.

Our first analysis will use time since enrollment as the primary time scale. For the PAD patients the time since diagnosis is a natural scale since it represents both the progression of disease and of treatments for the disease. Survival curves for the control subjects serve as a comparison outcome of similarly aged subjects without the disease, but do not have a natural stand-alone interpretation.

Death rates are higher for males than for females, which is no surprise given a mean age at entry of 65 years, and is higher for PAD subjects than for the age matched controls. The right hand panel shows the curves on age scale. Note that when using the age scale, we include two time points in the Survival function: age at entry in the study and age at the event or censoring. These data are both right-censored and left-truncated (we have not discussed this point in the lessons, but it is of interest if you want to deepen the survival framework).In two words, left-truncation is when the subjects to enter under your observation should be survived until a certain time-point (i.e. age at entry).

Survival curves can equivalently display either the fraction who have died (curves start at 0: 1-KM) or the fraction who have not died (curves start at 1); the choice between the two is often based on tradition or habit.

5.2 Covariates and Cox model

The analysis will focus on the impact on survival of the covariates PAD, age, sex, and serum values for high density and low density lipoprotein (HDL, LDL). Figure shows density estimates for each of the three continuous covariates at baseline, by PAD group. By study design there should be no difference in the age distribution. HDL and LDL are slightly lower for the PAD subjects.

oldpar <- par(mfrow=c(1,3), mar=c(5.1, 4.1, 5.1, 1))
dfun <- function(data, x, ...) {
    d0 <- density((data[[x]])[data$pad=="Control"])
    d1 <- density((data[[x]])[data$pad=="PAD"])
    matplot(cbind(d0$x, d1$x), cbind(d0$y, d1$y), type='l', lty=1, col=1:2,
            ...)
}
lunit <- 38.67  # convert from mmol/L to mg/dl
xx <- c(0, 50, 100, 150, 200, 250, 300)

dfun(pdata, "age", xlab="Age", ylab="Density", lwd=2)
legend(51, .02, c("PAD", "Control"), lty=1, lwd=2, col=2:1, bty='n')

dfun(na.omit(subset(pad2, visit==0, c(hdl, pad))),
             "hdl", xlab="HDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)

dfun(na.omit(subset(pad2, visit==0, c(ldl, pad))),
     "ldl", xlab="LDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)

Distributions of three continuous covariates, by group. Blood value scales are in mmol/L (lower) and mg/dl (above).

5.3 A more advanced topic : time-dependent covariates (multiple measurements) and competing risks !

The Cox model fits will use age, sex, baseline values of high density and low density lipoproteins (hdl0, ldl0), time-dependent lipid values (hdl, ldl), and a time-dependent value of “years since enrollment” (etime). Time since enrollment only plays a role in age scale models: in a time-since-entry model every one in a given risk set will have exactly the same value of the covariate, and hence the estimated coefficient for the variable would be 0.

For time-dependent analysis we use a counting process form. Mathematically each subject is represented by 3 processes: \(Y_i(t)=1\) if the subject is currently at risk, \(X_i(t)\) is the time dependent covariate process, and \(N(t)\) the cumulative number of events. In the R code this translates into multiple observations for each subject of the form start-stop:

 id, time1, time2, event, x1, x2, ...

where (time1, time2) describe an interval over which the subject is at risk, event corresponds to \(dN(t)\) and will be 1 of the subject had an event at time2 and 0 otherwise, x1, x2, ... are the value of the covariates over the interval, and id is a subject identifier. Subject 9, for instance, is a male with blood values measured on days 0, 583, 945, and 1309, who died on day 1349. Their data has the form

id tstart tstop death  sex  ldl  hdl
 9      0   583     0 male 2.66 1.08
 9    583   945     0 male 3.20 1.53
 9    945  1309     0 male 2.81 1.32
 9   1309  1349     1 male 2.51 1.30

The lipid values and time intervals encode the most recent measurement of the value. Creation and checking of the data often represents a major share of the work in an analysis. There is often concern that because the data set has multiple rows for the same subject, a valid variance estimate for the model will need to account for correlation.
This is not the case, however.
Internally, there is only one copy of each subject; the multiple rows are merely a way to communicate to the program which covariate value is applicable at any given time point.

When there are competing risks, such as CVD and non-CVD death, the 0/1 event event variable is replaced by a factor with levels of “none”, “CVD death”, “non-CVD death”, which shows both when an event occurred and its type.

Below we create separate time-dependent data sets for analysis on the time-from-enrollment scale (pdata2) and on the age scale (pdata3).

The two data sets have separate event indicators for death, the competing risk of CVD/non-CVD death, and for any CVD event/non-CVD death. Creating (time1, time2) data sets on age-in-years scale can lead to anomalies due to round off error; a detailed discussion and example is found in the “Round off error and tied times” vignette of the survival package. For this reason the time-dependent data sets all use time in days. Use birth date and entry date to create agedays = age at enrollment, in days

This data set will have two time scales (age and time since enrollment), time dependent covariates (hdl, ldl, etime), and multiple endpoints (all cause death, CVD vs non-CVD death, major CVD event).

A. etime = time since enrollment as a time dependent variable 1. This is a continuous time-dependent covariate – technically it changes minute by minute. 2. We will be using it in age scale fits. If we create it on age scale then the cuts for survSplit can be limited to “the ages at which deaths occur”

B. The tmerge routine can keep track of multiple endpoints on a single time scale, the survSplit routine a single endpoint and single time scale.

C. The tmerge operations needs to be sequential: no step that would change the number of rows can be inserted in the midst.

Taken all together, it is simplest to create the data sets, pdata2 on the time since enrollment scale, pdata3 on the age scale. Many of the steps will need to be repeated, e.g., adding time dependent hdl and ldl.

pdata2 <- tmerge(pdata[, c("id", "pad", "age", "sex")], pdata, 
                 id=id, status = event(futime, status))
pdata2 <- tmerge(pdata2, pad2, id=id, ldl= tdc(day, ldl),
                 hdl = tdc(day, hdl))

# add baseline hdl and ldl as covariates
indx <- match(pdata2$id, pdata2$id)
pdata2$ldl0 <- pdata2$ldl[indx]
pdata2$hdl0 <- pdata2$hdl[indx]
pdata2$age10 <- pdata2$age/10   # age in decades

# create "any cvd event" as a competing risk, it will be used later
temp <- subset(pad3, event %in% c("cvd death", "infarction", "stroke"))
temp <- temp[!duplicated(temp$id),]   # first of these
temp <- subset(temp, day< 5*365.25)   # must be <= 5 years
pdata2 <- tmerge(pdata2, temp, id=id, cvstat=event(day),
                 priorcv = tdc(day))

pdata2$death <- ifelse(pdata2$status=="censor", 0, 1)  # all cause death
tstat <- ifelse(pdata2$cvstat>0, pdata2$cvstat, 2*pdata2$death)
pdata2$cvstat <- factor(tstat, 0:2,
                         c("censor", "CVD event", "death"))
pdata2$death <- ifelse(pdata2$status=="censor", 0, 1)  # all cause death

# Reprise all the steps, on age scale.  We need to add "age" to the
#  merged-in data sets before adding them.
ageday <- as.numeric(pad1$d.first - pad1$d.birth) # age in days at enrollmemt
pdata3 <- tmerge(pdata[, c("id", "pad", "sex")], pdata, 
                 id=id, status = event(futime + ageday, status),
                 tstart=ageday, tstop=ageday+futime,
                 options=list(tstart="age1", tstop="age2"))
temp2 <- pad2
temp2$age <- temp2$day + ageday[match(temp2$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp2, id=id, ldl= tdc(age, ldl),
                 hdl = tdc(age, hdl))

# add baseline hdl and ldl as covariates
indx <- match(pdata3$id, pdata3$id) #points to the first row for each subject
pdata3$ldl0 <- pdata3$ldl[indx]
pdata3$hdl0 <- pdata3$hdl[indx]

# Add time since enrollment as a time-dependent covariate.
#  Use a temporary data set + survSplit to create a 'variable' that looks
#  like yet another lab test, then merge it in on the age scale.
#  Time since enrollment will be in weeks, which will create 52*5= 260 obs
#  per subject.  (Time since enrollment in days would be overkill.)
pdata$ageday <- ageday
temp3 <- survSplit(Surv(futime, status) ~ ageday + id, data= pdata,
      cut= seq(7, max(pdata$futime), by=7))
pdata3 <- tmerge(pdata3, temp3, id, etime = tdc(ageday + tstart, tstart))
pdata3$etime <- pdata3$etime/365.25   # recode to time in years

# add 'any cv' as an event
temp$age <- temp$day + ageday[match(temp$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp, id=id, cvstat=event(age),
                 priorcv = tdc(age))

# recode event variables, after all survSplit or tmerge steps are done
pdata3$death <- ifelse(pdata3$status=="censor", 0, 1)  # all cause death
tstat <- ifelse(pdata3$cvstat>0, pdata3$cvstat, 2*pdata3$death)
pdata3$cvstat <- factor(tstat, 0:2, c("censor", "CVD event", "nonCVD-death"))

# Creation of multi-row data sets has pitfalls.  It is always wise to
#  sanity check the results.  
check1 <- survcheck(Surv(tstart, tstop, death) ~ 1, id=id, data= pdata2)
check1

## Call:
## survcheck(formula = Surv(tstart, tstop, death) ~ 1, data = pdata2, 
##     id = id)
## 
## Unique identifiers       Observations        Transitions 
##               1455               6731                159 
## 
## Transitions table:
##       to
## from      1 (censored)
##   (s0)  159       1296
##   1       0          0
## 
## Number of subjects with 0, 1, ... transitions to each state:
##        count
## state      0   1
##   1     1296 159
##   (any) 1296 159

check2 <- survcheck(Surv(age1, age2, death) ~1, id=id, data= pdata3)
check2

## Call:
## survcheck(formula = Surv(age1, age2, death) ~ 1, data = pdata3, 
##     id = id)
## 
## Unique identifiers       Observations        Transitions 
##               1455             362876                159 
## 
## Transitions table:
##       to
## from      1 (censored)
##   (s0)  159       1296
##   1       0          0
## 
## Number of subjects with 0, 1, ... transitions to each state:
##        count
## state      0   1
##   1     1296 159
##   (any) 1296 159

5.3.1 Overall mortality

The first pair of fits shows that the estimated overall effects of PAD and sex hardly differ between analyses done on the enrollment or age scales. A second pair fits looks at the coefficient effects within the PAD and control groups. (Inclusion of the covariate by strata interactions gives results that are equivalent to separate fits for the the PAD and control groups; bundling them as a single model allows a formal test of whether the effects differ between the two strata.) None of sex, enrollment age, or time since enrollment show meaningful differences between the PAD and control groups. That is, on the log-hazard scale, PAD appears to act as an additive risk.

# Fit the first set of models: pad, age, sex on the time since enrollment
#  scale (cfit1a) and on the age scale (cfit1b)
cfit1a <- coxph(Surv(tstart, tstop, death) ~ pad + sex + age10, pdata2)
cfit1b <- coxph(Surv(age1, age2, death) ~ pad + sex + etime, pdata3)
# Add interactions
cfit2a <- coxph(Surv(tstart, tstop, death) ~ (sex + age10)*strata(pad), pdata2)
cfit2b <- coxph(Surv(age1, age2, death) ~ (sex + etime)*strata(pad),
                pdata3)

cat("Overall model on time since enrollment scale\n")

## Overall model on time since enrollment scale

print(cfit1a)

## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ pad + sex + age10, 
##     data = pdata2)
## 
##           coef exp(coef) se(coef)     z        p
## padPAD  0.8755    2.4000   0.1739 5.036 4.76e-07
## sexmale 0.6924    1.9985   0.1822 3.800 0.000145
## age10   0.6583    1.9315   0.1053 6.250 4.11e-10
## 
## Likelihood ratio test=82.63  on 3 df, p=< 2.2e-16
## n= 6731, number of events= 159

cat("\nOverall model on age scale\n")

## 
## Overall model on age scale

print(cfit1b)

## Call:
## coxph(formula = Surv(age1, age2, death) ~ pad + sex + etime, 
##     data = pdata3)
## 
##            coef exp(coef) se(coef)     z        p
## padPAD  0.87354   2.39538  0.17400 5.020 5.16e-07
## sexmale 0.70540   2.02466  0.18270 3.861 0.000113
## etime   0.16593   1.18049  0.05917 2.804 0.005042
## 
## Likelihood ratio test=51.15  on 3 df, p=4.538e-11
## n= 362876, number of events= 159

cat("\nInteractions between covariates and treatment, time since enrollment\n")

## 
## Interactions between covariates and treatment, time since enrollment

print(cfit2a)

## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ (sex + age10) * 
##     strata(pad), data = pdata2)
## 
##                            coef exp(coef) se(coef)      z        p
## sexmale                 0.67576   1.96552  0.33494  2.018 0.043638
## age10                   0.68341   1.98062  0.19354  3.531 0.000414
## sexmale:strata(pad)PAD  0.02208   1.02233  0.39925  0.055 0.955897
## age10:strata(pad)PAD   -0.03498   0.96563  0.23081 -0.152 0.879552
## 
## Likelihood ratio test=55.21  on 4 df, p=2.938e-11
## n= 6731, number of events= 159

cat("\nInteractions between covariates and treatment, age scale\n")

## 
## Interactions between covariates and treatment, age scale

print(cfit2b)

## Call:
## coxph(formula = Surv(age1, age2, death) ~ (sex + etime) * strata(pad), 
##     data = pdata3)
## 
##                             coef exp(coef)  se(coef)     z      p
## sexmale                0.6980525 2.0098347 0.3357736 2.079 0.0376
## etime                  0.1176904 1.1248958 0.1097373 1.072 0.2835
## sexmale:strata(pad)PAD 0.0001539 1.0001539 0.4003644 0.000 0.9997
## etime:strata(pad)PAD   0.0665382 1.0688018 0.1304441 0.510 0.6100
## 
## Likelihood ratio test=24.44  on 4 df, p=6.516e-05
## n= 362876, number of events= 159

Therefore, we can see that in the multivariable Cox model on the follow up scale, being PAD and increasing age decrease the estimated survival, male sex is at higher risk. On the age scale, these results are confirmed. There is no a significant interaction between PAD and gender or age.

5.3.2 Cardiovascular death

Since PAD is expected to exert its primary effect via cardiovascular disease (CVD) endpoints, a more interesting analysis is to focus on CVD rather than overall mortality. Since in this case non-CVD death will be a competing risk this needs to be accounted for in our estimate of the probability in state; in the R survival package this is accomplished by using a multi-level categorical variable as the “status”, rather than 0/1. The survfit function then computes Aalen-Johansen (AJ) estimates of a multi state model.

oldpar <- par(mfrow=c(1,2))
psurv4 <- survfit(Surv(tstart, tstop, status) ~ pad, id = id, pdata2)
plot(psurv4, col=1:2, lty=c(1,1,2,2), ylim=c(0,.15), xscale=365.25, 
     xlab="Years since enrollment", ylab="Absolute risk")
legend("topleft", c("PAD CVD death", "Control CVD death", 
                 "PAD nCVD death","Control nCVD"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n')

psurv5 <- survfit(Surv(tstart, tstop, cvstat) ~ pad, pdata2, id=id,
                  subset=(priorcv==0))
plot(psurv5, col=1:2, lty=c(1,1,2,2), xscale=365.25, ylim=c(0, .15),
         xlab="Years since enrollment", ylab="Absolute risk")         

legend("topleft", c("Major CVD, PAD", "Major CVD, Control", 
                  "death w/o CVD, PAD","death w/o CVD, Control"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.85)

Aalen-Johansen curves for CVD vs. non-CVD death outcomes (left) and for any CVD event vs. death without CVD (right).

Figure shows the Aalen-Johansen estimates of probability in state for CVD and non-CVD death in the left panel (which could also be labeled as “absolute risk”, “competing risk”, or “cumulative incidence”) and the AJ curves for first major CVD event (infarction, stroke, CVD death) versus non-CVD death in the right panel. For controls, CVD deaths are about 1/2 of the non-CVD, approximately 2% versus 4% at 5 years; both outcomes are near 8% for the PAD subjects. Nearly 14% of the PAD subjects experience a major CVD event versus 6% for controls.

The absolute risk curves require that all transitions in the model be considered jointly: incorrect curves will result from a simple Kaplan-Meier of time to CVD, censoring at non-CVD death, or a KM of time to non-CVD death, censoring at CVD.

Interestingly, this is not true for the cumulative hazard estimate: the naive estimate for one outcome, censoring the others, will be identical to the joint estimate.

The Cox model is a model for the hazards and has the same identity: the first coxph call below, which models the entire process, will have the same results for the CVD death portion as the second coxph call, which ignores non-CVD deaths by censoring them; this is verified by the all.equal test.

However, if one desires to predict absolute risk curves from the fitted Cox model then the first form (using the “complete” multi-state approach) is essential.

pcox1 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10 + hdl0 + ldl0, 
               pdata2, id=id)
cat("transitions table\n")

## transitions table

pcox1$transitions  # a check of the data

##              to
## from          CV death nonCV death (censored)
##   (s0)              67          91       1275
##   CV death           0           0          0
##   nonCV death        0           0          0

pcox1

## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 + 
##     hdl0 + ldl0, data = pdata2, id = id)
## 
##          
## 1:2           coef exp(coef) se(coef) robust se      z        p
##   padPAD   1.05440   2.87025  0.28291   0.28029  3.762 0.000169
##   sexmale  0.51298   1.67026  0.27813   0.27881  1.840 0.065786
##   age10    0.65698   1.92895  0.16378   0.17894  3.672 0.000241
##   hdl0    -0.30026   0.74063  0.32993   0.44220 -0.679 0.497130
##   ldl0    -0.08128   0.92194  0.12687   0.12754 -0.637 0.523966
## 
##          
## 1:3           coef exp(coef) se(coef) robust se      z        p
##   padPAD   0.71390   2.04194  0.22494   0.22789  3.133  0.00173
##   sexmale  0.75348   2.12438  0.25070   0.25505  2.954  0.00313
##   age10    0.65666   1.92835  0.14006   0.14379  4.567 4.95e-06
##   hdl0    -0.20347   0.81589  0.28098   0.32688 -0.622  0.53363
##   ldl0     0.02213   1.02237  0.10630   0.10857  0.204  0.83850
## 
##  States:  1= (s0), 2= CV death, 3= nonCV death 
## 
## Likelihood ratio test=83.96  on 10 df, p=8.386e-14
## n= 6626, unique id= 1433, number of events= 158 
##    (105 observations deleted due to missingness)

# cause-specific Cox model censoring non CV death 
pcox1b <- coxph(Surv(tstart, tstop, status=="CV death") ~ pad +sex + age10
              + hdl0 + ldl0, data=pdata2)  #censor non-CVD death
pcox1b

## Call:
## coxph(formula = Surv(tstart, tstop, status == "CV death") ~ pad + 
##     sex + age10 + hdl0 + ldl0, data = pdata2)
## 
##             coef exp(coef) se(coef)      z        p
## padPAD   1.05445   2.87039  0.28292  3.727 0.000194
## sexmale  0.51298   1.67027  0.27813  1.844 0.065122
## age10    0.65703   1.92905  0.16379  4.011 6.03e-05
## hdl0    -0.30029   0.74060  0.32993 -0.910 0.362744
## ldl0    -0.08124   0.92197  0.12687 -0.640 0.521959
## 
## Likelihood ratio test=39.51  on 5 df, p=1.874e-07
## n= 6626, number of events= 67 
##    (105 observations deleted due to missingness)

pcox2 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10+
              + hdl + ldl, data=pdata2, id=id)  # time dependent hdl and ldl
pcox2

## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 + 
##     +hdl + ldl, data = pdata2, id = id)
## 
##          
## 1:2          coef exp(coef) se(coef) robust se      z        p
##   padPAD   0.8751    2.3991   0.2857    0.2745  3.188 0.001433
##   sexmale  0.3094    1.3626   0.2775    0.2707  1.143 0.252954
##   age10    0.6963    2.0064   0.1652    0.1813  3.840 0.000123
##   hdl     -1.5482    0.2126   0.4109    0.4657 -3.325 0.000885
##   ldl     -0.2797    0.7560   0.1479    0.1496 -1.869 0.061609
## 
##          
## 1:3          coef exp(coef) se(coef) robust se      z        p
##   padPAD   0.7545    2.1265   0.2277    0.2256  3.345 0.000824
##   sexmale  0.7847    2.1918   0.2518    0.2461  3.188 0.001432
##   age10    0.6761    1.9662   0.1401    0.1409  4.797 1.61e-06
##   hdl     -0.1489    0.8617   0.2910    0.3901 -0.382 0.702680
##   ldl      0.1338    1.1432   0.1125    0.1070  1.250 0.211251
## 
##  States:  1= (s0), 2= CV death, 3= nonCV death 
## 
## Likelihood ratio test=105.6  on 10 df, p=< 2.2e-16
## n= 6702, unique id= 1453, number of events= 158 
##    (29 observations deleted due to missingness)

There were 67 CVD and 91 non-CVD deaths in 5 years of follow-up, which are the transitions from the initial state (s0) to the CVD death and non-CVD death states, respectively, in the transitions table. No one transitioned from death to another state, which would indicate an error in the data set. Table displays the coefficients for the fits with baseline and time-dependent lipids, for both of the competing outcomes. The joint Cox model shows that after adjustment for age, sex, and baseline lipids, PAD is associated with nearly three fold increase in the rate of CVD deaths and double fold increase in non-CVD death. Age and sex are important predictors as well, as expected, with male sex having a somewhat larger effect on the non-CVD endpoint. The effect of baseline lipid values was not significant, however, time-dependent HDL shows an effect on the CVD death endpoint, somewhat reducing the estimated effect of PAD on CVD death.

5.3.3 Predicted probability-in-state (survival) curves

Predicted \(P(state)\) curves can be created from the multi-state fits, as well. As with any Cox model prediction, the first task is to choose a set of covariate values for which prediction is desired. We create the predictions for four hypothetical male subjects with ages of 58 and 72 (quartiles of age), and PAD vs. control. For HDL and LDL use values of 1.3 and 3, which are near the medians. (These curves are for the model of CVD death vs other death, time-fixed hdl and ldl.)

dummy <- expand.grid(pad=c("PAD", "Control"), age10=c(5.8, 7.2), sex="male",
                     hdl0 =1.3, ldl0 =3)
dummy

##       pad age10  sex hdl0 ldl0
## 1     PAD   5.8 male  1.3    3
## 2 Control   5.8 male  1.3    3
## 3     PAD   7.2 male  1.3    3
## 4 Control   7.2 male  1.3    3

coxpred1 <- survfit(pcox1, newdata= dummy)
dim(coxpred1)

##   data states 
##      4      3

coxpred1$states

## [1] "(s0)"        "CV death"    "nonCV death"

The resulting set of absolute risk curves can be viewed as an array of estimates by stratum, target covariates, and state. The pcox1 fit had no strata so the first dimension is 1. The resulting curves are shown in Figure. Because a multi-state model can generate a very large number of curves it is often helpful to break them into multiple panels.

par(mfrow=c(1, 2))
plot(coxpred1[1, 1:2 ,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
     ylim=c(0, .15),
     xlab="Years since enrollment", ylab="predicted P(state), age = 58")
legend("topleft", c("PAD, CVD death", "Control, CVD death",
                   "PAD, nonCVD death", "Control, nonCVD death"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
plot(coxpred1[1, 3:4,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
     ylim=c(0, .15),
     xlab="Years since enrollment", ylab="predicted P(state), age=72")

legend("topleft", c("PAD, CVD death", "Control, CVD death",
                   "PAD, nonCVD death", "Control, nonCVD death"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)

Predicted absolute risk curves from the multistate Cox model: dying of CVD (solid) and from other causes (dashed) by PAD status, for a male enrolled at age 58 (left panel) or 72 (right). The values of baseline HDL and LDL are 1.3 and 3, respectively.

R applications & Exercises - Block 4 (reduced)

STATISTICAL LEARNING IN EPIDEMIOLOGY 2025

Prof. Giulia Barbati & Paolo Dalena

27/05/2025