This practical provides an introduction to survival analysis, and to conducting a survival analysis in R.

Introduction to Survival Analysis

We will cover some basics of survival analysis; the following series of tutorial papers can be helpful for additional reading:

Packages

Some packages we’ll be using include:

library(survival)
library(survminer)
library(lubridate)

Moreover, it will be necessary install a library from a github:

devtools::install_github("zabore/ezfun")
library(ezfun)

What is survival data?

Time-to-event data that consist of a distinct start time and end time.

Examples:

Aliases for survival analysis

Because survival analysis is common in many other fields, it also goes by other names

The lung dataset

The lung dataset is available from the survival package in R. The data contain subjects with advanced lung cancer from the North Central Cancer Treatment Group. Some variables we will use to demonstrate methods include:

What is censoring?

Types of censoring

A subject may be censored due to:

Specifically these are examples of right censoring.

Left censoring and interval censoring are also possible, and methods exist to analyze this type of data, but in this course we will be limited to right censoring.

Censored survival data

In this example, how would we compute the proportion of subjects who are event-free at 10 years?

Subjects 6 and 7 were event-free at 10 years. Subjects 2,9 and 10 had the event before 10 years. Subjects 1, 3, 4, 5 and 8 were censored before 10 years, so we don’t know whether they had the event or not by 10 years - how do we incorporate these subjects into our estimate?

Distribution of follow-up time

Components of survival data

For subject \(i\):

  1. Event time \(T_i\)

  2. Censoring time \(C_i\)

  3. Event indicator \(\delta_i\):

    • 1 if event observed (i.e. \(T_i \leq C_i\))
    • 0 if censored (i.e. \(T_i > C_i\))
  4. Observed time \(Y_i = \min(T_i, C_i)\)

The observed times and an event indicator are provided in the lung data

inst time status age sex ph.ecog ph.karno pat.karno meal.cal wt.loss
3 306 2 74 1 1 90 100 1175 NA
3 455 2 68 1 0 90 90 1225 15
3 1010 1 56 1 0 90 90 NA 15
5 210 2 57 1 1 90 60 1150 11
1 883 2 60 1 0 100 90 NA 0
12 1022 1 74 1 1 50 80 513 0

Dealing with dates in R

Data will often come with start and end dates rather than pre-calculated survival times. The first step is to make sure these are formatted as dates in R.

Let’s create a small example dataset with variables sx_date for surgery date and last_fup_date for the last follow-up date.

date_ex <- 
  tibble(
    sx_date = c("2007-06-22", "2004-02-13", "2010-10-27"), 
    last_fup_date = c("2017-04-15", "2018-07-04", "2016-10-31")
    )
date_ex
## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <chr>      <chr>        
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

We see these are both character variables, which will often be the case, but we need them to be formatted as dates.

Formatting dates - base R

date_ex %>% 
  mutate(
    sx_date = as.Date(sx_date, format = "%Y-%m-%d"), 
    last_fup_date = as.Date(last_fup_date, format = "%Y-%m-%d") 
    )
## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <date>     <date>       
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

Formatting dates - lubridate package

We can also use the lubridate package to format dates. In this case, use the ymd function

date_ex %>% 
  mutate(
    sx_date = ymd(sx_date), 
    last_fup_date = ymd(last_fup_date)
    )
## # A tibble: 3 × 2
##   sx_date    last_fup_date
##   <date>     <date>       
## 1 2007-06-22 2017-04-15   
## 2 2004-02-13 2018-07-04   
## 3 2010-10-27 2016-10-31

Calculating survival times - base R

Now that the dates formatted, we need to calculate the difference between start and end time in some units, usually months or years. In base R, use difftime to calculate the number of days between our two dates and convert it to a numeric value using as.numeric. Then convert to years by dividing by 365.25, the average number of days in a year.

date_ex %>% 
  mutate(
    os_yrs = 
      as.numeric(
        difftime(last_fup_date, 
                 sx_date, 
                 units = "days")) / 365.25
    )
## # A tibble: 3 × 3
##   sx_date    last_fup_date os_yrs
##   <date>     <date>         <dbl>
## 1 2007-06-22 2017-04-15      9.82
## 2 2004-02-13 2018-07-04     14.4 
## 3 2010-10-27 2016-10-31      6.01

Calculating survival times - lubridate

Using the lubridate package, the operator %--% designates a time interval, which is then converted to the number of elapsed seconds using as.duration and finally converted to years by dividing by dyears(1), which gives the number of seconds in a year.

date_ex %>% 
  mutate(
    os_yrs = 
      as.duration(sx_date %--% last_fup_date) / dyears(1)
    )
## # A tibble: 3 × 3
##   sx_date    last_fup_date os_yrs
##   <date>     <date>         <dbl>
## 1 2007-06-22 2017-04-15      9.82
## 2 2004-02-13 2018-07-04     14.4 
## 3 2010-10-27 2016-10-31      6.01

Event indicator

For the components of survival data I mentioned the event indicator:

Event indicator \(\delta_i\):

However, in R the Surv function will also accept TRUE/FALSE (TRUE = event) or 1/2 (2 = event).

In the lung data, we have:

Survival function

The probability that a subject will survive beyond any given specified time

\[S(t) = Pr(T>t) = 1 - F(t)\]

\(S(t)\): survival function \(F(t) = Pr(T \leq t)\): cumulative distribution function

In theory the survival function is smooth; in practice we observe events on a discrete time scale.

Survival probability

Creating survival objects

The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.

Surv(lung$time, lung$status)[1:10]
##  [1]  306   455  1010+  210   883  1022+  310   361   218   166

Estimating survival curves with the Kaplan-Meier method

f1 <- survfit(Surv(time, status) ~ 1, data = lung)
names(f1)
##  [1] "n"         "time"      "n.risk"    "n.event"   "n.censor"  "surv"     
##  [7] "std.err"   "cumhaz"    "std.chaz"  "type"      "logse"     "conf.int" 
## [13] "conf.type" "lower"     "upper"     "call"

Some key components of this survfit object that will be used to create survival curves include:

Kaplan-Meier plot - base R

Now we plot the survfit object in base R to get the Kaplan-Meier plot.

plot(survfit(Surv(time, status) ~ 1, data = lung), 
     xlab = "Days", 
     ylab = "Overall survival probability")

Kaplan-Meier plot - ggsurvplot

Alternatively, the ggsurvplot function from the survminer package is built on ggplot2, and can be used to create Kaplan-Meier plots. Checkout the cheatsheet for the survminer package.

ggsurvplot(
    fit = survfit(Surv(time, status) ~ 1, data = lung), 
    xlab = "Days", 
    ylab = "Overall survival probability")

Estimating \(x\)-year survival

One quantity often of interest in a survival analysis is the probability of surviving beyond a certain number (\(x\)) of years.

For example, to estimate the probability of surviving to \(1\) year, use summary with the times argument (Note the time variable in the lung data is actually in days, so we need to use times = 365.25)

summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##   365     65     121    0.409  0.0358        0.345        0.486

We find that the \(1\)-year probability of survival in this study is 41%.

The associated lower and upper bounds of the 95% confidence interval are also displayed.

\(x\)-year survival and the survival curve

The \(1\)-year survival probability is the point on the y-axis that corresponds to \(1\) year on the x-axis for the survival curve.

plot_main <- 
  ggsurvplot(
    data = lung, 
    fit = f1,
    xlab = "Months",
    legend = "none",
    xscale = 30.4,
    break.x.by = 182.4, 
    risk.table = TRUE,
    risk.table.y.text = FALSE)
plot1 <- plot_main
plot1$plot <- plot1$plot + 
  geom_segment(x = 365.25, xend = 365.25, y = -0.05, yend = 0.4092416, 
               size = 1.5) +
  geom_segment(x = 365.25, xend = -40, y = 0.4092416, yend = 0.4092416,
               size = 1.5, 
               arrow = arrow(length = unit(0.2, "inches"))) 
plot1

\(x\)-year survival is often estimated incorrectly

What happens if you use a “naive” estimate?

121 of the 228 patients died by \(1\) year so:

\[\Big(1 - \frac{121}{228}\Big) \times 100 = 47\%\]

Impact on \(x\)-year survival in absence of censoring

fakedata2 <- lung %>% 
  mutate(time = ifelse(status == 2, time, 1022), 
         group = "No censoring") %>% 
  full_join(mutate(lung, group = "With censoring"))
fit3 <- survfit(Surv(time, status) ~ group, data = fakedata2)
ggsurvplot(
  data = fakedata2, 
  fit = fit3,
  xlab = "Months",
  legend = "bottom",
  legend.title = "",
  legend.labs = c("No censoring", "With censoring"),
  xscale = 30.4,
  break.x.by = 182.4, 
  risk.table = TRUE,
  risk.table.y.text = FALSE)

Estimating median survival time

Another quantity often of interest in a survival analysis is the average survival time, which we quantify using the median. Survival times are not expected to be normally distributed so the mean is not an appropriate summary.

We can obtain this directly from our survfit object

survfit(Surv(time, status) ~ 1, data = lung)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 228    165    310     285     363

We see the median survival time is 310 days The lower and upper bounds of the 95% confidence interval are also displayed.

Median survival time and the survival curve

Median survival is the time corresponding to a survival probability of \(0.5\):

plot2 <- plot_main
plot2$plot <- plot2$plot + 
  geom_segment(x = -45, xend = 310, y = 0.5, yend = 0.5,  size = 1.5) +
  geom_segment(x = 310, xend = 310, y = 0.5, yend = -0.03, size = 1.5, 
               arrow = arrow(length = unit(0.2, "inches")))
plot2

Median survival is often estimated incorrectly

What happens if you use a “naive” estimate?

Summarize the median survival time among the 165 patients who died

lung %>% 
  filter(status == 2) %>% 
  summarize(median_surv = median(time))
##   median_surv
## 1         226

Impact on survival of ignoring (removing) censoring

fakedata <- lung %>% 
  filter(status == 2) %>% 
  mutate(group = "Ignoring censoring") %>% 
  full_join(mutate(lung, group = "With censoring"))
fit2 <- survfit(Surv(time, status) ~ group, data = fakedata)
ggsurvplot(
  data = fakedata, 
  fit = fit2,
  xlab = "Months",
  legend = "bottom",
  legend.title = "",
  legend.labs = c("Ignoring censoring", "With censoring"),
  xscale = 30.4,
  break.x.by = 182.4, 
  risk.table = TRUE, 
  risk.table.y.text = FALSE)

Comparison between the KM estimator and the FH estimator

We will use the “Duration of nursing home stay” data.

The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. Treated nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient’s health and sending them home.

Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.

Variables include:

  • los - Length of stay of a resident (in days)
  • age - Age of a resident
  • rx - Nursing home assignment (1:bonuses, 0:no bonuses)
  • gender - Gender (1:male, 0:female)
  • married - (1:married, 0:not married)
  • health - Health status (2:second best, 5:worst)
  • censor - Censoring indicator (1:censored, 0:discharged)

Say we want to obtain the Fleming-Harrington estimate of the survival function for married females, in the healthiest initial subgroup, who are randomized to the untreated group of the nursing home study.

nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat",
head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))

nurs2 = subset(nurshome, gender==0 & rx==0 & health==2 & married==1)
fit.fh = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="fleming-harrington", conf.type="log-log")
summary(fit.fh)
## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "fleming-harrington", 
##     conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    14     12       1   0.9200  0.0767      0.55345        0.988
##    24     11       1   0.8401  0.1036      0.49777        0.957
##    25     10       1   0.7601  0.1207      0.42614        0.916
##    38      9       1   0.6802  0.1318      0.35609        0.866
##    64      8       1   0.6003  0.1384      0.29015        0.810
##    89      7       1   0.5204  0.1412      0.22897        0.749
##   113      6       1   0.4405  0.1402      0.17290        0.682
##   123      5       1   0.3606  0.1356      0.12238        0.609
##   149      4       1   0.2809  0.1268      0.07814        0.531
##   168      3       1   0.2012  0.1129      0.04142        0.446
##   185      2       1   0.1221  0.0917      0.01445        0.352
##   234      1       1   0.0449  0.0562      0.00107        0.245

Compare with KM:

fit.km = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="kaplan-meier", conf.type="log-log")
summary(fit.km)
## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "kaplan-meier", 
##     conf.type = "log-log")
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##    14     12       1   0.9167  0.0798      0.53898        0.988
##    24     11       1   0.8333  0.1076      0.48171        0.956
##    25     10       1   0.7500  0.1250      0.40842        0.912
##    38      9       1   0.6667  0.1361      0.33702        0.860
##    64      8       1   0.5833  0.1423      0.27014        0.801
##    89      7       1   0.5000  0.1443      0.20848        0.736
##   113      6       1   0.4167  0.1423      0.15247        0.665
##   123      5       1   0.3333  0.1361      0.10270        0.588
##   149      4       1   0.2500  0.1250      0.06014        0.505
##   168      3       1   0.1667  0.1076      0.02651        0.413
##   185      2       1   0.0833  0.0798      0.00505        0.311
##   234      1       1   0.0000     NaN           NA           NA

Compare plots:

plot(fit.fh$time, fit.fh$surv,type="n", xlab="Follow up", ylab="Survival curve")
points(fit.fh$time, fit.fh$surv, col="blue", pch=17)
points(fit.km$time,fit.km$surv, col="red", pch=14)
lines(fit.fh$time,fit.fh$surv, col="blue", lwd=2)
lines(fit.km$time, fit.km$surv,col="red", lwd=2)
legend("topright", legend=c("FH", "KM"), col=c("blue", "red"), pch=c(17,14), lty=c(1,1), lwd=c(2,2), bty="n")

Generally, the Fleming Harrington estimator is slightly higher than the KM at every time point, but with larger datasets the two will typically be much closer.