This practical provides an introduction to survival analysis, and to conducting a survival analysis in R.
We will cover some basics of survival analysis; the following series of tutorial papers can be helpful for additional reading:
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part I: Basic concepts and first analyses. 232-238. ISSN 0007-0920.
M J Bradburn, T G Clark, S B Love, & D G Altman. (2003). Survival Analysis Part II: Multivariate data analysis - an introduction to concepts and methods. British Journal of Cancer, 89(3), 431-436.
Bradburn, M., Clark, T., Love, S., & Altman, D. (2003). Survival analysis Part III: Multivariate data analysis – choosing a model and assessing its adequacy and fit. 89(4), 605-11.
Clark, T., Bradburn, M., Love, S., & Altman, D. (2003). Survival analysis part IV: Further concepts and methods in survival analysis. 781-786. ISSN 0007-0920.
Some packages we’ll be using include:
lubridate
survival
survminer
library(survival)
library(survminer)
library(lubridate)
Moreover, it will be necessary install a library from a github:
devtools::install_github("zabore/ezfun")
library(ezfun)
Time-to-event data that consist of a distinct start time and end time.
Examples:
Because survival analysis is common in many other fields, it also goes by other names
The lung
dataset is available from the
survival
package in R
. The data contain
subjects with advanced lung cancer from the North Central Cancer
Treatment Group. Some variables we will use to demonstrate methods
include:
A subject may be censored due to:
Specifically these are examples of right censoring.
Left censoring and interval censoring are also possible, and methods exist to analyze this type of data, but in this course we will be limited to right censoring.
In this example, how would we compute the proportion of subjects who are event-free at 10 years?
Subjects 6 and 7 were event-free at 10 years. Subjects 2,9 and 10 had the event before 10 years. Subjects 1, 3, 4, 5 and 8 were censored before 10 years, so we don’t know whether they had the event or not by 10 years - how do we incorporate these subjects into our estimate?
For subject \(i\):
Event time \(T_i\)
Censoring time \(C_i\)
Event indicator \(\delta_i\):
Observed time \(Y_i = \min(T_i, C_i)\)
The observed times and an event indicator are provided in the
lung
data
inst | time | status | age | sex | ph.ecog | ph.karno | pat.karno | meal.cal | wt.loss |
---|---|---|---|---|---|---|---|---|---|
3 | 306 | 2 | 74 | 1 | 1 | 90 | 100 | 1175 | NA |
3 | 455 | 2 | 68 | 1 | 0 | 90 | 90 | 1225 | 15 |
3 | 1010 | 1 | 56 | 1 | 0 | 90 | 90 | NA | 15 |
5 | 210 | 2 | 57 | 1 | 1 | 90 | 60 | 1150 | 11 |
1 | 883 | 2 | 60 | 1 | 0 | 100 | 90 | NA | 0 |
12 | 1022 | 1 | 74 | 1 | 1 | 50 | 80 | 513 | 0 |
Data will often come with start and end dates rather than pre-calculated survival times. The first step is to make sure these are formatted as dates in R.
Let’s create a small example dataset with variables
sx_date
for surgery date and last_fup_date
for
the last follow-up date.
date_ex <-
tibble(
sx_date = c("2007-06-22", "2004-02-13", "2010-10-27"),
last_fup_date = c("2017-04-15", "2018-07-04", "2016-10-31")
)
date_ex
## # A tibble: 3 × 2
## sx_date last_fup_date
## <chr> <chr>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
We see these are both character variables, which will often be the case, but we need them to be formatted as dates.
date_ex %>%
mutate(
sx_date = as.Date(sx_date, format = "%Y-%m-%d"),
last_fup_date = as.Date(last_fup_date, format = "%Y-%m-%d")
)
## # A tibble: 3 × 2
## sx_date last_fup_date
## <date> <date>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
R
the format must include the
separator as well as the symbol. e.g. if your date is in format m/d/Y
then you would need format = "%m/%d/%Y"
We can also use the lubridate
package to format dates.
In this case, use the ymd
function
date_ex %>%
mutate(
sx_date = ymd(sx_date),
last_fup_date = ymd(last_fup_date)
)
## # A tibble: 3 × 2
## sx_date last_fup_date
## <date> <date>
## 1 2007-06-22 2017-04-15
## 2 2004-02-13 2018-07-04
## 3 2010-10-27 2016-10-31
R
option, the separators do
not need to be specified?dmy
will show all format
options.Now that the dates formatted, we need to calculate the difference
between start and end time in some units, usually months or years. In
base R
, use difftime
to calculate the number
of days between our two dates and convert it to a numeric value using
as.numeric
. Then convert to years by dividing by
365.25
, the average number of days in a year.
date_ex %>%
mutate(
os_yrs =
as.numeric(
difftime(last_fup_date,
sx_date,
units = "days")) / 365.25
)
## # A tibble: 3 × 3
## sx_date last_fup_date os_yrs
## <date> <date> <dbl>
## 1 2007-06-22 2017-04-15 9.82
## 2 2004-02-13 2018-07-04 14.4
## 3 2010-10-27 2016-10-31 6.01
Using the lubridate
package, the operator
%--%
designates a time interval, which is then converted to
the number of elapsed seconds using as.duration
and finally
converted to years by dividing by dyears(1)
, which gives
the number of seconds in a year.
date_ex %>%
mutate(
os_yrs =
as.duration(sx_date %--% last_fup_date) / dyears(1)
)
## # A tibble: 3 × 3
## sx_date last_fup_date os_yrs
## <date> <date> <dbl>
## 1 2007-06-22 2017-04-15 9.82
## 2 2004-02-13 2018-07-04 14.4
## 3 2010-10-27 2016-10-31 6.01
lubridate
package
using a call to library
in order to be able to access the
special operators (similar to situation with pipes)For the components of survival data I mentioned the event indicator:
Event indicator \(\delta_i\):
However, in R
the Surv
function will also
accept TRUE/FALSE (TRUE = event) or 1/2 (2 = event).
In the lung
data, we have:
The probability that a subject will survive beyond any given specified time
\[S(t) = Pr(T>t) = 1 - F(t)\]
\(S(t)\): survival function \(F(t) = Pr(T \leq t)\): cumulative distribution function
In theory the survival function is smooth; in practice we observe events on a discrete time scale.
The Kaplan-Meier method is the most common way to estimate survival times and probabilities. It is a non-parametric approach that results in a step function, where there is a step down each time an event occurs.
Surv
function from the survival
package creates a survival object for use as the response in a model
formula. There will be one entry for each subject that is the survival
time, which is followed by a +
if the subject was censored.
Let’s look at the first 10 observations:Surv(lung$time, lung$status)[1:10]
## [1] 306 455 1010+ 210 883 1022+ 310 361 218 166
survfit
function creates survival curves based on a
formula. Let’s generate the overall survival curve for the entire
cohort, assign it to object f1
, and look at the
names
of that object:f1 <- survfit(Surv(time, status) ~ 1, data = lung)
names(f1)
## [1] "n" "time" "n.risk" "n.event" "n.censor" "surv"
## [7] "std.err" "cumhaz" "std.chaz" "type" "logse" "conf.int"
## [13] "conf.type" "lower" "upper" "call"
Some key components of this survfit
object that will be
used to create survival curves include:
time
, which contains the start and endpoints of each
time intervalsurv
, which contains the survival probability
corresponding to each time
Now we plot the survfit
object in base R
to
get the Kaplan-Meier plot.
plot(survfit(Surv(time, status) ~ 1, data = lung),
xlab = "Days",
ylab = "Overall survival probability")
R
shows the step function
(solid line) with associated confidence intervals (dotted lines)mark.time = TRUE
)Alternatively, the ggsurvplot
function from the
survminer
package is built on ggplot2
, and can
be used to create Kaplan-Meier plots. Checkout the cheatsheet
for the survminer
package.
ggsurvplot(
fit = survfit(Surv(time, status) ~ 1, data = lung),
xlab = "Days",
ylab = "Overall survival probability")
ggsurvplot
shows the step
function (solid line) with associated confidence bands (shaded
area).censor = FALSE
One quantity often of interest in a survival analysis is the probability of surviving beyond a certain number (\(x\)) of years.
For example, to estimate the probability of surviving to \(1\) year, use summary
with the
times
argument (Note the time
variable in the lung
data is actually in days, so we need
to use times = 365.25
)
summary(survfit(Surv(time, status) ~ 1, data = lung), times = 365.25)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 365 65 121 0.409 0.0358 0.345 0.486
We find that the \(1\)-year probability of survival in this study is 41%.
The associated lower and upper bounds of the 95% confidence interval are also displayed.
The \(1\)-year survival probability is the point on the y-axis that corresponds to \(1\) year on the x-axis for the survival curve.
plot_main <-
ggsurvplot(
data = lung,
fit = f1,
xlab = "Months",
legend = "none",
xscale = 30.4,
break.x.by = 182.4,
risk.table = TRUE,
risk.table.y.text = FALSE)
plot1 <- plot_main
plot1$plot <- plot1$plot +
geom_segment(x = 365.25, xend = 365.25, y = -0.05, yend = 0.4092416,
size = 1.5) +
geom_segment(x = 365.25, xend = -40, y = 0.4092416, yend = 0.4092416,
size = 1.5,
arrow = arrow(length = unit(0.2, "inches")))
plot1
What happens if you use a “naive” estimate?
121 of the 228 patients died by \(1\) year so:
\[\Big(1 - \frac{121}{228}\Big) \times 100 = 47\%\]
You get an incorrect estimate of the \(1\)-year probability of survival when you ignore the fact that 42 patients were censored before \(1\) year.
Recall the correct estimate of the \(1\)-year probability of survival was 41%.
fakedata2 <- lung %>%
mutate(time = ifelse(status == 2, time, 1022),
group = "No censoring") %>%
full_join(mutate(lung, group = "With censoring"))
fit3 <- survfit(Surv(time, status) ~ group, data = fakedata2)
ggsurvplot(
data = fakedata2,
fit = fit3,
xlab = "Months",
legend = "bottom",
legend.title = "",
legend.labs = c("No censoring", "With censoring"),
xscale = 30.4,
break.x.by = 182.4,
risk.table = TRUE,
risk.table.y.text = FALSE)
Another quantity often of interest in a survival analysis is the average survival time, which we quantify using the median. Survival times are not expected to be normally distributed so the mean is not an appropriate summary.
We can obtain this directly from our survfit
object
survfit(Surv(time, status) ~ 1, data = lung)
## Call: survfit(formula = Surv(time, status) ~ 1, data = lung)
##
## n events median 0.95LCL 0.95UCL
## [1,] 228 165 310 285 363
We see the median survival time is 310 days The lower and upper bounds of the 95% confidence interval are also displayed.
Median survival is the time corresponding to a survival probability of \(0.5\):
plot2 <- plot_main
plot2$plot <- plot2$plot +
geom_segment(x = -45, xend = 310, y = 0.5, yend = 0.5, size = 1.5) +
geom_segment(x = 310, xend = 310, y = 0.5, yend = -0.03, size = 1.5,
arrow = arrow(length = unit(0.2, "inches")))
plot2
What happens if you use a “naive” estimate?
Summarize the median survival time among the 165 patients who died
lung %>%
filter(status == 2) %>%
summarize(median_surv = median(time))
## median_surv
## 1 226
lung
data is shown in
blue for comparisonfakedata <- lung %>%
filter(status == 2) %>%
mutate(group = "Ignoring censoring") %>%
full_join(mutate(lung, group = "With censoring"))
fit2 <- survfit(Surv(time, status) ~ group, data = fakedata)
ggsurvplot(
data = fakedata,
fit = fit2,
xlab = "Months",
legend = "bottom",
legend.title = "",
legend.labs = c("Ignoring censoring", "With censoring"),
xscale = 30.4,
break.x.by = 182.4,
risk.table = TRUE,
risk.table.y.text = FALSE)
We will use the “Duration of nursing home stay” data.
The National Center for Health Services Research studied 36 for-profit nursing homes to assess the effects of different financial incentives on length of stay. Treated nursing homes received higher per diems for Medicaid patients, and bonuses for improving a patient’s health and sending them home.
Study included 1601 patients admitted between May 1, 1981 and April 30, 1982.
Variables include:
Say we want to obtain the Fleming-Harrington estimate of the survival function for married females, in the healthiest initial subgroup, who are randomized to the untreated group of the nursing home study.
nurshome = read.table("http://www.karlin.mff.cuni.cz/~pesta/prednasky/NMFM404/Data/NursingHome.dat",
head=F, skip=14, col.names = c("los", "age", "rx", "gender", "married", "health", "censor"))
nurs2 = subset(nurshome, gender==0 & rx==0 & health==2 & married==1)
fit.fh = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="fleming-harrington", conf.type="log-log")
summary(fit.fh)
## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "fleming-harrington",
## conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 14 12 1 0.9200 0.0767 0.55345 0.988
## 24 11 1 0.8401 0.1036 0.49777 0.957
## 25 10 1 0.7601 0.1207 0.42614 0.916
## 38 9 1 0.6802 0.1318 0.35609 0.866
## 64 8 1 0.6003 0.1384 0.29015 0.810
## 89 7 1 0.5204 0.1412 0.22897 0.749
## 113 6 1 0.4405 0.1402 0.17290 0.682
## 123 5 1 0.3606 0.1356 0.12238 0.609
## 149 4 1 0.2809 0.1268 0.07814 0.531
## 168 3 1 0.2012 0.1129 0.04142 0.446
## 185 2 1 0.1221 0.0917 0.01445 0.352
## 234 1 1 0.0449 0.0562 0.00107 0.245
Compare with KM:
fit.km = survfit(Surv(nurs2$los, 1-nurs2$censor) ~ 1,type="kaplan-meier", conf.type="log-log")
summary(fit.km)
## Call: survfit(formula = Surv(nurs2$los, 1 - nurs2$censor) ~ 1, type = "kaplan-meier",
## conf.type = "log-log")
##
## time n.risk n.event survival std.err lower 95% CI upper 95% CI
## 14 12 1 0.9167 0.0798 0.53898 0.988
## 24 11 1 0.8333 0.1076 0.48171 0.956
## 25 10 1 0.7500 0.1250 0.40842 0.912
## 38 9 1 0.6667 0.1361 0.33702 0.860
## 64 8 1 0.5833 0.1423 0.27014 0.801
## 89 7 1 0.5000 0.1443 0.20848 0.736
## 113 6 1 0.4167 0.1423 0.15247 0.665
## 123 5 1 0.3333 0.1361 0.10270 0.588
## 149 4 1 0.2500 0.1250 0.06014 0.505
## 168 3 1 0.1667 0.1076 0.02651 0.413
## 185 2 1 0.0833 0.0798 0.00505 0.311
## 234 1 1 0.0000 NaN NA NA
Compare plots:
plot(fit.fh$time, fit.fh$surv,type="n", xlab="Follow up", ylab="Survival curve")
points(fit.fh$time, fit.fh$surv, col="blue", pch=17)
points(fit.km$time,fit.km$surv, col="red", pch=14)
lines(fit.fh$time,fit.fh$surv, col="blue", lwd=2)
lines(fit.km$time, fit.km$surv,col="red", lwd=2)
legend("topright", legend=c("FH", "KM"), col=c("blue", "red"), pch=c(17,14), lty=c(1,1), lwd=c(2,2), bty="n")
Generally, the Fleming Harrington estimator is slightly higher than the KM at every time point, but with larger datasets the two will typically be much closer.