The following examples are taken from the paper: “Analysis of time-to-event for observational studies: Guidance to the use of intensity models”, available at : https://pubmed.ncbi.nlm.nih.gov/33043497/ and from the vignette “Multi-state models and competing risks” at https://cran.r-project.org/web/packages/survival/vignettes/compete.pdf
## Load packages
library(survival)
library(splines)
library(timereg)
library(knitr)
library(ggplot2)
## IMPORT DATASET
# pad data
load("C:\\Users\\Giulia Barbati\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\SLE_AA 25_26\\Block 4\\R codes\\pad.rda")
Peripheral arterial disease (PAD) is a common circulatory problem in
which narrowed arteries reduce blood flow to peripheral limbs, often the
legs. It is also likely to be a sign of a more widespread
atherosclerosis, and subjects manifesting the disease carry an increased
risk for all atherothrombotic events including myocardial infarction,
stroke, and cardiac death. The pad data set contains the
results of a Slovene study. Briefly, the study was conducted by 74
primary care physicians-researchers, each of whom was asked to recruit
10 subjects with PAD along with 10 age and sex matched controls without
PAD; actual recruitment ranged from 1 to 31 pairs. All participating
subjects consented to a yearly examination and interview for 5 years.
Important endpoints are death, either due to cardiovascular disease (CVD
death) or other causes, non-fatal CVD endpoints of infarction and
stroke, and patient interventions attributed to the disease such as
revascularization procedures.
The final study includes 742 PAD and 713 controls, with baseline data for each subject, measurements at each visit, and endpoints. This analysis will focus on cardiac endpoints and on overall mortality.
The starting R data has 3 dataframes:
pad1 has one observation per subject containing
baseline information,pad2 has one observation per subject visit containing
measurements at that visit, andpad3 has an observation for each observed
endpoint.The dataset pad2 contains all the repeated examinations over the 5 years of the study. Note that in this dataset the ID of the subject is repeated for multiple lines (data are in the long format). The same happens for pad3 that reports all the events of interest. In the IDA phase (Initial Data Analysis) we look at the overall follow-up for subjects, and decide on a time scale and an endpoint. Per the study design subjects should have a visit at years 1–5. We see in Figure that visit 1 dates are stretched out over a full year and by visit 5 over almost 2 years, a pattern that is unavoidable in observational studies of human subjects. Based on the counts shown in the figure there is no evidence of differential drop out between the PAD and control subjects.
vcount <- with(pad2, table(pad, visit)) # visit count
vtime <- pad2$day/365.25
plot(vtime ~ jitter(visit + .3*(pad=="PAD") - .15), pad2, subset=(visit>0),
col=c(1,2)[as.numeric(pad)], ylim=c(0, max(vtime)),
xlab= "Visit number", ylab="Years from initial visit")
#abline(h=1:6, lty=2, col=4)
segments(rep(0,6), 1:6, 1:6+.5, 1:6, lty=2, col='gray')
legend(4,2, c("PAD", "Control"), col=c(2,1), bty='n', pch=19)
text(1:5 - .15, 0:4, vcount[1,-1])
text(1:5 + .15, 0:4, vcount[2,-1])
pfig2 <- survfit(Surv(futime/365.25, status=='censor') ~ pad, pad1)
plot(pfig2, fun='event', col=1:2,
xlab="Years post enrollment", ylab="Fraction censored")
legend(2, .6, c("Control", "PAD", "Ideal"), col=c(1,2,1), lty=c(1,1,2),
bty='n')
lines(c(0, 4.8, 6.8), c(0,0,1), lty=3)
The censoring pattern for subjects in the two arms, showed in the
above figure, are computed using a reverse Kaplan-Meier
(i.e. the status is to be censored, and the censoring is having the
event).
There is no evidence of differential follow-up.
Given the visit slippage of the first figure, an ideal censoring pattern would be a final followup that is approximately uniformly distributed between 4.8 and 6.8 years, this is shown as a dotted line on the figure.
Examining the counts of subjects for each visit the total number of visits for the subjects is not what was planned — about 25% do not have a fifth visit — but patients have adhered to the 5 year time commitment quite closely. As stated in the primary paper, a decision was made to truncate all patient data at 5 years for the purpose of analysis.
The initial survival curves can be created using a copy of the baseline data (pad1), with follow-up truncated at 5 years.
pdata <- subset(pad1, select=c(id, pad, sex, age, futime, status))
pdata$death <- ifelse(pdata$status == 'censor', 0, 1) # any cause of death
# truncate follow=up at 5 years
over5 <- pdata$futime > 5*(365.25)
pdata$futime <- pmin(1826, pdata$futime) # 1826 days = 5 years
pdata$death[over5] <- 0
pdata$status[over5] <- "censor"
table(pdata$status)
##
## censor CV death nonCV death
## 1296 68 91
# Convert to follow-up in years, and add an age scale
pdata$years <- pdata$futime/365.25
pdata$age1 <- as.numeric(pad1$d.first - pad1$d.birth)/365.25
pdata$age2 <- pdata$age1 + pdata$years
# comparison age at entry between groups
aggregate(age1 ~ sex + pad, data = pdata,
FUN = function(x) c(mean = round(mean(x), 2), sd = round(sd(x), 2)))
## sex pad age1.mean age1.sd
## 1 female Control 65.22 9.49
## 2 male Control 64.36 9.04
## 3 female PAD 66.37 8.92
## 4 male PAD 64.30 8.60
ggplot(pdata, aes(x = pad, y = age1, fill = sex)) +
geom_boxplot() +
labs(title = "Age at Entry",
x = "Pad Group",
y = "Age at Entry (Years)") +
theme_minimal()
oldpar <- par(mfrow=c(1,2))
psurv1 <- survfit(Surv(years, death) ~ pad + sex, data=pdata)
psurv2 <- survfit(Surv(age1, age2, death) ~ pad+ sex, pdata,
start.time=55)
plot(psurv1, col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
xlab="Years since enrollment", ylab="Death")
legend(.1, .18, c("PAD, male","PAD, female", "Control, male","Control, female"),
col=c(2,2,1,1), lty=c(2,1,2,1), lwd=2, bty='n', cex=.8)
plot(psurv2, col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
xlab='Age', ylab='Death, given alive at age 55')
Our first analysis will use time since enrollment as the primary time scale. For the PAD patients the time since diagnosis is a natural scale since it represents both the progression of disease and of treatments for the disease. Survival curves for the control subjects serve as a comparison outcome of similarly aged subjects without the disease, but do not have a natural stand-alone interpretation.
Death rates are higher for males than for females, and is higher for PAD subjects than for the age matched controls. The right hand panel shows the curves on age scale. Note that when using the age scale, we include two time points in the Survival function: age at entry in the study and age at the event or censoring. These data are both right-censored but also left-truncated (we have not discussed this point, but it is of interest if you want to deepen the survival framework). In two words, left-truncation is when the subjects to enter under your observation should be survived until a certain time-point (i.e. age at entry).
Survival curves can equivalently display either the fraction who have died (curves start at 0: 1-KM) or the fraction who have not died (curves start at 1); the choice between the two is often based on tradition or habit.
The analysis will focus on the impact on survival of the covariates PAD, age, sex, and serum values for high density and low density lipoprotein (HDL, LDL). We are therefore in a framework of a descriptive model, since there isn’t a formal causal model focused on one specific risk factor but instead the simultaneous associations of multiple factors to survival is of interest. Figure shows density estimates for each of the three continuous covariates at baseline, by PAD group. By study design there should be no difference in the age distribution. HDL and LDL are slightly lower for the PAD subjects.
oldpar <- par(mfrow=c(1,3), mar=c(5.1, 4.1, 5.1, 1))
dfun <- function(data, x, ...) {
d0 <- density((data[[x]])[data$pad=="Control"])
d1 <- density((data[[x]])[data$pad=="PAD"])
matplot(cbind(d0$x, d1$x), cbind(d0$y, d1$y), type='l', lty=1, col=1:2,
...)
}
lunit <- 38.67 # convert from mmol/L to mg/dl
xx <- c(0, 50, 100, 150, 200, 250, 300)
dfun(pdata, "age", xlab="Age", ylab="Density", lwd=2)
legend(51, .02, c("PAD", "Control"), lty=1, lwd=2, col=2:1, bty='n')
dfun(na.omit(subset(pad2, visit==0, c(hdl, pad))),
"hdl", xlab="HDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)
dfun(na.omit(subset(pad2, visit==0, c(ldl, pad))),
"ldl", xlab="LDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)
The Cox model fits will use age, sex, baseline values of high density
and low density lipoproteins (hdl0, ldl0),
time-dependent lipid values (hdl, ldl), and a
time-dependent value of “years since enrollment”
(etime). Note that time since enrollment only plays a role
in age scale models; in a classical time-since-entry model
every one in a given risk set will have exactly the same value of the
covariate, and hence the estimated coefficient for the variable would be
0.
For time-dependent analysis we use a counting process form.This framework is suitable for all kind of generalizations, and it is therefore to be preferred when switching to more advanced methods, like in the multi-state approaches. Mathematically each subject is represented by 3 processes: \(Y_i(t)=1\) if the subject is currently at risk, \(X_i(t)\) is the time dependent covariate process, and \(N(t)\) the cumulative number of events. In the R code this translates into multiple observations for each subject of the form start-stop:
id, time1, time2, event, x1, x2, ...
where (time1, time2) describe an interval over which the
subject is at risk, event corresponds to \(dN(t)\) and will be 1 of the subject had an
event at time2 and 0 otherwise, x1, x2, ...
are the value of the covariates over the interval, and id
is a subject identifier. Note that in this format the event happens at
time2, instead the value of the covariate in each line is
defined at time1. Subject 9, for instance, is a male with blood
values measured on days 0, 583, 945, and 1309, who died on day 1349.
Their data has the form
id tstart tstop death sex ldl hdl
9 0 583 0 male 2.66 1.08
9 583 945 0 male 3.20 1.53
9 945 1309 0 male 2.81 1.32
9 1309 1349 1 male 2.51 1.30
The lipid values and time intervals encode the most recent
measurement of the value. Creation and checking of the data often
represents a major share of the work in an analysis. There is often
concern that because the data set has multiple rows for the same
subject, a valid variance estimate for the model will need to account
for correlation.
This is not the case, however.
Internally, there is only one copy of each subject; the multiple rows
are merely a way to communicate to the program which covariate
value is applicable at any given time point.
When there are competing risks, such as CVD and non-CVD death, the 0/1 event event variable is replaced by a factor with levels of “none”, “CVD death”, “non-CVD death”, which shows both when an event occurred and it’s type.
Below we create separate time-dependent data sets for analysis on the
time-from-enrollment scale (pdata2) and on the age scale
(pdata3).
The two data sets have separate event indicators for death, the competing risk of CVD/non-CVD death, and for any CVD event/non-CVD death. Creating (time1, time2) data sets on age-in-years scale can lead to anomalies due to round off errors; for this reason the time-dependent data sets all use time in days. Use birth date and entry date to create agedays = age at enrollment, in days.
This data set will therefore have two time scales (age and time since enrollment), time dependent covariates (hdl, ldl, etime), and multiple endpoints (all cause death, CVD vs non-CVD death, major CVD event).
A. etime = time since enrollment as a time dependent variable
1. This is a continuous time-dependent covariate -- technically
it changes minute by minute.
2. We will be using it in age scale fits. If we create it on age scale
then the cuts for survSplit can be limited to "the ages at which deaths occur"
B. The tmerge routine can keep track of multiple endpoints on a single time scale, the survSplit routine a single endpoint and single time scale.
C. The tmerge operations needs to be sequential: no step that would change the number of rows can be inserted in the midst.
Taken all together, it is simplest to create the separate data sets, pdata2 on the time since enrollment scale, pdata3 on the age scale. Many of the steps will need to be repeated, e.g., adding time dependent hdl and ldl.
pdata2 <- tmerge(pdata[, c("id", "pad", "age", "sex")], pdata,
id=id, status = event(futime, status))
pdata2 <- tmerge(pdata2, pad2, id=id, ldl= tdc(day, ldl),
hdl = tdc(day, hdl))
head(pdata2)
## id pad age sex tstart tstop status ldl hdl
## 1 1 Control 69.5 male 0 487 censor 2.39 1.65
## 2 1 Control 69.5 male 487 876 censor 2.82 1.78
## 3 1 Control 69.5 male 876 1243 censor 2.59 2.27
## 4 1 Control 69.5 male 1243 1597 censor 2.80 2.00
## 5 1 Control 69.5 male 1597 1826 censor 2.33 2.26
## 6 2 PAD 63.9 female 0 459 censor 3.00 0.97
# add baseline hdl and ldl as covariates
indx <- match(pdata2$id, pdata2$id)
pdata2$ldl0 <- pdata2$ldl[indx]
pdata2$hdl0 <- pdata2$hdl[indx]
pdata2$age10 <- pdata2$age/10 # age in decades
head(pdata2)
## id pad age sex tstart tstop status ldl hdl ldl0 hdl0 age10
## 1 1 Control 69.5 male 0 487 censor 2.39 1.65 2.39 1.65 6.95
## 2 1 Control 69.5 male 487 876 censor 2.82 1.78 2.39 1.65 6.95
## 3 1 Control 69.5 male 876 1243 censor 2.59 2.27 2.39 1.65 6.95
## 4 1 Control 69.5 male 1243 1597 censor 2.80 2.00 2.39 1.65 6.95
## 5 1 Control 69.5 male 1597 1826 censor 2.33 2.26 2.39 1.65 6.95
## 6 2 PAD 63.9 female 0 459 censor 3.00 0.97 3.00 0.97 6.39
# create "any cvd event" as a competing risk, it will be used later
temp <- subset(pad3, event %in% c("cvd death", "infarction", "stroke"))
temp <- temp[!duplicated(temp$id),] # first of these
temp <- subset(temp, day< 5*365.25) # must be <= 5 years
pdata2 <- tmerge(pdata2, temp, id=id, cvstat=event(day),
priorcv = tdc(day))
pdata2$death <- ifelse(pdata2$status=="censor", 0, 1) # all cause death
tstat <- ifelse(pdata2$cvstat>0, pdata2$cvstat, 2*pdata2$death)
pdata2$cvstat <- factor(tstat, 0:2,
c("censor", "CVD event", "nonCVD-death"))
pdata2$death <- ifelse(pdata2$status=="censor", 0, 1) # all cause death
table(pdata2$cvstat)
##
## censor CVD event nonCVD-death
## 6488 143 100
table(pdata2$death)
##
## 0 1
## 6572 159
table(pdata2$status)
##
## censor CV death nonCV death
## 6572 68 91
Note that when considering competing risks (cvstat) we have only 100 deaths that are “death as a first event”. Note also that since we are duplicating subjects we have a lot more censored observations.
# Reprise now all the steps, but on age scale. We need to add "age" to the
# merged-in data sets before adding them.
ageday <- as.numeric(pad1$d.first - pad1$d.birth) # age in days at enrollmemt
pdata3 <- tmerge(pdata[, c("id", "pad", "sex")], pdata,
id=id, status = event(futime + ageday, status),
tstart=ageday, tstop=ageday+futime,
options=list(tstart="age1", tstop="age2"))
temp2 <- pad2
temp2$age <- temp2$day + ageday[match(temp2$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp2, id=id, ldl= tdc(age, ldl),
hdl = tdc(age, hdl))
head(pdata3)
## id pad sex age1 age2 status ldl hdl
## 1 1 Control male 25395 25882 censor 2.39 1.65
## 2 1 Control male 25882 26271 censor 2.82 1.78
## 3 1 Control male 26271 26638 censor 2.59 2.27
## 4 1 Control male 26638 26992 censor 2.80 2.00
## 5 1 Control male 26992 27221 censor 2.33 2.26
## 6 2 PAD female 23342 23801 censor 3.00 0.97
# add baseline hdl and ldl as covariates
indx <- match(pdata3$id, pdata3$id) #points to the first row for each subject
pdata3$ldl0 <- pdata3$ldl[indx]
pdata3$hdl0 <- pdata3$hdl[indx]
# Add time since enrollment as a time-dependent covariate.
# Use a temporary data set + survSplit to create a 'variable' that looks
# like yet another lab test, then merge it in on the age scale.
# Time since enrollment will be in weeks, which will create 52*5= 260 obs
# per subject. (Time since enrollment in days would be overkill.)
pdata$ageday <- ageday
temp3 <- survSplit(Surv(futime, status) ~ ageday + id, data= pdata,
cut= seq(7, max(pdata$futime), by=7))
pdata3 <- tmerge(pdata3, temp3, id, etime = tdc(ageday + tstart, tstart))
pdata3$etime <- pdata3$etime/365.25 # recode to time in years
head(pdata3)
## id pad sex age1 age2 status ldl hdl ldl0 hdl0 etime
## 1 1 Control male 25395 25402 censor 2.39 1.65 2.39 1.65 0.00000000
## 2 1 Control male 25402 25409 censor 2.39 1.65 2.39 1.65 0.01916496
## 3 1 Control male 25409 25416 censor 2.39 1.65 2.39 1.65 0.03832991
## 4 1 Control male 25416 25423 censor 2.39 1.65 2.39 1.65 0.05749487
## 5 1 Control male 25423 25430 censor 2.39 1.65 2.39 1.65 0.07665982
## 6 1 Control male 25430 25437 censor 2.39 1.65 2.39 1.65 0.09582478
# add 'any cv' as an event
temp$age <- temp$day + ageday[match(temp$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp, id=id, cvstat=event(age),
priorcv = tdc(age))
# we are adding variable (priorcv) that tracks if they’ve had a "prior" CV event.
# recode now the event variables
pdata3$death <- ifelse(pdata3$status=="censor", 0, 1) # all cause death
tstat <- ifelse(pdata3$cvstat>0, pdata3$cvstat, 2*pdata3$death)
pdata3$cvstat <- factor(tstat, 0:2, c("censor", "CVD event", "nonCVD-death"))
# pdata3[1:280,]
table(pdata3$death)
##
## 0 1
## 362717 159
table(pdata3$cvstat)
##
## censor CVD event nonCVD-death
## 362633 143 100
Now the number of censored observations is much more higher since we have duplicated rows to create the TD variable time since enrollment.
# Creation of multi-row data sets has pitfalls. It is always wise to
# sanity check the results.
check1 <- survcheck(Surv(tstart, tstop, death) ~ 1, id=id, data= pdata2)
check1
## Call:
## survcheck(formula = Surv(tstart, tstop, death) ~ 1, data = pdata2,
## id = id)
##
## Unique identifiers Observations Transitions
## 1455 6731 159
##
## Transitions table:
## to
## from event (censored)
## (s0) 159 1296
## event 0 0
##
## Number of subjects with 0, 1, ... transitions to each state:
## 0 1
## event 1296 159
check2 <- survcheck(Surv(age1, age2, death) ~1, id=id, data= pdata3)
check2
## Call:
## survcheck(formula = Surv(age1, age2, death) ~ 1, data = pdata3,
## id = id)
##
## Unique identifiers Observations Transitions
## 1455 362876 159
##
## Transitions table:
## to
## from event (censored)
## (s0) 159 1296
## event 0 0
##
## Number of subjects with 0, 1, ... transitions to each state:
## 0 1
## event 1296 159
The first pair of fits shows that the estimated overall effects of PAD and sex hardly differ between analyses done on the enrollment or age scales. A second pair fits looks at the coefficient effects within the PAD and control groups. (Inclusion of the covariate by strata interactions gives results that are equivalent to separate fits for the the PAD and control groups; bundling them as a single model allows a formal test of whether the effects differ between the two strata.) None of sex, enrollment age, or time since enrollment show meaningful differences between the PAD and control groups. That is, on the log-hazard scale, PAD appears to act as an additive risk.
# Fit the first set of models: pad, age, sex on the time since enrollment
# scale (cfit1a) and on the age scale (cfit1b)
cfit1a <- coxph(Surv(tstart, tstop, death) ~ pad + sex + age10, pdata2)
cfit1b <- coxph(Surv(age1, age2, death) ~ pad + sex + etime, pdata3)
# Add interactions
cfit2a <- coxph(Surv(tstart, tstop, death) ~ (sex + age10)*strata(pad), pdata2)
cfit2b <- coxph(Surv(age1, age2, death) ~ (sex + etime)*strata(pad),
pdata3)
cat("Overall model on time since enrollment scale\n")
## Overall model on time since enrollment scale
print(cfit1a)
## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ pad + sex + age10,
## data = pdata2)
##
## coef exp(coef) se(coef) z p
## padPAD 0.8755 2.4000 0.1739 5.036 4.76e-07
## sexmale 0.6924 1.9985 0.1822 3.800 0.000145
## age10 0.6583 1.9315 0.1053 6.250 4.11e-10
##
## Likelihood ratio test=82.63 on 3 df, p=< 2.2e-16
## n= 6731, number of events= 159
cat("\nOverall model on age scale\n")
##
## Overall model on age scale
print(cfit1b)
## Call:
## coxph(formula = Surv(age1, age2, death) ~ pad + sex + etime,
## data = pdata3)
##
## coef exp(coef) se(coef) z p
## padPAD 0.87354 2.39538 0.17400 5.020 5.16e-07
## sexmale 0.70540 2.02466 0.18270 3.861 0.000113
## etime 0.16593 1.18049 0.05917 2.804 0.005042
##
## Likelihood ratio test=51.15 on 3 df, p=4.538e-11
## n= 362876, number of events= 159
cat("\nInteractions between covariates and treatment, time since enrollment\n")
##
## Interactions between covariates and treatment, time since enrollment
print(cfit2a)
## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ (sex + age10) *
## strata(pad), data = pdata2)
##
## coef exp(coef) se(coef) z p
## sexmale 0.67576 1.96552 0.33494 2.018 0.043638
## age10 0.68341 1.98062 0.19354 3.531 0.000414
## sexmale:strata(pad)PAD 0.02208 1.02233 0.39925 0.055 0.955897
## age10:strata(pad)PAD -0.03498 0.96563 0.23081 -0.152 0.879552
##
## Likelihood ratio test=55.21 on 4 df, p=2.938e-11
## n= 6731, number of events= 159
cat("\nInteractions between covariates and treatment, age scale\n")
##
## Interactions between covariates and treatment, age scale
print(cfit2b)
## Call:
## coxph(formula = Surv(age1, age2, death) ~ (sex + etime) * strata(pad),
## data = pdata3)
##
## coef exp(coef) se(coef) z p
## sexmale 0.6980525 2.0098347 0.3357736 2.079 0.0376
## etime 0.1176904 1.1248958 0.1097373 1.072 0.2835
## sexmale:strata(pad)PAD 0.0001539 1.0001539 0.4003644 0.000 0.9997
## etime:strata(pad)PAD 0.0665382 1.0688018 0.1304441 0.510 0.6100
##
## Likelihood ratio test=24.44 on 4 df, p=6.516e-05
## n= 362876, number of events= 159
Therefore, we can see that in the multivariable Cox model on the follow up scale, being PAD and increasing age decrease the estimated survival, male sex is at higher risk. On the age scale, these results are confirmed. There is no a significant interaction between PAD and gender or age.
Since PAD is expected to exert its primary effect via cardiovascular
disease (CVD) endpoints, a more interesting analysis is to focus on CVD
rather than overall mortality. Since in this case non-CVD death will be
a competing risk this needs to be accounted for in our estimate
of the probability in state; in the R survival package this is
accomplished by using a multi-level categorical variable as the
“status”, rather than 0/1. The survfit function then
computes Aalen-Johansen (AJ) estimate.
psurv4 <- survfit(Surv(tstart, tstop, status) ~ pad, id = id, pdata2)
psurv4
## Call: survfit(formula = Surv(tstart, tstop, status) ~ pad, data = pdata2,
## id = id)
##
## n nevent rmean se(rmean)*
## pad=Control, (s0) 3349 0 1774.19419 8.648882
## pad=PAD, (s0) 3382 0 1711.26894 12.097682
## pad=Control, CV death 3349 17 16.99381 5.010891
## pad=PAD, CV death 3382 51 57.63161 9.122496
## pad=Control, nonCV death 3349 30 34.81200 7.166591
## pad=PAD, nonCV death 3382 61 57.09945 8.486423
## *restricted mean time in state (max time = 1826 )
plot(psurv4, col=1:2, lty=c(1,1,2,2), ylim=c(0,.15), xscale=365.25,
xlab="Years since enrollment", ylab="Absolute risk")
legend("topleft", c("PAD CVD death", "Control CVD death",
"PAD nCVD death","Control nCVD"),
col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n')
Figure shows the Aalen-Johansen estimates of probability in state for CVD and non-CVD death (which could also be labeled here as “absolute risk” or “cumulative incidence”). For controls, CVD deaths are about 1/2 of the non-CVD, approximately 2% versus 4% at 5 years; near 8% both outcomes for the PAD subjects.
The absolute risk curves require that all transitions in the model be considered jointly: incorrect curves will result from a simple Kaplan-Meier of time to CVD, censoring at non-CVD death, or a KM of time to non-CVD death, censoring at CVD.
(Interestingly, this is not true for the cumulative hazard estimate: the naive estimate for one outcome, censoring the others, will be identical to the joint estimate).
We could repeat the same procedure but focusing now on major CVD events with death w/o CVD as a competing risk:
psurv5 <- survfit(Surv(tstart, tstop, cvstat) ~ pad, pdata2, id=id,
subset=(priorcv==0))
plot(psurv5, col=1:2, lty=c(1,1,2,2), xscale=365.25, ylim=c(0, .15),
xlab="Years since enrollment", ylab="Absolute risk")
legend("topleft", c("Major CVD, PAD", "Major CVD, Control",
"death w/o CVD, PAD","death w/o CVD, Control"),
col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.85)
The plot represent the AJ curves for first major CVD event (infarction, stroke, CVD death) versus non-CVD death w/o any CVD non-fatal event as a competing risk. Nearly 14% of the PAD subjects experienced a major CVD event versus 6% for controls.
The Cox model is a model for the hazards and has the same identity:
the first coxph call below, which models the entire
process, will have the same results for the CVD death portion as the
second coxph call, which ignores non-CVD deaths by
censoring them; this is verified by the all.equal test.
However, if one desires to predict absolute risk curves from the fitted Cox model then the first form (using the “complete” multi-state approach) is essential !!!
pcox1 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10 + hdl0 + ldl0,
pdata2, id=id)
cat("transitions table\n")
## transitions table
pcox1$transitions # a check of the data
## to
## from CV death nonCV death (censored)
## (s0) 67 91 1275
## CV death 0 0 0
## nonCV death 0 0 0
pcox1
## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 +
## hdl0 + ldl0, data = pdata2, id = id)
##
##
## 1:2 coef exp(coef) robust se z p
## padPAD 1.05440 2.87025 0.28029 3.762 0.000169
## sexmale 0.51298 1.67026 0.27881 1.840 0.065786
## age10 0.65698 1.92895 0.17894 3.672 0.000241
## hdl0 -0.30026 0.74063 0.44220 -0.679 0.497130
## ldl0 -0.08128 0.92194 0.12754 -0.637 0.523966
##
##
## 1:3 coef exp(coef) robust se z p
## padPAD 0.71390 2.04194 0.22789 3.133 0.00173
## sexmale 0.75348 2.12438 0.25505 2.954 0.00313
## age10 0.65666 1.92835 0.14379 4.567 4.95e-06
## hdl0 -0.20347 0.81589 0.32688 -0.622 0.53363
## ldl0 0.02213 1.02237 0.10857 0.204 0.83850
##
## States: 1= (s0), 2= CV death, 3= nonCV death
##
## Likelihood ratio test=83.96 on 10 df, p=8.386e-14
## n= 6626, unique id= 1433, number of events= 158
## (105 osservazioni eliminate a causa di valori mancanti)
Note that we have lost 1 subject for CV death and 21 censored subjects because there are some missing data in the lipid covariates.
# cause-specific Cox model (censoring non CV death)
pcox1b <- coxph(Surv(tstart, tstop, status=="CV death") ~ pad +sex + age10
+ hdl0 + ldl0, data=pdata2) #censor non-CVD death
pcox1b
## Call:
## coxph(formula = Surv(tstart, tstop, status == "CV death") ~ pad +
## sex + age10 + hdl0 + ldl0, data = pdata2)
##
## coef exp(coef) se(coef) z p
## padPAD 1.05445 2.87039 0.28292 3.727 0.000194
## sexmale 0.51298 1.67027 0.27813 1.844 0.065122
## age10 0.65703 1.92905 0.16379 4.011 6.03e-05
## hdl0 -0.30029 0.74060 0.32993 -0.910 0.362744
## ldl0 -0.08124 0.92197 0.12687 -0.640 0.521959
##
## Likelihood ratio test=39.51 on 5 df, p=1.874e-07
## n= 6626, number of events= 67
## (105 osservazioni eliminate a causa di valori mancanti)
# adding also TD covariates
pcox2 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10+
+ hdl + ldl, data=pdata2, id=id) # time dependent hdl and ldl
pcox2
## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 +
## +hdl + ldl, data = pdata2, id = id)
##
##
## 1:2 coef exp(coef) robust se z p
## padPAD 0.8751 2.3991 0.2745 3.188 0.001433
## sexmale 0.3094 1.3626 0.2707 1.143 0.252954
## age10 0.6963 2.0064 0.1813 3.840 0.000123
## hdl -1.5482 0.2126 0.4657 -3.325 0.000885
## ldl -0.2797 0.7560 0.1496 -1.869 0.061609
##
##
## 1:3 coef exp(coef) robust se z p
## padPAD 0.7545 2.1265 0.2256 3.345 0.000824
## sexmale 0.7847 2.1918 0.2461 3.188 0.001432
## age10 0.6761 1.9662 0.1409 4.797 1.61e-06
## hdl -0.1489 0.8617 0.3901 -0.382 0.702680
## ldl 0.1338 1.1432 0.1070 1.250 0.211251
##
## States: 1= (s0), 2= CV death, 3= nonCV death
##
## Likelihood ratio test=105.6 on 10 df, p=< 2.2e-16
## n= 6702, unique id= 1453, number of events= 158
## (29 osservazioni eliminate a causa di valori mancanti)
There were 67 CVD and 91 non-CVD deaths in 5 years of follow-up, which are the transitions from the initial state (s0) to the CVD death and non-CVD death states, respectively, in the transitions table.
(No one transitioned from death to another state, which would indicate an error in the data set !!!)
The joint Cox model shows that after adjustment for age, sex, and baseline lipids, PAD is associated with nearly three fold increase in the rate of CVD deaths and double fold increase in non-CVD death.
Age and sex are important predictors as well, as expected, with male sex having a somewhat larger effect on the non-CVD endpoint.
The effect of baseline lipid values was not significant, however, time-dependent HDL shows an effect on the CVD death endpoint, somewhat reducing the estimated effect of PAD on CVD death.
Below we reprise the CVD portion of the fit, and investigate linearity for the serum lipids, whose skewness may raise concerns. We can explore nonlinearity in several ways, in R fitting a spline is a simple one.
The likelihood ratio test provided by the anova command
shows that the gain in goodness of fit from an HDL and/or LDL
transformation is modest. The plot implies there is less “high HDL”
benefit after reaching a value of 1.5, which agrees with common
guidelines; together this suggests that a modified covariate
hdl2 = min(HDL, 1.5) might be the better predictor. A final
check for proportional hazards did not reveal any concerns.
pcox5a <- coxph(Surv(tstart, tstop, cvstat=="CVD event") ~ pad + age10+ sex
+ hdl + ldl, data=pdata2, subset=(priorcv==0))
pcox5b <- coxph(Surv(tstart, tstop, cvstat=="CVD event") ~ pad + age10+ sex
+ ns(hdl,3) + ns(ldl,3), data=pdata2, subset= (priorcv==0))
anova(pcox5a ,pcox5b)
## Analysis of Deviance Table
## Cox model: response is Surv(tstart, tstop, cvstat == "CVD event")
## Model 1: ~ pad + age10 + sex + hdl + ldl
## Model 2: ~ pad + age10 + sex + ns(hdl, 3) + ns(ldl, 3)
## loglik Chisq Df Pr(>|Chi|)
## 1 -985.67
## 2 -980.93 9.4697 4 0.05037 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
cox.zph(pcox5b)
## chisq df p
## pad 0.413 1 0.520
## age10 3.249 1 0.071
## sex 2.361 1 0.124
## ns(hdl, 3) 6.453 3 0.092
## ns(ldl, 3) 2.343 3 0.504
## GLOBAL 12.787 9 0.172
termplot(pcox5b, term=4, se=TRUE, ylim=c(-4,4), col.term=1, col.se=1,
xlab="HDL", ylab="Estimated hazard ratio")
Predicted \(P(state)\) curves can be created from the multi-state fits, as well. As with any Cox model prediction, the first task is to choose a set of covariate values for which prediction is desired. We create the predictions for four hypothetical male subjects with ages of 58 and 72 (quartiles of age), and PAD vs. control. For HDL and LDL use values of 1.3 and 3, which are near the medians. (These curves are for the model of CVD death vs other death, time-fixed hdl and ldl).
dummy <- expand.grid(pad=c("PAD", "Control"), age10=c(5.8, 7.2), sex="male",
hdl0 =1.3, ldl0 =3)
dummy
## pad age10 sex hdl0 ldl0
## 1 PAD 5.8 male 1.3 3
## 2 Control 5.8 male 1.3 3
## 3 PAD 7.2 male 1.3 3
## 4 Control 7.2 male 1.3 3
coxpred1 <- survfit(pcox1, newdata= dummy)
dim(coxpred1)
## data states
## 4 3
coxpred1$states
## [1] "(s0)" "CV death" "nonCV death"
The resulting set of absolute risk curves can be viewed as an array
of estimates by stratum, target covariates, and state. The
pcox1 fit had no strata so the first dimension is 1. The
resulting curves are shown in the Figure. Because a multi-state model
can generate a very large number of curves it is often helpful to break
them into multiple panels.
par(mfrow=c(1, 2))
plot(coxpred1[1, 1:2 ,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
ylim=c(0, .15),
xlab="Years since enrollment", ylab="predicted P(state), age = 58")
legend("topleft", c("PAD, CVD death", "Control, CVD death",
"PAD, nonCVD death", "Control, nonCVD death"),
col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
plot(coxpred1[1, 3:4,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
ylim=c(0, .15),
xlab="Years since enrollment", ylab="predicted P(state), age=72")
legend("topleft", c("PAD, CVD death", "Control, CVD death",
"PAD, nonCVD death", "Control, nonCVD death"),
col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
The mgus2 data set contains the time to plasma cell malignancy (PCM) and/or death for 1384 subjects diagnosed with monoclonal gammopathy of undetermined significance (MGUS). Survival and progression time are in months. The code below creates an ordinary Kaplan-Meier curve of post-diagnosis survival for these subjects, along with a histogram of age at diagnosis. The mean age at diagnosis is just over 70 years.
mfit1 <- survfit(Surv(futime, death) ~ sex, data=mgus2)
mfit1
## Call: survfit(formula = Surv(futime, death) ~ sex, data = mgus2)
##
## n events median 0.95LCL 0.95UCL
## sex=F 631 423 108 100 121
## sex=M 753 540 88 79 97
plot(mfit1, col=c(1,2), xscale=12, mark.time=FALSE, lwd=2,
xlab="Years post diagnosis", ylab="Survival")
legend("topright", c("female", "male"), col=1:2, lwd=2, bty='n')
As a second model for these subjects we will use competing risks (CR), where PCM and death without malignancy are the two terminal states. In a CR model we are only interested in the first event for each subject. Formally we are treating progression to a PCM as an absorbing state, i.e., one that subjects never exit. We create a variable etime containing the time to the first of progression, death, or last follow-up along with an event variable that contains the outcome. The starting data set mgus2 has two pairs of variables (ptime, pstat) that contain the time to progression and (futime, status) that contain the time to death or last known alive. The code below creates the necessary etime and event variables, then computes and plots the competing risks estimate.
mgus2$etime <- with(mgus2, ifelse(pstat==0, futime, ptime))
event <- with(mgus2, ifelse(pstat==0, 2*death, 1))
mgus2$event <- factor(event, 0:2, labels=c("censor", "pcm", "death"))
table(mgus2$event)
##
## censor pcm death
## 409 115 860
mfit2 <- survfit(Surv(etime, event) ~ sex, data=mgus2)
print(mfit2, rmean=240, scale=12)
## Call: survfit(formula = Surv(etime, event) ~ sex, data = mgus2)
##
## n nevent rmean se(rmean)*
## sex=F, (s0) 631 0 9.853608 0.2909061
## sex=M, (s0) 753 0 8.675012 0.2627185
## sex=F, pcm 631 59 1.323284 0.1742312
## sex=M, pcm 753 56 1.064693 0.1417536
## sex=F, death 631 370 8.823108 0.3053757
## sex=M, death 753 490 10.260294 0.2825688
## *restricted mean time in state (max time = 20 )
mfit2$transitions
## to
## from pcm death (censored)
## (s0) 115 860 409
## pcm 0 0 0
## death 0 0 0
Note that each subject’s initial state can be specified by the istate argument. When this is omitted all subjects are assumed to start from an entry state named ’(s0)’. The printout shows that a male subject will spend, on average, 8.7 of his first 20 years post diagnosis in the entry state, 1.1 years in the PCM state and 10.3 of those 20 in the death state. If a cutoff time is not given the default is to use the maximum observed time for all curves, which is 424 months in this case.
plot(mfit2, col=c(1,2,1,2), lty=c(2,2,1,1),
mark.time=FALSE, lwd=2, xscale=12,
xlab="Years post diagnosis", ylab="Probability in State")
legend(240, .6, c("death:female", "death:male", "pcm:female", "pcm:male"),
col=c(1,2,1,2), lty=c(1,1,2,2), lwd=2, bty='n')
Now we switch to a multi-state model: in this version a subject can have multiple transitions and thus multiple rows in the data set (we consider also death after PCM). In this case it is necessary to identify which data rows go with which subject via the id argument of survfit. Note that we need to decide what to do with the 9 subjects who have PCM and death declared at the same month. (Some of these were cancer cases discovered at autopsy.) For the multi-state model we need to be explicit in how this is coded since a sojourn time of 0 within a state is not allowed. Below we push the progression time back by .1 month when there is a tie, but that amount is entirely arbitrary. Since there are now 3 possible transitions we will call the data set data3.
ptemp <- with(mgus2, ifelse(ptime==futime & pstat==1, ptime-.1, ptime))
data3 <- tmerge(mgus2, mgus2, id=id, death=event(futime, death),
pcm = event(ptemp, pstat))
data3 <- tmerge(data3, data3, id, enum=cumtdc(tstart))
with(data3, table(death, pcm))
## pcm
## death 0 1
## 0 421 115
## 1 963 0
The table above shows that there are no observations in data3 that have both a PCM and death, i.e., the ties have been resolved. The last tmerge line above creates a variable enum which simply counts rows for each person, which is often useful.
temp <- with(data3, ifelse(death==1, 2, pcm))
data3$event <- factor(temp, 0:2, labels=c("censor", "pcm", "death"))
mfit3 <- survfit(Surv(tstart, tstop, event) ~ sex, data=data3, id=id)
print(mfit3, rmean=240, digits=2)
## Call: survfit(formula = Surv(tstart, tstop, event) ~ sex, data = data3,
## id = id)
##
## n nevent rmean se(rmean)*
## sex=F, (s0) 690 0 118.2 3.49
## sex=M, (s0) 809 0 104.1 3.15
## sex=F, pcm 690 59 3.2 0.59
## sex=M, pcm 809 56 2.7 0.62
## sex=F, death 690 423 118.6 3.48
## sex=M, death 809 540 133.2 3.19
## *restricted mean time in state (max time = 240 )
mfit3$transitions
## to
## from pcm death (censored)
## (s0) 115 860 409
## pcm 0 103 12
## death 0 0 0
It is worthwhile to check the transitions table in the output, simply as a data check, since an error in creating the input data can lead to surprising counts and an even more surprising curve. In this case it shows subjects going from the entry state to PCM and death along with transitions from PCM to death. This is as expected. In the following plot we combine the results from competing risk model used above along with a second fit that treats death after PCM as a separate state from death before progression. In this plot the fraction of subjects currently in the PCM state is shown by the distance between the two curves. (Only males are shown in the plot to minimize overlap).
# Death after PCM will correspond to data rows with
# enum = 2 and event = death
d2 <- with(data3, ifelse(enum==2 & event=='death', 4, as.numeric(event)))
e2 <- factor(d2, labels=c("censor", "pcm", "death w/o pcm",
"death after pcm"))
mfit4 <- survfit(Surv(tstart, tstop, e2) ~ sex, data=data3, id=id)
plot(mfit2[2,], lty=c(1,2),
xscale=12, mark.time=FALSE, lwd=2,
xlab="Years post diagnosis", ylab="Probability in State")
lines(mfit4[2,4], mark.time=FALSE, col=2, lty=1, lwd=2,
conf.int=FALSE)
legend(200, .5, c("Death w/o PCM", "ever PCM", "Death after PCM"), col=c(1,1,2), lty=c(2,1,1),
lwd=2, bty='n', cex=.82)