The following examples are taken from the paper: “Analysis of time-to-event for observational studies: Guidance to the use of intensity models”, available at : https://pubmed.ncbi.nlm.nih.gov/33043497/ and from the vignette “Multi-state models and competing risks” at https://cran.r-project.org/web/packages/survival/vignettes/compete.pdf

Peripheral Arterial Disease

Background

## Load packages
library(survival)
library(splines)
library(timereg) 
library(knitr)
library(ggplot2)


## IMPORT DATASET
# pad data
load("C:\\Users\\Giulia Barbati\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\SLE_AA 25_26\\Block 4\\R codes\\pad.rda")

Peripheral arterial disease (PAD) is a common circulatory problem in which narrowed arteries reduce blood flow to peripheral limbs, often the legs. It is also likely to be a sign of a more widespread atherosclerosis, and subjects manifesting the disease carry an increased risk for all atherothrombotic events including myocardial infarction, stroke, and cardiac death. The pad data set contains the results of a Slovene study. Briefly, the study was conducted by 74 primary care physicians-researchers, each of whom was asked to recruit 10 subjects with PAD along with 10 age and sex matched controls without PAD; actual recruitment ranged from 1 to 31 pairs. All participating subjects consented to a yearly examination and interview for 5 years. Important endpoints are death, either due to cardiovascular disease (CVD death) or other causes, non-fatal CVD endpoints of infarction and stroke, and patient interventions attributed to the disease such as revascularization procedures.

The final study includes 742 PAD and 713 controls, with baseline data for each subject, measurements at each visit, and endpoints. This analysis will focus on cardiac endpoints and on overall mortality.

Endpoint and follow-up

The starting R data has 3 dataframes:

pad1 has one observation per subject containing baseline information,
pad2 has one observation per subject visit containing measurements at that visit, and
pad3 has an observation for each observed endpoint.

The dataset pad2 contains all the repeated examinations over the 5 years of the study. Note that in this dataset the ID of the subject is repeated for multiple lines (data are in the long format). The same happens for pad3 that reports all the events of interest. In the IDA phase (Initial Data Analysis) we look at the overall follow-up for subjects, and decide on a time scale and an endpoint. Per the study design subjects should have a visit at years 1–5. We see in Figure that visit 1 dates are stretched out over a full year and by visit 5 over almost 2 years, a pattern that is unavoidable in observational studies of human subjects. Based on the counts shown in the figure there is no evidence of differential drop out between the PAD and control subjects.

vcount <- with(pad2, table(pad, visit))  # visit count
vtime <-  pad2$day/365.25
plot(vtime ~ jitter(visit + .3*(pad=="PAD") - .15), pad2, subset=(visit>0),
     col=c(1,2)[as.numeric(pad)], ylim=c(0, max(vtime)), 
     xlab= "Visit number", ylab="Years from initial visit")
#abline(h=1:6, lty=2, col=4)
segments(rep(0,6), 1:6, 1:6+.5, 1:6, lty=2, col='gray')
legend(4,2, c("PAD", "Control"), col=c(2,1), bty='n', pch=19)
text(1:5 - .15, 0:4, vcount[1,-1])
text(1:5 + .15, 0:4, vcount[2,-1])

pfig2 <- survfit(Surv(futime/365.25, status=='censor') ~ pad, pad1)
plot(pfig2, fun='event', col=1:2,
     xlab="Years post enrollment", ylab="Fraction censored")
legend(2, .6, c("Control", "PAD", "Ideal"), col=c(1,2,1), lty=c(1,1,2),
       bty='n')
lines(c(0, 4.8, 6.8), c(0,0,1), lty=3)

The censoring pattern for subjects in the two arms, showed in the above figure, are computed using a reverse Kaplan-Meier (i.e. the status is to be censored, and the censoring is having the event).
There is no evidence of differential follow-up.

Given the visit slippage of the first figure, an ideal censoring pattern would be a final followup that is approximately uniformly distributed between 4.8 and 6.8 years, this is shown as a dotted line on the figure.

Examining the counts of subjects for each visit the total number of visits for the subjects is not what was planned — about 25% do not have a fifth visit — but patients have adhered to the 5 year time commitment quite closely. As stated in the primary paper, a decision was made to truncate all patient data at 5 years for the purpose of analysis.

The time scale of the survival analysis

The initial survival curves can be created using a copy of the baseline data (pad1), with follow-up truncated at 5 years.

pdata <- subset(pad1, select=c(id, pad, sex, age, futime, status))
pdata$death <- ifelse(pdata$status == 'censor', 0, 1) # any cause of death

# truncate follow=up at 5 years
over5 <- pdata$futime > 5*(365.25)
pdata$futime <- pmin(1826, pdata$futime)  # 1826 days = 5 years
pdata$death[over5] <- 0
pdata$status[over5] <- "censor"
table(pdata$status)

## 
##      censor    CV death nonCV death 
##        1296          68          91

# Convert to follow-up in years, and add an age scale
pdata$years <- pdata$futime/365.25
pdata$age1 <- as.numeric(pad1$d.first - pad1$d.birth)/365.25
pdata$age2 <- pdata$age1 + pdata$years

# comparison age at entry between groups
aggregate(age1 ~ sex + pad, data = pdata, 
          FUN = function(x) c(mean = round(mean(x), 2), sd = round(sd(x), 2)))

##      sex     pad age1.mean age1.sd
## 1 female Control     65.22    9.49
## 2   male Control     64.36    9.04
## 3 female     PAD     66.37    8.92
## 4   male     PAD     64.30    8.60

ggplot(pdata, aes(x = pad, y = age1, fill = sex)) +
  geom_boxplot() +
  labs(title = "Age at Entry",
       x = "Pad Group",
       y = "Age at Entry (Years)") +
  theme_minimal()

oldpar <- par(mfrow=c(1,2))
psurv1 <- survfit(Surv(years, death) ~ pad + sex, data=pdata)
psurv2 <- survfit(Surv(age1, age2, death) ~ pad+ sex, pdata,
               start.time=55)
plot(psurv1, col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
     xlab="Years since enrollment", ylab="Death")
legend(.1, .18, c("PAD, male","PAD, female", "Control, male","Control, female"),
       col=c(2,2,1,1), lty=c(2,1,2,1), lwd=2, bty='n', cex=.8)
plot(psurv2,   col=c(1,1,2,2), lty=1:2, fun='event', lwd=2,
      xlab='Age', ylab='Death, given alive at age 55')

Our first analysis will use time since enrollment as the primary time scale. For the PAD patients the time since diagnosis is a natural scale since it represents both the progression of disease and of treatments for the disease. Survival curves for the control subjects serve as a comparison outcome of similarly aged subjects without the disease, but do not have a natural stand-alone interpretation.

Death rates are higher for males than for females, and is higher for PAD subjects than for the age matched controls. The right hand panel shows the curves on age scale. Note that when using the age scale, we include two time points in the Survival function: age at entry in the study and age at the event or censoring. These data are both right-censored but also left-truncated (we have not discussed this point, but it is of interest if you want to deepen the survival framework). In two words, left-truncation is when the subjects to enter under your observation should be survived until a certain time-point (i.e. age at entry).

Survival curves can equivalently display either the fraction who have died (curves start at 0: 1-KM) or the fraction who have not died (curves start at 1); the choice between the two is often based on tradition or habit.

Risk factors

The analysis will focus on the impact on survival of the covariates PAD, age, sex, and serum values for high density and low density lipoprotein (HDL, LDL). We are therefore in a framework of a descriptive model, since there isn’t a formal causal model focused on one specific risk factor but instead the simultaneous associations of multiple factors to survival is of interest. Figure shows density estimates for each of the three continuous covariates at baseline, by PAD group. By study design there should be no difference in the age distribution. HDL and LDL are slightly lower for the PAD subjects.

oldpar <- par(mfrow=c(1,3), mar=c(5.1, 4.1, 5.1, 1))
dfun <- function(data, x, ...) {
    d0 <- density((data[[x]])[data$pad=="Control"])
    d1 <- density((data[[x]])[data$pad=="PAD"])
    matplot(cbind(d0$x, d1$x), cbind(d0$y, d1$y), type='l', lty=1, col=1:2,
            ...)
}
lunit <- 38.67  # convert from mmol/L to mg/dl
xx <- c(0, 50, 100, 150, 200, 250, 300)

dfun(pdata, "age", xlab="Age", ylab="Density", lwd=2)
legend(51, .02, c("PAD", "Control"), lty=1, lwd=2, col=2:1, bty='n')

dfun(na.omit(subset(pad2, visit==0, c(hdl, pad))),
             "hdl", xlab="HDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)

dfun(na.omit(subset(pad2, visit==0, c(ldl, pad))),
     "ldl", xlab="LDL (mmol/L)", ylab="Density", lwd=2)
axis(3, xx/lunit, xx)

Time-dependent covariates (multiple measurements) & competing risks

The Cox model fits will use age, sex, baseline values of high density and low density lipoproteins (hdl0, ldl0), time-dependent lipid values (hdl, ldl), and a time-dependent value of “years since enrollment” (etime). Note that time since enrollment only plays a role in age scale models; in a classical time-since-entry model every one in a given risk set will have exactly the same value of the covariate, and hence the estimated coefficient for the variable would be 0.

For time-dependent analysis we use a counting process form.This framework is suitable for all kind of generalizations, and it is therefore to be preferred when switching to more advanced methods, like in the multi-state approaches. Mathematically each subject is represented by 3 processes: \(Y_i(t)=1\) if the subject is currently at risk, \(X_i(t)\) is the time dependent covariate process, and \(N(t)\) the cumulative number of events. In the R code this translates into multiple observations for each subject of the form start-stop:

 id, time1, time2, event, x1, x2, ...

where (time1, time2) describe an interval over which the subject is at risk, event corresponds to \(dN(t)\) and will be 1 of the subject had an event at time2 and 0 otherwise, x1, x2, ... are the value of the covariates over the interval, and id is a subject identifier. Note that in this format the event happens at time2, instead the value of the covariate in each line is defined at time1. Subject 9, for instance, is a male with blood values measured on days 0, 583, 945, and 1309, who died on day 1349. Their data has the form

id tstart tstop death  sex  ldl  hdl
 9      0   583     0 male 2.66 1.08
 9    583   945     0 male 3.20 1.53
 9    945  1309     0 male 2.81 1.32
 9   1309  1349     1 male 2.51 1.30

The lipid values and time intervals encode the most recent measurement of the value. Creation and checking of the data often represents a major share of the work in an analysis. There is often concern that because the data set has multiple rows for the same subject, a valid variance estimate for the model will need to account for correlation.
This is not the case, however.
Internally, there is only one copy of each subject; the multiple rows are merely a way to communicate to the program which covariate value is applicable at any given time point.

When there are competing risks, such as CVD and non-CVD death, the 0/1 event event variable is replaced by a factor with levels of “none”, “CVD death”, “non-CVD death”, which shows both when an event occurred and it’s type.

Below we create separate time-dependent data sets for analysis on the time-from-enrollment scale (pdata2) and on the age scale (pdata3).

The two data sets have separate event indicators for death, the competing risk of CVD/non-CVD death, and for any CVD event/non-CVD death. Creating (time1, time2) data sets on age-in-years scale can lead to anomalies due to round off errors; for this reason the time-dependent data sets all use time in days. Use birth date and entry date to create agedays = age at enrollment, in days.

This data set will therefore have two time scales (age and time since enrollment), time dependent covariates (hdl, ldl, etime), and multiple endpoints (all cause death, CVD vs non-CVD death, major CVD event).

A. etime = time since enrollment as a time dependent variable

 1. This is a continuous time-dependent covariate -- technically
 it changes minute by minute.
 
 2. We will be using it in age scale fits.  If we create it on age scale
then the cuts for survSplit can be limited to "the ages at which deaths occur"

B. The tmerge routine can keep track of multiple endpoints on a single time scale, the survSplit routine a single endpoint and single time scale.

C. The tmerge operations needs to be sequential: no step that would change the number of rows can be inserted in the midst.

Taken all together, it is simplest to create the separate data sets, pdata2 on the time since enrollment scale, pdata3 on the age scale. Many of the steps will need to be repeated, e.g., adding time dependent hdl and ldl.

pdata2 <- tmerge(pdata[, c("id", "pad", "age", "sex")], pdata, 
                 id=id, status = event(futime, status))
pdata2 <- tmerge(pdata2, pad2, id=id, ldl= tdc(day, ldl),
                 hdl = tdc(day, hdl))

head(pdata2)

##   id     pad  age    sex tstart tstop status  ldl  hdl
## 1  1 Control 69.5   male      0   487 censor 2.39 1.65
## 2  1 Control 69.5   male    487   876 censor 2.82 1.78
## 3  1 Control 69.5   male    876  1243 censor 2.59 2.27
## 4  1 Control 69.5   male   1243  1597 censor 2.80 2.00
## 5  1 Control 69.5   male   1597  1826 censor 2.33 2.26
## 6  2     PAD 63.9 female      0   459 censor 3.00 0.97

# add baseline hdl and ldl as covariates
indx <- match(pdata2$id, pdata2$id)
pdata2$ldl0 <- pdata2$ldl[indx]
pdata2$hdl0 <- pdata2$hdl[indx]
pdata2$age10 <- pdata2$age/10   # age in decades


head(pdata2)

##   id     pad  age    sex tstart tstop status  ldl  hdl ldl0 hdl0 age10
## 1  1 Control 69.5   male      0   487 censor 2.39 1.65 2.39 1.65  6.95
## 2  1 Control 69.5   male    487   876 censor 2.82 1.78 2.39 1.65  6.95
## 3  1 Control 69.5   male    876  1243 censor 2.59 2.27 2.39 1.65  6.95
## 4  1 Control 69.5   male   1243  1597 censor 2.80 2.00 2.39 1.65  6.95
## 5  1 Control 69.5   male   1597  1826 censor 2.33 2.26 2.39 1.65  6.95
## 6  2     PAD 63.9 female      0   459 censor 3.00 0.97 3.00 0.97  6.39

# create "any cvd event" as a competing risk, it will be used later
temp <- subset(pad3, event %in% c("cvd death", "infarction", "stroke"))
temp <- temp[!duplicated(temp$id),]   # first of these
temp <- subset(temp, day< 5*365.25)   # must be <= 5 years
pdata2 <- tmerge(pdata2, temp, id=id, cvstat=event(day),
                 priorcv = tdc(day))

pdata2$death <- ifelse(pdata2$status=="censor", 0, 1)  # all cause death
tstat <- ifelse(pdata2$cvstat>0, pdata2$cvstat, 2*pdata2$death)
pdata2$cvstat <- factor(tstat, 0:2,
                         c("censor", "CVD event", "nonCVD-death"))
pdata2$death <- ifelse(pdata2$status=="censor", 0, 1)  # all cause death


table(pdata2$cvstat)

## 
##       censor    CVD event nonCVD-death 
##         6488          143          100

table(pdata2$death)

## 
##    0    1 
## 6572  159

table(pdata2$status)

## 
##      censor    CV death nonCV death 
##        6572          68          91

Note that when considering competing risks (cvstat) we have only 100 deaths that are “death as a first event”. Note also that since we are duplicating subjects we have a lot more censored observations.

# Reprise now all the steps, but on age scale.  We need to add "age" to the
#  merged-in data sets before adding them.
ageday <- as.numeric(pad1$d.first - pad1$d.birth) # age in days at enrollmemt

pdata3 <- tmerge(pdata[, c("id", "pad", "sex")], pdata, 
                 id=id, status = event(futime + ageday, status),
                 tstart=ageday, tstop=ageday+futime,
                 options=list(tstart="age1", tstop="age2"))
temp2 <- pad2
temp2$age <- temp2$day + ageday[match(temp2$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp2, id=id, ldl= tdc(age, ldl),
                 hdl = tdc(age, hdl))
head(pdata3)

##   id     pad    sex  age1  age2 status  ldl  hdl
## 1  1 Control   male 25395 25882 censor 2.39 1.65
## 2  1 Control   male 25882 26271 censor 2.82 1.78
## 3  1 Control   male 26271 26638 censor 2.59 2.27
## 4  1 Control   male 26638 26992 censor 2.80 2.00
## 5  1 Control   male 26992 27221 censor 2.33 2.26
## 6  2     PAD female 23342 23801 censor 3.00 0.97

# add baseline hdl and ldl as covariates
indx <- match(pdata3$id, pdata3$id) #points to the first row for each subject
pdata3$ldl0 <- pdata3$ldl[indx]
pdata3$hdl0 <- pdata3$hdl[indx]

# Add time since enrollment as a time-dependent covariate.
#  Use a temporary data set + survSplit to create a 'variable' that looks
#  like yet another lab test, then merge it in on the age scale.
#  Time since enrollment will be in weeks, which will create 52*5= 260 obs
#  per subject.  (Time since enrollment in days would be overkill.)
pdata$ageday <- ageday
temp3 <- survSplit(Surv(futime, status) ~ ageday + id, data= pdata,
      cut= seq(7, max(pdata$futime), by=7))
pdata3 <- tmerge(pdata3, temp3, id, etime = tdc(ageday + tstart, tstart))
pdata3$etime <- pdata3$etime/365.25   # recode to time in years

head(pdata3)

##   id     pad  sex  age1  age2 status  ldl  hdl ldl0 hdl0      etime
## 1  1 Control male 25395 25402 censor 2.39 1.65 2.39 1.65 0.00000000
## 2  1 Control male 25402 25409 censor 2.39 1.65 2.39 1.65 0.01916496
## 3  1 Control male 25409 25416 censor 2.39 1.65 2.39 1.65 0.03832991
## 4  1 Control male 25416 25423 censor 2.39 1.65 2.39 1.65 0.05749487
## 5  1 Control male 25423 25430 censor 2.39 1.65 2.39 1.65 0.07665982
## 6  1 Control male 25430 25437 censor 2.39 1.65 2.39 1.65 0.09582478

# add 'any cv' as an event
temp$age <- temp$day + ageday[match(temp$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp, id=id, cvstat=event(age),
                 priorcv = tdc(age))

# we are adding variable (priorcv) that tracks if they’ve had a "prior" CV event.

# recode now the event variables
pdata3$death <- ifelse(pdata3$status=="censor", 0, 1)  # all cause death
tstat <- ifelse(pdata3$cvstat>0, pdata3$cvstat, 2*pdata3$death)
pdata3$cvstat <- factor(tstat, 0:2, c("censor", "CVD event", "nonCVD-death"))


# pdata3[1:280,]

table(pdata3$death)

## 
##      0      1 
## 362717    159

table(pdata3$cvstat)

## 
##       censor    CVD event nonCVD-death 
##       362633          143          100

Now the number of censored observations is much more higher since we have duplicated rows to create the TD variable time since enrollment.

# Creation of multi-row data sets has pitfalls.  It is always wise to
#  sanity check the results.  
check1 <- survcheck(Surv(tstart, tstop, death) ~ 1, id=id, data= pdata2)
check1

## Call:
## survcheck(formula = Surv(tstart, tstop, death) ~ 1, data = pdata2, 
##     id = id)
## 
## Unique identifiers       Observations        Transitions 
##               1455               6731                159 
## 
## Transitions table:
##        to
## from    event (censored)
##   (s0)    159       1296
##   event     0          0
## 
## Number of subjects with 0, 1, ... transitions to each state:
##          0   1
## event 1296 159

check2 <- survcheck(Surv(age1, age2, death) ~1, id=id, data= pdata3)
check2

## Call:
## survcheck(formula = Surv(age1, age2, death) ~ 1, data = pdata3, 
##     id = id)
## 
## Unique identifiers       Observations        Transitions 
##               1455             362876                159 
## 
## Transitions table:
##        to
## from    event (censored)
##   (s0)    159       1296
##   event     0          0
## 
## Number of subjects with 0, 1, ... transitions to each state:
##          0   1
## event 1296 159

Overall mortality

The first pair of fits shows that the estimated overall effects of PAD and sex hardly differ between analyses done on the enrollment or age scales. A second pair fits looks at the coefficient effects within the PAD and control groups. (Inclusion of the covariate by strata interactions gives results that are equivalent to separate fits for the the PAD and control groups; bundling them as a single model allows a formal test of whether the effects differ between the two strata.) None of sex, enrollment age, or time since enrollment show meaningful differences between the PAD and control groups. That is, on the log-hazard scale, PAD appears to act as an additive risk.

# Fit the first set of models: pad, age, sex on the time since enrollment
#  scale (cfit1a) and on the age scale (cfit1b)
cfit1a <- coxph(Surv(tstart, tstop, death) ~ pad + sex + age10, pdata2)
cfit1b <- coxph(Surv(age1, age2, death) ~ pad + sex + etime, pdata3)
# Add interactions
cfit2a <- coxph(Surv(tstart, tstop, death) ~ (sex + age10)*strata(pad), pdata2)
cfit2b <- coxph(Surv(age1, age2, death) ~ (sex + etime)*strata(pad),
                pdata3)

cat("Overall model on time since enrollment scale\n")

## Overall model on time since enrollment scale

print(cfit1a)

## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ pad + sex + age10, 
##     data = pdata2)
## 
##           coef exp(coef) se(coef)     z        p
## padPAD  0.8755    2.4000   0.1739 5.036 4.76e-07
## sexmale 0.6924    1.9985   0.1822 3.800 0.000145
## age10   0.6583    1.9315   0.1053 6.250 4.11e-10
## 
## Likelihood ratio test=82.63  on 3 df, p=< 2.2e-16
## n= 6731, number of events= 159

cat("\nOverall model on age scale\n")

## 
## Overall model on age scale

print(cfit1b)

## Call:
## coxph(formula = Surv(age1, age2, death) ~ pad + sex + etime, 
##     data = pdata3)
## 
##            coef exp(coef) se(coef)     z        p
## padPAD  0.87354   2.39538  0.17400 5.020 5.16e-07
## sexmale 0.70540   2.02466  0.18270 3.861 0.000113
## etime   0.16593   1.18049  0.05917 2.804 0.005042
## 
## Likelihood ratio test=51.15  on 3 df, p=4.538e-11
## n= 362876, number of events= 159

cat("\nInteractions between covariates and treatment, time since enrollment\n")

## 
## Interactions between covariates and treatment, time since enrollment

print(cfit2a)

## Call:
## coxph(formula = Surv(tstart, tstop, death) ~ (sex + age10) * 
##     strata(pad), data = pdata2)
## 
##                            coef exp(coef) se(coef)      z        p
## sexmale                 0.67576   1.96552  0.33494  2.018 0.043638
## age10                   0.68341   1.98062  0.19354  3.531 0.000414
## sexmale:strata(pad)PAD  0.02208   1.02233  0.39925  0.055 0.955897
## age10:strata(pad)PAD   -0.03498   0.96563  0.23081 -0.152 0.879552
## 
## Likelihood ratio test=55.21  on 4 df, p=2.938e-11
## n= 6731, number of events= 159

cat("\nInteractions between covariates and treatment, age scale\n")

## 
## Interactions between covariates and treatment, age scale

print(cfit2b)

## Call:
## coxph(formula = Surv(age1, age2, death) ~ (sex + etime) * strata(pad), 
##     data = pdata3)
## 
##                             coef exp(coef)  se(coef)     z      p
## sexmale                0.6980525 2.0098347 0.3357736 2.079 0.0376
## etime                  0.1176904 1.1248958 0.1097373 1.072 0.2835
## sexmale:strata(pad)PAD 0.0001539 1.0001539 0.4003644 0.000 0.9997
## etime:strata(pad)PAD   0.0665382 1.0688018 0.1304441 0.510 0.6100
## 
## Likelihood ratio test=24.44  on 4 df, p=6.516e-05
## n= 362876, number of events= 159

Therefore, we can see that in the multivariable Cox model on the follow up scale, being PAD and increasing age decrease the estimated survival, male sex is at higher risk. On the age scale, these results are confirmed. There is no a significant interaction between PAD and gender or age.

Cardiovascular death

Since PAD is expected to exert its primary effect via cardiovascular disease (CVD) endpoints, a more interesting analysis is to focus on CVD rather than overall mortality. Since in this case non-CVD death will be a competing risk this needs to be accounted for in our estimate of the probability in state; in the R survival package this is accomplished by using a multi-level categorical variable as the “status”, rather than 0/1. The survfit function then computes Aalen-Johansen (AJ) estimate.

psurv4 <- survfit(Surv(tstart, tstop, status) ~ pad, id = id, pdata2)
psurv4

## Call: survfit(formula = Surv(tstart, tstop, status) ~ pad, data = pdata2, 
##     id = id)
## 
##                             n nevent      rmean se(rmean)*
## pad=Control, (s0)        3349      0 1774.19419   8.648882
## pad=PAD, (s0)            3382      0 1711.26894  12.097682
## pad=Control, CV death    3349     17   16.99381   5.010891
## pad=PAD, CV death        3382     51   57.63161   9.122496
## pad=Control, nonCV death 3349     30   34.81200   7.166591
## pad=PAD, nonCV death     3382     61   57.09945   8.486423
##    *restricted mean time in state (max time = 1826 )

plot(psurv4, col=1:2, lty=c(1,1,2,2), ylim=c(0,.15), xscale=365.25, 
     xlab="Years since enrollment", ylab="Absolute risk")
legend("topleft", c("PAD CVD death", "Control CVD death", 
                 "PAD nCVD death","Control nCVD"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n')

Figure shows the Aalen-Johansen estimates of probability in state for CVD and non-CVD death (which could also be labeled here as “absolute risk” or “cumulative incidence”). For controls, CVD deaths are about 1/2 of the non-CVD, approximately 2% versus 4% at 5 years; near 8% both outcomes for the PAD subjects.

The absolute risk curves require that all transitions in the model be considered jointly: incorrect curves will result from a simple Kaplan-Meier of time to CVD, censoring at non-CVD death, or a KM of time to non-CVD death, censoring at CVD.

(Interestingly, this is not true for the cumulative hazard estimate: the naive estimate for one outcome, censoring the others, will be identical to the joint estimate).

We could repeat the same procedure but focusing now on major CVD events with death w/o CVD as a competing risk:

psurv5 <- survfit(Surv(tstart, tstop, cvstat) ~ pad, pdata2, id=id,
                  subset=(priorcv==0))
plot(psurv5, col=1:2, lty=c(1,1,2,2), xscale=365.25, ylim=c(0, .15),
         xlab="Years since enrollment", ylab="Absolute risk")         
legend("topleft", c("Major CVD, PAD", "Major CVD, Control", 
                  "death w/o CVD, PAD","death w/o CVD, Control"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.85)

The plot represent the AJ curves for first major CVD event (infarction, stroke, CVD death) versus non-CVD death w/o any CVD non-fatal event as a competing risk. Nearly 14% of the PAD subjects experienced a major CVD event versus 6% for controls.

Cox model for the transitions

The Cox model is a model for the hazards and has the same identity: the first coxph call below, which models the entire process, will have the same results for the CVD death portion as the second coxph call, which ignores non-CVD deaths by censoring them; this is verified by the all.equal test.

However, if one desires to predict absolute risk curves from the fitted Cox model then the first form (using the “complete” multi-state approach) is essential !!!

pcox1 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10 + hdl0 + ldl0, 
               pdata2, id=id)
cat("transitions table\n")

## transitions table

pcox1$transitions  # a check of the data

##              to
## from          CV death nonCV death (censored)
##   (s0)              67          91       1275
##   CV death           0           0          0
##   nonCV death        0           0          0

pcox1

## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 + 
##     hdl0 + ldl0, data = pdata2, id = id)
## 
##          
## 1:2           coef exp(coef) robust se      z        p
##   padPAD   1.05440   2.87025   0.28029  3.762 0.000169
##   sexmale  0.51298   1.67026   0.27881  1.840 0.065786
##   age10    0.65698   1.92895   0.17894  3.672 0.000241
##   hdl0    -0.30026   0.74063   0.44220 -0.679 0.497130
##   ldl0    -0.08128   0.92194   0.12754 -0.637 0.523966
## 
##          
## 1:3           coef exp(coef) robust se      z        p
##   padPAD   0.71390   2.04194   0.22789  3.133  0.00173
##   sexmale  0.75348   2.12438   0.25505  2.954  0.00313
##   age10    0.65666   1.92835   0.14379  4.567 4.95e-06
##   hdl0    -0.20347   0.81589   0.32688 -0.622  0.53363
##   ldl0     0.02213   1.02237   0.10857  0.204  0.83850
## 
##  States:  1= (s0), 2= CV death, 3= nonCV death 
## 
## Likelihood ratio test=83.96  on 10 df, p=8.386e-14
## n= 6626, unique id= 1433, number of events= 158 
##    (105 osservazioni eliminate a causa di valori mancanti)

Note that we have lost 1 subject for CV death and 21 censored subjects because there are some missing data in the lipid covariates.

# cause-specific Cox model (censoring non CV death) 
pcox1b <- coxph(Surv(tstart, tstop, status=="CV death") ~ pad +sex + age10
              + hdl0 + ldl0, data=pdata2)  #censor non-CVD death
pcox1b

## Call:
## coxph(formula = Surv(tstart, tstop, status == "CV death") ~ pad + 
##     sex + age10 + hdl0 + ldl0, data = pdata2)
## 
##             coef exp(coef) se(coef)      z        p
## padPAD   1.05445   2.87039  0.28292  3.727 0.000194
## sexmale  0.51298   1.67027  0.27813  1.844 0.065122
## age10    0.65703   1.92905  0.16379  4.011 6.03e-05
## hdl0    -0.30029   0.74060  0.32993 -0.910 0.362744
## ldl0    -0.08124   0.92197  0.12687 -0.640 0.521959
## 
## Likelihood ratio test=39.51  on 5 df, p=1.874e-07
## n= 6626, number of events= 67 
##    (105 osservazioni eliminate a causa di valori mancanti)

# adding also TD covariates
pcox2 <- coxph(Surv(tstart, tstop, status) ~ pad + sex + age10+
              + hdl + ldl, data=pdata2, id=id)  # time dependent hdl and ldl
pcox2

## Call:
## coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 + 
##     +hdl + ldl, data = pdata2, id = id)
## 
##          
## 1:2          coef exp(coef) robust se      z        p
##   padPAD   0.8751    2.3991    0.2745  3.188 0.001433
##   sexmale  0.3094    1.3626    0.2707  1.143 0.252954
##   age10    0.6963    2.0064    0.1813  3.840 0.000123
##   hdl     -1.5482    0.2126    0.4657 -3.325 0.000885
##   ldl     -0.2797    0.7560    0.1496 -1.869 0.061609
## 
##          
## 1:3          coef exp(coef) robust se      z        p
##   padPAD   0.7545    2.1265    0.2256  3.345 0.000824
##   sexmale  0.7847    2.1918    0.2461  3.188 0.001432
##   age10    0.6761    1.9662    0.1409  4.797 1.61e-06
##   hdl     -0.1489    0.8617    0.3901 -0.382 0.702680
##   ldl      0.1338    1.1432    0.1070  1.250 0.211251
## 
##  States:  1= (s0), 2= CV death, 3= nonCV death 
## 
## Likelihood ratio test=105.6  on 10 df, p=< 2.2e-16
## n= 6702, unique id= 1453, number of events= 158 
##    (29 osservazioni eliminate a causa di valori mancanti)

There were 67 CVD and 91 non-CVD deaths in 5 years of follow-up, which are the transitions from the initial state (s0) to the CVD death and non-CVD death states, respectively, in the transitions table.

(No one transitioned from death to another state, which would indicate an error in the data set !!!)

The joint Cox model shows that after adjustment for age, sex, and baseline lipids, PAD is associated with nearly three fold increase in the rate of CVD deaths and double fold increase in non-CVD death.

Age and sex are important predictors as well, as expected, with male sex having a somewhat larger effect on the non-CVD endpoint.

The effect of baseline lipid values was not significant, however, time-dependent HDL shows an effect on the CVD death endpoint, somewhat reducing the estimated effect of PAD on CVD death.

A note on evaluating linearity

Below we reprise the CVD portion of the fit, and investigate linearity for the serum lipids, whose skewness may raise concerns. We can explore nonlinearity in several ways, in R fitting a spline is a simple one.

The likelihood ratio test provided by the anova command shows that the gain in goodness of fit from an HDL and/or LDL transformation is modest. The plot implies there is less “high HDL” benefit after reaching a value of 1.5, which agrees with common guidelines; together this suggests that a modified covariate hdl2 = min(HDL, 1.5) might be the better predictor. A final check for proportional hazards did not reveal any concerns.

pcox5a <- coxph(Surv(tstart, tstop, cvstat=="CVD event") ~ pad + age10+ sex
              + hdl + ldl, data=pdata2, subset=(priorcv==0))
pcox5b <- coxph(Surv(tstart, tstop, cvstat=="CVD event") ~ pad + age10+ sex
              + ns(hdl,3) + ns(ldl,3), data=pdata2, subset= (priorcv==0))
anova(pcox5a ,pcox5b)

## Analysis of Deviance Table
##  Cox model: response is  Surv(tstart, tstop, cvstat == "CVD event")
##  Model 1: ~ pad + age10 + sex + hdl + ldl
##  Model 2: ~ pad + age10 + sex + ns(hdl, 3) + ns(ldl, 3)
##    loglik  Chisq Df Pr(>|Chi|)  
## 1 -985.67                       
## 2 -980.93 9.4697  4    0.05037 .
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

cox.zph(pcox5b)

##             chisq df     p
## pad         0.413  1 0.520
## age10       3.249  1 0.071
## sex         2.361  1 0.124
## ns(hdl, 3)  6.453  3 0.092
## ns(ldl, 3)  2.343  3 0.504
## GLOBAL     12.787  9 0.172

termplot(pcox5b, term=4, se=TRUE, ylim=c(-4,4), col.term=1, col.se=1,
         xlab="HDL", ylab="Estimated hazard ratio")

Predicted probability-in-state curves

Predicted \(P(state)\) curves can be created from the multi-state fits, as well. As with any Cox model prediction, the first task is to choose a set of covariate values for which prediction is desired. We create the predictions for four hypothetical male subjects with ages of 58 and 72 (quartiles of age), and PAD vs. control. For HDL and LDL use values of 1.3 and 3, which are near the medians. (These curves are for the model of CVD death vs other death, time-fixed hdl and ldl).

dummy <- expand.grid(pad=c("PAD", "Control"), age10=c(5.8, 7.2), sex="male",
                     hdl0 =1.3, ldl0 =3)
dummy

##       pad age10  sex hdl0 ldl0
## 1     PAD   5.8 male  1.3    3
## 2 Control   5.8 male  1.3    3
## 3     PAD   7.2 male  1.3    3
## 4 Control   7.2 male  1.3    3

coxpred1 <- survfit(pcox1, newdata= dummy)
dim(coxpred1)

##   data states 
##      4      3

coxpred1$states

## [1] "(s0)"        "CV death"    "nonCV death"

The resulting set of absolute risk curves can be viewed as an array of estimates by stratum, target covariates, and state. The pcox1 fit had no strata so the first dimension is 1. The resulting curves are shown in the Figure. Because a multi-state model can generate a very large number of curves it is often helpful to break them into multiple panels.

par(mfrow=c(1, 2))
plot(coxpred1[1, 1:2 ,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
     ylim=c(0, .15),
     xlab="Years since enrollment", ylab="predicted P(state), age = 58")
legend("topleft", c("PAD, CVD death", "Control, CVD death",
                   "PAD, nonCVD death", "Control, nonCVD death"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
plot(coxpred1[1, 3:4,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
     ylim=c(0, .15),
     xlab="Years since enrollment", ylab="predicted P(state), age=72")

legend("topleft", c("PAD, CVD death", "Control, CVD death",
                   "PAD, nonCVD death", "Control, nonCVD death"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)

From competing risks to multistate: an example

The mgus2 data set contains the time to plasma cell malignancy (PCM) and/or death for 1384 subjects diagnosed with monoclonal gammopathy of undetermined significance (MGUS). Survival and progression time are in months. The code below creates an ordinary Kaplan-Meier curve of post-diagnosis survival for these subjects, along with a histogram of age at diagnosis. The mean age at diagnosis is just over 70 years.

mfit1 <- survfit(Surv(futime, death) ~ sex, data=mgus2)
mfit1

## Call: survfit(formula = Surv(futime, death) ~ sex, data = mgus2)
## 
##         n events median 0.95LCL 0.95UCL
## sex=F 631    423    108     100     121
## sex=M 753    540     88      79      97

plot(mfit1, col=c(1,2), xscale=12, mark.time=FALSE, lwd=2,
xlab="Years post diagnosis", ylab="Survival")
legend("topright", c("female", "male"), col=1:2, lwd=2, bty='n')

As a second model for these subjects we will use competing risks (CR), where PCM and death without malignancy are the two terminal states. In a CR model we are only interested in the first event for each subject. Formally we are treating progression to a PCM as an absorbing state, i.e., one that subjects never exit. We create a variable etime containing the time to the first of progression, death, or last follow-up along with an event variable that contains the outcome. The starting data set mgus2 has two pairs of variables (ptime, pstat) that contain the time to progression and (futime, status) that contain the time to death or last known alive. The code below creates the necessary etime and event variables, then computes and plots the competing risks estimate.

mgus2$etime <- with(mgus2, ifelse(pstat==0, futime, ptime))
event <- with(mgus2, ifelse(pstat==0, 2*death, 1))
mgus2$event <- factor(event, 0:2, labels=c("censor", "pcm", "death"))
table(mgus2$event)

## 
## censor    pcm  death 
##    409    115    860

mfit2 <- survfit(Surv(etime, event) ~ sex, data=mgus2)
print(mfit2, rmean=240, scale=12)

## Call: survfit(formula = Surv(etime, event) ~ sex, data = mgus2)
## 
##                n nevent     rmean se(rmean)*
## sex=F, (s0)  631      0  9.853608  0.2909061
## sex=M, (s0)  753      0  8.675012  0.2627185
## sex=F, pcm   631     59  1.323284  0.1742312
## sex=M, pcm   753     56  1.064693  0.1417536
## sex=F, death 631    370  8.823108  0.3053757
## sex=M, death 753    490 10.260294  0.2825688
##    *restricted mean time in state (max time = 20 )

mfit2$transitions

##        to
## from    pcm death (censored)
##   (s0)  115   860        409
##   pcm     0     0          0
##   death   0     0          0

Note that each subject’s initial state can be specified by the istate argument. When this is omitted all subjects are assumed to start from an entry state named ’(s0)’. The printout shows that a male subject will spend, on average, 8.7 of his first 20 years post diagnosis in the entry state, 1.1 years in the PCM state and 10.3 of those 20 in the death state. If a cutoff time is not given the default is to use the maximum observed time for all curves, which is 424 months in this case.

plot(mfit2, col=c(1,2,1,2), lty=c(2,2,1,1),
mark.time=FALSE, lwd=2, xscale=12,
xlab="Years post diagnosis", ylab="Probability in State")
legend(240, .6, c("death:female", "death:male", "pcm:female", "pcm:male"),
col=c(1,2,1,2), lty=c(1,1,2,2), lwd=2, bty='n')

Now we switch to a multi-state model: in this version a subject can have multiple transitions and thus multiple rows in the data set (we consider also death after PCM). In this case it is necessary to identify which data rows go with which subject via the id argument of survfit. Note that we need to decide what to do with the 9 subjects who have PCM and death declared at the same month. (Some of these were cancer cases discovered at autopsy.) For the multi-state model we need to be explicit in how this is coded since a sojourn time of 0 within a state is not allowed. Below we push the progression time back by .1 month when there is a tie, but that amount is entirely arbitrary. Since there are now 3 possible transitions we will call the data set data3.

ptemp <- with(mgus2, ifelse(ptime==futime & pstat==1, ptime-.1, ptime))
data3 <- tmerge(mgus2, mgus2, id=id, death=event(futime, death),
pcm = event(ptemp, pstat))
data3 <- tmerge(data3, data3, id, enum=cumtdc(tstart))
with(data3, table(death, pcm))

##      pcm
## death   0   1
##     0 421 115
##     1 963   0

The table above shows that there are no observations in data3 that have both a PCM and death, i.e., the ties have been resolved. The last tmerge line above creates a variable enum which simply counts rows for each person, which is often useful.

temp <- with(data3, ifelse(death==1, 2, pcm))
data3$event <- factor(temp, 0:2, labels=c("censor", "pcm", "death"))
mfit3 <- survfit(Surv(tstart, tstop, event) ~ sex, data=data3, id=id)
print(mfit3, rmean=240, digits=2)

## Call: survfit(formula = Surv(tstart, tstop, event) ~ sex, data = data3, 
##     id = id)
## 
##                n nevent rmean se(rmean)*
## sex=F, (s0)  690      0 118.2       3.49
## sex=M, (s0)  809      0 104.1       3.15
## sex=F, pcm   690     59   3.2       0.59
## sex=M, pcm   809     56   2.7       0.62
## sex=F, death 690    423 118.6       3.48
## sex=M, death 809    540 133.2       3.19
##    *restricted mean time in state (max time = 240 )

mfit3$transitions

##        to
## from    pcm death (censored)
##   (s0)  115   860        409
##   pcm     0   103         12
##   death   0     0          0

It is worthwhile to check the transitions table in the output, simply as a data check, since an error in creating the input data can lead to surprising counts and an even more surprising curve. In this case it shows subjects going from the entry state to PCM and death along with transitions from PCM to death. This is as expected. In the following plot we combine the results from competing risk model used above along with a second fit that treats death after PCM as a separate state from death before progression. In this plot the fraction of subjects currently in the PCM state is shown by the distance between the two curves. (Only males are shown in the plot to minimize overlap).

# Death after PCM will correspond to data rows with
# enum = 2 and event = death
d2 <- with(data3, ifelse(enum==2 & event=='death', 4, as.numeric(event)))
e2 <- factor(d2, labels=c("censor", "pcm", "death w/o pcm",
"death after pcm"))
mfit4 <- survfit(Surv(tstart, tstop, e2) ~ sex, data=data3, id=id)
plot(mfit2[2,], lty=c(1,2),
xscale=12, mark.time=FALSE, lwd=2,
xlab="Years post diagnosis", ylab="Probability in State")
lines(mfit4[2,4], mark.time=FALSE, col=2, lty=1, lwd=2,
conf.int=FALSE)
legend(200, .5, c("Death w/o PCM", "ever PCM", "Death after PCM"), col=c(1,1,2), lty=c(2,1,1),
lwd=2, bty='n', cex=.82)

Supplementary materials : more advanced survival topics