These examples are taken from the paper: “Analysis of time-to-event for observational studies: Guidance to the use of intensity models”, available at : https://pubmed.ncbi.nlm.nih.gov/33043497/.

Introduction

Peripheral Arterial Disease

Background

Peripheral arterial disease (PAD) is a common circulatory problem in which narrowed arteries reduce blood flow to peripheral limbs, often the legs. It is also likely to be a sign of a more widespread atherosclerosis, and subjects manifesting the disease carry an increased risk for all atherothrombotic events including myocardial infarction, stroke, and cardiac death. The pad data set contains the results of a Slovene study. Briefly, the study was conducted by 74 primary care physicians-researchers, each of whom was asked to recruit 10 subjects with PAD along with 10 age and sex matched controls without PAD; actual recruitment ranged from 1 to 31 pairs. All participating subjects consented to a yearly examination and interview for 5 years. Important endpoints are death, either due to cardiovascular disease (CVD death) or other causes, non-fatal CVD endpoints of infarction and stroke, and patient interventions attributed to the disease such as revascularization procedures.

The final study includes 742 PAD and 713 controls, with baseline data for each subject, measurements at each visit, and endpoints. This analysis will focus on cardiac endpoints and on overall mortality.

Endpoint and follow-up

The starting R data has 3 dataframes:

  • pad1 has one observation per subject containing baseline information,
  • pad2 has one observation per subject visit containing measurements at that visit, and
  • pad3 has an observation for each observed endpoint.

The dataset pad2 contains all the repeated examinations over the 5 years of the study. Note that in this dataset the ID of the subject is repeated for multiple lines (data are in the long format). The same happens for pad3 that reports all the events of interest. In the IDA phase (Initial Data Analysis) we look at the overall follow-up for subjects, and decide on a time scale and an endpoint. Per the study design subjects should have a visit at years 1–5. We see in Figure that visit 1 dates are stretched out over a full year and by visit 5 over almost 2 years, a pattern that is unavoidable in observational studies of human subjects. Based on the counts shown in the figure there is no evidence of differential drop out between the PAD and control subjects.

Time to each visit for the PAD study.

Time to each visit for the PAD study.

Observed and ideal follow-up in the PAD study

Observed and ideal follow-up in the PAD study

The censoring pattern for subjects in the two arms, showed in the above figure, are computed using a reverse Kaplan-Meier (i.e. the status is to be censored, and the censoring is having the event).
There is no evidence of differential follow-up.

Given the visit slippage of the first figure, an ideal censoring pattern would be a final followup that is approximately uniformly distributed between 4.8 and 6.8 years, this is shown as a dotted line on the figure.

Examining the counts of subjects for each visit the total number of visits for the subjects is not what was planned — about 25% do not have a fifth visit — but patients have adhered to the 5 year time commitment quite closely. As stated in the primary paper, a decision was made to truncate all patient data at 5 years for the purpose of analysis.

The time scale of the survival analysis

The initial survival curves can be created using a copy of the baseline data (pad1), with follow-up truncated at 5 years.


     censor    CV death nonCV death 
       1296          68          91 
Kaplan-Meier curves of overall survival.

Kaplan-Meier curves of overall survival.

Our first analysis will use time since enrollment as the primary time scale. For the PAD patients the time since diagnosis is a natural scale since it represents both the progression of disease and of treatments for the disease. Survival curves for the control subjects serve as a comparison outcome of similarly aged subjects without the disease, but do not have a natural stand-alone interpretation.

Death rates are higher for males than for females, which is no surprise given a mean age at entry of 65 years, and is higher for PAD subjects than for the age matched controls. The right hand panel shows the curves on age scale. Note that when using the age scale, we include two time points in the Survival function: age at entry in the study and age at the event or censoring. These data are both right-censored and left-truncated (we have not discussed this point in the lessons, but it is of interest if you want to deepen the survival framework).In two words, left-truncation is when the subjects to enter under your observation should be survived until a certain time-point (i.e. age at entry).

Survival curves can equivalently display either the fraction who have died (curves start at 0: 1-KM) or the fraction who have not died (curves start at 1); the choice between the two is often based on tradition or habit.

Covariates and Cox model

The analysis will focus on the impact on survival of the covariates PAD, age, sex, and serum values for high density and low density lipoprotein (HDL, LDL). Figure shows density estimates for each of the three continuous covariates at baseline, by PAD group. By study design there should be no difference in the age distribution. HDL and LDL are slightly lower for the PAD subjects.

Distributions of three continuous covariates, by group.  Blood value scales are in mmol/L (lower) and mg/dl (above).

Distributions of three continuous covariates, by group. Blood value scales are in mmol/L (lower) and mg/dl (above).

A more advanced topic : time-dependent covariates (multiple measurements) and competing risks !

The Cox model fits will use age, sex, baseline values of high density and low density lipoproteins (hdl0, ldl0), time-dependent lipid values (hdl, ldl), and a time-dependent value of “years since enrollment” (etime). Time since enrollment only plays a role in age scale models: in a time-since-entry model every one in a given risk set will have exactly the same value of the covariate, and hence the estimated coefficient for the variable would be 0.

For time-dependent analysis we use a counting process form. Mathematically each subject is represented by 3 processes: \(Y_i(t)=1\) if the subject is currently at risk, \(X_i(t)\) is the time dependent covariate process, and \(N(t)\) the cumulative number of events. In the R code this translates into multiple observations for each subject of the form start-stop:

 id, time1, time2, event, x1, x2, ...

where (time1, time2) describe an interval over which the subject is at risk, event corresponds to \(dN(t)\) and will be 1 of the subject had an event at time2 and 0 otherwise, x1, x2, ... are the value of the covariates over the interval, and id is a subject identifier. Subject 9, for instance, is a male with blood values measured on days 0, 583, 945, and 1309, who died on day 1349. Their data has the form

id tstart tstop death  sex  ldl  hdl
 9      0   583     0 male 2.66 1.08
 9    583   945     0 male 3.20 1.53
 9    945  1309     0 male 2.81 1.32
 9   1309  1349     1 male 2.51 1.30

The lipid values and time intervals encode the most recent measurement of the value. Creation and checking of the data often represents a major share of the work in an analysis. There is often concern that because the data set has multiple rows for the same subject, a valid variance estimate for the model will need to account for correlation.
This is not the case, however.
Internally, there is only one copy of each subject; the multiple rows are merely a way to communicate to the program which covariate value is applicable at any given time point.

When there are competing risks, such as CVD and non-CVD death, the 0/1 event event variable is replaced by a factor with levels of “none”, “CVD death”, “non-CVD death”, which shows both when an event occurred and it’s type.

Below we create separate time-dependent data sets for analysis on the time-from-enrollment scale (pdata2) and on the age scale (pdata3).

The two data sets have separate event indicators for death, the competing risk of CVD/non-CVD death, and for any CVD event/non-CVD death. Creating (time1, time2) data sets on age-in-years scale can lead to anomalies due to round off error; a detailed discussion and example is found in the “Round off error and tied times” vignette of the survival package. For this reason the time-dependent data sets all use time in days. Use birth date and entry date to create agedays = age at enrollment, in days

This data set will have two time scales (age and time since enrollment), time dependent covariates (hdl, ldl, etime), and multiple endpoints (all cause death, CVD vs non-CVD death, major CVD event).

A. etime = time since enrollment as a time dependent variable 1. This is a continuous time-dependent covariate – technically it changes minute by minute. 2. We will be using it in age scale fits. If we create it on age scale then the cuts for survSplit can be limited to “the ages at which deaths occur”

B. The tmerge routine can keep track of multiple endpoints on a single time scale, the survSplit routine a single endpoint and single time scale.

C. The tmerge operations needs to be sequential: no step that would change the number of rows can be inserted in the midst.

Taken all together, it is simplest to create the data sets, pdata2 on the time since enrollment scale, pdata3 on the age scale. Many of the steps will need to be repeated, e.g., adding time dependent hdl and ldl.

pdata2 <- tmerge(pdata[, c("id", "pad", "age", "sex")], pdata, 
                 id=id, status = event(futime, status))
pdata2 <- tmerge(pdata2, pad2, id=id, ldl= tdc(day, ldl),
                 hdl = tdc(day, hdl))

# add baseline hdl and ldl as covariates
indx <- match(pdata2$id, pdata2$id)
pdata2$ldl0 <- pdata2$ldl[indx]
pdata2$hdl0 <- pdata2$hdl[indx]
pdata2$age10 <- pdata2$age/10   # age in decades

# create "any cvd event" as a competing risk, it will be used later
temp <- subset(pad3, event %in% c("cvd death", "infarction", "stroke"))
temp <- temp[!duplicated(temp$id),]   # first of these
temp <- subset(temp, day< 5*365.25)   # must be <= 5 years
pdata2 <- tmerge(pdata2, temp, id=id, cvstat=event(day),
                 priorcv = tdc(day))

pdata2$death <- ifelse(pdata2$status=="censor", 0, 1)  # all cause death
tstat <- ifelse(pdata2$cvstat>0, pdata2$cvstat, 2*pdata2$death)
pdata2$cvstat <- factor(tstat, 0:2,
                         c("censor", "CVD event", "death"))
pdata2$death <- ifelse(pdata2$status=="censor", 0, 1)  # all cause death

# Reprise all the steps, on age scale.  We need to add "age" to the
#  merged-in data sets before adding them.
ageday <- as.numeric(pad1$d.first - pad1$d.birth) # age in days at enrollmemt
pdata3 <- tmerge(pdata[, c("id", "pad", "sex")], pdata, 
                 id=id, status = event(futime + ageday, status),
                 tstart=ageday, tstop=ageday+futime,
                 options=list(tstart="age1", tstop="age2"))
temp2 <- pad2
temp2$age <- temp2$day + ageday[match(temp2$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp2, id=id, ldl= tdc(age, ldl),
                 hdl = tdc(age, hdl))

# add baseline hdl and ldl as covariates
indx <- match(pdata3$id, pdata3$id) #points to the first row for each subject
pdata3$ldl0 <- pdata3$ldl[indx]
pdata3$hdl0 <- pdata3$hdl[indx]

# Add time since enrollment as a time-dependent covariate.
#  Use a temporary data set + survSplit to create a 'variable' that looks
#  like yet another lab test, then merge it in on the age scale.
#  Time since enrollment will be in weeks, which will create 52*5= 260 obs
#  per subject.  (Time since enrollment in days would be overkill.)
pdata$ageday <- ageday
temp3 <- survSplit(Surv(futime, status) ~ ageday + id, data= pdata,
      cut= seq(7, max(pdata$futime), by=7))
pdata3 <- tmerge(pdata3, temp3, id, etime = tdc(ageday + tstart, tstart))
pdata3$etime <- pdata3$etime/365.25   # recode to time in years

# add 'any cv' as an event
temp$age <- temp$day + ageday[match(temp$id, pad1$id)]
pdata3 <- tmerge(pdata3, temp, id=id, cvstat=event(age),
                 priorcv = tdc(age))

# recode event variables, after all survSplit or tmerge steps are done
pdata3$death <- ifelse(pdata3$status=="censor", 0, 1)  # all cause death
tstat <- ifelse(pdata3$cvstat>0, pdata3$cvstat, 2*pdata3$death)
pdata3$cvstat <- factor(tstat, 0:2, c("censor", "CVD event", "nonCVD-death"))

# Creation of multi-row data sets has pitfalls.  It is always wise to
#  sanity check the results.  
check1 <- survcheck(Surv(tstart, tstop, death) ~ 1, id=id, data= pdata2)
check1  
Call:
survcheck(formula = Surv(tstart, tstop, death) ~ 1, data = pdata2, 
    id = id)

Unique identifiers       Observations        Transitions 
              1455               6731                159 

Transitions table:
      to
from      1 (censored)
  (s0)  159       1296
  1       0          0

Number of subjects with 0, 1, ... transitions to each state:
       count
state      0   1
  1     1296 159
  (any) 1296 159
check2 <- survcheck(Surv(age1, age2, death) ~1, id=id, data= pdata3)
check2
Call:
survcheck(formula = Surv(age1, age2, death) ~ 1, data = pdata3, 
    id = id)

Unique identifiers       Observations        Transitions 
              1455             362876                159 

Transitions table:
      to
from      1 (censored)
  (s0)  159       1296
  1       0          0

Number of subjects with 0, 1, ... transitions to each state:
       count
state      0   1
  1     1296 159
  (any) 1296 159

Overall mortality

The first pair of fits shows that the estimated overall effects of PAD and sex hardly differ between analyses done on the enrollment or age scales. A second pair fits looks at the coefficient effects within the PAD and control groups. (Inclusion of the covariate by strata interactions gives results that are equivalent to separate fits for the the PAD and control groups; bundling them as a single model allows a formal test of whether the effects differ between the two strata.) None of sex, enrollment age, or time since enrollment show meaningful differences between the PAD and control groups. That is, on the log-hazard scale, PAD appears to act as an additive risk.

Overall model on time since enrollment scale
Call:
coxph(formula = Surv(tstart, tstop, death) ~ pad + sex + age10, 
    data = pdata2)

          coef exp(coef) se(coef)     z        p
padPAD  0.8755    2.4000   0.1739 5.036 4.76e-07
sexmale 0.6924    1.9985   0.1822 3.800 0.000145
age10   0.6583    1.9315   0.1053 6.250 4.11e-10

Likelihood ratio test=82.63  on 3 df, p=< 2.2e-16
n= 6731, number of events= 159 

Overall model on age scale
Call:
coxph(formula = Surv(age1, age2, death) ~ pad + sex + etime, 
    data = pdata3)

           coef exp(coef) se(coef)     z        p
padPAD  0.87354   2.39538  0.17400 5.020 5.16e-07
sexmale 0.70540   2.02466  0.18270 3.861 0.000113
etime   0.16593   1.18049  0.05917 2.804 0.005042

Likelihood ratio test=51.15  on 3 df, p=4.538e-11
n= 362876, number of events= 159 

Interactions between covariates and treatment, time since enrollment
Call:
coxph(formula = Surv(tstart, tstop, death) ~ (sex + age10) * 
    strata(pad), data = pdata2)

                           coef exp(coef) se(coef)      z        p
sexmale                 0.67576   1.96552  0.33494  2.018 0.043638
age10                   0.68341   1.98062  0.19354  3.531 0.000414
sexmale:strata(pad)PAD  0.02208   1.02233  0.39925  0.055 0.955897
age10:strata(pad)PAD   -0.03498   0.96563  0.23081 -0.152 0.879552

Likelihood ratio test=55.21  on 4 df, p=2.938e-11
n= 6731, number of events= 159 

Interactions between covariates and treatment, age scale
Call:
coxph(formula = Surv(age1, age2, death) ~ (sex + etime) * strata(pad), 
    data = pdata3)

                            coef exp(coef)  se(coef)     z      p
sexmale                0.6980525 2.0098347 0.3357736 2.079 0.0376
etime                  0.1176904 1.1248958 0.1097373 1.072 0.2835
sexmale:strata(pad)PAD 0.0001539 1.0001539 0.4003644 0.000 0.9997
etime:strata(pad)PAD   0.0665382 1.0688018 0.1304441 0.510 0.6100

Likelihood ratio test=24.44  on 4 df, p=6.516e-05
n= 362876, number of events= 159 

Therefore, we can see that in the multivariable Cox model on the follow up scale, being PAD and increasing age decrease the estimated survival, male sex is at higher risk. On the age scale, these results are confirmed. There is no a significant interaction between PAD and gender or age.

Cardiovascular death

Since PAD is expected to exert its primary effect via cardiovascular disease (CVD) endpoints, a more interesting analysis is to focus on CVD rather than overall mortality. Since in this case non-CVD death will be a competing risk this needs to be accounted for in our estimate of the probability in state; in the R survival package this is accomplished by using a multi-level categorical variable as the “status”, rather than 0/1. The survfit function then computes Aalen-Johansen (AJ) estimates of a multi state model.

Aalen-Johansen curves for CVD vs. non-CVD death outcomes (left) and for any CVD event vs. death without CVD (right).

Aalen-Johansen curves for CVD vs. non-CVD death outcomes (left) and for any CVD event vs. death without CVD (right).

Figure shows the Aalen-Johansen estimates of probability in state for CVD and non-CVD death in the left panel (which could also be labeled as “absolute risk”, “competing risk”, or “cumulative incidence”) and the AJ curves for first major CVD event (infarction, stroke, CVD death) versus non-CVD death in the right panel. For controls, CVD deaths are about 1/2 of the non-CVD, approximately 2% versus 4% at 5 years; both outcomes are near 8% for the PAD subjects. Nearly 14% of the PAD subjects experience a major CVD event versus 6% for controls.

The absolute risk curves require that all transitions in the model be considered jointly: incorrect curves will result from a simple Kaplan-Meier of time to CVD, censoring at non-CVD death, or a KM of time to non-CVD death, censoring at CVD.

Interestingly, this is not true for the cumulative hazard estimate: the naive estimate for one outcome, censoring the others, will be identical to the joint estimate.

The Cox model is a model for the hazards and has the same identity: the first coxph call below, which models the entire process, will have the same results for the CVD death portion as the second coxph call, which ignores non-CVD deaths by censoring them; this is verified by the all.equal test.

However, if one desires to predict absolute risk curves from the fitted Cox model then the first form (using the “complete” multi-state approach) is essential.

transitions table
             to
from          CV death nonCV death (censored)
  (s0)              67          91       1275
  CV death           0           0          0
  nonCV death        0           0          0
Call:
coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 + 
    hdl0 + ldl0, data = pdata2, id = id)

         
1:2           coef exp(coef) se(coef) robust se      z        p
  padPAD   1.05445   2.87039  0.28292   0.28029  3.762 0.000169
  sexmale  0.51298   1.67027  0.27813   0.27883  1.840 0.065802
  age10    0.65703   1.92905  0.16379   0.17895  3.672 0.000241
  hdl0    -0.30029   0.74060  0.32993   0.44220 -0.679 0.497084
  ldl0    -0.08124   0.92197  0.12687   0.12755 -0.637 0.524164

         
1:3           coef exp(coef) se(coef) robust se      z        p
  padPAD   0.71390   2.04194  0.22494   0.22789  3.133  0.00173
  sexmale  0.75348   2.12438  0.25070   0.25505  2.954  0.00313
  age10    0.65666   1.92835  0.14006   0.14379  4.567 4.95e-06
  hdl0    -0.20347   0.81589  0.28098   0.32688 -0.622  0.53363
  ldl0     0.02213   1.02237  0.10630   0.10857  0.204  0.83850

 States:  1= (s0), 2= CV death, 3= nonCV death 

Likelihood ratio test=83.96  on 10 df, p=8.375e-14
n= 6626, number of events= 158 
   (105 osservazioni eliminate a causa di valori mancanti)
Call:
coxph(formula = Surv(tstart, tstop, status == "CV death") ~ pad + 
    sex + age10 + hdl0 + ldl0, data = pdata2)

            coef exp(coef) se(coef)      z        p
padPAD   1.05445   2.87039  0.28292  3.727 0.000194
sexmale  0.51298   1.67027  0.27813  1.844 0.065122
age10    0.65703   1.92905  0.16379  4.011 6.03e-05
hdl0    -0.30029   0.74060  0.32993 -0.910 0.362744
ldl0    -0.08124   0.92197  0.12687 -0.640 0.521959

Likelihood ratio test=39.51  on 5 df, p=1.874e-07
n= 6626, number of events= 67 
   (105 osservazioni eliminate a causa di valori mancanti)
Call:
coxph(formula = Surv(tstart, tstop, status) ~ pad + sex + age10 + 
    +hdl + ldl, data = pdata2, id = id)

         
1:2          coef exp(coef) se(coef) robust se      z        p
  padPAD   0.8751    2.3992   0.2857    0.2745  3.188 0.001433
  sexmale  0.3093    1.3625   0.2775    0.2707  1.143 0.253167
  age10    0.6964    2.0065   0.1652    0.1814  3.840 0.000123
  hdl     -1.5483    0.2126   0.4109    0.4657 -3.324 0.000886
  ldl     -0.2797    0.7560   0.1479    0.1496 -1.869 0.061593

         
1:3          coef exp(coef) se(coef) robust se      z        p
  padPAD   0.7545    2.1265   0.2277    0.2256  3.345 0.000824
  sexmale  0.7847    2.1918   0.2518    0.2461  3.188 0.001432
  age10    0.6761    1.9662   0.1401    0.1409  4.797 1.61e-06
  hdl     -0.1489    0.8617   0.2910    0.3901 -0.382 0.702680
  ldl      0.1338    1.1432   0.1125    0.1070  1.250 0.211251

 States:  1= (s0), 2= CV death, 3= nonCV death 

Likelihood ratio test=105.6  on 10 df, p=< 2.2e-16
n= 6702, number of events= 158 
   (29 osservazioni eliminate a causa di valori mancanti)

There were 67 CVD and 91 non-CVD deaths in 5 years of follow-up, which are the transitions from the initial state (s0) to the CVD death and non-CVD death states, respectively, in the transitions table. No one transitioned from death to another state, which would indicate an error in the data set. Table displays the coefficients for the fits with baseline and time-dependent lipids, for both of the competing outcomes. The joint Cox model shows that after adjustment for age, sex, and baseline lipids, PAD is associated with nearly three fold increase in the rate of CVD deaths and double fold increase in non-CVD death. Age and sex are important predictors as well, as expected, with male sex having a somewhat larger effect on the non-CVD endpoint. The effect of baseline lipid values was not significant, however, time-dependent HDL shows an effect on the CVD death endpoint, somewhat reducing the estimated effect of PAD on CVD death.

Predicted probability-in-state (survival) curves

Predicted \(P(state)\) curves can be created from the multi-state fits, as well. As with any Cox model prediction, the first task is to choose a set of covariate values for which prediction is desired. We create the predictions for four hypothetical male subjects with ages of 58 and 72 (quartiles of age), and PAD vs. control. For HDL and LDL use values of 1.3 and 3, which are near the medians. (These curves are for the model of CVD death vs other death, time-fixed hdl and ldl.)

dummy <- expand.grid(pad=c("PAD", "Control"), age10=c(5.8, 7.2), sex="male",
                     hdl0 =1.3, ldl0 =3)
dummy
      pad age10  sex hdl0 ldl0
1     PAD   5.8 male  1.3    3
2 Control   5.8 male  1.3    3
3     PAD   7.2 male  1.3    3
4 Control   7.2 male  1.3    3
coxpred1 <- survfit(pcox1, newdata= dummy)
dim(coxpred1)
  data states 
     4      3 
coxpred1$states
[1] "(s0)"        "CV death"    "nonCV death"

The resulting set of absolute risk curves can be viewed as an array of estimates by stratum, target covariates, and state. The pcox1 fit had no strata so the first dimension is 1. The resulting curves are shown in Figure. Because a multi-state model can generate a very large number of curves it is often helpful to break them into multiple panels.

par(mfrow=c(1, 2))
plot(coxpred1[1, 1:2 ,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
     ylim=c(0, .15),
     xlab="Years since enrollment", ylab="predicted P(state), age = 58")
legend("topleft", c("PAD, CVD death", "Control, CVD death",
                   "PAD, nonCVD death", "Control, nonCVD death"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
plot(coxpred1[1, 3:4,], col=2:1, lty=c(1,1,2,2), lwd=2, xscale=365.25,
     ylim=c(0, .15),
     xlab="Years since enrollment", ylab="predicted P(state), age=72")

legend("topleft", c("PAD, CVD death", "Control, CVD death",
                   "PAD, nonCVD death", "Control, nonCVD death"),
       col=c(2,1,2,1), lty=c(1,1,2,2), lwd=2, bty='n', cex=.8)
Predicted absolute risk curves from the multistate Cox model:  dying of CVD (solid) and from other causes (dashed) by PAD status, for a male enrolled at age 58 (left panel) or 72 (right).  The values of baseline HDL and LDL are 1.3 and 3, respectively.

Predicted absolute risk curves from the multistate Cox model: dying of CVD (solid) and from other causes (dashed) by PAD status, for a male enrolled at age 58 (left panel) or 72 (right). The values of baseline HDL and LDL are 1.3 and 3, respectively.