Load libraries:

library(survival)
library(MASS)
library(rms)
library(ISwR)
library(survMisc)
library(survminer)
library(KMsurv)

Recap of the basic steps !

The melanom data frame has 205 rows and 6 columns. It contains data relating to the survival of patients after an operation for malignant melanoma, collected at Odense University Hospital. Each patient had their tumour removed by surgery during the period 1962 to 1977. The surgery consisted of complete removal of the tumour together with about 2.5cm of the surrounding skin. Among the measurements taken were the thickness of the tumour and whether it was ulcerated or not. These are thought to be important prognostic variables in that patients with a thick and/or ulcerated tumour have an increased chance of death from melanoma. Patients were followed until the end of 1977.

Loading and viewing dataset melanom:

dati <- melanom
head(dati)
##    no status days ulc thick sex
## 1 789      3   10   1   676   2
## 2  13      3   30   2    65   2
## 3  97      2   35   2   134   2
## 4  16      3   99   2   290   1
## 5  21      1  185   1  1208   2
## 6 469      1  204   1   484   2
attach(dati)

Create the survival object:

msurv <- with(dati, Surv(days, status == 1))
msurv
##   [1]   10+   30+   35+   99+  185   204   210   232   232+  279   295   355+
##  [13]  386   426   469   493+  529   621   629   659   667   718   752   779 
##  [25]  793   817   826+  833   858   869   872   967   977   982  1041  1055 
##  [37] 1062  1075  1156  1228  1252  1271  1312  1427+ 1435  1499+ 1506  1508+
##  [49] 1510+ 1512+ 1516  1525+ 1542+ 1548  1557+ 1560  1563+ 1584  1605+ 1621 
##  [61] 1627+ 1634+ 1641+ 1641+ 1648+ 1652+ 1654+ 1654+ 1667  1678+ 1685+ 1690 
##  [73] 1710+ 1710+ 1726  1745+ 1762+ 1779+ 1787+ 1787+ 1793+ 1804+ 1812+ 1836+
##  [85] 1839+ 1839+ 1854+ 1856+ 1860+ 1864+ 1899+ 1914+ 1919+ 1920+ 1927+ 1933 
##  [97] 1942+ 1955+ 1956+ 1958+ 1963+ 1970+ 2005+ 2007+ 2011+ 2024+ 2028+ 2038+
## [109] 2056+ 2059+ 2061  2062  2075+ 2085+ 2102+ 2103  2104+ 2108  2112+ 2150+
## [121] 2156+ 2165+ 2209+ 2227+ 2227+ 2256  2264+ 2339+ 2361+ 2387+ 2388  2403+
## [133] 2426+ 2426+ 2431+ 2460+ 2467  2492+ 2493+ 2521+ 2542+ 2559+ 2565  2570+
## [145] 2660+ 2666+ 2676+ 2738+ 2782  2787+ 2984+ 3032+ 3040+ 3042  3067+ 3079+
## [157] 3101+ 3144+ 3152+ 3154+ 3180+ 3182+ 3185+ 3199+ 3228+ 3229+ 3278+ 3297+
## [169] 3328+ 3330+ 3338  3383+ 3384+ 3385+ 3388+ 3402+ 3441+ 3458+ 3459+ 3459+
## [181] 3476+ 3523+ 3667+ 3695+ 3695+ 3776+ 3776+ 3830+ 3856+ 3872+ 3909+ 3968+
## [193] 4001+ 4103+ 4119+ 4124+ 4207+ 4310+ 4390+ 4479+ 4492+ 4668+ 4688+ 4926+
## [205] 5565+

Histogram of the follow up (days) and descriptive statistics:

hist(days)

summary(days)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##      10    1525    2005    2153    3042    5565

Estimate the KM curve: we focus just on “status=1” in this example (we are ignoring other possible risks, just for didactic purposes, we will see later an example in which competing risks are accounted for)

mfit <- survfit(Surv(days, status==1)~1, data = dati)
mfit
## Call: survfit(formula = Surv(days, status == 1) ~ 1, data = dati)
## 
##        n events median 0.95LCL 0.95UCL
## [1,] 205     57     NA      NA      NA

Output of the KM estimate at different time points:

summary(mfit, times=c(1000,3000,4000))
## Call: survfit(formula = Surv(days, status == 1) ~ 1, data = dati)
## 
##  time n.risk n.event survival std.err lower 95% CI upper 95% CI
##  1000    171      26    0.869  0.0240        0.823        0.917
##  3000     54      29    0.677  0.0385        0.605        0.757
##  4000     13       2    0.645  0.0431        0.566        0.735

Graphical representation of the curve:

plot(mfit)
abline(h=0.50, col="red",lty=2)

In this case the median survival time is not even interesting because the estimate is infinite.

The survival curve does not cross the 50% mark before all patients are censored.

If you want to see also censored times in the plot:

plot(mfit,mark.time=T)
abline(h=0.50, col="red",lty=2)

if you want to see also the number at risk at different time points:

ggsurvplot(mfit, data=dati, risk.table = TRUE)
## Ignoring unknown labels:
## • colour : "Strata"

Remember that estimate of the curve is reliable until the risk set has a sufficient sample size.
As a rule of thumb: the curve can become unreliable when the number of patients at risk is less than 15. The width of confidence intervals will help to decide.

Estimate survival curves by gender:

surv.bysex <- survfit(Surv(days,status==1)~sex)

Plot the curves:

plot(surv.bysex, conf.int=T,col=c("red","black"))

Other types of graphical representations:

ggsurvplot(surv.bysex, data=dati, risk.table = TRUE)

ggsurvplot(surv.bysex, data=dati, risk.table ="abs_pct")

Comparing survival curves : the log-rank test

We apply now the log-rank test to see if there is a significant difference between males and females: the survdiff function implements different tests specified by a parameter called rho, allowing a non-proportional hazards alternative to the null hypothesis, but the default value of rho gives the log-rank test.

survdiff(Surv(days,status==1)~sex)
## Call:
## survdiff(formula = Surv(days, status == 1) ~ sex)
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 126       28     37.1      2.25      6.47
## sex=2  79       29     19.9      4.21      6.47
## 
##  Chisq= 6.5  on 1 degrees of freedom, p= 0.01

We conclude that there is a significant difference in survival between males and females.

Stratified log-rank test using presence of ulcer as strata:

There is also the option to conduct the test stratifying with respect to another categorical variable, such as “ulc”: interpreting the \(p\)-value of a stratified log-rank test is very similar to a standard log-rank test, but with one crucial distinction: you are accounting for “nuisance” variables (strata, ulc in this case) that might otherwise blur the true effect of your primary treatment or exposure or risk factor (sex).

surv.bysex.ulc <- survdiff(Surv(days,status==1)~sex+strata(ulc))
surv.bysex.ulc
## Call:
## survdiff(formula = Surv(days, status == 1) ~ sex + strata(ulc))
## 
##         N Observed Expected (O-E)^2/E (O-E)^2/V
## sex=1 126       28     34.7      1.28      3.31
## sex=2  79       29     22.3      1.99      3.31
## 
##  Chisq= 3.3  on 1 degrees of freedom, p= 0.07

The \(p\)-value represents the probability of observing a difference in survival distributions as large as (or larger than) the one seen in your data, assuming the null hypothesis (\(H_0\)) is true.The Null Hypothesis (\(H_0\)): There is no difference in survival between the groups after accounting for the strata.The Alternative Hypothesis (\(H_A\)): There is a significant difference in survival between the groups in at least one stratum. (In a standard log-rank test, you compare groups globally. In a stratified version, you calculate the observed and expected number of events within each stratum, e.g., age groups, clinical sites, or disease stages… and then aggregate them.)

To visualize in separate graphs:

index.ulc1 <- ulc==1
index.ulc2 <- ulc==2

surv.bysex1 <- survfit(Surv(days,status==1)~sex, data=dati[index.ulc1,])
surv.bysex2 <- survfit(Surv(days,status==1)~sex, data=dati[index.ulc2,])


par(pty=("s"), mfrow=c(1,2))
plot(surv.bysex1, conf.int=T,col=c("red","black"))
title("Presence of ulcer")
plot(surv.bysex2, conf.int=T,col=c("red","black"))
title("Absence of ulcer")

Stratification makes the effect of sex less significant. A possible explanation might be that males seek treatment when the disease is in a more advanced state than women do, so that the gender difference is reduced when adjusted for a measure of disease progression.

Extension for more than two groups

Example of log rank with more than 2 groups: survival times of subjects with multiple myeloma, data extracted from publicly available gene expression data.

data(myeloma)
fit <- survfit(Surv(time, event) ~ molecular_group, data = myeloma)
ggsurvplot(fit, legend.labs = levels(myeloma$molecular_group),
           pval = TRUE)

This plot shows the global p value across groups. Let’s see the differences between pairs (adjusting for multiple comparisons):

res <- pairwise_survdiff(Surv(time, event) ~ molecular_group,
                         data = myeloma)
res
## 
##  Pairwise comparisons using Log-Rank test 
## 
## data:  myeloma and molecular_group 
## 
##                  Cyclin D-1 Cyclin D-2 Hyperdiploid Low bone disease MAF  
## Cyclin D-2       0.723      -          -            -                -    
## Hyperdiploid     0.943      0.723      -            -                -    
## Low bone disease 0.723      0.988      0.644        -                -    
## MAF              0.644      0.447      0.523        0.485            -    
## MMSET            0.328      0.103      0.103        0.103            0.723
## Proliferation    0.103      0.038      0.038        0.062            0.485
##                  MMSET
## Cyclin D-2       -    
## Hyperdiploid     -    
## Low bone disease -    
## MAF              -    
## MMSET            -    
## Proliferation    0.527
## 
## P value adjustment method: BH

Proliferation vs Cyclin D-2 and vs Hyperdiploid show significant differences in survival curves.No others significant differences are detected between others groups.