Comparing two binary diagnostic tests

The comparison of the performance of two binary diagnostic tests is an important topic in clinical research. The most frequent type of study design to compare two binary diagnostic tests is the paired design. This design consists of applying the two binary diagnostic tests to all of the individuals in a random sample, where the disease status of each individual is known through the application of a gold standard. We will use for this example an R program written by Roldán-Nofuentes in 2020, called compbdt. You can download it from : https://bmcmedresmethodol.biomedcentral.com/articles/10.1186/s12874-020-00988-y#MOESM1 The objective is to estimate the sensitivity and the specificity, the likelihood ratios and the predictive values of each diagnostic test applying the confidence intervals with the best asymptotic performance. The program compares the sensitivities and specificities of the two diagnostic tests simultaneously, as well as the likelihood ratios and the predictive values, applying the global hypothesis tests with the best performance in terms of type I error and power. When the global hypothesis test is significant, the causes of the significance are investigated solving the individual hypothesis tests and applying the multiple comparison method of Holm. The most optimal confidence intervals are also calculated for the difference or ratio between the respective parameters. Based on the data observed in the sample, the program also estimates the probability of making a type II error if the null hypothesis is not rejected, or estimates the power if the if the alternative hypothesis is accepted. We will review some basic ideas about statistical testing procedure in Block 2, related to sample size determination !!

Type of Errors \(H_{0}\) TRUE \(H_{0}\) FALSE
Reject \(H_{0}\) Type I error \(\alpha\) ok
Do not reject \(H_{0}\) ok Type II error \(\beta\)

We define as power of a statistical test the probability of not incurring in Type II error, i.e. 1-\(\beta\). The estimation of the probability of making a type II error allows the researcher to decide about the reliability of the null hypothesis when this hypothesis is not rejected.

We will apply this function to a real example on the diagnosis of coronary artery disease.

The diagnosis of coronary artery disease (CAD) was investigated using as diagnostic tests the exercise test (Test 1) and the clinical history of chest pain (Test 2), and the coronary angiography as the gold standard. The following table shows the frequencies obtained by applying three medical tests to a sample of 871 individuals.

We source the R program that we will use to do the calculations: [if you have downloaded the program, change the path in the following syntax]

source('C:\\Users\\13474\\OneDrive - Università degli Studi di Trieste\\Stat_Learning_Epi\\AA 24_25\\Block 1\\R codes\\compbdt.R')

and we run the program: the input are the frequencies of diseased (non-diseased) patients in which the Test 1 gives a result positive and negative and Test 2 gives a result positive and negative.

pp <- compbdt(473, 29, 81, 25, 22, 46, 44, 151)
## 
##           PREVALENCE OF THE DISEASE 
## 
## Estimated prevalence of the disease is  69.805 % and its standard error is 0.016 
## 
## 95 % confidence interval for the prevalence of the disease is ( 66.681 % ;  72.768 %) 
## 
## 
##           COMPARISON OF THE ACCURACIES (SENSITIVITIES AND SPECIFICITIES) 
## 
## Estimated sensitivity of Test 1 is  82.566 % and its standard error is 0.015 
## 
## 95 % confidence interval for the sensitivity of Test 1 is ( 79.363 % ;  85.389 %) 
## 
## Estimated sensitivity of Test 2 is  91.118 % and its standard error is 0.012 
## 
## 95 % confidence interval for the sensitivity of Test 1 is ( 88.61 % ;  93.148 %) 
## 
## Estimated specificity of Test 1 is  74.144 % and its standard error is 0.027 
## 
## 95 % confidence interval for the specificity of Test 1 is ( 68.557 % ;  79.087 %) 
## 
## Estimated specificity of Test 2 is  74.905 % and its standard error is 0.027 
## 
## 95 % confidence interval for the specificity of Test 1 is ( 69.358 % ;  79.787 %) 
## 
## 
## Wald test statistic for the global hypothesis test H0: (Se1 = Se2 and Sp1 = Sp2) is  25.662  
## 
##   Global p-value is  0  
## 
##   Applying the global Wald test (to an alpha error of 5 %), we reject the hypothesis H0: (Se1 = Se2 and Sp1 = Sp2) 
## 
##   Estimated power (to an alpha error of 5 %) is 99.8 %  
## 
##   Investigation of the causes of significance: 
## 
##    McNemar test statistic (with cc) for H0: Se1 = Se2 is  23.645  and the two-sided p-value is  0  
## 
##    McNemar test statistic (with cc) for H0: Sp1 = Sp2 is  0.011  and the two-sided p-value is  0.991  
## 
##    Applying the Holm method (to an alpha error of 5 %), we reject the hypothesis H0: Se1 = Se2 and we do not reject the hypothesis H0: Sp1 = Sp2 
## 
##    Sensitivity of Test 2 is significantly greater than sensitivity of Test 1 
## 
##     95 % confidence interval for the difference Se2 - Se1 is ( 5.192 % ;  11.857 %) 
## 
## 
## 
##           COMPARISON OF THE LIKELIHOOD RATIOS 
## 
## Estimated positive LR of Test 1 is  3.193  and its standard error is 0.339 
## 
## 95 % confidence interval for the positive LR of Test 1 is ( 2.61  ;  3.952 ) 
## 
## Estimated positive LR of Test 2 is  3.631  and its standard error is 0.39 
## 
## 95 % confidence interval for the positive LR of Test 1 is ( 2.962  ;  4.505 ) 
## 
## Estimated negative LR of Test 1 is  0.235  and its standard error is 0.022 
## 
## 95 % confidence interval for the negative LR of Test 1 is ( 0.195  ;  0.283 ) 
## 
## Estimated negative LR of Test 2 is  0.119  and its standard error is 0.016 
## 
## 95 % confidence interval for the negative LR of Test 2 is ( 0.09  ;  0.153 ) 
## 
## 
## Test statistic for the global hypothesis test H0: (PLR1 = PLR2 and NLR1 = NLR2) is  23.438  
## 
##   Global p-value is  0  
## 
##   Applying the global hypothesis test (to an alpha error of 5 %), we reject the hypothesis H0: (PLR1 = PLR2 and NLR1 = NLR2) 
## 
##   Estimated power (to an alpha error of 5 %) is 99.78 %  
## 
##   Investigation of the causes of significance: 
## 
##    Test statistic for H0: PLR1 = PLR2 is  0.898  and the two-sided p-value is  0.369  
## 
##    Test statistic for H0: NLR1 = NLR2 is  4.663  and the two-sided p-value is  0  
## 
##    Applying the Holm method (to an alpha error of 5 %), we do not reject the hypothesis H0: PLR1 = PLR2 and we reject the hypothesis H0: NLR1 = NLR2 
## 
##    Negative likelihood ratio of Test 1 is significantly greater than negative likelihood ratio of Test 2 
## 
##     95 % confidence interval for the ratio NLR1 / NLR2 is ( 1.412  ;  2.554 ) 
## 
## 
## 
##           COMPARISON OF THE PREDICTIVE VALUES 
## 
## Estimated positive PV of Test 1 is  88.07 % and its standard error is 0.014 
## 
## 95 % confidence interval for the positive PV of Test 1 is ( 85.17 % ;  90.498 %) 
## 
## Estimated positive PV of Test 2 is  89.355 % and its standard error is 0.012 
## 
## 95 % confidence interval for the positive PV of Test 2 is ( 86.698 % ;  91.562 %) 
## 
## Estimated negative PV of Test 1 is  64.784 % and its standard error is 0.028 
## 
## 95 % confidence interval for the negative PV of Test 1 is ( 59.246 % ;  69.976 %) 
## 
## Estimated negative PV of Test 2 is  78.486 % and its standard error is 0.026 
## 
## 95 % confidence interval for the negative PV of Test 2 is ( 73.024 % ;  83.151 %) 
## 
## 
## Wald test statistic for the global hypothesis test H0: (PPV1 = PPV2 and NPV1 = NPV2) is  25.944  
## 
##   Global p-value is  0  
## 
##   Applying the global hypothesis test (to an alpha error of 5 %), we reject the hypothesis H0: (PPV1 = PPV2 and NPV1 = NPV2) 
## 
##   Estimated power (to an alpha error of 5 %) is 99.26 %  
## 
##   Investigation of the causes of significance: 
## 
##    Weighted generalized score statistic for H0: PPV1 = PPV2 is  0.807  and the two-sided p-value is  0.369  
## 
##    Weighted generalized score statistic for H0: NPV1 = NPV2 is  22.502  and the two-sided p-value is  0  
## 
##    Applying the Holm method (to an alpha error of 5 %), we do not reject the hypothesis H0: PPV1 = PPV2 and we reject the hypothesis H0: NPV1 = NPV2 
## 
##    Negative PV of Test 2 is significantly greater than negative PV of Test 1 
## 
##     95 % confidence interval for the difference NPV2 - NPV1 is ( 8.041 % ;  19.363 %)

You will also find three text files in the folder that you are currently using. The results obtained comparing the sensitivities and specificities are recorded in the file “Results_Comparison_Accuracies.txt”, those obtained when comparing the LRs are recorded in the file “Results_Comparison_LRs.txt”, and those obtained when comparing the PVs are recorded in the file “Results_Comparison_PVs.txt”. In R, an alternative program to “compbdt” is the DTComPair package. The DTComPair package estimates the same parameters as the “compbdt” and compares the parameters individually, i.e. solving each hypothesis test to an α error.

ROC curves : evaluation of diagnostic accuracy of a continuous marker

For this example we will use the R package pROC. It builds a ROC curve and returns a roc object, a list of class roc. This object can be printed, plotted, or passed to the functions auc, ci, smooth.roc and coords. Additionally, two roc objects can be compared with the function roc.test. Data can be provided as response, predictor, where the predictor is the numeric (or ordered) level of the evaluated risk factor, and the response encodes the observation class (control or case). The level argument specifies which response level must be taken as controls (first value of level) or cases (second). It can safely be ignored when the response is encoded as 0 and 1, but it will frequently fail otherwise. By default, the first two values of levels(as.factor(response)) are taken, and the remaining levels are ignored. This means that if your response is coded “control” and “case”, the levels will be inverted.

library(pROC)

Subarachnoid hemorrhage data

This dataset summarizes several clinical and one laboratory variable of 113 patients with an aneurysmal subarachnoid hemorrhage. The purpose of the case presented here is to identify patients at risk of poor post-aSAH outcome, as they require specific healthcare management; therefore the clinical test must be highly specific. Detailed results of the study are reported in [1]. We only outline the features relevant to the ROC analysis.

data(aSAH)
head(aSAH)
##    gos6 outcome gender age wfns s100b  ndka
## 29    5    Good Female  42    1  0.13  3.01
## 30    5    Good Female  37    1  0.14  8.54
## 31    5    Good Female  42    1  0.10  8.09
## 32    5    Good Female  27    1  0.04 10.42
## 33    1    Poor Female  42    3  0.13 17.40
## 34    1    Poor   Male  48    2  0.10 12.75
with(aSAH, table(gender, outcome))
##         outcome
## gender   Good Poor
##   Male     22   20
##   Female   50   21

Let’s estimate the ROC curve for the parameter s100b, a biomarker:

roc(aSAH$outcome, aSAH$s100b,levels=c("Good", "Poor"))
## 
## Call:
## roc.default(response = aSAH$outcome, predictor = aSAH$s100b,     levels = c("Good", "Poor"))
## 
## Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
## Area under the curve: 0.7314

Inspect the ROC plot and report the 95% CI:

roc(aSAH$outcome, aSAH$s100b,percent=TRUE, plot=TRUE, ci=TRUE)

## 
## Call:
## roc.default(response = aSAH$outcome, predictor = aSAH$s100b,     percent = TRUE, ci = TRUE, plot = TRUE)
## 
## Data: aSAH$s100b in 72 controls (aSAH$outcome Good) < 41 cases (aSAH$outcome Poor).
## Area under the curve: 73.14%
## 95% CI: 63.01%-83.26% (DeLong)

We can also inspect the confidence intervals around our estimates, point by point:

rocobj <- plot.roc(aSAH$outcome, aSAH$s100b)
ci.sp.obj <- ci.sp(rocobj, sensitivities=seq(0, 1, .01), boot.n=100)
plot(rocobj)
plot(ci.sp.obj, type="shape", col="blue")

We can also compare the performance of two different biomarkers:

roc.test(response = aSAH$outcome, predictor1= aSAH$wfns, predictor2 = aSAH$s100)
## 
##  DeLong's test for two correlated ROC curves
## 
## data:  aSAH$wfns and aSAH$s100 by aSAH$outcome (Good, Poor)
## Z = 2.209, p-value = 0.02718
## alternative hypothesis: true difference in AUC is not equal to 0
## 95 percent confidence interval:
##  0.01040618 0.17421442
## sample estimates:
## AUC of roc1 AUC of roc2 
##   0.8236789   0.7313686

And plot the two ROC curves in one graph:

roc.s100b <- roc(aSAH$outcome, aSAH$s100b)
roc.wfns <- roc(aSAH$outcome, aSAH$wfns)
plot(roc.s100b)
plot(roc.wfns, add=TRUE, col="red")

Determining an “optimal” cut-off

Load the required library:

library(OptimalCutpoints)

The package Optimal.cutpoints calculates optimal cutpoints in diagnostic tests. Several methods or criteria for selecting optimal cutoffs have been implemented, including methods based on cost-benefit analysis and diagnostic test accuracy measures (Sensitivity/Specificity, Predictive Values and Diagnostic Likelihood Ratios) or prevalence.

As a basic example, we can use the Youden Index approach (i.e. maximizing the sum of Sensitivity and Specificity):

perf.s100b <-optimal.cutpoints(X ="s100b", status ="outcome", tag.healthy ="Good",
                                    methods = "Youden", data =aSAH, control = control.cutpoints(), 
                                    ci.fit = TRUE, 
                                    conf.level = 0.95, trace = FALSE)
perf.s100b
## 
## Call:
## optimal.cutpoints.default(X = "s100b", status = "outcome", tag.healthy = "Good", 
##     methods = "Youden", data = aSAH, control = control.cutpoints(), 
##     ci.fit = TRUE, conf.level = 0.95, trace = FALSE)
## 
## Optimal cutoffs:
##   Youden
## 1 0.2200
## 
## Area under the ROC curve (AUC):  0.731 (0.63, 0.833)

Therefore, 0.22 is the selected cut-off for the biomarker that maximizes the sum of Sensitivity and Specificity. But, pay always attention to the shape of the functional relationship between the biomarker and the risk of outcome…more on this in block 3.