Binary diagnostic tests are among the most commonly used tests in medicine and are used to rule in or out a certain condition. Commonly, this condition is disease status, but tests may also detect, for example, the presence of a bacteria or a virus, independent of any clinical manifestations.
Test metrics such as diagnostic accuracies, predictive values and likelihood ratios are useful tools to evaluate the efficacy of such tests in comparison to a gold standard, however, these statistics only provide a description of the quality of a test. Performing statistical inference to evaluate if one test is better than another while simultaneously referencing the gold standard is more complicated. Several authors have invested significant effort into developing statistical methods to perform such inference.
Understanding and implementing methods described in the statistical literature is often far outside the comfort zone of clinicians, particularly those who are not routinely involved in academic research.
Here we demonstrate the implementation of the
testCompareR
package.
The package comes with a data set derived from the Coronary Artery
Surgery Study (cass
). This dataset looks at exercise stress
testing and history of chest pain as two tests for coronary artery
disease as determined by coronary angiography (the gold standard).
The testCompareR
package is elegant in its simplicity.
You can pass your data to the compareR()
function as the
only argument and the function outputs a list object containing the
results of descriptive and inferential statistical tests.
results <- compareR(dat)
#> Warning in validatR(df = df, test1 = test1, test2 = test2, gold = gold): Using default columns. Check test 1 is first column, test 2 is second
#> column and gold standard is third column.
results
#> test metric estimate lower_ci upper_ci p
#> 1 Test 1 Sensitivity 82.6 79.4 85.4
#> 2 Test 2 Sensitivity 91.1 88.6 93.1 ***
#> 3 Test 1 Specificity 74.1 68.6 79.1
#> 4 Test 2 Specificity 74.9 69.4 79.8
#> 5 Test 1 PPV 88.1 85.2 90.5
#> 6 Test 2 PPV 89.4 86.7 91.6
#> 7 Test 1 NPV 64.8 59.2 70.0
#> 8 Test 2 NPV 78.5 73.0 83.2 ***
#> 9 Test 1 PLR 3.2 2.6 4.0
#> 10 Test 2 PLR 3.6 3.0 4.5
#> 11 Test 1 NLR 0.2 0.2 0.3 ***
#> 12 Test 2 NLR 0.1 0.1 0.2
#>
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 ' ' 1
Individual results can be accessed via standard indexing.
results$acc$accuracies # returns summary tables for diagnostic accuracies
#> $`Test 1`
#> Estimate SE Lower CI Upper CI
#> Sensitivity 82.6 1.5 79.4 85.4
#> Specificity 74.1 2.7 68.6 79.1
#>
#> $`Test 2`
#> Estimate SE Lower CI Upper CI
#> Sensitivity 91.1 1.2 88.6 93.1
#> Specificity 74.9 2.7 69.4 79.8
The list output of the compareR()
function is useful
should you need to manipulate any of the individual outputs in
subsequent calculations. However, should you wish to see an
interpretation of your results, you can pass the compareR()
output to the interpretR()
function. This provides the same
results in a more human-readable format.
interpretR(results)
#>
#> --------------------------------------------------------------------------------
#> CONTINGENCY TABLES
#> --------------------------------------------------------------------------------
#>
#> True Status - POSITIVE
#> Test 2
#> Test 1 Positive Negative
#> Positive 473 29
#> Negative 81 25
#>
#> True Status - NEGATIVE
#> Test 2
#> Test 1 Positive Negative
#> Positive 22 46
#> Negative 44 151
#>
#> Gold standard vs. Test 1
#> Test 1
#> Gold standard Positive Negative
#> Positive 502 106
#> Negative 68 195
#>
#> Gold standard vs. Test 2
#> Test 2
#> Gold standard Positive Negative
#> Positive 554 54
#> Negative 66 197
#>
#> --------------------------------------------------------------------------------
#> PREVALENCE (%)
#> --------------------------------------------------------------------------------
#>
#> Estimate SE Lower CI Upper CI
#> Prevalence 69.8 1.6 66.7 72.8
#>
#> --------------------------------------------------------------------------------
#> DIAGNOSTIC ACCURACIES
#> --------------------------------------------------------------------------------
#>
#> Test 1 (%)
#> Estimate SE Lower CI Upper CI
#> Sensitivity 82.6 1.5 79.4 85.4
#> Specificity 74.1 2.7 68.6 79.1
#>
#> Test 2 (%)
#> Estimate SE Lower CI Upper CI
#> Sensitivity 91.1 1.2 88.6 93.1
#> Specificity 74.9 2.7 69.4 79.8
#>
#> Global Null Hypothesis: Se1 = Se2 & Sp1 = Sp2
#> Test statistic: 25.662 Adjusted p value: 6.971825e-06 ***SIGNIFICANT***
#>
#> Investigating individual differences
#>
#> Null Hypothesis 1: Se1 = Se2
#> Test statistic: 23.64545 Adjusted p value: 6.949149e-06 ***SIGNIFICANT***
#>
#> Null Hypothesis 2: Sp1 = Sp2
#> Test statistic: 0.01111111 Adjusted p value: 1
#>
#> --------------------------------------------------------------------------------
#> PREDICTIVE VALUES
#> --------------------------------------------------------------------------------
#>
#> Test 1 (%)
#> Estimate SE Lower CI Upper CI
#> PPV 88.1 1.4 85.2 90.5
#> NPV 64.8 2.8 59.2 70.0
#>
#> Test 2 (%)
#> Estimate SE Lower CI Upper CI
#> PPV 89.4 1.2 86.7 91.6
#> NPV 78.5 2.6 73.0 83.2
#>
#> Global Null Hypothesis: PPV1 = PPV2 & NPV1 = NPV2
#> Test statistic: 25.94449 Adjusted p value: 6.971825e-06 ***SIGNIFICANT***
#>
#> Investigating individual differences
#>
#> Null Hypothesis 1: PPV1 = PPV2
#> Test statistic: 0.8070579 Adjusted p value: 1
#>
#> Null Hypothesis 2: NPV1 = NPV2
#> Test statistic: 22.50225 Adjusted p value: 1.049486e-05 ***SIGNIFICANT***
#>
#> --------------------------------------------------------------------------------
#> LIKELIHOOD RATIOS
#> --------------------------------------------------------------------------------
#>
#> Test 1 (%)
#> Estimate SE Lower CI Upper CI
#> PLR 3.2 0.3 2.6 4.0
#> NLR 0.2 0.0 0.2 0.3
#>
#> Test 2 (%)
#> Estimate SE Lower CI Upper CI
#> PLR 3.6 0.4 3.0 4.5
#> NLR 0.1 0.0 0.1 0.2
#>
#> Global Null Hypothesis: PLR1 = PLR2 & NLR1 = NLR2
#> Test statistic: 23.43805 Adjusted p value: 8.137524e-06 ***SIGNIFICANT***
#>
#> Investigating individual differences
#>
#> Null Hypothesis 1: PLR1 = PLR2
#> Test statistic: 0.8980246 Adjusted p value: 1
#>
#> Null Hypothesis 2: NLR1 = NLR2
#> Test statistic: 4.662817 Adjusted p value: 1.247637e-05 ***SIGNIFICANT***
And really, that’s it! That is all you need to know to get your
answers with testCompareR
.
There is some additional functionality that might be useful to know about, though.
The compareR()
function will accept data as a data frame
or matrix and there are a range of coding options for positive and
negative results, detailed below. If you have been working across
multiple sites and find that researchers have used different coding
systems, no problem! As long as positive results are coded using
something from the positive list and negative results with something in
the negative list compareR()
will handle that for you. No
more manually re-coding your data!
“What about those pesky trailing spaces?” I hear you ask. Of course,
compareR()
can handle that, too. “Case-sensitivity?” Taken
care of.
POSITIVE: positive, pos, p, yes, y, true, t, +, 1
NEGATIVE: negative, neg, no, n, false, f, -, 0, 2
# create data frame with varied coding
df <- data.frame(
test1 = c(" positive ", "POS ", " n ", "N ", " 1 ", "+"),
test2 = c(" NEG ", " yes ", " negative", " Y ", "-", " 0 "),
gold = c(0, 1, 0, 1, 2, 1)
)
# recode the dataframe
recoded <- testCompareR:::recoder(df)
recoded
#> test1 test2 gold
#> 1 1 0 0
#> 2 1 1 1
#> 3 0 0 0
#> 4 0 1 1
#> 5 1 0 0
#> 6 1 0 1
There are two things that compareR()
cannot handle.
Firstly, it is imperative that the data structure provided has three
columns and that those columns follow the pattern Test 1, Test 2, gold
standard. If you place the gold standard at any index other than
your_data[,3]
then compareR()
may return a
result that looks sensible but does not answer the question you wanted
to ask.
Finally, compareR()
cannot handle missing data. Removing
missing data is an option, but consider why the data is missing and
don’t omit this from any write-up of the results. Alternatively, if the
data are missing at random, you could consider the use of imputation
methods to replace missing data. If the data are not missing at random
then imputation becomes vastly more complex and you should probably seek
expert advice.
If you’re using testCompareR
because you’re not
statistically savvy and you want a nice function to do it all for you
then you should probably just leave alpha alone. If you have a good
reason, or you’re just messing around, feel free to change it to
whatever you’d like, though.
# simulate data
test1 <- c(rep(1, 300), rep(0, 100), rep(1, 65), rep(0, 135))
test2 <- c(rep(1, 280), rep(0, 120), rep(1, 55), rep(0, 145))
gold <- c(rep(1, 400), rep(0, 200))
df <- data.frame(test1, test2, gold)
# test with alpha = 0.5
result <- compareR(df, alpha = 0.5)
#> Warning in validatR(df = df, test1 = test1, test2 = test2, gold = gold): Using default columns. Check test 1 is first column, test 2 is second
#> column and gold standard is third column.
# all results are significant
interpretR(result)
#>
#> WARNING:
#> Zeros exist in contingency table. Tests may return NA/NaN.
#>
#> --------------------------------------------------------------------------------
#> CONTINGENCY TABLES
#> --------------------------------------------------------------------------------
#>
#> True Status - POSITIVE
#> Test 2
#> Test 1 Positive Negative
#> Positive 280 20
#> Negative 0 100
#>
#> True Status - NEGATIVE
#> Test 2
#> Test 1 Positive Negative
#> Positive 55 10
#> Negative 0 135
#>
#> Gold standard vs. Test 1
#> Test 1
#> Gold standard Positive Negative
#> Positive 300 100
#> Negative 65 135
#>
#> Gold standard vs. Test 2
#> Test 2
#> Gold standard Positive Negative
#> Positive 280 120
#> Negative 55 145
#>
#> --------------------------------------------------------------------------------
#> PREVALENCE (%)
#> --------------------------------------------------------------------------------
#>
#> Estimate SE Lower CI Upper CI
#> Prevalence 66.7 1.9 65.4 68
#>
#> --------------------------------------------------------------------------------
#> DIAGNOSTIC ACCURACIES
#> --------------------------------------------------------------------------------
#>
#> Test 1 (%)
#> Estimate SE Lower CI Upper CI
#> Sensitivity 75.0 2.2 73.5 76.4
#> Specificity 67.5 3.3 65.2 69.7
#>
#> Test 2 (%)
#> Estimate SE Lower CI Upper CI
#> Sensitivity 70.0 2.3 68.4 71.5
#> Specificity 72.5 3.2 70.3 74.6
#>
#> Global Null Hypothesis: Se1 = Se2 & Sp1 = Sp2
#> Test statistic: 31.57895 Adjusted p value: 4.167158e-07 ***SIGNIFICANT***
#>
#> Investigating individual differences
#>
#> Null Hypothesis 1: Se1 = Se2
#> Test statistic: 18.05 Adjusted p value: 0.0001291072 ***SIGNIFICANT***
#>
#> Null Hypothesis 2: Sp1 = Sp2
#> Test statistic: 8.1 Adjusted p value: 0.02213263 ***SIGNIFICANT***
#>
#> --------------------------------------------------------------------------------
#> PREDICTIVE VALUES
#> --------------------------------------------------------------------------------
#>
#> Test 1 (%)
#> Estimate SE Lower CI Upper CI
#> PPV 82.2 2.0 80.8 83.5
#> NPV 57.4 3.2 55.3 59.6
#>
#> Test 2 (%)
#> Estimate SE Lower CI Upper CI
#> PPV 83.6 2.0 82.2 84.9
#> NPV 54.7 3.1 52.6 56.8
#>
#> Global Null Hypothesis: PPV1 = PPV2 & NPV1 = NPV2
#> Test statistic: 26.92232 Adjusted p value: 2.850504e-06 ***SIGNIFICANT***
#>
#> Investigating individual differences
#>
#> Null Hypothesis 1: PPV1 = PPV2
#> Test statistic: 3.171214 Adjusted p value: 0.1498935 ***SIGNIFICANT***
#>
#> Null Hypothesis 2: NPV1 = NPV2
#> Test statistic: 5.653882 Adjusted p value: 0.06966709 ***SIGNIFICANT***
#>
#> --------------------------------------------------------------------------------
#> LIKELIHOOD RATIOS
#> --------------------------------------------------------------------------------
#>
#> Test 1 (%)
#> Estimate SE Lower CI Upper CI
#> PLR 2.3 0.2 2.1 2.5
#> NLR 0.4 0.0 0.3 0.4
#>
#> Test 2 (%)
#> Estimate SE Lower CI Upper CI
#> PLR 2.5 0.3 2.3 2.7
#> NLR 0.4 0.0 0.4 0.4
#>
#> Global Null Hypothesis: PLR1 = PLR2 & NLR1 = NLR2
#> Test statistic: 23.37068 Adjusted p value: 8.416292e-06 ***SIGNIFICANT***
#>
#> Investigating individual differences
#>
#> Null Hypothesis 1: PLR1 = PLR2
#> Test statistic: 1.779904 Adjusted p value: 0.1498935 ***SIGNIFICANT***
#>
#> Null Hypothesis 2: NLR1 = NLR2
#> Test statistic: 2.375766 Adjusted p value: 0.06966709 ***SIGNIFICANT***
Contingency tables are included in the readout from
compareR()
or interpretR()
. Some people like
to see the summed totals for columns and rows. If you’re one of those
people, then set margins = TRUE
.
# simulate data
test1 <- c(rep(1, 300), rep(0, 100), rep(1, 65), rep(0, 135))
test2 <- c(rep(1, 280), rep(0, 120), rep(1, 55), rep(0, 145))
gold <- c(rep(1, 400), rep(0, 200))
df <- data.frame(test1, test2, gold)
# test with alpha = 0.5
result <- compareR(df, margins = TRUE)
#> Warning in validatR(df = df, test1 = test1, test2 = test2, gold = gold): Using default columns. Check test 1 is first column, test 2 is second
#> column and gold standard is third column.
# contingency tables have margins
result$cont
#> $`Gold standard vs. Test 1`
#> Test 1
#> Gold standard Positive Negative Sum
#> Positive 300 100 400
#> Negative 65 135 200
#> Sum 365 235 600
#>
#> $`Gold standard vs. Test 2`
#> Test 2
#> Gold standard Positive Negative Sum
#> Positive 280 120 400
#> Negative 55 145 200
#> Sum 335 265 600
#>
#> $`True Status: POS`
#> Test 2
#> Test 1 Positive Negative Sum
#> Positive 280 20 300
#> Negative 0 100 100
#> Sum 280 120 400
#>
#> $`True Status: NEG`
#> Test 2
#> Test 1 Positive Negative Sum
#> Positive 55 10 65
#> Negative 0 135 135
#> Sum 55 145 200
By default compareR()
runs a minimum of three hypothesis
tests and it can perform up to nine. This is accounted for using
adjusted p-values according to the Holm method. If you’d prefer to use a
different method, that’s no problem. Just set the
multi_corr
parameter to any of the methods which are
handled by the base R function p.adjust()
.
# display p.adjust.methods
p.adjust.methods
#> [1] "holm" "hochberg" "hommel" "bonferroni" "BH"
#> [6] "BY" "fdr" "none"
# simulate data
test1 <- c(rep(1, 300), rep(0, 100), rep(1, 65), rep(0, 135))
test2 <- c(rep(1, 280), rep(0, 120), rep(1, 55), rep(0, 145))
gold <- c(rep(1, 400), rep(0, 200))
df <- data.frame(test1, test2, gold)
# test with different multiple comparison methods
result1 <- compareR(df, multi_corr = "holm")
#> Warning in validatR(df = df, test1 = test1, test2 = test2, gold = gold): Using default columns. Check test 1 is first column, test 2 is second
#> column and gold standard is third column.
result2 <- compareR(df, multi_corr = "bonf")
#> Warning in validatR(df = df, test1 = test1, test2 = test2, gold = gold): Using default columns. Check test 1 is first column, test 2 is second
#> column and gold standard is third column.
# the more restrictive Bonferroni method returns higher adjusted p values
result1$pv$glob.p.adj < result2$pv$glob.p.adj
#> [1] TRUE
In certain circumstances compareR()
uses McNemar’s test
for testing differences in diagnostic accuracies. This test is routinely
performed with continuity correction. If you wish to perform it without
continuity correction then set cc = FALSE
. If you aren’t
sure whether to run the test with or without continuity correction, then
stick to the default parameters.
# simulate data
test1 <- c(rep(1, 6), rep(0, 2), rep(1, 14), rep(0, 76))
test2 <- c(rep(1, 1), rep(0, 7), rep(1, 2), rep(0, 88))
gold <- c(rep(1, 8), rep(0, 90))
df <- data.frame(test1, test2, gold)
# run compareR without continuity correction
result <- compareR(df, cc = FALSE)
#> Warning in validatR(df = df, test1 = test1, test2 = test2, gold = gold): Using default columns. Check test 1 is first column, test 2 is second
#> column and gold standard is third column.
result$acc
#> $accuracies
#> $accuracies$`Test 1`
#> Estimate SE Lower CI Upper CI
#> Sensitivity 75.0 15.3 41.5 93.4
#> Specificity 84.4 3.8 75.7 90.6
#>
#> $accuracies$`Test 2`
#> Estimate SE Lower CI Upper CI
#> Sensitivity 12.5 11.7 1.4 46.2
#> Specificity 97.8 1.6 92.4 99.5
#>
#>
#> $glob.test.stat
#> [1] "n < 100 and prevalence <= 10% - global test not used"
#>
#> $glob.p.value
#> [1] NA
#>
#> $glob.p.adj
#> [1] NA
#>
#> $sens.test.stat
#> [1] 13.33333
#>
#> $sens.p.value
#> [1] NA
#>
#> $sens.p.adj
#> [1] NA
#>
#> $spec.test.stat
#> [1] 13.84615
#>
#> $spec.p.value
#> [1] NA
#>
#> $spec.p.adj
#> [1] NA
You can change the number of decimal places displayed in the summary
tables which are output by both the compareR()
and
interpretR()
functions with the dp
parameter.
This parameter does not affect the number of decimal places displayed
for p values or test statistics.
# simulate data
test1 <- c(rep(1, 317), rep(0, 83), rep(1, 68), rep(0, 132))
test2 <- c(rep(1, 281), rep(0, 119), rep(1, 51), rep(0, 149))
gold <- c(rep(1, 390), rep(0, 210))
df <- data.frame(test1, test2, gold)
# test with different multiple comparison methods
result <- compareR(df, dp = 3)
#> Warning in validatR(df = df, test1 = test1, test2 = test2, gold = gold): Using default columns. Check test 1 is first column, test 2 is second
#> column and gold standard is third column.
# the values in the summary tables are displayed to 3 decimal places
result$acc$accuracies
#> $`Test 1`
#> Estimate SE Lower CI Upper CI
#> Sensitivity 81.282 1.975 77.135 84.863
#> Specificity 67.619 3.229 61.046 73.605
#>
#> $`Test 2`
#> Estimate SE Lower CI Upper CI
#> Sensitivity 72.051 2.272 67.415 76.289
#> Specificity 75.714 2.959 69.520 81.052
Another important aspect of the testCompareR
package is
the ability to control your study design. For example, if you have made
the a priori decision that you are only interested in the predictive
values, is it really necessary to control for multiple tests for
diagnostic accuracies and likelihood ratios? Of course not!
You can ask compareR()
not to display the results by
setting the parameters for any pairs of tests which aren’t of interest
to you to FALSE
. The parameters are as follows:
sesp
for diagnostic accuracies; ppvnpv
for
predictive values; plrnlr
for likelihood ratios.
# simulate data
test1 <- c(rep(1, 317), rep(0, 83), rep(1, 68), rep(0, 132))
test2 <- c(rep(1, 281), rep(0, 119), rep(1, 51), rep(0, 149))
gold <- c(rep(1, 390), rep(0, 210))
df <- data.frame(test1, test2, gold)
# only display results for predictive values
result <- compareR(df, test1 = "test1", test2 = "test2", gold = "gold",
sesp = FALSE, plrnlr = FALSE)
result
#> test metric estimate lower_ci upper_ci p
#> 1 Test 1 PPV 82.3 78.2 85.8
#> 2 Test 2 PPV 84.6 80.4 88.1 *
#> 3 Test 1 NPV 66.0 59.5 72.1 ***
#> 4 Test 2 NPV 59.3 53.4 65.0
#>
#> Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 ' ' 1
If you want specific test names to be included in the output for
compareR()
then you can set the test.names
parameter. This parameter accepts a character vector of length 2.
# simulate data
test1 <- c(rep(1, 317), rep(0, 83), rep(1, 68), rep(0, 132))
test2 <- c(rep(1, 281), rep(0, 119), rep(1, 51), rep(0, 149))
gold <- c(rep(1, 390), rep(0, 210))
df <- data.frame(test1, test2, gold)
# only display results for predictive values
result <- compareR(df, test.names = c("POCT", "Lab Blood"))
#> Warning in validatR(df = df, test1 = test1, test2 = test2, gold = gold): Using default columns. Check test 1 is first column, test 2 is second
#> column and gold standard is third column.
result$acc$accuracies
#> $POCT
#> Estimate SE Lower CI Upper CI
#> Sensitivity 81.3 2.0 77.1 84.9
#> Specificity 67.6 3.2 61.0 73.6
#>
#> $`Lab Blood`
#> Estimate SE Lower CI Upper CI
#> Sensitivity 72.1 2.3 67.4 76.3
#> Specificity 75.7 3.0 69.5 81.1
You made it to the end! That pretty well summarises everything you need to know about the package. Hopefully it will save you a lot of time when comparing two binary diagnostic tests.
Please get in touch with any refinements, comments or bugs. The source code is available on Github if you think you can improve it yourself!