Stats: ROC curve (August 18, 1999)

ROC curve (August 19, 1999)

This page is moving to a new website.

Dear Professor Mean, I was at a meeting in Belgium and the buzz statistic was ROC Analysis. I think it stands for Receiver Operating Characteristic curve. It seems to be used for predictive values. I seemed to be a lone ranger in not understanding as they were showing in several presentations "by this curve you can see this is good or bad" and they didn't look very different. Do you have a simple explanation about ROC curves?

To understand an ROC curve, you first have to accept the fact that MDs like to ruin a nice continuous outcome measure by turning it into a dichotomy. For example, doctors have measured the S100 protein in serum and found that higher values tend to be associated with Creutzfeldt-Jakob disease. The median value is 395 pg/ml for the 108 patients with the disease and only 109 pg/ml for the 74 patients without the disease. The doctors set a cut off of 213 pg/ml, even though they realized that 22.2% of the diseased patients had values below the cut off and 18.9% of the disease-free patients had values above the cut off.

The two percentages listed above are the false negative and false positive rates, respectively. If we lowered the cut off value, we would decrease the false negative rate probability, but we would also increase the false positive rate. Similarly, if we raised the cut off, we would decrease the false positive rate, but we would increase the false negative rate.

Short explanation

An ROC curve is a graphical representation of the trade off between the false negative and false positive rates for every possible cut off. Equivalently, the ROC curve is the representation of the tradeoffs between sensitivity (Sn) and specificity (Sp).

By tradition, the plot shows the false positive rate on the X axis and 1 - the false negative rate on the Y axis. You could also describe this as a plot with 1-Sp on the X axis and Sn on the Y axis.

So how can you tell a good ROC curve from a bad one?

All ROC curves are good, it is the diagnostic test which can be good or bad. A good diagnostic test is one that has small false positive and false negative rates across a reasonable range of cut off values. A bad diagnostic test is one where the only cut offs that make the false positive rate low have a high false negative rate and vice versa.

We are usually happy when the ROC curve climbs rapidly towards upper left hand corner of the graph. This means that 1- the false negative rate is high and the false positive rate is low. We are less happy when the ROC curve follows a diagonal path from the lower left hand corner to the upper right hand corner. This means that every improvement in false positive rate is matched by a corresponding decline in the false negative rate.

You can quantify how quickly the ROC curve rises to the upper left hand corner by measuring the area under the curve. The larger the area, the better the diagnostic test. If the area is 1.0, you have an ideal test, because it achieves both 100% sensitivity and 100% specificity. If the area is 0.5, then you have a test which has effectively 50% sensitivity and 50% specificity. This is a test that is no better than flipping a coin. In practice, a diagnostic test is going to have an area somewhere between these two extremes. The closer the area is to 1.0, the better the test is, and the closer the area is to 0.5, the worse the test is.

Area under the curve does have one direct interpretation. If you take a random healthy patient and get a score of X and a random diseased patient and get a score of Y, then the area under the curve is an estimate of P[Y>X] (assuming that large values of the test are indicative of disease).

Show me an example of an ROC curve.

Consider a diagnostic test that can only take on five values, A, B, C, D, and E. We classify diseased (D+) and healthy (D-) patients by this test and get the following results:

It's a bit easier if we convert this table to cumulative percentages.

We add a row (*) to represent the cumulative percentage of 0% which will end up corresponding to a diagnostic test where all the results are considered positive regardless of the test value. The last row represents the other extreme, where all the results are considered negative regardless of the test value.

The complementary percentages shown above represent the true positive rate (or Sn) and the the false positive rate (or 1-Sp).

This table includes two extreme cases for the sake of completeness. If you always classify a test as positive, then you will have a 100% true positive rate among those with the disease (Sn=1), but also a 100% false positive rate among those who are healthy (Sp=0). Conversely, if you always classify a test as negative, you will have a 0% true positive rate among those with the disease (Sn=0), but you will have a 0% false positive rate among those who are healthy (Sp=1). Neither extreme would probably be used in a practical setting; if you always classified a test as positive (or negative) that would mean that you are ignoring the test results entirely.

Here is what the graph of the ROC curve would look like.

Here is information about Area Under the Curve. This area (0.91) is quite good; it is close to the ideal value of 1.0 and much larger than worst case value of 0.5.

Here are the actual values used to draw the ROC curve (I selected the "Coordinate points of the ROC Curve" button in SPSS).

Here is the same ROC curve with annotations added

Shown below is an artificial ROC curve with an area equal to 0.5. Notice that each gain in sensitivity is balanced by the exact same loss in specificity and vice versa. Also notice that this curve includes the point corresponding to 50% for both sensitivity and specificity. You could achieve this level of diagnostic ability by flipping a coin. Clearly, this curve represents a worst case scenario.

What's a good value for the area under the curve?

Deciding what a good value is for area under the curve is tricky and it depends a lot on the context of your individual problem. One way to approach the problem is to examine what some of the likelihood ratios would be for various areas. A good test should have a LR+ of at least 2.0 and a LR- of 0.5 or less. This would correspond to an area of roughly 0.75. A better test would have likelihood ratios of 5 and 0.2, respectively, and this corresponds to an area of around 0.92. Even better would be likelihood ratios of 10 and 0.1, which corresponds roughly to an area of 0.97. So here is one interpretation of the areas:

0.50 to 0.75 = fair

0.75 to 0.92 = good

0.92 to 0.97 = very good

0.97 to 1.00 = excellent.

These are very rough guidelines; further work on refining these would be appreciated.

Summary

The ROC curve plots the false positive rate on the X axis and 1 - the false negative rate on the Y axis. It shows the trade-off between the two rates. If the area under the ROC curve is close to 1, you have a very good test. If the area is close to 0.5, you have a lousy test.

Further reading

Quantifying the information value of clinical assessments with signal detection theory. Richard M. McFall, Teresa A. Treat. Annu Rev Psychol 1999: 50215-41. [Abstract]
The magnificent ROC (Receiver Operating Characteristic curve). Jo van Schalkwyk. Accessed on 2003-09-08. www.anaesthetist.com/mnm/stats/roc/
Receiver-operating characteristic (ROC) plots: a fundamental evaluation tool in clinical medicine. MH Zweig, G Campbell. Clin Chem 1993: 39(4); 561-77. [Medline]
Accuracy of clinical diagnosis of cirrhosis among alcohol-abusing men. K. J. Hamberg, B. Carstensen, T. I. Sorensen, K. Eghoje. J Clin Epidemiol 1996: 49(11); 1295-301. [Medline] [Abstract]
Comparing diagnostic tests: a simple graphic using likelihood ratios. B. J. Biggerstaff. Statistics in Medicine 2000: 19(5); 649-63. [Medline] [Abstract]
Slopes of a receiver operating characteristic curve and likelihood ratios for a diagnostic test. BCK Choi. AJE 1998: 148(11); 1127-32. [Medline]
Comparing the areas under two or more correlated receiver operating characteristic curves: a nonparametric approach. E. R. De Long, D. M. De Long, D. L. Clarke-Pearson. Biometrics 1988: 44(3); 837-45. [Medline]
Analysis of correlated ROC areas in diagnostic testing. H. H. Song. Biometrics 1997: 53(1); 370-82. [Medline]
Receiver Operating Characteristic (ROC) Literature Research. Kelly H. Zou, Harvard Medical School. Accessed on 2003-09-08. splweb.bwh.harvard.edu:8000/pages/ppl/zou/roc.html
Published examples of ROC curves
The influence of prostate volume on the ratio of free to total prostate specific antigen in serum of patients with prostate carcinoma and benign prostate hyperplasia. C. Stephan, M. Lein, K. Jung, D. Schnorr, S. A. Loening. Cancer 1997: 79(1); 104-9. [Medline] [Abstract]
Diagnostic Accuracy of Four Assays of Prostatic Acid Phosphatase: Comparison Using Receiver Operating Characteristic Curve Analysis. JL Carson, JM Eisenberg, LM Shaw, et al:. Journal of the American Medical Association 1985: 253665-669. [Medline]
The ratio of free to total serum prostate specific antigen and its use in differential diagnosis of prostate carcinoma in Japan. S. Egawa, S. Soh, M. Ohori, T. Uchida, K. Gohji, A. Fujii, S. Kuwao, K. Koshiba. Cancer 1997: 79(1); 90-8. [Medline] [Abstract]
Using the Hospital Anxiety and Depression Scale to screen for psychiatric disorders in people presenting with deliberate self-harm. D. Hamer, D. Sanjeev, E. Butterworth, P. Barczak. Br J Psychiatry 1991: 158782-4. [Medline]
Screening for anxiety, depressive and somatoform disorders in rehabilitation--validity of HADS and GHQ-12 in patients with musculoskeletal disease. M. Harter, K. Reuter, K. Gross-Hardt, J. Bengel. Disabil Rehabil 2001: 23(16); 737-44. [Medline]
Diagnostic markers of infection: comparison of procalcitonin with C reactive protein and leucocyte count. M. Hatherill, S. M. Tibby, K. Sykes, C. Turner, I. A. Murdoch. Arch Dis Child 1999: 81(5); 417-21. [Medline] [Abstract] [Full text] [PDF]
Using fasting plasma glucose concentrations to screen for gestational diabetes mellitus: prospective population based study. D Perucchini, U Fischer, GA Spinas, R Huch, A Huch, R Lehmann. British Medical Journal 1999: 319(7213); 812-815. [Medline] [Abstract] [Full text] [PDF]
Sensitivity and specificity of observer and self-report questionnaires in major and minor depression following myocardial infarction. J. J. Strik, A. Honig, R. Lousberg, J. Denollet. Psychosomatics 2001: 42(5); 423-8. [Medline] [Abstract]
Diagnosis of Creutzfeldt-Jakob disease by measurement of S100 protein in serum: prospective case-control study. M. Otto, J. Wiltfang, E. Schutz, I. Zerr, A. Otto, A. Pfahlberg, O. Gefeller, M. Uhr, A. Giese, T. Weber, H. A. Kretzschmar, S. Poser. Bmj 1998: 316(7131); 577-82. [Medline] [Abstract] [Full text] [PDF]