P.Mean >> Category >> Diagnostic testing (created 2003-09-08).  

Evaluation of diagnostic tests involves some subtle but important issues in Statistics. These webpages show some interesting examples of diagnostic tests, offer pointers for critical evaluation of studies of diagnostic tests, and present practical applications of diagnostic tests in your day-to-day medical practice. Also see Category: Bayesian statistics.


37. The Monthly Mean: Positive and negative predictive values (May/June 2010)

36. The Monthly Mean: Specific advice about a sensitive topic (February/March 2010)


35. What is an ROC curve? (December 2009)

34. P.Mean: Data layout for an ROC curve (created 2009-10-16). Back in 1999, I wrote a brief description of the ROC curve and showed what it would look like in SPSS. That page can be found at www.childrensmercy.org/stats/ask/roc.asp. I didn't show, however, what the data would look like when entered into SPSS or what the dialog boxes would look like.

33. P.Mean: The problem with being too sensitive or too specific (created 2009-09-16). Somebody asked my opinion about cost effectiveness research. My bottom line is that I like it, but I understand why it is controversial. Here's the logic that I presented to draw that conclusion.

32. P.Mean: Getting a good cut-off when sensitivity is more important than specificity (created 2009-09-14). "I am working on a prediction model to help with diagnosis. In this particular area I need a model that has the highest possible sensitivity (low specificity is not a problem)." One obvious comment is that you can achieve a sensitivity of 100% if you don't mind a specificity of 0%. So when you say "low specificity is not a problem" that statement is only partially true. What you mean to say is that false negatives are far more serious than false positives. How much more serious, though. Five times? Ten times? Once you've decided the relative costs of false negatives and false positives, the rest is easy.

31. P.Mean: Locating individual points on an ROC curve (created 2009-03-05). In a project examining a diagnostic test, I was asked to develop an ROC curve. That is fairly easy to do. Six months later, though, I was asked to designate a particular point on the curve corresponding to a cutpoint of 7. This is a bit ambiguous, but in re-reading the paper, it was obvious from the context that this meant locating the point on the curve where a positive test result of 7 or less (alternatively a negative test result of 8 or more) occurred. It takes a while to get oriented properly on an ROC curve. Here's what I did.


30. P.Mean: Controversies with a test for ovarian cancer (created 2008-08-27). A recent article in the New York Times raises some interesting questions about diagnostic testing.

Outside resources:

Osamu Komori, Shinto Eguchi. A boosting method for maximizing the partial area under the ROC curve. BMC Bioinformatics. 2010;11(1):314. Abstract: "BACKGROUND: The receiver operating characteristic (ROC) curve is a fundamental tool to assess the discriminant performance for not only a single marker but also a score function combining multiple markers. The area under the ROC curve (AUC) for a score function measures the intrinsic ability for the score function to discriminate between the controls and cases. Recently, the partial AUC (pAUC) has been paid more attention than the AUC, because a suitable range of the false positive rate can be focused according to various clinical situations. However, existing pAUC-based methods only handle a few markers and do not take nonlinear combination of markers into consideration. RESULTS: We have developed a new statistical method that focuses on the pAUC based on a boosting technique. The markers are combined componentially for maximizing the pAUC in the boosting algorithm using natural cubic splines or decision stumps (single-level decision trees), according to the values of markers (continuous or discrete). We show that the resulting score plots are useful for understanding how each marker is associated with the outcome variable. We compare the performance of the proposed boosting method with those of other existing methods, and demonstrate the utility using real data sets. As a result, we have much better discrimination performances in the sense of the pAUC in both simulation studies and real data analysis. CONCLUSIONS: The proposed method addresses how to combine the markers after a pAUC-based filtering procedure in high dimensional setting. Hence, it provides a consistent way of analyzing data based on the pAUC from maker selection to marker combination for discrimination problems. The method can capture not only linear but also nonlinear association between the outcome variable and the markers, about which the nonlinearity is known to be necessary in general for the maximization of the pAUC. The method also puts importance on the accuracy of classification performance as well as interpretability of the association, by offering simple and smooth resultant score plots for each marker." [Accessed June 14, 2010]. Available at: http://www.biomedcentral.com/1471-2105/11/314.

Jens Klotsche, Dietmar Ferger, Lars Pieper, Jurgen Rehm, Hans-Ulrich Wittchen. A novel nonparametric approach for estimating cut-offs in continuous risk indicators with application to diabetes epidemiology. BMC Medical Research Methodology. 2009;9(1):63. Abstract: "BACKGROUND: Epidemiological and clinical studies, often including anthropometric measures, have established obesity as a major risk factor for the development of type 2 diabetes. Appropriate cut-off values for anthropometric parameters are necessary for prediction or decision purposes. The cut-off corresponding to the Youden-Index is often applied in epidemiology and biomedical literature for dichotomizing a continuous risk indicator. METHODS: Using data from a representative large multistage longitudinal epidemiological study in a primary care setting in Germany, this paper explores a novel approach for estimating optimal cut-offs of anthropomorphic parameters for predicting type 2 diabetes based on a discontinuity of a regression function in a nonparametric regression framework. RESULTS: The resulting cut-off corresponded to values obtained by the Youden Index (maximum of the sum of sensitivity and specificity, minus one), often considered the optimal cut-off in epidemiological and biomedical research. The nonparametric regression based estimator was compared to results obtained by the established methods of the Receiver Operating Characteristic plot in various simulation scenarios and based on bias and root mean square error, yielded excellent finite sample properties. CONCLUSION: It is thus recommended that this nonparametric regression approach be considered as valuable alternative when a continuous indicator has to be dichotomized at the Youden Index for prediction or decision purposes." [Accessed October 11, 2009]. Available at: http://www.biomedcentral.com/1471-2288/9/63.

K J Hamberg, B Carstensen, T I S�rensen, K Egh�je. Accuracy of clinical diagnosis of cirrhosis among alcohol-abusing men. J Clin Epidemiol. 1996;49(11):1295-1301. Abstract: "There is a considerable variation among specialists in the use of liver biopsy for the diagnosis of alcoholic cirrhosis, which is often based solely on clinical findings, sometimes supplemented with blood tests. To assess the diagnostic accuracy that may be achieved by this approach, we related items of the history, symptoms and signs, and routine blood tests to the presence/absence of cirrhosis in a unique, previously established, consecutive series of 303 alcohol-abusing men, in whom liver biopsy was performed irrespective of the clinical and biochemical findings. Using logistic regression analyses, we created a clinical, a combined clinical and biochemical, and a pure biochemical diagnostic model. The probability of cirrhosis in patients with the specified characteristics was estimated, the diagnostic accuracy was assessed as functions of diagnostic thresholds for cirrhosis defined by the probability of cirrhosis varying between 0 and 1,and confidence intervals were estimated by bootstrap sampling. The clinical model, including facial teleangiectasia, vascular spiders, white nails, abdominal veins, fatness, and peripheral edema, could be used with high diagnostic accuracy and it was clearly superior to the biochemical model. Adding biochemical findings to the clinical model improved the accuracy of the clinical model only slightly. We conclude that cirrhosis may be diagnosed in alcohol-abusing men with a high accuracy using selected, properly weighted clinical observations only." [Accessed December 4, 2009]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/8892498.

Jerome Groopman. The Best Medicine. Excerpt: "It turns out that about 15 percent of all complaints are misdiagnosed. Many people assume that such diagnostic mistakes are related to technical factors, like mixing up tubes of blood in the laboratory so that the results given to the physician are for the wrong patient. Such technical errors are, in fact, rare. The vast majority of misdiagnoses are related to cognitive biases, thinking traps that occur more often under time pressure and uncertainty. Many of these biases were identified by the cognitive scientists Daniel Kahneman and Amos Tversky." [Accessed on March 23, 2011]. http://incharacter.org/archives/wisdom/the-best-medicine-2/%20/

I Hozo, B Djulbegovic. Calculating confidence intervals for threshold and post-test probabilities. MD Comput. 1998;15(2):110-115. Abstract: "We describe a method and a computer program, written in JavaScript, for calculating confidence intervals. The method uses Taylor's series to approximate the standard errors of a post-test probability and threshold probabilities and, from them, to obtain the associated confidence intervals. This method is valid if the variables of interest are stochastically independent." [Accessed December 4, 2009]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/9540324.

Journal article: Anne W S Rutjes, Johannes B Reitsma, Jan P Vandenbroucke, Afina S Glas, Patrick M M Bossuyt. Case-control and two-gate designs in diagnostic accuracy studies Clin. Chem. 2005;51(8):1335-1341. Abstract: "BACKGROUND: In some diagnostic accuracy studies, the test results of a series of patients with an established diagnosis are compared with those of a control group. Such case-control designs are intuitively appealing, but they have also been criticized for leading to inflated estimates of accuracy. METHODS: We discuss similarities and differences between diagnostic and etiologic case-control studies, as well as the mechanisms that can lead to variation in estimates of diagnostic accuracy in studies with separate sampling schemes ("gates") for diseased (cases) and nondiseased individuals (controls). RESULTS: Diagnostic accuracy studies are cross-sectional and descriptive in nature. Etiologic case-control studies aim to quantify the effect of potential causal exposures on disease occurrence, which inherently involves a time window between exposure and disease occurrence. Researchers and readers should be aware of spectrum effects in diagnostic case-control studies as a result of the restricted sampling of cases and/or controls, which can lead to changes in estimates of diagnostic accuracy. These spectrum effects may be advantageous in the early investigation of a new diagnostic test, but for an overall evaluation of the clinical performance of a test, case-control studies should closely mimic cross-sectional diagnostic studies. CONCLUSIONS: As the accuracy of a test is likely to vary across subgroups of patients, researchers and clinicians might carefully consider the potential for spectrum effects in all designs and analyses, particularly in diagnostic accuracy studies with differential sampling schemes for diseased (cases) and nondiseased individuals (controls)." [Accessed on September 20, 2011]. http://www.clinchem.org/cgi/content/full/51/8/1335.

Nathaniel D. Mercaldo, Kit F. Lau, Xiao H. Zhou. Confidence intervals for predictive values with an emphasis to case-control studies. Statistics in Medicine. 2007;26(10):2170-2183. Abstract: "The accuracy of a binary-scale diagnostic test can be represented by sensitivity (Se), specificity (Sp) and positive and negative predictive values (PPV and NPV). Although Se and Sp measure the intrinsic accuracy of a diagnostic test that does not depend on the prevalence rate, they do not provide information on the diagnostic accuracy of a particular patient. To obtain this information we need to use PPV and NPV. Since PPV and NPV are functions of both the accuracy of the test and the prevalence of the disease, constructing their confidence intervals for a particular patient is not straightforward. In this paper, a novel method for the estimation of PPV and NPV, as well as their confidence intervals, is developed. For both predictive values, standard, adjusted and their logit transformed-based confidence intervals are compared using coverage probabilities and interval lengths in a simulation study. These methods are then applied to two case-control studies: a diagnostic test assessing the ability of the e4 allele of the apolipoprotein E gene (ApoE.e4) on distinguishing patients with late-onset Alzheimer's disease (AD) and a prognostic test assessing the predictive ability of a 70-gene signature on breast cancer metastasis. Copyright � 2006 John Wiley & Sons, Ltd." [Accessed December 10, 2009]. Available at: http://dx.doi.org/10.1002/sim.2677.

Judith L Bowen. Educational strategies to promote clinical diagnostic reasoning. N. Engl. J. Med. 2006;355(21):2217-2225. Excerpt: "Clinical teachers differ from clinicians in a fundamental way. They must simultaneously foster high-quality patient care and assess the clinical skills and reasoning of learners in order to promote their progress toward independence in the clinical setting. Clinical teachers must diagnose both the patient's clinical problem and the learner's ability and skill. To assess a learner's diagnostic reasoning strategies effectively, the teacher needs to consider how doctors learn to reason in the clinical environment." [Accessed December 4, 2009]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/17124019.

David J. Hand. Evaluating diagnostic tests: The area under the ROC curve and the balance of errors. Statistics in Medicine. 2010;29(14):1502-1510. Abstract: "Because accurate diagnosis lies at the heart of medicine, it is important to be able to evaluate the effectiveness of diagnostic tests. A variety of accuracy measures are used. One particularly widely used measure is the AUC, the area under the receiver operating characteristic (ROC) curve. This measure has a well-understood weakness when comparing ROC curves which cross. However, it also has the more fundamental weakness of failing to balance different kinds of misdiagnoses effectively. This is not merely an aspect of the inevitable arbitrariness in choosing a performance measure, but is a core property of the way the AUC is defined. This property is explored, and an alternative, the H measure, is described. Copyright � 2010 John Wiley & Sons, Ltd." [Accessed June 16, 2010]. Available at: http://dx.doi.org/10.1002/sim.3859.

Thomas Perneger, Delphine Courvoisier. Interpretation of evidence in data by untrained medical students: a scenario-based study. BMC Medical Research Methodology. 2010;10(1):78. Abstract: "BACKGROUND: To determine which approach to assessment of evidence in data - statistical tests or likelihood ratios - comes closest to the interpretation of evidence by untrained medical students. METHODS: Empirical study of medical students (N = 842), untrained in statistical inference or in the interpretation of diagnostic tests. They were asked to interpret a hypothetical diagnostic test, presented in four versions that differed in the distributions of test scores in diseased and non-diseased populations. Each student received only one version. The intuitive application of the statistical test approach would lead to rejecting the null hypothesis of no disease in version A, and to accepting the null in version B. Application of the likelihood ratio approach led to opposite conclusions - against the disease in A, and in favour of disease in B. Version C tested the importance of the p-value (A: 0.04 versus C: 0.08) and version D the importance of the likelihood ratio (C: 1/4 versus D: 1/8). RESULTS: In version A, 7.5% concluded that the result was in favour of disease (compatible with p value), 43.6% ruled against the disease (compatible with likelihood ratio), and 48.9% were undecided. In version B, 69.0% were in favour of disease (compatible with likelihood ratio), 4.5% against (compatible with p value), and 26.5% undecided. Increasing the p value from 0.04 to 0.08 did not change the results. The change in the likelihood ratio from 1/4 to 1/8 increased the proportion of non-committed responses. CONCLUSIONS: Most untrained medical students appear to interpret evidence from data in a manner that is compatible with the use of likelihood ratios." [Accessed October 25, 2010]. Available at: http://www.biomedcentral.com/1471-2288/10/78.

Tracey Sach, David Whynes. Men and women: beliefs about cancer and about screening. BMC Public Health. 2009;9(1):431. Abstract: "BACKGROUND: Cancer screening programmes in England are publicly-funded. Professionals' beliefs in the public health benefits of screening can conflict with individuals' entitlements to exercise informed judgement over whether or not to participate. The recognition of the importance of individual autonomy in decision making requires greater understanding of the knowledge, attitudes and beliefs upon which people's screening choices are founded. Until recently, the technology available required that cancer screening be confined to women. This study aimed to discover whether male and female perceptions of cancer and of screening differ. METHODS: Data on the public's cancer beliefs were collected by means of a postal survey (anonymous questionnaire). Two general practices based in Nottingham and in Mansfield, in east-central England, sent questionnaires to registered patients aged 30 to 70 years. 1,808 completed questionnaires were returned for analysis, 56.5 per cent from women. RESULTS: Women were less likely to underestimate overall cancer incidence, although each sex was more likely to cite a sex-specific cancer as being amongst the most common cancer site. In terms of risk factors, men were most uncertain about the role of stress and sexually-transmitted diseases, whereas women were more likely to rate excessive alcohol and family history as major risk factors. The majority of respondents believed the public health care system should provide cancer screening, but significantly more women than men reported having benefiting from the nationally-provided screening services. Those who were older, in better health or had longer periods of formal education were less worried about cancer than those who had illness experiences, lower incomes, or who were smokers. Actual or potential participation in bowel screening was higher amongst those who believed bowel cancer to be common and amongst men, despite women having more substantial worries about cancer than men. CONCLUSIONS: Our results suggest that men's and women's differential knowledge of cancer correlates with women's closer involvement with screening. Even so, men were neither less positive about screening nor less likely to express a willingness to participate in relevant screening in the future. It is important to understand gender-related differences in knowledge and perceptions of cancer, if health promotion resources are to be allocated efficiently." [Accessed November 30, 2009]. Available at: http://www.biomedcentral.com/1471-2458/9/431.

Eta S. Berner, Randolph A. Miller, Mark L. Graber. Missed and Delayed Diagnoses in the Ambulatory Setting. Annals of Internal Medicine. 2007;146(6):470. Excerpt: "We applaud Gandhi and colleagues for highlighting the problem of outpatient diagnostic errors. However, malpractice claims are a biased data source. Primary identification of diagnostic errors in ambulatory settings remains problematic." [Accessed December 4, 2009]. Available at: http://www.annals.org/content/146/6/470.1.extract.

E Berner, M Graber. Overconfidence as a Cause of Diagnostic Error in Medicine. The American Journal of Medicine. 2008;121(5):S2-S23. Abstract: "The great majority of medical diagnoses are made using automatic, efficient cognitive processes, and these diagnoses are correct most of the time. This analytic review concerns the exceptions: the times when these cognitive processes fail and the final diagnosis is missed or wrong. We argue that physicians in general underappreciate the likelihood that their diagnoses are wrong and that this tendency to overconfidence is related to both intrinsic and systemically reinforced factors. We present a comprehensive review of the available literature and current thinking related to these issues. The review covers the incidence and impact of diagnostic error, data on physician overconfidence as a contributing cause of errors, strategies to improve the accuracy of diagnostic decision making, and recommendations for future research." [Accessed December 4, 2009]. Available at: http://www.amjmed.com/article/S0002-9343(08)00040-5/fulltext.

H. Gilbert Welch, William C. Black. Overdiagnosis in Cancer. J. Natl. Cancer Inst. 2010:djq099. Abstract: "This article summarizes the phenomenon of cancer overdiagnosis--the diagnosis of a "cancer" that would otherwise not go on to cause symptoms or death. We describe the two prerequisites for cancer overdiagnosis to occur: the existence of a silent disease reservoir and activities leading to its detection (particularly cancer screening). We estimated the magnitude of overdiagnosis from randomized trials: about 25% of mammographically detected breast cancers, 50% of chest x-ray and/or sputum-detected lung cancers, and 60% of prostate-specific antigen-detected prostate cancers. We also review data from observational studies and population-based cancer statistics suggesting overdiagnosis in computed tomography-detected lung cancer, neuroblastoma, thyroid cancer, melanoma, and kidney cancer. To address the problem, patients must be adequately informed of the nature and the magnitude of the trade-off involved with early cancer detection. Equally important, researchers need to work to develop better estimates of the magnitude of overdiagnosis and develop clinical strategies to help minimize it." [Accessed April 28, 2010]. Available at: http://jnci.oxfordjournals.org/cgi/content/abstract/djq099v1.

Donald A. Redelmeier. The Cognitive Psychology of Missed Diagnoses. Ann Intern Med. 2005;142(2):115-120. Abstract: "Cognitive psychology is the science that examines how people reason, formulate judgments, and make decisions. This case involves a patient given a diagnosis of pharyngitis, whose ultimate diagnosis of osteomyelitis was missed through a series of cognitive shortcuts. These errors include the availability heuristic (in which people judge likelihood by how easily examples spring to mind), the anchoring heuristic (in which people stick with initial impressions), framing effects (in which people make different decisions depending on how information is presented), blind obedience (in which people stop thinking when confronted with authority), and premature closure (in which several alternatives are not pursued). Rather than trying to completely eliminate cognitive shortcuts (which often serve clinicians well), becoming aware of common errors might lead to sustained improvement in patient care." [Accessed July 8, 2009]. Available at: http://www.annals.org/cgi/content/abstract/142/2/115.

Eve A. Kerr, Brian J. Zikmund-Fisher, Mandi L. Klamerus, et al. The Role of Clinical Uncertainty in Treatment Decisions for Diabetic Patients with Uncontrolled Blood Pressure. Annals of Internal Medicine. 2008;148(10):717-727. Abstract: "Factors underlying failure to intensify therapy in response to elevated blood pressure have not been systematically studied. To examine the process of care for diabetic patients with elevated triage blood pressure (≥140/90 mm Hg) during routine primary care visits to assess whether a treatment change occurred and to what degree specific patient and provider factors correlated with the likelihood of treatment change. Prospective cohort study. 9 Veterans Affairs facilities in 3 midwestern states. 1169 diabetic patients with scheduled visits to 92 primary care providers from February 2005 to March 2006. Proportion of patients who had a change in a blood pressure treatment (medication intensification or planned follow-up within 4 weeks). Predicted probability of treatment change was calculated from a multilevel logistic model that included variables assessing clinical uncertainty, competing demands and prioritization, and medication-related factors (controlling for blood pressure). Overall, 573 (49%) patients had a blood pressure treatment change at the visit. The following factors made treatment change less likely: repeated blood pressure by provider recorded as less than 140/90 mm Hg versus 140/90 mm Hg or greater or no recorded repeated blood pressure (13% vs. 61%; < 0.001); home blood pressure reported by patients as less than 140/90 mm Hg versus 140/90 mm Hg or greater or no recorded home blood pressure (18% vs. 52%; < 0.001); provider systolic blood pressure goal greater than 130 mm Hg versus 130 mm Hg or less (33% vs. 52%; = 0.002); discussion of conditions unrelated to hypertension and diabetes versus no discussion (44% vs. 55%; = 0.008); and discussion of medication issues versus no discussion (23% vs. 52%; < 0.001). Providers knew that the study pertained to diabetes and hypertension, and treatment change was assessed for 1 visit per patient. Approximately 50% of diabetic patients presenting with a substantially elevated triage blood pressure received treatment change at the visit. Clinical uncertainty about the true blood pressure value was a prominent reason that providers did not intensify therapy." [Accessed December 4, 2009]. Available at: http://www.annals.org/content/148/10/717.abstract.

Margaret Sullivan Pepe. The Statistical Evaluation of Medical Tests for Classification and Prediction (Oxford Statistical Science Series). Excerpt from the back cover: "The use of clinical and laboratory information to detct conditions and predict patient outcomes is a mainstay of medical practice. This book describes the statistical concepts and techniques for evaluating the accuracy of medical tests. Main topics include: estimation and comparison of measures of accuracy, including receiver operating characteristic curves; regression frameworks for assessing factors that influence test accuracy and for comparing tests while adjustic for such factors; and sample size calculations and other issues pertinent to study design. Problems relating to missing and imperfect reference data are discussed in detail. Additional topics include: meta-analysis for summarizing the results of multiple studies of a test; the evaluation of markers for predicting event time data; and procedures for combining the results of multiple tests to improve classification. A variety of worked examples are provided. [This book] will be of interest to quantitative researchers and to practicing statisticians. The book also covers the theoretical foundations for statistical inference and will be of interest to academic statisticians."

Karin Velthove, Hubert Leufkens, Patrick Souverein, Rene Schweizer, Wouter van Solinge. Testing bias in clinical databases: methodological considerations. Emerging Themes in Epidemiology. 2010;7(1):2. Abstract: "BACKGROUND: Laboratory testing in clinical practice is never a random process. In this study we evaluated testing bias for neutrophil counts in clinical practice by using results from requested and non-requested hematological blood tests. METHODS: This study was conducted using data from the Utrecht Patient Oriented Database, a unique clinical database as it contains physician requested data, but also data that are not requested by the physician, but measured as result of requesting other hematological parameters. We identified adult patients, hospitalized in 2005 with at least two blood tests during admission, where requests for general blood profiles and specifically for neutrophil counts were contrasted in scenario analyses. Possible effect modifiers were diagnosis and glucocorticoid use. RESULTS: A total of 567 patients with requested neutrophil counts and 1,439 patients with non-requested neutrophil counts were analyzed. The absolute neutrophil count at admission differed with a mean of 7.4.10E9/l for requested counts and 8.3.10E9/l for non-requested counts (p-value <0.001). This difference could be explained for 83.2% by the occurrence of cardiovascular disease as underlying disease and for 4.5% by glucocorticoid use. CONCLUSION: Requests for neutrophil counts in clinical databases are associated with underlying disease and with cardiovascular disease in particular. The results from our study show the importance of evaluating testing bias in epidemiological studies obtaining data from clinical databases." [Accessed June 14, 2010]. Available at: http://www.ete-online.com/content/7/1/2.

University of Cambridge. Understanding Uncertainty Excerpt: "This site is produced by the Winton programme for the public understanding of risk based in the Statistical Laboratory in the University of Cambridge. The aim is to help improve the way that uncertainty and risk are discussed in society, and show how probability and statistics can be both useful and entertaining! However we also acknowledge that uncertainty is not just a matter of working out numerical chances, and aim for an appropriate balance between qualitative and quantitative insights. The current team comprises David Spiegelhalter, Mike Pearson, Owen Smith Arciris Garay-Arevalo and Ian Short, with contributions from Hauke Riesch, Owen Walker, Madeleine Cule and Hayley Jones . However we are always looking for people who would like to contribute material to this site, and you will get proper acknowledgement." [Accessed on March 23, 2011]. http://understandinguncertainty.org/.

Lynne Gaffikin, John McGrath, Marc Arbyn, Paul Blumenthal. Visual inspection with acetic acid as a cervical cancer test: accuracy validated using latent class analysis. BMC Medical Research Methodology. 2007;7(1):36. Abstract: "BACKGROUND: The purpose of this study was to validate the accuracy of an alternative cervical cancer test - visual inspection with acetic acid (VIA) - by addressing possible imperfections in the gold standard through latent class analysis (LCA). The data were originally collected at peri-urban health clinics in Zimbabwe. METHODS: Conventional accuracy (sensitivity/specificity) estimates for VIA and two other screening tests using colposcopy/biopsy as the reference standard were compared to LCA estimates based on results from all four tests. For conventional analysis, negative colposcopy was accepted as a negative outcome when biopsy was not available as the reference standard. With LCA, local dependencies between tests were handled through adding direct effect parameters or additional latent classes to the model. RESULTS: Two models yielded good fit to the data, a 2-class model with two adjustments and a 3-class model with one adjustment. The definition of latent disease associated with the latter was more stringent, backed by three of the four tests. Under that model, sensitivity for VIA (abnormal+) was 0.74 compared to 0.78 with conventional analyses. Specificity was 0.639 versus 0.568, respectively. By contrast, the LCA-derived sensitivity for colposcopy/biopsy was 0.63. CONCLUSION: VIA sensitivity and specificity with the 3-class LCA model were within the range of published data and relatively consistent with conventional analyses, thus validating the original assessment of test accuracy. LCA probably yielded more likely estimates of the true accuracy than did conventional analysis with in-country colposcopy/biopsy as the reference standard. Colpscopy with biopsy can be problematic as a study reference standard and LCA offers the possibility of obtaining estimates adjusted for referent imperfections." [Accessed December 4, 2009]. Available at: http://www.biomedcentral.com/1471-2288/7/36.

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2017-06-15. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.


29. Stats: ROC curve for an imperfect gold standard (March 12, 2008). Someone asked me about how to use an ROC curve if you have more than two categories. Apparently the gold standard that the researchers were using was known to be imperfect, so they wanted an intermediate category (possible disease).

28. Stats: Does prevalence affect sensitivity (January 31, 2008). Dear Professor Mean, Does lowering the prevalence of a disease have an effect on sensitivity?


27. Stats: Postlude to my Dallas talk (November 11, 2007). I gave a talk this morning to the American College of Allergy, Asthma & Immunology. I documented my preparations for this talk on my webpages and wanted to share some thoughts I had during and after the talk.

26. Stats: Handout for diagnostic testing (November 6, 2007). I have been busy preparing a handout describing the basics of diagnostic testing (e.g., sensitivity and specificity), the medical issues associated with these tests (e.g., the difficulty in testing for a rare disease, the need to balance the costs of false positives and false negatives), and applications of the likelihood ratio. I also show how to use the likelihood ratio slide rule.

25. Stats: Continuing education questions for a talk on diagnostic tests (July 24, 2007). As part of my talk to the American College of Allergy, Asthma & Immunology, I have been asked to present two questions related to my topic (Use of Diagnostic Tests for Making Clinical Decisions). These questions would consist of a brief clinical stem followed by four choices on how to manage the situation. These will be presented prior to my talk and then afterwards to see how effective the training is.

24. Stats: Classic calculations for a diagnostic test (July 20, 2007). I created a table that illustrates many of the classic calculations for a diagnostic test.

23. Stats: Code for drawing new likelihood ratio slide rule (July 12, 2007). I have made some minor changes to my likelihood ratio slide. The original code was lost somewhere, so I wrote some new code and added documentation. I also changed the orientation of the slide rule so it can be held horizontally and shaded the regions that need to be cut out or away.

22. Stats: Recommendations from Sackett et al for evaluating a diagnostic test (July 2, 2007). There is a lot of controversy about diagnostic testing, and I have mentioned some of these controversies in other weblog entries. I wanted to review what the experts say about diagnostic testing. The definitive resource for evaluating any medical controversy is Evidence-based Medicine How to Practice and Teach EBM. David L. Sackett, Scott W. Richardson, William Rosenberg, Brian R. Haynes (1998) Edinburgh: Churchill Livingstone.

21. Stats: Use of diagnostic tests for making clinical decisions (June 15, 2007). I'm giving a talk for the American College of Allergy, Asthma, and Immunology with the title "Use of diagnostic tests for making clinical decisions." Here's an abstract of this talk.

20. Stats: Applying likelihood ratios in your head (June 1, 2007). Someone sent me a nice email complimenting my likelihood ratio slide rule. He/she also pointed out a simple way to apply likelihood ratios in your head.

19. Stats: Quantifying the ability of dreams to predict the future (April 10, 2007). Someone wrote to me about a diary they had kept for the past eight years about their dreams. About every other month or so, a dream of theirs came true. I was asked if I could quantify the likelihood of successful predictions. Assessing psychic phenomena is outside my area of expertise, but I offered a few general suggestions, partly because I thought that an analogy to diagnostic testing was interesting.

18. Stats: What makes a good diagnostic test? (April 6, 2007). I've been invited to give a talk at the annual meeting of the American College of Allergy, Asthma & Immunology. The tentative title of the talk is "What makes a good diagnostic test?" It will be part of a plenary session and I'll be followed by two speakers debating the merits of two particular diagnostic tests. I don't have a lot of details at this time, but as I develop my talk, I'll put details here on this weblog.


17. Stats: Incorporating risk factors into diagnostic test calculations (November 9, 2006). A contributor to the Evidence-Based Health email discussion group (PK) raised an interesting question about how to incorporate information about risk factors when applying the results of a diagnostic test. When you are estimating a pre-test probability for a diagnostic test, you need to take three steps: (1) find an estimate of the prevalence of the disease in the general population, (2) modify this estimate based on characteristics of your particular practice, and (3) further modify this estimate based on characteristics of the individual patient that is currently sitting in front of you.

16. Stats: Mathematical derivation of the odds form of Bayes theorem (October 16, 2006). I had included some rather technical details on my web page about likelihood ratios, but I thought it would be best to move it to a separate page.

15. Stats: Calculations involving diagnostic tests using open source abstracts (October 5, 2006). I spent a few hours reviewing 200+ abstracts published in BiomedCentral that had the words "sensitivity" and "specificity" in the title. There were four which had enough information in the abstract to be used as teaching examples on how to calculate sensitivity, specificity, positive predictive value, and/or negative predictive value.

14. Stats: A novel diagnostic test (January 26, 2006). A recently published article on diagnosing cancer got a lot of press. The article, Diagnostic Accuracy of Canine Scent Detection in Early- and Late-Stage Lung and Breast Cancers. McCulloch M, Jezierski T, Broffman M, Hubbard A, Kirk Turner, Janecki T. Integrative Cancer Therapies 2006: 5(1); 1-10., noted that canines have an unusually sensitive sense of smell and might be able to diagnose cancer by sniffing breath sample from human patients. This is rather intriguing, since dogs have already been trained to locate explosives, cadavers, drugs, and so forth.


13. Stats: An error slips through the peer review process (September 19, 2005). A group of residents wanted me to look at an article because they were confused about the calculation of the likelihood ratio. The numbers that they got were quite different from those in the publication. It turns out that they were calculating things correctly, and did not realize that the paper had several serious errors in some of the more fundamental calculations of sensitivity and specificity.

12. Stats: Likelihood ratio--extra information (August 3, 2005). In a meta-analysis of studies of diagnosing anemia (Guyatt 1992 JGIM 7(2): 145-53), Serum ferritin was discovered to be the most effective test.  Here are the results of this test

11. Stats: The costs of a false positive test (March 1, 2005). The New York Times had an excellent article on newborn screening tests, Panel to Advise Testing Babies for 29 Diseases. Kolata G. The New York Times, February 21, 2005. Unfortunately, this article is no longer available online. But it discusses a recent push to standardize and expand the screening tests for newborns to include 29 different diseases.

10. Stats: Spectrum Bias (January 4, 2005). I tried to start a page on diagnostic tests a while back, but have not had the time to fully develop it. One of the important issues for diagnostic tests is spectrum bias. The sensitivity and specificity of a diagnostic test can depend on who exactly is being tested. Think of disease as a range of possibilities from slight to moderate to extreme. If only a portion of the disease range is included, you may get an incorrect impression of how well a diagnostic test works. This is known as spectrum bias.


9. Stats: Unnecessary diagnostic tests (October 25, 2004). You would think that you can never have enough information about your health. Barring financial considerations, the more testing the better. That actually is not true. In some situations, too many diagnostic tests are being run, and it hurts rather than helps the patient. American Medical News has an article about this, Lab tests go under a critical microscope Experts point out that good tests used badly can lead to bad medicine. Victoria Stagg Elliott. Nov. 1, 2004. www.ama-assn.org/amednews/2004/11/01/hlsd1101.htm. They offer several good examples.

8. Stats: Full-Body Computed Tomography Screening (September 6, 2004). Full body scans represent a good example of the conflicting considerations when you need to evaluate a screening test. A full body scan uses a CT (Computerized Tomography) scan to examine the inside of your body. These full body scans are heavily advertised as a way to detect physiologic abnormalities that might provide an early warning of cancer, heart disease, or other illnesses. Many organizations, including the U.S. Food and Drug Administration strongly discourage the use of full body scans in healthy adults with no obvious symptoms of disease.

7. Stats: Unbalanced sample sizes for evaluating a diagnostic test (August 5, 2004). I get a lot of questions about unbalanced sample sizes. Quite often the mechanics of the research protocol make it easier to find a lot of patients in one group and only a few in another group. For example, someone is evaluating a diagnostic test and notes that only 16% of the patients in the study will actually have the disease being tested for. Will this cause any bias, he wonders? Any loss in precision? You will lose some precision, but there is no bias of any kind.

6. Stats: Evaluating the AUC for an ROC curve (July 27, 2004). Someone asked me where I got the following guidance for Area Under the Curve (AUC) for a Receiver Operating Characteristic (ROC) curve: 0.50 to 0.75 = fair, 0.75 to 0.92 = good, 0.92 to 0.97 = very good, 0.97 to 1.00 = excellent. I cannot find where I got these numbers. It must be a sign of senility on my part.

5. Stats: Pap smears for women without a cervix (June 24, 2004). In the most recent issue of JAMA is an article by Sirovich and Welch, Cervical Cancer Screening Among Women Without a Cervix that estimates almost 10 million women in the United States have received a pap smear unnecessarily because they have had a full hysterectomy and no longer have a cervix. For women who have had only a partial hysterectomy or where the hysterectomy was done for cervical neoplasia, regular pap smears are recommended. For the other women, though, this is an unnecessary test, because the pap smear is trying to detect cancer in an organ that the woman no longer has.

4. Stats: Prostate Specific Antigen testing (May 31, 2004). A recent report in the New England Journal of Medicine highlights the continuing controversy over Prostate-Specific Antigen (PSA) testing. This controversy is interesting to me because it highlights the uncertain nature of medical research. Keep in mind that I am not a doctor (read my disclaimer) and if you are confronting this issue with regard to your own health, please discuss this with your doctor. PSA is a test commonly used to detect prostate cancer, and any value larger than 4.0 ng per milliliter is considered by some as cause for additional testing. The article examines prevalence of prostate cancer among men in the control arm of a large randomized prevention trial. Of the 9,459 men in the trial, 2,950 had measured PSA that never exceeded  4.0, and yet 15% of these men had prostate cancer confirmed by biopsy.


3. Stats: Likelihood ratio slide rule (October 24, 2002). The use of likelihood ratios requires a bit of tedious calculations. I have developed a simple slide rule that will do likelihood ratio calculations for you.


2. Stats: Sample size for a diagnostic study (September 3, 1999) Dear Professor Mean, How big should a study of a diagnostic test be? I want to estimate a sample size for the sensitivity and specifity of a test. I guess confidence intervals would address this, but is there a calculation analogous to a power analysis that would apply to figure out the size of the groups beforehand? -- Jovial John

1. Stats: ROC curve (August 18, 1999) Dear Professor Mean: I was at a meeting in Belgium and the buzz statistic was ROC Analysis. I think it stands for Receiver Operating Characteristic curve. It seems to be used for predictive values. I seemed to be a lone ranger in not understanding as they were showing in several presentations "by this curve you can see this is good or bad" and they didn't look very different. Do you have a simple explanation about ROC curves?


What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2017-06-15.