[The Monthly Mean] August 2010--What statistical details belong in the methods section of your research paper?

The Monthly Mean is a newsletter with articles about Statistics with occasional forays into research ethics and evidence based medicine. I try to keep the articles non-technical, as far as that is possible in Statistics. The newsletter also includes links to interesting articles and websites. There is a very bad joke in every newsletter as well as a bit of personal news about me and my family. If you are not yet subscribed to this newsletter, you can sign on at www.pmean.com/news.

Welcome to the Monthly Mean newsletter for August 2010. If you are having trouble reading this newsletter in your email system, please go to www.pmean.com/news/201008.html. If you no longer wish to receive this newsletter, there is a link to unsubscribe at the bottom of this email. Here's a list of topics.

1. What statistical details belong in the methods section of your research paper?

This is an early draft of a chapter for a book I am hoping to write. You can find details about my book proposal at

If you are writing the methods section of your research paper, you need to present enough information so that a reasonably intelligent and skilled person can accurately reproduce your work. If there were any protocol deviations, describe what you had planned to do and explain how and why you changed your plan.

There is a lot of stuff that goes into the methods section that I can't comment on, such as the settings used on your flow cytometer. There are, however, details of a statistical nature that I can offer some advice about.

Sampling approach: You should discuss how patients were selected for your research study. If your patients represent a random sample, describe your sampling frame (the list of people that you drew names from for your random sample) and explain how you selected from your sampling frame. Include details of how you created your random numbers for the sampling scheme.

If your sample was not random, but rather a convenience sample, explain the process by which you approached your research participants.

If treatments/placebos were assigned randomly, you need to explain this process. In particular, briefly describe the mechanism that you used to create random assignments (e.g., Random assignments were created using the RAND() function in Excel 2003.)

Was there blinding? Blinding is not always possible, but when this is the case, please state this and explain the reason. If a study was completely blinded, explain how you achieved this. If the study was partially blinded, explain who was blinded and how.

Even in randomized studies that can't be blinded, you can use concealed allocation. This is the process of hiding the randomization list from the patient and the doctor prior to assessing inclusion/exclusion criteria and the offering of informed consent. If the randomized assignment is known ahead of time, doctors might deliberately steer patients of a particular type into one arm of the study. Even if they don't do this, they may subconsciously apply in the inclusion/exclusion criteria differently. The randomization list can be concealed through the use of sealed envelopes or (for a large multi-center trial) through the use of a special toll-free number that offers randomization information.

While the inclusion and exclusion criteria are not, strictly speaking, part of the statistical sampling approach, I mention them because the inclusion and exclusion criteria are so critical for gauging the ability to extrapolate your results to other patient populations. You need to describe your inclusion criteria: what characteristics did your patients have to have in order to qualify for your study? Alternately, or in addition, describe your exclusion criteria: what characteristics would disqualify a patient from being in your study.

Some examples of characteristics that should be defined among the inclusion or exclusion criteria include

* demography (e.g., race, sex, age)
* geography
* time frame
* occupation
* care requirements
* diagnosis

There may be some implicit inclusion and exclusion criteria in your research. Your research may require someone to have a permanent home address, which excludes the homeless and migrant workers. These can be acknowledged either here, but more commonly these are presented in the discussion sections (under limitations of the research). The actual statistics on variables like demography (average age, proportion female) should normally be included in the results section and not here.

Sample size justification: If you used a formal power calculation or other similar method to justify your sample size, present it here. Having performed such a calculation prior to data collection greatly increases the credibility of your research.

Also document any changes to the sample size, either by terminating the study early, or adding extra patients beyond those planned for originally. Generally, any change to your sample size should be based on rules established prior to data collection, with the possible exception of early stopping because of unanticipated safety issues. Be honest here, especially about informal rules for early stopping. Changes to the planned sample size have broad implications for the credibility of your research and failure to mention these changes will make it look like you are trying to hide an important limitation to the credibility of your research.

If you failed to meet your planned sample size because of problems in recruiting patients, discuss that here also.

Statistical methods: You don't have to explain all of the analyses you conducted, but you do want to offer a general overview.

Some things you don't have to mention are supplementary analyses you did primarily as a quality check on the primary analyses. Here are some of the things that you should mention routinely:

* What were your primary endpoints? secondary endpoints?
* Did you transform any of your data?
* How did you handle dropouts and noncompliant patients?
* What statistical program (including version number and operating system) did you use?

Sometimes these details change from what was proposed in the original protocol. Changes in your protocol are a factor that can alter the credibility of your research. Be honest here. Much research has to be adapted to address unanticipated issues. The problem is that if you fail to document those changes, you may find yourself accused of fraud.

You don't need a reference for methods that are commonly used and are well understood by most readers. This would include t-tests, regression, ANOVA, and survival data analysis. If there are more complicated procedures, cite a reference that will help a reasonably competent statistician to follow what you did with your data. If you reference a book, please include a chapter number or page numbers so that your readers will not have to read through the entire book to find the details they are interested in.

2. Is Evidence-Based Medicine too rigid?

Someone was asking about criticisms of Evidence-Based Medicine (EBM) that the reliance on grading schemes and the hierarchy of evidence was too rigid. Their view was that EBM was providing some heuristics that could be adapted as needed. This is hard to respond to, but it is an important question. I view checklists and hierarchies as a necessary evil, and that sometimes they are applied too rigidly.

As an example, a recent article in the Skeptical Inquirer criticized case-control studies (Park 2010) and said they were analogous to election polls which SOMETIMES agree with the results of the election. Case-control studies are indeed a weak form of evidence, but when they produce an effect of strong magnitude and are associated with a plausible mechanism, they can provide convincing evidence. Case-control studies, for example, correctly identified the link between aspirin use and Reye's syndrome, a critical step in the prevention of this disease (Monto 1999).

This is an important point which we sometimes neglect to teach. No one study should be examined in isolation. It needs to be thought of in the whole context of knowledge of the problem. So replication is important, biological mechanisms are important, a dose response relationship is important, and so forth. When these things are present, a case-control study can and should move higher on the hierarchy. When they are absent, a randomized trial should drop lower on the hierarchy.

Do practitioners of EBM look at the whole picture or do they rigidly stick to a hierarchy? That's something that could be studied, but it would be difficult to identify when someone was too rigid in applying a hierarchy versus appropriately discounting weak evidence. The best example of this was the fuss over eight randomized trials of mammography. When the best two trials were pooled, mammography did not look so good. When the (slightly?) flawed remaining studies were included, mammography looked much better. So was using only the two best trials being too rigid, or was it appropriate? I don't think there is a truly objective answer to that question. A nice summary of the controversy appears in Jackson 2002.

On my website, I include the references noted above and describe two big advantages of EBM, the transparency of EBM and its self-critical nature.
* www.pmean.com/10/EbmTooRigid.html

3. Quality checks for data entry

When you enter data into a computer, there is always a chance for introducing bad data because of typographical errors. You can proofread your data, of course, but do you really know how successful your proofreading was?

There are a couple of simple ways of estimating the total number of typographical errors during data entry. Both require a bit of extra help. The first approach is to have two people proofread your data independently. Count the number of errors caught by the first person, by the second person, and by both people. If there is little overlap between the errors caught by the two people, that is an indication that a lot more errors are out there to be discovered. Think of it this way. If each person found errors that the other person missed, then a third person is likely to find errors that were missed by the first two people.

You can quantify this with the following formula. If N1 is the number of errors found by the first person, and N2 is the number of errors found by the second person, and N3 is the number of errors found by both people, then the estimated total number of errors in the data set (detected and undetected) is

N1 * N2 / N3.

Suppose that the first proofreader found 15 errors and the second proofreader found 12 errors and there were only 3 errors in common. This is bad news. The estimated total number of errors would be

15 * 12 / 3 = 90.

So there are a huge number of undetected errors.

If instead there were 9 errors in common, the estimated number of total errors would be

15 * 12 / 9 = 20,

which means there are probably about 2 more errors lurking undetected (you've already detected 15 + 12 - 9 = 18 errors). That still might be a problem but it is clearly less serious than the first setting.

This method assumes that the proofreaders work independently, which is reasonable, but also assumes that all errors have the same probability of being detected, which is often an unreasonable assumption. Still, this approach will still provide a ballpark estimate and it is much better than getting no estimate of the number of undetected errors.

Another approach is to have someone deliberately introduce some errors into your data set after you've typed it all in. Then you proofread your data and find out how many errors you detected, both among the deliberately introduced errors and the errors that you were responsible for. If you notice most of the deliberately introduced errors, then you probably have also caught most of the errors that you made yourself during data entry.

Again, you can quantify this. If N1 is the number of errors introduced, N2 is the number of errors that you noticed among the introduced errors and N3 is the number of errors that you found outside of the deliberately introduced errors, then the estimated total number of errors (both detected and undetected, excluding the deliberately introduced errors) would be

(N1 / N2) * N3

So if your colleague introduced 10 errors and you only caught 5 of them as well as 20 of your own, then the estimated number of errors would be

(10 / 5) * 20 =  40.

This is quite intuitive. If you only caught half of the deliberate errors, you probably only caught about half of your own typographical errors.

Suppose instead that you caught 8 of the deliberately introduced errors, then the total number of errors in the data set would be

(10 / 8) * 20 = 25.

That's still 5 undetected errors, but still much better than 20 undetected errors.

Again, there are some implicit assumptions here, the key one being that the deliberately introduced errors are about as difficult to spot as your own errors. This is never going to be perfectly satisfied but as long as the person introducing the errors doesn't make blatantly obvious errors like changing 15 to 3,415,238, you should be okay.

4. Monthly Mean Article (peer reviewed): Peers nip misconduct in the bud

Gerald P. Koocher, Patricia Keith-Spiegel. Peers nip misconduct in the bud. Nature. 2010;466(7305):438-440. Excerpt: "What do researchers do when they suspect a colleague of cutting corners, not declaring a conflict of interest, neglecting proper oversight of research assistants or 'cooking' data? In one study1, almost all said that they would personally intervene if they viewed an act as unethical, especially if it seemed minor and the offender had no history of infractions." [Accessed August 14, 2010]. Available at: http://www.ethicsresearch.com/images/Nature_Opinion_-_Koocher_Keith-Spiegel.pdf.

5. Monthly Mean Article (popular press): Factory Efficiency Comes to the Hospital

Julie Weed. Factory Efficiency Comes to the Hospital. The New York Times. 2010. Excerpt: "The program, called �continuous performance improvement,� or C.P.I., examines every aspect of patients� stays at the hospital, from the time they arrive in the parking lot until they are discharged, to see what could work better for them and their families. Last year, amid rising health care expenses nationally, C.P.I. helped cut Seattle Children�s costs per patient by 3.7 percent, for a total savings of \$23 million, Mr. Hagan says. And as patient demand has grown in the last six years, he estimates that the hospital avoided spending \$180 million on capital projects by using its facilities more efficiently. It served 38,000 patients last year, up from 27,000 in 2004, without expansion or adding beds." [Accessed July 13, 2010]. Available at: http://www.nytimes.com/2010/07/11/business/11seattle.html.

6. Monthly Mean Book: True Enough: Learning to Live in a Post-Fact Society

True Enough: Learning to Live in a Post-Fact Society by Farhad Manjoo, ISBN: 978-0-470-05010-1. Here's the description on the publisher's website: "Why has punditry lately overtaken news? Why do lies seem to linger so long in the cultural subconscious even after they�ve been thoroughly discredited? And why, when more people than ever before are documenting the truth with laptops and digital cameras, does fact-free spin and propaganda seem to work so well? True Enough explores leading controversies of national politics, foreign affairs, science, and business, explaining how Americans have begun to organize themselves into echo chambers that harbor diametrically different facts�not merely opinions�from those of the larger culture." www.wiley.com/WileyCDA/WileyTitle/productCd-0470050101.html.

But this book is more than just an exploration of politics and the news. It covers research that shows how all of us seek out information to confirm what we already believe and how we selectively interpret information we are presented to discredit any information that makes us uncomfortable. This is especially important for those who seek to use critical appraisal tools to answer key questions about science and medicine.

7. Monthly Mean Definition: What are Type I and Type II errors?

In many research settings, you wish to choose between two competing hypotheses. By tradition, the first hypothesis, the null hypothesis, represents the hypothesis of no effect or no change. So a null hypothesis involving drug testing might be that the efficacy of a new drug is equal to the efficacy of a placebo.

A Type I error is rejecting the null hypothesis when the null hypothesis is true. Since the null hypothesis is traditionally a negative hypothesis and the alternative hypothesis is a positive hypothesis, a Type I error can be thought of as a false positive finding.

A Type II error is accepting the null hypothesis when the alternative hypothesis is true. A Type II error can be thought of as a false negative finding.

Consider the drug testing example. A Type I error would be allowing an ineffective drug onto the market. A Type II error would be keeping an effective drug off the market.

Statisticians try to plan a research study so that the probability of a Type I error is small and the probability of a Type II error is also small. You should give some consideration, though, to whether a Type I error (false positive) or a Type II error (false negative) is more serious.

8. Monthly Mean Quote: If you're a politician, admitting you're wrong is a weakness...

"If you're a politician, admitting you're wrong is a weakness, but if you're an engineer, you essentially want to be wrong half the time. If you do experiments and you're always right, then you aren't getting enough information out of those experiments. You want your experiment to be like the flip of a coin: You have no idea if it is going to come up heads or tails. You want to not know what the results are going to be." Peter Norvig, Director of Research at Google, as quoted at http://www.slate.com/blogs/blogs/thewrongstuff/archive/2010/08/03/error-message-google-research-director-peter-norvig-on-being-wrong.aspx.

9. Monthly Mean Unsung Hero Award: The R Foundation

The development of the R programming language has revolutionized the practice of Statistics. Since R is free, it is used in a lot of settings where the cost of a commercial package is unfeasible. Since R is open source, anyone with the appropriate expertise can look at the code to see if it is well constructed. Open source also makes it easy for people to customize and improve R for their own needs. Finally, since it is so easy to add libraries of specialized routines in R, researchers can easily disseminate new statistical methods to those who need them.

R could not have been developed without the support of the R Foundation. Here is a description that the R Foundation provides for themselves: The R Foundation is a not for profit organization working in the public interest. It has been founded by the members of the R Development Core Team in order to
* Provide support for the R project and other innovations in statistical computing. We believe that R has become a mature and valuable tool and we would like to ensure its continued development and the development of future innovations in software for statistical and computational research.
* Provide a reference point for individuals, institutions or commercial enterprises that want to support or interact with the R development community.
.

10. Monthly Mean Website: Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use

Flom PL, Cassell DL. Stopping stepwise: Why stepwise and similar selection methods are bad, and what you should use. Excerpt: "A common problem in regression analysis is that of variable selection. Often, you have a large number of potential independent variables, and wish to select among them, perhaps to create a �best� model. One common method of dealing with this problem is some form of automated procedure, such as forward, backward, or stepwise selection." Available at: http://www.nesug.org/proceedings/nesug07/sa/sa07.pdf.

11. Nick News: Nicholas goes whale watching

Cathy, Nicholas, and I went on an Alaska cruise in late June. In Juneau, we went on a whale watching boat ride. Here's the picture of a humpback whale diving.

We also saw sea lions and bald eagles. For some more pictures, go to

12. Very bad joke: There are 10 types of programmers...

There are 10 types of programmers in the world: those who understand binary numbers and those who do not understand binary numbers. As quoted at http://stackoverflow.com/questions/17512/computer-language-puns-and-jokes.

13. Tell me what you think.

How did you like this newsletter? I have three short open ended questions at

* https://app.icontact.com/icp/sub/survey/start?sid=6378&cid=338122

You can also provide feedback by responding to this email. My three questions are:

1. What was the most important thing that you learned in this newsletter?
2. What was the one thing that you found confusing or difficult to follow?
3. What other topics would you like to see covered in a future newsletter?

Three people provided feedback to the last newsletter. Two people liked the material about Structural Equations Modeling. One person challenged my approach to applying data from a t-test because I ignored the problems that non-normality could cause. There's not a clear consensus on how to best approach non-normality, but my major beef was with a different test, the test that checked the assumption of equality of variances. I'll try to explain in greater detail why I dislike this test in a later newsletter. One person suggested that I explain the special issues associated with equivalence and non-inferiority trials. This is an excellent suggestion. Another asked for examples of R scripts to solve certain problems. That might be a bit advanced for many of the readers of the newsletter, but I could always link to my website for people who would like to see this.

14. Upcoming statistics webinars

I offer regular webinars (web seminars) for free as a service to the research community and to build up a bit of good will for my independent consulting business.

Data entry and data management issues with examples in IBM SPSS. Tuesday, August 24, 11am CDT. Abstract: This training class will give you a general introduction to data management using IBM SPSS software. This class is useful for anyone who needs to enter or analyze research data. There are three steps that will help you get started with data entry for a research project. First, arrange your data in a rectangular format (one and only one number in each intersection of every row and column). Second, create a name for each column of data and provide documentation on this column such as units of measurement. Third, create codes for categorical data and for missing values. This class will show examples of data entry including the tricky issues associated with data entry of a two by two table and entry of dates.

What is a p-value? Tuesday, September 21, 11am-noon CDT. Abstract: The P-value is the fundamental tools used in most inferential data analyses. It is possibly the most commonly reported statistics in the medical literature. Unfortunately, p-values are subject to frequent misinterpretations. In this presentation, you will learn the proper interpretation of p-values, and the common abuses and misconceptions about this statistic.

What is a confidence interval? Wednesday, September 22, 11am-noon CDT. Abstract: A confidence interval provides information about the uncertainty associated with a statistical estimate. It is a vital piece of information for assessing whether a clinically important change has occured and if the sample size in the study was sufficiently large. In this presentation, you will learn how to interpret confidence intervals and identify common abuses and misconceptions.

A gentle introduction to Bayesian inference. Thursday, September 23, 11am-noon CDT. Abstract: Bayesian methods provide an alternative to tradtional statistical approaches that offer simpler interpretations and greater flexibility. In this presention, you will see a simple application of Bayesian statistics through the specification of a prior distribution, the computation of likelihoods for the collected data, and the combination of these two factors to produce a posterior distribution.

Note that the three webinars in September are all closely related, but you do not need to attend all three. Each one stands on its own and does not require the other webinars as a pre-requisite.

To sign up for any of these, send me an email with the date of the webinar in the title line (e.g., "August 24 webinar"). For further information, go to

* www.pmean.com/webinars

and there is a fan page for The Monthly Mean

I usually put technical stuff on the Monthly Mean fan page and personal stuff on my page, but there's a bit of overlap.