Controversy over stopping a study early (2004-11-24)

This page has moved to my new website.

A while back, the IRB asked me to look into a randomized study where the interim report indicated a huge disparity in the two treatment arms. One arm of the study had almost all good outcomes and the other arm had almost all bad outcomes or at best no improvement. The sample size, though, was only 20 patients, and the protocol had no formal rule for stopping the study early. Even without such a rule, a careful analysis of the data revealed that there was little justification for continuing randomization when one arm of the study was clearly inferior.

The principal investigator and the IRB both agreed, so we stopped the study wrote up the results and published them. The PI wrote back to me a few months later with the following comments (loosely paraphrased to simplify the discussion and to preserve confidentiality).

I have now presented this paper to two separate groups of docs here at CMH and once nationally and I have heard a consistent theme of questioning regarding the discontinuation of the study. The most notable commentary came from a doctor who is also chair of a Data Safety and Monitoring Board – he was surprised that we didn’t enroll at least a third of the targeted 144 patients and questioned why we even looked once reaching the initial 20 patients.

He had a hard time arguing the statistics but asked how we didn’t know that the next 20 patients wouldn’t have shown the opposite results. I think most people are having trouble accepting the findings based on the final enrollment of only 20 patients, the fact that we didn’t control for an important confounding variable, and the fact that the [inferior arm] in this study is a pretty deeply rooted therapy modality in many medical settings. It’s also unfortunate that [an important covariate] was worse for the [inferior arm] than for the [superior arm] which may have contributed to us not seeing much improvement in that group.

The paper has now been accepted for an oral presentation at another major medical meeting. I would like to talk over some of the above noted concerns as I try to figure out the next step for this project and as I make sure that I have the best answers in hand for rebuttal of the questions that are being raised.

Here's a summary of what I replied by email (again with some paraphrasing).

People will always be skeptical, so there's only so much you can do. Here's how I would argue this.

The comment "how do we know that the next 20 people wouldn't show the opposite result?" is more than a bit silly. If you flipped a coin 20 times and it came up heads 20 times, you'd suspect that the coin was loaded. But what they are saying is "why don't you flip it 20 more times, because you don't know, maybe it will come up tails the next 20 times." We have to assume that our universe shows some level of order and consistency. If we jump off a cliff 20 times and each time we break a leg, how do we know that we won't land softly and safely the next 20 times? Furthermore, that comment is one that can apply to any sample size. If we get a certain result with 200 patients, how do we know that the next 200 patients won't show the opposite results? If we get a certain result with 2,000 patients, how do we know that the next 2,000 patients won't show the opposite results?

Why stop at 20 rather than at a third or half of the patients? You can argue that this was driven by the IRB concerns, or that you thought a yearly review was appropriate. A big weakness of this is that you did not specify the criteria for stopping early in the protocol itself. That would have been nice, but you were seeing an all-or-nothing phenomena where the worst patient in one arm is still better off than any patient in the other arm. You were just uncomfortable continuing the study in the face of such an extreme finding.

An all-or-nothing finding starts to become convincing when the total sample size is 10 or so. Go back to the coin analogy. Who among even the most skeptical of your colleagues would believe that a coin was fair after seeing 10 straight heads come up?

Lack of control for and imbalance in [an important covariate] is indeed a problem. I think the magnitude of the difference seen here is so extreme that it is unlikely to be caused by [this covariate]. But that has to be a qualitative argument, because we don't have the proper data to test this formally.

When the first research on smoking and cancer came out, it was based on imperfect data, but the magnitude of the effect was so large, that only a fool (or a tobacco company lawyer) would argue that this was caused by the imperfections in the data.

The fact that the [inferior arm] is pretty deeply rooted is not a serious argument. Hormone replacement therapy for post-menopausal women is also a deeply rooted practice.

It would also help for you to review a good article that has argued that failure to stop some of these trials early has led to serious ethical lapses.

Safeguarding patients in clinical trials with high mortality rates. Bradley D. Freeman, Robert L. Danner, Steven M. Banks, Charles Natanson. Am J Respir Crit Care Med 2001: 164(2); 190-192.
[Full text] http://ajrccm.atsjournals.org/cgi/content/full/164/2/190
[PDF] http://ajrccm.atsjournals.org/cgi/reprint/164/2/190.pdf

This is not to say that the people raising these concerns are wrong. There is always some level of ambiguity in research and different people will draw differing conclusions from the same data. If people are really upset, they always have the option of replicating this research at their site.