The futility of small sample sizes for evaluating a binary outcome (created 2010-06-16).

This page is moving to a new website.

I'm helping out with a project that involves a non-randomized comparison of two groups of patients. One group gets a particular anesthetic drug and the other group does not. The researcher wants to compare rates of hypotension, respiratory depression, apnea, and hypoxia. I suggested using continuous outcomes like O2 saturation levels rather than discrete events like hypoxia, but for a variety of reasons, they cannot use continuous outcomes. Their original goal was to collect data on about 20 patients in each group.

I warned them about the rule of 50, which says that if you want to have reasonable power, you need to observe approximately 25 to 50 events in each group. So how often do you see side effects like these? Well some of them occur in 1 out of every 20 patients and some of them occur in 1 out of every 100 patients. If you do the math, you are likely to see one event or less in each group, if you collect data on 20 patients per group.

Now, the rule of 50 is just a rule of thumb, so what would the actual power calculations tell us? I did the work, and it is not too encouraging. First, I upped the sample size to 50 per group. When I tried to use the Java power calculators developed by Russ Lenth, it refused to let me enter a proportion of 5%. That's odd, I thought, but it actually isn't so odd. The power calculations appear to be based on the normal approximation to the binomial distribution. If one of the groups has a proportion of 5%, then the expected number of events in that group would be much less than 5. This is a classic limit for applying the normal approximation to the binomial.

Now you could use a different test, like Fisher's Exact test, and calculate power based on that rather than the normal approximation to the binomial, but my heart wasn't in it. If you are trying to get approval for your research, and you're already dealing with a potentially inadequate sample size from a power calculation, and now you have to deal with another well known limit, the rule that the expected counts in any cell should never be less than 5, you are really just asking for trouble. Don't give your reviewers extra ammunition to shoot down your proposal.

So let's up the sample size to 100 so we can consider a control event rate of 5%. It turns out that a sample size of 100 per group will produce a test with power of 80% if the rate in the exposed group was 17.5%. That's more than a tripling of the risk, and it's for a sample size that is already five times bigger than the researcher was hoping for. We're assuming some of the standard things like an alpha level of 5% and a two-tailed hypothesis.

Suppose that we did have a higher event rate in the control group, 10%, that would allow us to use a normal approximation even with only 50 patients per group. Then the power calculation is equally bleak.

A sample size of 50 per group would provide 81% power if the control/exposed groups had rates of 10% and 33% respectively. Again this is more than a three fold change in risk. And it is still for a sample size that is much bigger than the researcher had hoped for.

Now if the researcher really wants to look at Fisher's Exact test, I'll be glad to run those numbers. You can find a nice web calculator at

I'd be even happier if the researcher found a way to treat hypotension, respiratory depression, etc. as continuous outcomes, as those might be able to achieve decent power, even with sample sizes as small as 20 per group.

If there were any justice in the world, the reviewers would look at the setting and say, "Okay, I realize that logistics don't let you get a very big sample size. How about if we let you test your hypothesis just this once at an alpha level of 15% instead of 5%?" But that would never happen in a million years. Better to let the power hover around 20% than to let the alpha level move a millimeter away from 5%.