P.Mean: Defending Bonferroni (created 2008-10-18).

I had someone argue with some advice that I gave, which is a good thing. I had recommended the use of a Bonferroni comparison, and he argued that Bonferroni should not be used when making "independent" comparisons.

It is like comparing the change in gas prices in different states: I compare the change in gas prices in state B versus state A, and state B  versus C and state B versus D; there is no way in my mind that the comparisons can have an effect on each other since these gas prices are independent variables (they are different phenomena). I only want to show that gas prices in state B have not changed compared to the price changes in the other three, taken one on one, not versus all three together. If I prove that B is different from A, C and D - it is self-evident it is different from all of them. I would humbly beg for your exact specific opinion on this idea.

By the way, such deferential language (humbly begging my opinion) is always a good strategy. So this is what I wrote back.

Here's my opinion. If there are 50 states and you compare each state with each other state, then first of all the comparisons are not independent, since the comparison of CA to OR is not independent of the comparison of CA to NV.

But even if the tests were independent, there are more than a thousand possible ways you can compare two states from among 50 (1,275 if you want to be precise).

Let's assume that there is no difference between the prices in the 50 states, except for a bit of sampling error. What's the probability that you would find at least one false positive among the 1,275 comparisons. It turns out to be greater than 99.99%. The expected number of false positives would be 64.

Now you and I know that there are some real differences among the 50 states. Suppose that ten percent of the possible comparisons are real and the remaining comparisons represent differences that could easily be accounted for by sampling error. The probability of one or more false positives is still greater than 99.99% and the expected number of false positives would be 57. So you would have 127 true positives and 58 false positives and no way of determining which was which.

Now the math is nowhere near as troublesome for three or four groups, but the principle is the same. Failure to use Bonferroni leads to an increase in the probability and the number of false positives. That's true whether the comparisons are independent or not.

Now if your comparisons are conceptually unrelated to one another (note the difference between this wording and calling them independent), don't use Bonferroni. If, for example, one comparison is of racial differences to identify discrimination patterns and second comparison is of geographic regions to identify areas where additional funding for an intervention is needed, then these are conceptually unrelated.

So that's my opinion. For what it's worth, people smarter than me have a different opinion, and I won't be offended if you follow their advice instead. (Notice that I can be deferential as well.)

There are legitimate criticisms of Bonferroni. In particular, Bonferroni provides very strict control of Type I errors but does this at the expense of Type II errors. In some situations, the increase in Type II errors is too big a price to pay.