P.Mean: What to say when any data analysis is pointless (created 2010-03-25)

What to say when any data analysis is pointless (created 2010-03-25).

This page is moving to a new website.

Someone on the MEDSTATS email discussion group asked for help. They were trying to establish a normal range or reference interval for a set of observations involving gastric emptying. The sample size, 14, was much too small to produce reliable results, but it got worse than that. For one of the outcomes, the result was fourteen zeros. What can you do with such a data set? What can you say? That a difficult question, and here is how I would approach such a problem.

First, let's quickly review, why obtaining a reference interval this data set is impossible. A reference interval (or a normal range) represents two percentiles from the distribution, typically the 2.5 and 97.5 percentile. With 14 observations, the smallest observation could be thought of, possibly as representing the 1/15 or 6.67 percentile. So any effort to estimate the 2.5 percentile would require an extrapolation beyond the range of the data. You MIGHT be able to do such an extrapolation if you add an assumption, such as that the data is normally distributed. Then some sort of interval like

might be okay. Except that you can't get a good estimate of variation with 14 zeros and the data itself stands in stark defiance of any attempt to characterize its distribution using any sort of bell shaped curve.

Several people said, and quite accurately, it is pointless to attempt to establish a normal range for a small data set that is all zeros. But I couldn't bring myself to say such a thing. I did make a joke (oh, you're looking for the "blood from a turnip" test), but I still tried to offer some help.

Maybe it's something in my genes, but when someone asks sincerely for help, I try to provide some sort of answer other than to tell them that their effort is pointless. I know that this can sometimes lead to problems, but I just have to try to be helpful if I possibly can.

So here's what I told them. If you can establish that the distribution places more than 97.5% of its mass at zero, then you have established zero as the single point representing the normal range. There is an informal rule of thumb that states: if you observe zero events in a sample of size n, then 3/n represents an upper confidence bound on the probability that an event is observed.

Set 3/n equal to 0.025 and solve for n. It would take about 120 observations, all of them equal to zero to establish with 95% confidence that there is at least a 97.5% probability that you will observe zero for this particular outcome. Another person answered the problem from a similar perspective by showing that the confidence interval for the probability of observing a non-zero response in a data set with 14 zeros was hopelessly wide.

Both answers were questioned by someone as begging the question. You are offering a confidence interval for the proportion and that doesn't really produce a normal range in any way that is truly helpful.

I argued back that I did offer the same comment but with a bit of sugarcoating. My answer helped them to understand WHY their current situation is untenable--because their sample size is too small by an order of magnitude. From a practical setting, they are not going to come back next week with 120 observations, so if my method isn't perfectly Kosher there is no serious chance that I might have to implement it when they produce the larger sample size. But also it places a memory somewhere in the back of their head that establishing a normal range sometimes takes over 100 data points. And that's a good thing.

Now if several people who are all smarter than me say that any effort spent on a data set with 14 zeros is pointless, I can't disagree with them. But I know from experience that the effort to get just these 14 data points was probably quite considerable. It belittles their effort to offer such a blunt (but certainly honest) assessment. So that's why I didn't join in with a similar comment.