P.Mean: Jackknife applied to entropy calculations (created 2008-09-15).

This page is moving to a new website.

I have been working with entropy for a couple of different projects and one important question to ask is "How much does the entropy change when a single observation is removed from the data set. This process of removing one item from a data set and recalculating a statistic based on the remaining (n-1) observations is called jackknifing. It is a very simple but still very useful technique in a variety of statistical settings.

Assume that a variable has k possible values, and the observed probabilities of these values based on all n observations are p1,n, p2,n, ..., pk,n. Denote the observed probabilities based on the first (n-1) observations as are p1,(n-1), p2,(n-1), ..., pk,(n-1). If you remove a single observation, that would effectively decrease the count in one category (call it I) and leave the counts for the remaining categories unchanged. But the denominator declines, effectively causing an increase in all of the probabilities except the probability for category I, which has to decline.

It's a bit tedious to calculate the actual size of the (k-1) increases and the one decrease, but the formula is

The entropy based on all n observations is

and the entropy based on (n-1) observations is

In order to calculate the entropy based on (n-1) observations, you need to remember the Taylor series approximation. If you let

then we can compute

This allows you to compute

The constant term (log2(e)) drops out because the sum of the delta terms has to be zero. After a bit more tedious mathematics, you get

which can be rearranged as

This formula shows that the change in entropy is proportional to the difference between the entropy and the surprisal value of the category associated with the observation being removed. If you are removing an observation whose category has a high surprisal level, that leads to a reduction in entropy. If you are removing an observation whose category has a low surprisal level, that leads to an increase in entropy. If you remove an observation whose category has a surprisal level equal to the entropy in the variable, that leads to no change.

Here's an example. A variable has the values

 A, B, B, C, C, C, D, D, D, D

which produces four probabilities

0.1, 0.2, 0.3, and 0.4

and four surprisals:

3.3219, 2.3219, 1.7370, 1.3219.

The total entropy for these 10 observations is 1.8464. If we deleted the last observation, the total entropy would be 1.8911 (an increase of 0.0446). Notice that this difference between the last surprisal and the entropy is 0.5245. Divide this value by 9 to get 0.0583. Now that's not a very good approximation.

Suppose you get 100 values (10 A's, 20 B's, 30 C's, 40 D's). The entropy of these 100 observations and the four surprisal values stay the same. But when you compute the entropy after deleting the last observation, you get 1.8516. The true difference, 0.0052, is very close to what the approximation says (0.52 / 99 or 0.0053)

So the relationship between the change in entropy after deleting an observation and the difference between the individual surprisal and the entropy is a very rough approximation for 10 observations, but quite good for 100 observations.

If I was really clever, I could compute a second order Taylor series expansion to get a better approximation and/or a bound on the error. But today, I'm just barely able to understand the first order Taylor series.

The calculations shown here are closely related to the concept of conditional entropy and joint information. I will elaborate in a future webpage.

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-01. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Information theory.