StATS: What is entropy?

Most of the material on this page has been moved to a new page on information theory models.

The terms entropy, uncertainty, and information are used more or less interchangably, but from the perspective of probability, uncertainty is perhaps a more accurate term. The formula for entropy/uncertainty  is

or in an equivalent form

If any of the probabilities is zero, the term is mathematically ambiguous (0 times infinity). By convention, we set that term to zero, which effectively drops that terms from the uncertainty calculation.  Those of you who are familiar with calculus could use limits to establish that zero is a reasonable value for this product.

Myron Tribus calls the term

the surprisal. Surprisal is the degree to which you are surprised to see a result. When the probability is 1, there is zero surprise at seeing the result. As the probability gets smaller and smaller, the surprisal goes up with positive infinity as the maximum value.

The uncertainty can therefore be thought of as a weighted average surprisal. If the data contains mostly a few events, each with relatively high probability, then you are unlikely to be surprised very often. But if the data contains a large number of rare events, then you are encountering surprises all over the place.

Simple examples

Here are a few examples of entropy calculations. Suppose we have a classification scheme with categories A, B, C, D, and E. Let's suppose that a group of people all classify the same object. If everyone agreed on the classification:

then the entropy would be calculated as,

implying no uncertainty. Recall that all of the terms with zero probability are set equal to zero.

If instead, the raters split evenly between two categories:

then the entropy would be calculated as

which implies one "bit" of uncertainty. Let's suppose now that the raters split evenly between four categories:

then the entropy would be

which implies two bits of uncertainty, a greater level of uncertainty than when the raters split evenly among only two categories.

It's not just the number of categories that influence entropy though. If there is a wide range of categories selected, but one category is favored by most raters, the entropy will be lower than if the raters split evenly among all the categories. Consider this distribution of ratings:

The entropy calculation would be

which shows a comparable level of uncertainty to our previous case.

Further reading

  1. Tribus M. (1961) Thermostatistics and Thermodynamics. D. van Nostrand Company, Inc. Princeton NJ.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Definitions, Category: Probability concepts.