#### A simple example of joint and conditional entropy (created 2009-01-07)

In a project involving sperm morphology classification, I have found that the concept of entropy very useful in analyzing the data and describing certain patterns. I want to extend the work to include joint and conditional entropy. I wanted to start with a simple data set, so I downloaded a file from the Data and Story Library website.

There is an interesting file "High Fiber Diet Plan" that provides a useful way to explore joint and conditional entropy.

A manufacturer was considering marketing crackers high in a certain kind of edible fiber as a dieting aid. Dieters would consume some crackers before a meal, filling their stomachs so that they would feel less hungry and eat less. A laboratory studied whether people would in fact eat less in this way. Overweight female subjects ate crackers with different types of fiber (bran fiber, gum fiber, both, and a control cracker) and were then allowed to eat as much as they wished from a prepared menu. The amount of food they consumed and their weight were monitored, along with any side effects they reported. Unfortunately, some subjects developed uncomfortable bloating and gastric upset from some of the fiber crackers. A contingency table of "Cracker" versus "Bloat" shows the relationship between the four different types of cracker and the four levels of severity of bloating as reported by the subjects.

Conditional entropy for Bloat given Diet. If you look at the probabilities for bloat across all subjects and all diets, you get the following:

```0=no 35% 1=lo 31% 2=md 19% 3=hi 15%```

The entropy for these four probabilities is 1.91. If you look, however, at the individual diets, the entropies are lower.

Control (H = 1.46)

```0=no 50% 1=lo 33% 2=md 17% 3=hi  0%```

Gum (H = 1.89)

```0=no 17% 1=lo 17% 2=md 25% 3=hi 42%```

Combo (H = 1.89)

```0=no 17% 1=lo 42% 2=md 25% 3=hi 17%```

Bran (H = 1.28)

```0=no 58% 1=lo 33% 2=md  8% 3=hi  0%```

Notice that Control and Bran tend to be clustered at the lower values of bloat, unlike the overall probabilities which are more evenly distributed across the four bloat categories. The average of these four entropies (1.63) is the conditional entropy of Bloat given Diet. In mathematical notation, we would write H(Bloat | Diet) = 1.63.

In this examples, the entropies for Bloat at each individual value of diet (1.46, 1.89, 1.89, and 1.28) were all less than the overall entropy for Bloat (1.91). It's easy to find examples where some of the individual entropy values are larger, but these will always be counterbalanced by other individual entropy values. With a bit of algebra, you can show that the conditional entropy is always less than of equal to the unconditional entropy.

H(X|Y) <= H(X)

with equality occurring only if the two variables are statistically independent of one another.

Conditional entropy for Diet given Bloat. You could reverse this problem and note that each diet occurs (by design) a quarter of the time.

```1=con 25% 2=gum 25% 3=com 25% 4=brn 25%```

The entropy for these four probabilities is 2.00. The distribution of diets, however, is different among those experiencing different levels of bloat.

No bloating (H = 1.78)

```1=con 35% 2=gum 12% 3=com 12% 4=brn 41%```

Low bloating (H = 1.93)

```1=con 27% 2=gum 13% 3=com 33% 4=brn 27%```

Medium bloating (H = 1.89)

```1=con 22% 2=gum 33% 3=com 33% 4=brn 11%```

High bloating (H = 0.86)

```1=con  0% 2=gum 71% 3=com 29% 4=brn  0%```

You shouldn't just compute a simple average here to get the conditional entropy. No bloating and low bloating occur more frequently than medium and high bloating. A better statistic would weight the entropies by the probabilities of the respective bloating levels. This gives you a weighted average of 1.72 (= 0.35* 1.78 + 0.31*1.93 + 0.19*1.89 + 0.15*0.86). This is the conditional entropy of diet given bloating. In mathematical notation, this would be written H(Diet | Bloat) = 1.72.

Notice that H(Diet | Bloat) is not equal to H(Bloat | Diet).

Joint entropy. You can calculate a joint probability distribution for diet and bloat. The table looks like this:

```      Diet Bloat 1=con 2=gum 3=com 4=brn 0=no    12%    4%    4%   15% 1=lo     8%    4%   10%    8% 2=md     4%    6%    6%    2% 3=hi     0%   10%    4%    0%```

Notice that these 16 probabilities should add up to 100% but actually add up to 97% because of rounding. If you calculate entropy from these 16 probabilities, you get 3.63. In mathematical notation, this would be H(Bloat, Diet) = 3.63. You could also define the joint entropy using the term H(Diet, Bloat) since changing the rows and columns of the above table does not affect the entropy calculation.

Mutual information. The joint entropy is always less than or equal to the sum of the two individual entropies. In this example, the two individual entropies are 1.91 and 2.00. The only time that the joint entropy is exactly equal to the sum of the two individual entropies is when the two variables are statistically independent. The degree to which the joint entropy is less than the sum of the two individual entropies is a measure of the degree of dependence between the two variables. This discrepancy is called the mutual information and is denoted by the mathematical symbol I. In mathematical notation, this would be

I(X, Y) = H(X) + H(Y) - H(X,Y).

You can visualize this formula graphically as well.

``` + - ------------------------------ ```

In the current example, I(Diet, Bloat) = 1.91 + 2.00 - 3.63 = 0.28.

With a bit of mathematical manipulation, you can show that

I(X,Y) = H(X) - H(X|Y)

which has this graphic visualization

``` - ------------------------------ ```

and also that

I(X, Y) = H(Y) - H(Y|X)

which has this graphic visualization

``` - ------------------------------ ```

You can confirm these two formulas with the data set on diet and bloating. Note that

H(Bloat) - H(Bloat | Diet) = 1.91 - 1.63 = 0.28 and

H(Diet) - H(Diet | Bloat) = 2.00 - 1.72 = 0.28.