The surprisal matrix and applications to exploration of very large discrete data sets (created 2009-03-04, updated 2009-05-22)

This page is moving to a new website.

I am working on a talk for the KUMC Statistics journal club and for possible publication. The talk (in early April went well, and I am hoping to explore publication sometime soon. Right now, I don't have the material on sperm morphology. Also, some of the graphs have poor resolution making them difficult to read. I'll work on these as I have time.

Abstract: The surprisal, defined as the negative of the base 2 logarithm of a probability, is a fundamental component used in the calculation of entropy. In this talk, I will define a surprisal matrix for a data set consisting of multiple discrete variables, possibly with different supports. The surprisal matrix is useful in identifying areas of high heterogeneity in such a data set, which often corresponds to interesting and unusual patterns among the observations or among the variables. I will illustrate two applications of the surprisal matrix: monitoring data quality in a large stream of fixed format text data, and examining consensus in the evaluation of sperm morphology.

Entropy has both a physical and a mathematical definition. In thermodynamics, entropy of a physical system can thought of to represent the amount of energy in a system that is not available for work. Entropy is often informally tied to the concept of disorder. Sydney Harris has a cartoon (which I can't reproduce here because of copyright) that shows a picture of the "Department of Entropy". The door is off the hinge, the walls are cracked and the whole building is in a state of disrepair. One needs to be careful with this analogy as it tends to oversimplify the concept of entropy.

A mathematical definition relies on the concept of surprisals. Suppose we have a probability, P, that is strictly greater than zero. The surprisal, S(P), is

The surprisal value increases as the probability decreases. A surprisal of 5, corresponding to a probability of 1/32, represents an event that is as surprising as getting five consecutive heads in the flip of a coin.

If discrete random variable, X, with k possible values has probabilities p1, p2, �, pk, then the entropy of this random variable is defined as

For pi=0, define pi log2(pi) to be zero.

The entropy is simply the weighted average of the surprisal values. Suppose a random variable has five possible levels, A-E, and the probabilities are 1/2, 1/8, 1/8, 1/8, and 1/8, respectively. Then there are five surprisals, 1, 3, 3, 3, and 3. The average surprisal, weighted by the probabilities, is

For a discrete random variable with k levels, the maximum entropy equals log2(k) and occurs when each value has the same probability of 1/k.

For a binary random variable, the entropy is zero at the two extreme probabilities (p1=0, p2=1 or p1=1, p2=0). It increases to a maximum of 1 for p1=0.5, p2=0.5. For a binary classification, an entropy is less than 0.3 only if the smaller of the two probabilities is less than about 5%. An entropy of 0.6 occurs when the smaller of the two probabilities is about 15%. An entropy of 0.9 occurs if the smaller of the two probabilities is about 32%. The largest entropy for a binary variable, 1, occurs when both probabilities are 0.5.

For discrete random variable with three levels, a tri-linear graph shows how entropy behaves for various combinations of probabilities. The minimum entropy occurs at the extreme vertices of the triangle (p1=0, p2=0, p3=1, for example) and reaches a maximum of log2(3) at p1=1/3, p2=1/3, p3=1/3. The graph below shows contour lines where entropy equals 0.3, 0.6, 0.9, 1.2, and 1.5 and with key points at or very close to these contour boundaries.

Suppose we have a random sample of n observations from an m-dimensional vector of discrete random variables. We will not assume that the individual components of the vector have the distribution. One component, for example, could be a discrete distribution defined on the digits 0-9 and another component could be a discrete distribution defined on the set of upper case letters A-Z.

The marginal entropy is defined as sum of the entropies for each individual component of that vector. More specifically, if Xj, the jth component of the vector, has estimated probabilities p1j, p2j, �, pkj, then the marginal entropy is defined as

Note that the probabilities in the above equation sum to 1 when added across the second subscript. In many applications, this type of probability will be called the column probability. An alternative definition of entropy, the joint entropy, is defined similarly, but with the probabilities summing to one only when summed across both subscripts. There are some intriguing situations involving joint entropy, but I will not discuss them here.

You can get an equivalent definition of the marginal entropy by replacing the sample values by their surprisals and then averaging. Here is a simple example. The table shows a list of 16 four letter words. The first, second, third, and fourth position each define a separate discrete distribution.

-------   -----------------    ----------------   -----
T H I S   1/2  1   1/2  1/4    1    0    1    2      4
T H A T   1/2  1   1/4  1/4    1    0    2    2      5
C H I T   1/4  1   1/2  1/4    2    0    1    2      5
C H A T   1/4  1   1/4  1/4    2    0    2    2      6
T H I S   1/2  1   1/2  1/4    1    0    1    2      4
T H A W   1/2  1   1/4  1/16   1    0    2    4      7
C H A T   1/4  1   1/4  1/4    2    0    2    2      6
C H I P   1/4  1   1/2  1/16   2    0    1    4      7
T H I S   1/2  1   1/2  1/4    1    0    1    2      4
T H U G   1/2  1   1/2  1/8    1    0    1    3      5
W H I G   1/8  1   1/16 1/8    3    0    4    3     10
W H E Y   1/8  1   1/8  1/8    3    0    3    3      9
A H O Y   1/16 1   1/16 1/8    4    0    4    3     11
R H E A   1/16 1   1/8  1/16   4    0    3    4     11
T H I S   1/2  1   1/2  1/4    1    0    1    2      4
T H I N   1/2  1   1/2  1/16   1    0    1    4      6
                              --   --   --   --    ---
Total                         30    0   30   44    104
Average                        1.9  0.0  1.9  2.8    1.6

We will treat this list of four letter words as four discrete data values corresponding to the first, second, third, and fourth letters. Next to each word is a matrix of probabilities for each letter within the column of letters that it belongs to. Next to this matrix of probabilities is a matrix of surprisals.

Looking at the list of words, you can see the two words with the largest surprisals, AHOY and RHEA, with WHIG coming in as a close third. Note that AHOY and RHEA are the only two words in the list that end in vowels, which is part of the reason they have such large surprisals.

What's the worst case scenario for a series of four letter words? That would occur if each of the 26 letters was equally likely. This would produce a surprisal of 4.7 (= -log2(1/26)).

The difference between the null model and the entropies actually observed is defined as the information. It is helpful to plot the information rather than plotting the entropy, because this emphasizes regions of stability and consistency. This is a histrogram of the information at each letter position. It is facing left to maintain consistency with some of the plots shown later.

The second position is the most stable, of course, because it is always H. The other three regions, however, do show a good level of stability compared to a distribution uniformly distributed across all 26 letters.

To help identify how each individual letter contributes to entropy and information, I list the letters in decreasing order of frequency on the right hand side of the bar chart and I divide the bars according to the probabilities.

So, in the first position, T occurs most frequently (50% of the time). C is the next most frequent (25%). The least frequent letters in the first position are A and R.

Let's apply this concept to a large data stream. A typical data stream might include the digits 0-9, the uppercase letters A-Z, the lower case letters a-z, blanks, or symbols (!@#$ etc.). I will color code the numeric digits in black, uppercase letters in blue, lowercase letters in green, blanks in orange, and symbols in red.

This data set provides information on over 1,000 colleges and universities.

The same approach can show the data from another file, the 1973 National Ambulatory Medical Care Survey.

Note that both data sets have informal signposts or road markers, regions of the data where the columns are consistently a single value or sometimes just two or three values. There are other columns where anything goes--there are a lot of possible values and no one value dominates.

Graphic images like the ones above do not represent substitutes for the actual coding format for these files. They represent, however, a quick way to get a subtle feel for the arrangement of data in a file. Notice that the first file makes liberal use of blanks. You could almost use a delimited format for inputting this data (a stray parentheses in column 37 is the only obstacle). Certain parts of the file are dominated by letters and a few symbols, other regions are dominated by numeric digits only (no minus signs or decimal points!).

The second data set has almost no blanks (except, curiously, for the last four columns). The sea of black indicates that almost everything is a numeric digit. There are no symbols, but an uppercase Y creeps into a few columns of data.

You can compute a surprisal matrix for these files and they help identify rows which do not fit the general pattern for the rest of the data. I like to think of the surprisal matrix as helping to answer that Sesame Street song, "Which of these things is not like the other".

A row with unusually high surprisal values might represent data with a legitimate but unusual pattern. It might also represent data quality problem. It is regions of low entropy (high information) that can help us diagnose data quality problems. If, for example, some of the columns in a particular row are misaligned, then the "signposts" and "roadmarkers" in the data set will be off, leading to some surprising characters appearing in places where they clearly don't belong. Columns with high entropy (low information) are not much help. You're already used to seeing a lot of surprises in a column with high entropy, so a stray character will not seem all that unusual.

In the AAUP data set, the two rows with the highest surprisals are rows 658 (Rutgers State Univ.-N.Brswck) and 389 (Massachusetts Inst. of Tech.). The figure below shows how the suprisals got so big for Rutgers.

This is a similar graph for MIT.

Although both Rutgers and MIT appear to have no data entry errors, there is some value in highlighting the regions where the surprisals are unusually high. The name of the College/University appears in columns 7-37, but many of these names are short. Blanks start appearing regularly by column 12, account for about half of the characters by column 27, and account for most of the characters by column 32. In contrast, Rutgers and MIT have names that stretch out across the entire length of the university/college name field. Notice also that the lead digit in many of the entries for Rutgers and MIT has a high surprisal value. These lead digits seem to indicate unusually large numbers. The first numeric field after the university/college name, for example, is 846 for Rutgers. The 8 is unusual, but not the 4 or 6.

There is more material on the application of the surprisal matrix to a sperm morphology classification system, but I will have to add it later.