Using entropy and the surprisal value to measure the degree of agreement with the consensus finding (created 2010-04-11).

One of the research problems that I am working on involves evaluation of a subjective rating system. I have been using information theory to try to identify objects where the evaluators agree well and objects where the evaluators do not agree well. I also am working on identifying objects that an individual rater does poorly. The method is to measure when the surprisal of the category that a rater selected is much lower than the entropy (the average surprisal across all raters).

Here are some definitions. If you have a probability, p, then the surprisal is defined as

Logarithms relate numbers to powers, and base two logarithms relate numbers to powers of 2 (4, 8,16, 32, 64, etc.) as well as powers of 1/2 (1/4, 1/8, etc.). For probabilities, of course, powers of 1/2 are the only relevant powers since it is impossible to have a probability larger than 1.

If p=1/32, then the surprisal is 5, which corresponds to an event that is as surprising as getting five consecutive heads on the flip of a coin. If an event has zero probability, the surprisal is infinity. We need to tread with a bit of care around this fact.

If an event has probability 1, then the surprisal is zero. This is a quantification of the fact that an event that always occurs offers no surprise. As a probability approaches zero, the surprisal increases without bound. A probability of zero corresponds to an infinite suprisial

Now assume that n raters are asked to provide a categorical rating to an object. The values of the categorical rating are C1, C2, ..., Ck. Let n1, n2, ..., nk be the number of raters who selected each of the particular categories and define

Then the entropy is defined as

There are some technical adjustments to the formula if one or more of the probabilities are zero, but they are not worth mentioning here.

There is an alternative way to compute entropy that will produce the same number. If you replaced each rater's classification with the probability for that classification across all raters, then transform those probabilities into surprisals, then the entropy would be the average surprisal across all raters.

Here's a simple example. A rating system has four possible choices, A, B, C, and D. Eight raters evaluate an object, and four classify it as A, two as B, and one each as C and D. The four probabilities are

so the entropy would be calculated as

If you look at the individual rating, replace them with probabilities and then calculate the surprisals, you get

The average of the last column is 1.75.

Now notice in this example that there is no perfect consensus choice. A is the most popular choice though, and the surprisals of the four raters who chose A is lower than the average surprisal of 1.75. The two raters who selected B didn't do quite as well and there surprisal is slightly higher than the average surprisal. The raters who selected C or D performed the worst, and their surprisals are 1.25 units higher than the average surprisal. Now "worst" is a word that should be used carefully here. Perhaps the seventh rater was the only one who recognized the correct rating. But in many settings, there is no way to know what the correct rating is, and then you should consider not the correctness of the rating, but rather consider how close a rater comes to the consensus finding.

Comparing the individual surprisal to the average surprisal is also very important. For some objects, the raters are able to come close to a perfect consensus. In these settings, a deviation from the most popular choice should be considered very extreme. On the other hand, for some objects raters may not be able to produce a strong consensus. Then a deviation from the most popular choice should be treated with some deference and understanding.

Consider a rating system where the first 7 of the 8 raters agreed on a single classification. The surprisal for the first seven raters is 0.19 and the surprisal for the last rater is 3. The average surprisal is 0.54, so the last rater's deviation from the average surprisal is 2.46.

Contrast this with the above example where only half of the raters agreed on the most popular category. The surprisal of the last rater is only 1.25 units higher than the overall average surprisal.

For the above examples, the efforts to quantify the degree of agreement with the consensus may seem a bit silly. After all, you can just look at the data and offer an opinion about how well someone agreed with the consensus. But in applications where a large number of raters evaluate and classify a large number of objects, the ability to quantify becomes more important. It allows you to select those objects which a particular rater viewed quite differently than the consensus across all raters.

In a sperm morphology assessment workshop, 39 raters were asked to classify 80 sperm images as either normal, borderline, or abnormal. For some images one or two raters did not provide a classification, which I treated as a fourth category.

Some images were relatively uncontroversial, with almost all raters providing the same classification. Others were more controversial and individual raters could not agree among themselves which classification was appropriate.

This is obviously a good situation, with rater #30 agreeing with everyone else that the sperm image was abnormal.

Here's another good situation, with rater #30 agreeing with more than 80% of the other raters that this image was normal.

Here, the rater did not choose the most popular classification (borderline) but the abnormal classification still constituted a sufficient number of raters to allow this to be classified as acceptable agreement.

Here's another example of acceptable agreement.

Here's a sperm cell image where pretty much any choice should be considered acceptable because each choice garnered only 30-40% of the raters.

Here's an example of poor agreement. Rater #30 chose a classification that was chosen by only 20% of the other raters.

Here's another example of poor agreement.

and another.

Here's an example of very poor agreement: selecting a classification that only 10% of the other raters selected.

Here's another example of a very poor agreement.

Agreement with consensus is not necessarily the best choice. If there is an expert who can provide an accurate classification, you should compare your results to the expert.