P.Mean: Cumulative entropy as a measure of data quality (created 2008-08-11).

I was talking to someone about some of my work with control charts, and they asked a question out of the blue. A lot of data sources that might be candidates for my control chart software has potential problems with data quality. Did I have any thoughts about ways to screen for poor data quality?

There are certainly many things you can do to screen for poor data quality. Here are some examples:

1. There are three possible values for gender (male, female, and unknown). If you have a variable for gender with four values, you know that you are on the wrong planet.
2. Certain dates have to occur in certain orders. If you have a birthdate of July 1, 2008 and an examination date of August 3, 2005, you either have an error in your data or an extremely far-sighted prenatal care program.
3. Certain medical measurements are incompatible with life, such as an oral temperature above 45 degrees Celsius.

The problem with these examples is that they are context specific. They require an understanding of the context of the variables, and acquiring such understanding is labor intensive. Is there a method that could be applied to large data sets, where it was difficult to get this type of information?

Automated screening for poor data quality is always going to be imperfect, because there are many examples of data values that look perfectly normal if you know nothing of the context. Furthermore, any automated procedure is likely to flag as a problem certain data values that are perfectly legitimate. Still, there can be substantial value by starting a check of data quality using an automated procedure and then supplementing it with additional context specific quality checks.

I would like to explore the use of cumulative entropy as a measure of data quality. It may allow for a simple method that can screen for certain types of data quality problems: misaligned data and introduction of new undocumented codes.

Although entropy can be defined for a continuous random variable, I will limit discussion to the definition of entropy that applies for discrete random variables.

Suppose that a categorical variable has possible values 1, 2, ..., k and that the probability that it equals a particular value i is equal to pi.

You can then define entropy as

This formula can be easily generalized to categories that are non-consecutive integers or to non-numeric categories.

The formula for entropy seems a bit unusual, but it works well in a variety of different circumstances. For a given number of categories, entropy is large when each category has roughly the same proportion. Entropy is small when the probability is concentrated in a few specific categories.

Suppose that a random variable has values 1 through 5 with 50% probability on the first value and 12.5% probability on the remaining four values. The entropy would be

which equals

or 2.

I am proposing that the cumulative entropy of a column of data is a measure that can sometimes highlight data quality problems. It will not identify typing errors that stray randomly into a data set, but rather will highlight a sudden shift in uncertainty that may indicate problems with misaligned data or introduction of new undocumented codes.

Suppose you have a column of categorical data of length n. The cumulative entropy at time j<n is simply the entropy of the first j data points. A plot of the cumulative entropy will indicate a sudden shift in the uncertainty of a process which could be caused by the introduction of new undocumented codes or by a misalignment of data. Suppose you have a column of data representing the gender of a patient. The values for the column of data are

``` [1] "F" "F" "M" "F" "F" "M" "M" "M" "M" "F" "M" "M" "F" "M" "F" "F" "F" "F" "M" "M" [21] "F" "M" "M" "F" "M" "M" "F" "M" "M" "F" "M" "F" "M" "F" "M" "M" "F" "M" "M" "F" [41] "M" "M" "F" "M" "M" "M" "F" "F" "M" "M" "F" "M" "M" "F" "F" "M" "F" "F" "M" "M" [61] "M" "F" "F" "F" "M" "M" "F" "M" "F" "F" "M" "M" "M" "F" "F" "M" "M" "F" "M" "M" [81] "F" "M" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "F" "M" "F" "F" "F" "M" "M" "M"```

The cumulative entropy of this data would be

``` [1] 0.00 0.00 0.92 0.81 0.72 0.92 0.99 1.00 0.99 1.00 0.99 0.98 1.00 0.99 1.00 1.00 1.00 0.99 1.00 1.00 [21] 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 0.99 1.00 0.99 1.00 0.99 0.99 1.00 0.99 0.99 0.99 [41] 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.99 0.99 0.98 0.99 0.98 0.98 0.98 0.99 0.99 0.99 0.99 0.99 0.99 [61] 0.98 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 0.99 [81] 0.99 0.99 0.99 0.99 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00```

A plot of the cumulative entropy

After the first few values, the cumulative entropy settles down on a value consistent with a roughly 50/50 split between males and females. Suppose, however, that partway through the data set, the columns became misaligned, so that a different column of data, marital status, was placed in the data. Marital status would have four values (S=single, M=married, D=divorced, W=widowed). The data set would now look like

``` [1] "F" "F" "M" "F" "F" "M" "M" "M" "M" "F" "M" "M" "F" "M" "F" "F" "F" "F" "M" "M" [21] "F" "M" "M" "F" "M" "M" "F" "M" "M" "F" "M" "F" "M" "F" "M" "M" "F" "M" "M" "F" [41] "M" "M" "F" "M" "M" "M" "F" "F" "M" "M" "F" "M" "M" "F" "F" "M" "F" "F" "M" "M" [61] "D" "M" "D" "D" "W" "D" "D" "M" "M" "S" "M" "M" "W" "M" "M" "M" "M" "S" "S" "S" [81] "M" "M" "S" "M" "S" "S" "M" "D" "M" "S" "S" "M" "M" "M" "M" "M" "S" "M" "D" "D"```

and the cumulative entropy would now be

``` [1] 0.00 0.00 0.92 0.81 0.72 0.92 0.99 1.00 0.99 1.00 0.99 0.98 1.00 0.99 1.00 1.00 1.00 0.99 1.00 1.00 [21] 1.00 1.00 1.00 1.00 1.00 1.00 1.00 1.00 0.99 1.00 0.99 1.00 0.99 1.00 0.99 0.99 1.00 0.99 0.99 0.99 [41] 0.99 0.99 0.99 0.99 0.98 0.98 0.98 0.99 0.99 0.98 0.99 0.98 0.98 0.98 0.99 0.99 0.99 0.99 0.99 0.99 [61] 1.09 1.09 1.16 1.21 1.31 1.35 1.39 1.38 1.38 1.46 1.46 1.45 1.51 1.50 1.49 1.48 1.47 1.53 1.57 1.61 [81] 1.60 1.59 1.62 1.61 1.64 1.66 1.66 1.68 1.67 1.69 1.71 1.70 1.70 1.69 1.68 1.67 1.69 1.68 1.70 1.72```

Notice that a plot of the data

shows a sudden rise in entropy that occurs after the aberrant data slips in.

This approach does not make sense, of course, if the data set is sorted by gender. The sort order has to represent a variable that is conceptually unrelated to the key research variables. The order in which the data was entered or the time in which subjects entered a trial are two logical sort orders.

This approach is also only useful for detecting certain problems with data quality and will totally miss other problems. If there are a large number of keypunch errors, but those errors are distributed uniformly throughout the file, the cumulative entropy will be larger than it should be, but it will not show a sudden shift. On the flip side, certain factors, such as a change in inclusion criteria could produce a false alarm.

There are some simple extensions to different situations. First, there are certain descriptive measures on a column of data, such as the number of values recorded behind the decimal, or the length of a string after removing trailing blanks that can be analyzed for sudden shifts in entropy. Second, if there are multiple people doing data entry, each person's entropy could be compared to the overall entropy.

I will try to add more to this as I have time.