P.Mean: Applying the sequence logo concept to data quality (created 2008-09-04).

I am trying to adapt the logo graph used in genetics to an examination of data quality. I am just starting this, so the graphs are a bit crude. I took the 1973 NAMCS data set and calculated entropy for each column of data. This is a massive data set with 29,210 rows and 85 columns.

I displayed a barchart at each position, where the height of the bars was equal to the base 2 logarithm of 95 minus the calculated entropy. The value of 95 reflects the fact that in standard 7 bit ASCII,  there are 95 printable characters (including the blank). The maximum entropy in 7 bit ASCII occurs when each of the 95 characters occurs equally often. The difference between the maximum entropy and the observed entropy is sometimes referred to as the information content. A large value for information means (among other things) that the sequence is easily compressible. In our setting a high information value represents a column whose values are more regular, consistent, and predictable. This represents a column where unusual values are easy to recognize.

Inside the bar graph, I pasted a bit map of the particular ASCII characters that occur in the column, with the height of the ASCII characters equal to their proportion. Note that the higher bars have fewer values and the highest bar represents a column with 0 entropy, which can only occur when a single ASCII value occurs in that column for each and every row in the data set.

```         1         2         3         4         5         6         7         8 1234567890123456789012345678901234567890123456789012345678901234567890123456789012345```

Notice that column 1 has only two possible values 0 and 1, with 0 the more common value. The second column has multiple numeric values with 1 and 2 as slightly more probable than the other values.

The third and fourth column show an interesting pattern. Column 3 has a 7 at each and every row, while column 4 has either a 3 or a 4 (with 3 more probable).

In columns 14 to 22, zeros are very common but other numeric values occur infrequently. Columns 25 through 38 show a binary response with mostly 0 but an occasional 1. We see another dominance of zeros in columns 42 through 75 with a few notable exceptions (columns 63 and 70).

This logo gives a very nice indication of the structure of a data file, and it is entirely independent of any context specific information. In subsequent webpages, I want to show how this logo can be used to identify unusual rows that may be indicative of data quality problems.

Update: 2008-09-06. Here is an alternate version of this graph.

It has better axis labels, repeats the ascii characters rather than stretching them, and uses color codes for numbers (black), upper case letters (blue), lower case letters (green), symbols (red), and blanks (yellow). This image is probably too wide for most browsers. Please accept my apologies. I will work on this some more.

(Update: 2008-09-09) Here is a logo for a different data set. I've moved the ascii symbols below the graph.

(Update: 2008-09-11) I realized last night that turning the graph sideways would allow me to fit things better. Here is an example. I also darkened some of the colors (yellow to orange, green to dark green).

The white lines after every tenth bar are an artifact of my splitting the file into smaller pieces and then re-assembling them, but I think I like the artifact.