Using information theory to identify discrepancies within and between text files (created 2010-09-02).

I have been experimenting with the use of information theory to identify patterns in text data files. This work in somewhat preliminary, but it has some exciting possibilities. If there are certain patterns that occur frequently at a given column of a text data file (e.g., always the letters "A" or "B"), then these columns become important for looking for aberrant data that might be caused by a typographical error, a misalignment of the row of data, or a deviation from the code book. I want to show some preliminary graphs that illustrate what these patterns look like for some files I am working with. Warning: this is a very large webpage with graphics that extend across dozens of pages!!

The files come from the National Hospital Ambulatory Medical Care Survey, a yearly survey conducted by the Centers for Disease Control and Prevention. I want to look at Emergency Department (ED) files in 2005, 2006, and 2007.

The graph shown below lists the patterns found in the 2005 ED data set. The length of the bar is a measure of the consistency of the information in a column of text, and the maximum width occurs when one character appears 100% of the time in that column. A bar of zero length represents the case of minimal consistency, where everyone of the 96 possible characters occurs at the same rate. I'm assuming here that there are no special characters, such as accented letters or letters from other alphabets.

I've color coded the data with

• orange representing a blank,
• black representing the digits 0-9,
• blue representing uppercase letters,
• dark green representing lowercase letters, and
• dark red representing symbols.

I don't use information about the colors in the analysis, though this is a possible development worth considering. A column that is a mix of 6 numbers does, in an intuitive sense, appear to have more consistency that a column that is a mix of 3 numbers, 2 upper case letters, and 1 symbol.

The bars are split boxes with white lines, with characters sorted by the probabilities (highest probability on the left, lowest probability on the right). Note that the very low probability characters are all squished together. That's okay because the main focus should be on the characters with high probability. Also, I'm hoping to create a flexible graph that can expand wider if someone is interested in those low probability cases.

Here's a side by side comparison of 2005 and 2006.

Notice that the files line up nicely until column 292 where the 2006 data inserts a new variable, POPCT. I want to look at ways to use information theory to realign two data sets when extra variables are inserted in one of the files.

Here's the graph comparing 2006 and 2007.

There are quite a few discrepancies between 2006 and 2007, too many to summarize here.