Finding duplicate records in a 19 million record database (created 2010-03-02).

This page is moving to a new website.

I was asked to help find duplicate records in a large database (19 million records). The suspected number of duplicates was suspected to be small, possibly around 90. My colleague's approach was running PROC FREQ in SAS on the "unique" id and then looking for ids that have a frequency greater than 1. That did not work--it took too long or it overloaded the system, or both. So I wanted to look at alternatives for identifying duplicate records that would do this more efficiently.

When you're dealing with a large database, you want to avoid activities that require multiple passes through the data or that require storing temporary files that are very large. Computing frequencies would produce a table that would have almost 19 million records and would take three to four hundred thousand pages to display.

I also wanted to be cautious about sorting the data needlessly. Sorting the data requires multiple passes through the data and is time consuming for very large data sets. I was told that this data set was already sorted by the id variable, but I adopt a Ronald Reagan strategy to data management, "Trust but verify".

There's a procedure in SPSS that looks for duplicate records (from the SPSS menu, click on  DATA | FIND DUPLICATE CASES),

 but when I ran it, it started sorting the data set. Time to click and abort before my machine is tied up for hours. In retrospect, I didn't need to be quite so paranoid (see below).

My next thought was to use the LAG function. The id variable in this data set was labelled (mislabelled?) UNIQUE. If an id is duplicated and if the data is sorted, then the second, third, etc. duplicate records would be identified by the logic statement UNIQUE=LAG(UNIQUE). So I created a new variable called DUP which was equal to one if UNIQUE equaled LAG(UNIQUE) and zero otherwise.

 Then select cases where DUP=1 and delete the  rest.

This produced 98 ids that were duplicates. The whole thing took about a minute of computing time.

I wanted to double check this calculation (I apply the "Trust but verify" philosophy to my own work), so I ran an AGGREGATE function.

Unlike the search for duplicate records in SPSS, the aggregate dialog box allows you to inform SPSS that the data is already sorted. It also allows you to place the aggregated statistics back in the original file. So I created a new variable, N_BREAK, that equaled the number of records with a particular id. Then I selected cases where N_BREAK was greater than 1, deleting the rest. This file had 196 records, two for every one of the duplicated ids. This took about five minutes.

At this point, I broke down and decided to see how long it would take to sort the entire file.  A full sort of the data took about ten minutes. That's an order of magnitude slower than using the LAG function. For this database, it would probably be okay to be a bit inefficient. For a much larger data set, say 190 million records rather than 19 million, you would definitely want to avoid any unnecessary sorts. With this size data set, you are looking at 100 minutes to sort the data, or possibly even worse.

But then I got to wondering. How could I utilize the LAG function on a data set that I thought was sorted, but which I was not 100% sure. And it turns out there is a clever trick. Make a more complex LAG function, by adding +100*(UNIQUE<LAG(UNIQUE) to the end of the equation.

This would create a new variable that equaled one when duplicate consecutive records were found but equal to 100 for any records not in ascending order. When all the non-zero values of DUP were equal to 1 rather than 100, I could state with authority that the data was indeed sorted and that all the duplicate cases were found.

Just for kicks, I tried to run FREQUENCIES in SPSS to compare to the disastrous PROC FREQ that my colleague had run in SAS. It froze the computer.