Lessons learned the hard way: don't throw good money after bad (created 2010-05-14).

This page is moving to a new website.

I am helping out with data management for a project involving 19 million records from an insurance database. The file is too big to be read into R in one piece, so I decided to read in successive segments of 100,000 records and then write them out again as separate files. This was a big mistake and showed me the importance of the saying: "Don't throw good money after bad."

The syntax for reading a group of lines in the middle of a comma separated value (CSV) file is

zzz <- paste(rep(c("000","00","0",""),c(10,90,900,9000)),0:9999,sep="")
for (i in 0:9999) {
 dy <- read.csv("g4.csv",header=FALSE,as.is=TRUE,nrows=10^5,skip=1+i*10^5)

Don't worry too much if you don't know R. The clever trick (which was not so clever) was to repeatedly read the same file, but to skip a different number of lines each time. That prevents you from reading too much data in at once and overloading the system. But the problem is that R is not that efficient at skipping over a bunch of lines. At first, you don't notice this, as the amount being skipped is fairly trivial. But when you start skipping over 1 million, 2 million, 3 million lines, the program slows down. There's probably a way to read a certain amount of data, pause to process that data, then read a bit more, but I couldn't figure it out.

Anyway, there's nothing wrong with inefficient in most settings, as the only thing wasted is a few electrons. But when the data set gets as big as this one, you start to see some serious time issues.

You could watch the progress of the program as files were being written. The first several files took two minutes apiece, but then the time creeped up to three minutes per file, then four. I figured that this was okay and hated to interrupt the process since I had already sunk so much time into the program. When 50 of the files were processed, the program had pretty much eaten up all of my day. But since I was working at home, I could just let the program keep chugging. As the saying goes, in for a penny, in for a pound.

The next morning, though, the program still hadn't finished and now it was taking almost twenty minutes for each new file. But I was so close, with about 150 of the 190 files processed. Couldn't I just wait it out? I started doing the math, though, and 20+ minutes per file times 40 files was not a very happy number to be staring at. Finally, I pulled the plug and tried a different approach. I used a third party utility that you can download from the web for free. This program split the file into 191 pieces in three minutes. THREE MINUTES?!? My original program was still working on the second piece at the three minute mark.

Once the program took an excessive amount of time (say 30 minutes), I should have interrupted, even at the risk of losing those 30 minutes. But it was difficult, psychologically, to make that break. Have you ever been in a very long line that is moving slowly forward. Once you're in, you hate to leave, especially when you see all the other people lining up behind you. It is ironic, but the longer you wait, the more you hate to give up.

I need to have a bit more discipline. If I've already wasted an hour and only have a little to show for it, it might make sense to back up and try an alternate approach. The saying is "don't throw good money after bad" and I would adapt it to this setting as "don't throw good time after bad."

Some of you might take this as a sign that R is a bad program. Actually, it is not bad, if you know how to use it well. In may ways it saved me time over what I could have done in other programs like SAS or SPSS. It worked like a dream once I got the files split. But getting those files split was not something I should have done in R, at least not with the code I have listed above.

Large data sets require a bit of finesse. With small data sets, you can wrestle them easily to the ground and get them to submit to your will. Brute force techniques are okay for these files because there is little difference between 3 milliseconds and 300 milliseconds. But for large files, 3 minutes and 300 minutes are quite different. You need to think a bit before you compute.