P.Mean >> Category >> Data management (created 2007-06-20). 

Data management is the foundation of every good data analysis. You need to consider issues like how your data are entered, documented, and stored. Careful attention to these issues now will help save you time and frustration during your data analysis. Articles are arranged by date with the most recent entries at the top. You can find outside resources at the bottom of this page. Other entries about data management can be found in the data management page at the StATS website.

2011

  1. P.Mean: Discrepancies in the chisquare test (created 2011-12-16). I was working with two researchers on a project and they got different results for their chisquare tests. See if you can find out what went wrong.
  2. P.Mean: Using a binary coding trick illustrated by a Car Talk puzzler (created 2011-05-21). I often need to see how often certain variables and combinations of those variables appear in a data set. If the variable is binary, there is a trick for doing this that is illustrated by a Car Talk puzzler.

    2010

  3. P.Mean: Dealing with a large text file that crashes your computer (created 2010-04-02). At a meeting, a colleague was describing a text file that he had received that had crashed his system. No way, I thought, could a simple text file crash your system. I offered to investigate and he was right. The text file crashed my system too, and repeatedly. Here's what I did to figure out how a simple text file could crash your computer.
  4. P.Mean: Finding duplicate records in a 19 million record database (created 2010-03-02). I was asked to help find duplicate records in a large database (19 million records). The suspected number of duplicates was suspected to be small, possibly around 90. My colleague's approach was running PROC FREQ in SAS on the "unique" id and then looking for ids that have a frequency greater than 1. That did not work--it took too long or it overloaded the system, or both. So I wanted to look at alternatives for identifying duplicate records that would do this more efficiently.

    2008
     
  5. P.Mean: A false sense of frugality (created 2008-12-17). A while back I received a data set that was very well documented, but there was one thing that I wish that the data entry person had not done. The demographic data was listed as 45f, 52m, 22m, 21f, etc. This was obvious shorthand for a 45 year old female, 52 year old male, and so forth.
  6. P.Mean: Naming conventions for variables (created 2008-07-30). For almost all statistical software programs, you can and should provide variable names for your data. Variable names are a short descriptive explanation of what resides in each column of data. You should choose a variable name that is short, concise, and descriptive.
  7. P.Mean: Undeclared missing code leads to bad results (created 2008-07-15). I found this ticket in a computer store many years ago and am just now getting around to showing it. It demonstrates how failure to declare a missing value code can lead to laughably incorrect results.

Newsletter articles:

  1. March/April 2011: Should I use a spreadsheet or a database to enter my data? I often get asked whether you should use a spreadsheet (like Microsoft Excel) to enter your data or a database (like Microsoft Access). The short answer is that for most projects it does not matter all that much. But here are some considerations that you should think about before making this choice. Databases easily allow you to implement quality checks. They also allow you to easily integrate data from multiple sources. Finally, they are more effective in handling very large data sets. On the other hand, spreadsheets are faster to set up and allow easier copying and duplication for data with repetitive patterns.
  2. March/April 2011: A binary coding trick that you can learn from Car Talk. I often need to see how often certain variables and combinations of those variables appear in a data set. If the variable is binary, there is a trick for doing this that is illustrated by a Car Talk puzzler.

Outside resources:

Interesting quote: Not even the most subtle and skilled analysis can overcome completely the unreliability of basic data. - R. G. D. Allen.

Morris Rivera, Jason Donnelly, Blair Parry, et al. Prospective, randomized evaluation of a personal digital assistant-based research tool in the emergency department. BMC Medical Informatics and Decision Making. 2008;8(1):3. Abstract: "BACKGROUND: Personal digital assistants (PDA) offer putative advantages over paper for collecting research data. However, there are no data prospectively comparing PDA and paper in the emergency department. The aim of this study was to prospectively compare the performance of PDA and paper enrollment instruments with respect to time required and errors generated. METHODS: We randomized consecutive patients enrolled in an ongoing prospective study to having their data recorded either on a PDA or a paper data collection instrument. For each method, we recorded the total time required for enrollment, and the time required for manual transcription (paper) onto a computer database. We compared data error rates by examining missing data, nonsensical data, and errors made during the transcription of paper forms. Statistical comparisons were performed by Kruskal-Wallis and Poisson regression analyses for time and errors, respectively. RESULTS: We enrolled 68 patients (37 PDA, 31 paper). Two of 31 paper forms were not available for analysis. Total data gathering times, inclusive of transcription, were significantly less for PDA (6:13 min per patient) compared to paper (9:12 min per patient; p < 0.001). There were a total of 0.9 missing and nonsense errors per paper form compared to 0.2 errors per PDA form (p < 0.001). An additional 0.7 errors per paper form were generated during transcription. In total, there were 1.6 errors per paper form and 0.2 errors per PDA form (p < 0.001). CONCLUSION: Using a PDA-based data collection instrument for clinical research reduces the time required for data gathering and significantly improves data integrity." [Accessed February 22, 2011]. Available at: http://www.biomedcentral.com/1472-6947/8/3.

David Pogue. Should You Worry About Data Rot?. The New York Times. 2009. Excerpt: "Data rot refers mainly to problems with the medium on which information is stored. Over time, things like temperature, humidity, exposure to light, being stored not-very-good locations like moldy basements, make this information very difficult to read. The second aspect of data rot is actually finding the machines to read them. And that is a real problem. If you think of the 8-track tape player, for example, basically the only way you can find 8-track cartridges is in a flea market or a garage sale." [Accessed March 30, 2009]. Available at: http://www.nytimes.com/2009/03/26/technology/personaltech/26pogue-email.html.

Circle Systems. Stat/Transfer Data Conversion Software Utility - Excel, SAS, Databases & Statistical Packages.. Excerpt: "Stat/Transfer has provided fast, reliable, and convenient data transfer between popular software packages for thousands of users, worldwide. Stat/Transfer knows about statistical data --- it handles missing data, value and variable labels and all of the other details that are necessary to move as much information as is possible from one file format to another." [Accessed February 22, 2011]. Available at: http://www.stattransfer.com/

What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-11.