P.Mean >> Category >> Data management (created 2007-06-20). 

Data management is the foundation of every good data analysis. You need to consider issues like how your data are entered, documented, and stored. Careful attention to these issues now will help save you time and frustration during your data analysis. Articles are arranged by date with the most recent entries at the top. You can find outside resources at the bottom of this page. Other entries about data management can be found in the data management page at the StATS website.

2010

  1. P.Mean: Dealing with a large text file that crashes your computer (created 2010-04-02). At a meeting, a colleague was describing a text file that he had received that had crashed his system. No way, I thought, could a simple text file crash your system. I offered to investigate and he was right. The text file crashed my system too, and repeatedly. Here's what I did to figure out how a simple text file could crash your computer.
  2. P.Mean: Finding duplicate records in a 19 million record database (created 2010-03-02). I was asked to help find duplicate records in a large database (19 million records). The suspected number of duplicates was suspected to be small, possibly around 90. My colleague's approach was running PROC FREQ in SAS on the "unique" id and then looking for ids that have a frequency greater than 1. That did not work--it took too long or it overloaded the system, or both. So I wanted to look at alternatives for identifying duplicate records that would do this more efficiently.

    2008
     
  3. P.Mean: A false sense of frugality (created 2008-12-17). A while back I received a data set that was very well documented, but there was one thing that I wish that the data entry person had not done. The demographic data was listed as 45f, 52m, 22m, 21f, etc. This was obvious shorthand for a 45 year old female, 52 year old male, and so forth.
  4. P.Mean: Naming conventions for variables (created 2008-07-30). For almost all statistical software programs, you can and should provide variable names for your data. Variable names are a short descriptive explanation of what resides in each column of data. You should choose a variable name that is short, concise, and descriptive.
  5. P.Mean: Undeclared missing code leads to bad results (created 2008-07-15). I found this ticket in a computer store many years ago and am just now getting around to showing it. It demonstrates how failure to declare a missing value code can lead to laughably incorrect results.

Outside resources:

What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-11.