Data management
is the foundation of every good data analysis. You need to consider issues like
how your data are entered, documented, and stored. Careful attention to these
issues now will help save you time and frustration during your data analysis. Articles are arranged by date with the most recent entries at the top. You
can find outside resources at the bottom of
this page. Other entries about data management can be found in the
data
management page at the StATS
website.
2010
- P.Mean: Dealing with a large text file that
crashes your computer (created 2010-04-02). At a meeting, a colleague was
describing a text file that he had received that had crashed his system. No
way, I thought, could a simple text file crash your system. I offered to
investigate and he was right. The text file crashed my system too, and
repeatedly. Here's what I did to figure out how a simple text file could crash
your computer.
- P.Mean: Finding duplicate records in
a 19 million record database (created 2010-03-02). I was asked to help
find duplicate records in a large database (19 million records). The suspected
number of duplicates was suspected to be small, possibly around 90. My
colleague's approach was running PROC FREQ in SAS on the "unique" id and then
looking for ids that have a frequency greater than 1. That did not work--it
took too long or it overloaded the system, or both. So I wanted to look at
alternatives for identifying duplicate records that would do this more
efficiently.
2008
- P.Mean: A false sense of frugality
(created 2008-12-17). A while back I received a data set that was very well
documented, but there was one thing that I wish that the data entry person had
not done. The demographic data was listed as 45f, 52m, 22m, 21f, etc. This was
obvious shorthand for a 45 year old female, 52 year old male, and so forth.
- P.Mean: Naming conventions for
variables (created 2008-07-30). For almost all statistical software
programs, you can and should provide variable names for your data. Variable
names are a short descriptive explanation of what resides in each column of
data. You should choose a variable name that is short, concise, and
descriptive.
- P.Mean: Undeclared missing code
leads to bad results (created 2008-07-15). I found this ticket in a
computer store many years ago and am just now getting around to showing it. It
demonstrates how failure to declare a missing value code can lead to laughably
incorrect results.
Outside resources:
- Interesting quote: Not even the most subtle and skilled analysis can
overcome completely the unreliability of basic data. - R. G. D. Allen.
- Pogue D. Should You Worry About Data Rot? The New York Times. 2009.
Available at: http://www.nytimes.com/2009/03/26/technology/personaltech/26pogue-email.html
[Accessed March 30, 2009]. Description: There are some physical aspects to
the data that you store. these can affect whether your data is readable 10 or
20 years later. David Pogue interviews Dag Spicer, an expert on the durability
(or lack thereof) of various storage media.
- Prospective, randomized evaluation of a personal
digital assistant-based research tool in the emergency department. M. L.
Rivera, J. Donnelly, B. A. Parry, A. Dinizio, C. L. Johnson, J. A. Kline, C.
Kabrhel. BMC Med Inform Decis Mak 2008: 8(1); 3.
[Medline]
[Abstract]
[PDF].
Description: This article studied the use of a Personal Digital Assistant
(PDA) for data collection. Compared to a paper form, the PDA was faster and
more accurate.
-
Stat/Transfer. Circle Systems. Excerpt: Since 1986,
Stat/Transfer has provided fast, reliable, and convenient data transfer
between popular software packages for thousands of users, worldwide. This
website was last verified on 2008-URL: www.stattransfer.com
What now?
Browse other categories at this site
Browse through the most recent
entries
Get help
This work is licensed under a
Creative
Commons Attribution 3.0 United States License. This page was written by
Steve Simon and was last modified on
2010-04-11.