StATS: A clumsy attempt at anonymization (August 15, 2006)

Statisticians frequently deal with confidentiality issues when deciding what type of data and what amount of detail should be withheld to protect sensitive information about individual patients or institutions. It's not an easy task and there are some subtle traps. And sometimes there are not so subtle traps.

At the request of some researchers, America Online (AOL) released data on 20 million web searches performed 650 thousand AOL users over a three month span. They released the data, not just to those researchers, but to the general public. AOL quickly realized that this was a bad idea and removed the database, but it had already been copied to many locations. It is unlikely that they will ever be able to persuade the web owners at all the other locations to take the files offline.

The data was anonymized by replacing the user name with a random number. This is important, because some of the search terms are for rather sensitive items. Examples of things that people searched on are

But replacing a name by a number did not come even close to anonymizing all of the records. The problem is that people will do web searches about things that reveal hints about themselves. Actual searches listed in the data base included things like geographic locations:

or places where the searchers shopped or banked or got health care,

or products that the searchers owned,

or their hobbies,

It gets even more revealing when people do web searches on their relatives or even themselves.

These individual searches are, according to one report, like individual pieces in a mosaic. Put enough of them together and you can get a really clear picture of who the searcher is. Can you actually identify people from their web searches? The answer is yes.

According to the article, user number 4417749 searched for

as well as the names of several people, all of whose last names were Arnold. It didn't take long for the New York Times to track down a 62 year old widow named Thelma Arnold.

Ms. Arnold, who agreed to discuss her searches with a reporter, said she was shocked to hear that AOL had saved and published three months’ worth of them. “My goodness, it’s my whole personal life,” she said. “I had no idea somebody was looking over my shoulder.”

This is an important lesson that statisticians have been aware of for some time. An individual piece of information by itself may not compromise someone's privacy, but will do so when it is combined with other pieces of information. Knowing that someone lives in a small town still preserves anonymity, but when that small town name appears in a database of all pediatric heart transplant cases, you have a problem.

I posted this article on the Chance Wiki as well,

Further reading:

  1. Protection of Human Particpants in Survey Research: A Source Document for Institutional Review Boards. American Association for Public Opinion Research. Accessed on 2005-08-15. www.aapor.org/default.asp?page=news_and_issues/aapor_statement_for_irb
  2. Medical privacy and medical research--judging the new federal regulations. G. J. Annas. New England Journal of Medicine 2002: 346(3); 216-20. [Abstract]
  3. Threshold protocol for the exchange of confidential medical data. J. J. Berman. BMC Med Res Methodol 2002: 2(1); 12. [Medline] [Abstract] [Full text] [PDF]
  4. A proposed architecture and method of operation for improving the protection of privacy and confidentiality in disease registers. T. Churches. BMC Med Res Methodol 2003: 3(1); 1. [Medline] [Abstract] [Full text] [PDF]
  5. Welcome to the American Statistical Association's Privacy, Confidentiality, and Data Security Website. Committee on Privacy and Confidentiality, American Statistical Association. Accessed on 2003-08-11. www.amstat.org/comm/cmtepc/
  6. Health Insurance Portability and Accountability Act Privacy Regulations: Consequences for Use and Disclosures of Patient Information for Research Purposes. Michele Garvin, Jessica Lind, Published in the July/August 2001 NCURA Newsletter. Accessed on 2003-09-08. www.ncura.edu/newsroom/enews/August2001/HIPAA.doc
  7. The Effect of the New Federal Medical-Privacy Rule on Research. J. Kulynych, D. Korn. NEJM 2002: 346(3); 201-204.
  8. Data Encryption Tutorial — Lesson 1. Julie Meloni. Accessed on 2003-03-18. hotwired.lycos.com/webmonkey/00/20/index3a.html?tw=programming
  9. Medical Privacy - National Standards to Protect the Privacy of Personal Health Information. Office for Civil Rights, U.S. Department of Health and Human Services. Accessed on 2003-03-14. www.hhs.gov/ocr/hipaa/privacy.html
  10. Issues to Consider in the Research Use of Stored Data or Tissues. Office for Protection from Research Risks, Published by the U.S. Department of Health and Human Services, November 7, 1997. Accessed on 2003-07-28. ohrp.osophs.dhhs.gov/humansubjects/guidance/reposit.htm
  11. NIH Data Sharing Policy and Implementation Guidance. Office of Extramural Research, U.S. National Institutes of Health. Accessed on 2005-04-20. grants.nih.gov/grants/policy/data_sharing/data_sharing_guidance.htm
  12. NIH Data Sharing Policy. Office of Extramural Research, U.S. National Institutes of Health. Accessed on 2005-04-20. grants.nih.gov/grants/policy/data_sharing/index.htm
  13. Investigator Checklist for HIPAA Privacy Rule Compliance. Partners Human Research Committee. Accessed on 2003-03-14. healthcare.partners.org/phsirb/hipatodo.htm
  14. PGP Corporation. Protecting Confidential Information. In Transit, In Storage, Everywhere, All the Time.. PGP Corporation. Accessed on 2003-09-08. www.pgp.com/
  15. Consent to the publication of patient information. Peter A Singer. BMJ 2004: 329(7465); 566-8. [Medline] [Full text] [PDF] (Plan, Privacy)
  16. The high cost of skepticism. Carol Tavris. Skeptical Inquirer 2002: 25(4); 41-44. [Full text] (Plan, Privacy)
  17. Research Repositories, Databases, and the HIPAA Privacy Rule. U.S. Department of Health & Human Services. Accessed on 2004-01-27. privacyruleandresearch.nih.gov/research_repositories.asp
  18. Clinical Research and the HIPAA Privacy Rule. U.S. Department of Health & Human Services. Accessed on 2004-02-20.  privacyruleandresearch.nih.gov/clin_research.asp
  19. Protecting Personal Health Information in Research: Understanding the HIPAA Privacy Rule. U.S. Department of Health and Human Services. Accessed on 2003-04-22.  privacyruleandresearch.nih.gov/pr_02.asp
  20. Information for Covered Entities and Researchers on Authorizations for Research Uses or Disclosures of Protected Health Information [pdf]. U.S. Department of Health and Human Services. Accessed on 2003-12-01. privacyruleandresearch.nih.gov/authorization.pdf
  21. Institutional Review Boards and the HIPAA Privacy Rule. U.S. Department of Health and Human Services. Accessed on 2003-11-10. privacyruleandresearch.nih.gov/irbandprivacyrule.asp

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Privacy in research.