**What is principal components analysis? (created 2010-07-19)**.

This page is moving to a new website.

I was asked to help someone who was reviewing a paper that used principal components analysis (PCA) as part of the statistical methodology. I have not yet seen the article, so I could only offer very general advice.

Principal components analysis (PCA) is typically used when there are a large number of variables and there is a need to find a small number of composite variables that summarize the behavior of these variables. PCA is often also considered one of the simpler (obviously simple is a relative term) forms of factor analysis. Factor analysis in general, and PCA in particular, is used to discover patterns among the interrelationships (correlations) of a batch of variables. Sometimes PCA is used to avoid issues of multicollinearity in a linear regression model.

Here's a nice article that shows how PCA is used to solve a real-world problem.

- Anne Caroline Krefis, Norbert Georg Schwarz, Bernard Nkrumah, et al.
**Principal component analysis of socioeconomic factors and their association with malaria in children from the Ashanti Region, Ghana**. Malaria Journal. 2010;9(1):201. [Accessed July 19, 2010]. Available at: http://www.malariajournal.com/content/9/1/201.

When studying socioeconomic status (SES) and its relationship to malaria,
the researchers noted the difficulty of this Ghana: "*In malaria endemic
areas, however, valid classification of socioeconomic factors is difficult due
to the lack of standardized tax and income data.*" The researchers did
collect a range of variables that were related to SES, such as

- mother�s and father�s profession (employed/unemployed) and education (ability to read and write: yes/no),
- type of house the family is living in (cement/brick house or mud/wood house),
- water supply (open water source/closed water source),
- existence of an indoor kitchen (Yes/No),
- electricity (Yes/No),
- indoor toilet (Yes/No),
- use of freezing as measure of conservation (Yes/No),
- existence of a relative abroad who might financially support the family (Yes/No),
- the self-rated ability to manage with the available monthly income (difficult or not difficult), and
- membership in the national Health Insurance Scheme (Yes/No).

PCA created a single composite variable from these measures, a composite that (theoretically) would provide a reasonable indictor of SES. This composite variable was then converted into a three level categorical variable with the lowest 33% of this value corresponding to "poor" families, the middle 33% corresponding to "average" families, and the top 33% corresponding to "rich" families. This new variable was then used as a predictor variable along with other variables (use of protection measures, age of the child, place of residence, ethnicity, number of children, sex of the child, and mother's age) to see what factors influenced the presence of malaria in the child.

A nice tutorial on PCA can be found at http://www.cs.otago.ac.nz/cosc453/student_tutorials/principal_components.pdf