StATS: A simple application of propensity scores (April 26, 2006).

In many research studies, you do not have the opportunity to randomly assign an exposure variable. The influence of the exposure variable on the outcome variable can sometimes produce misleading results because there may be other covariates which are important predictors of the outcome which are also imbalanced across the levels of exposure. A propensity score model creates a new composite variable, the propensity score, which helps you identify pairs or groups of variables with similar covariate patterns. The use of stratification or matching on the propensity score removes the effect of covariate imbalance and allows for a fair and unbiased comparison of the exposure group with the control group.

A propensity score model assumes that you have a binary exposure variable and one or more covariates that are potentially imbalanced across the levels of the exposure variable. The outcome variable can take on many forms. It can be a continuous outcome, a categorical outcome, or a survival time. The propensity score is estimated by fitting a logistic regression model to the exposure variable as the dependent variable and all the important covariates as independent variables. The outcome variable is ignored during this stage of the analysis. For each observation in the data set, compute a predicted probability for being in the exposed group. A large value for the propensity score represents a covariate pattern that is more likely to appear in the exposed category. A small value represents a covariate pattern that is more likely to appear in the control category. By matching or stratifying on the propensity score, you make implicit comparisons only among observations that have similar covariate patterns.

Here are two interesting published examples of propensity score analysis.

I have used a simple data set in some of my classes that is interesting and fun. It comes from the Data and Story Library (DASL) website. This data set contains the resale values for a random sample of 117 homes in Albuquerque, NM in 1993. The variables in the data set are

The eleven features included dishwasher, refrigerator, microwave, disposer, washer, intercom, skylight(s), compactor, dryer, handicap fit, and cable TV access. The original data set had selling price in hundreds of dollars, but I found it useful to convert this to dollars. This data set also had a column for annual taxes, which I did not include in this data set.

An interesting question is whether custom built houses sell for more than regular homes. A direct comparison of these two groups is unfair though, because custom built houses tend to be bigger and they tend to have more features than regular houses on average. There are also small differences in location with custom built houses being located slightly more often in the northeast sector of the city. There is almost no difference in the proportion of houses found on corner lots however.

You can correct for these disparities using an analysis of covariance model, and that works reasonably well for this data set. In many research settings, though, the number of covariates becomes so large that adjustment for all of the variables simultaneously will not work. In these cases, a propensity score model is very useful.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Covariate adjustment.