P.Mean: Explaining CART models in simple terms (created 2008-11-05)

P.Mean: Explaining CART models in simple terms (created 2008-11-05).

This page is moving to a new website.

I need some help understanding and explaining Classification and Regression Trees (CART). I am personally not familiar with this technique. When would someone select this over linear/logistic regression model?

CART was developed to handle problems with overfitting. This is its primary advantage over stepwise regression. Generally, CART models are considered exploratory rather than confirmatory.

From the perspective of critical appraisal, classification and regression trees (CART) are not too much different from linear and logistic regression. They are an approach to make sense of data where there are multiple predictor variables. They work reasonably well if the data is good quality, but like any statistical procedure, the quality of the analysis is limited by the quality of the data coming in.
A classification tree is used when the outcome variable is categorical and a regression tree is used when the outcome variable is continuous. Both methods rely on a similar approach, known as recursive partitioning.
Generally, this approach is used when there are numerous predictor variables and the researcher desires a simple prediction involving a small number of these predictor variables.
Recursive partitioning divides each predictor variables into discrete groups. The groups are typically required to have a minimum sample size (usually 5), but otherwise are allowed to vary considerably. So if one of the predictor variables is the one minute apgar score, then the possible groups to be considered are
1 versus 2-10 1-2 versus 3-10 1-3 versus 4-10 . . 1-9 versus 10
For a discrete variable with levels A, B, C and D and no particular order among the categories, the possible groups to be considered are
A versus BCD AB versus CD AC versus BD AD versus BC B versus ACD . . .
A CART model examines all possible partitions among all possible variables and selects the partition that produces the best possible prediction of the outcome variable. Once that partition is selected, each of the two subgroups is examined versus all possible remaining partitions.
The result of a CART model is a tree diagram. Here's an example from

Malaria in central Vietnam: analysis of risk factors by multivariate analysis and classification tree models. Ngo Duc Thang, Annette Erhart, Niko Speybroeck, Le Xuan Hung, Le Khanh Thuan, Cong Trinh Hung, Pham Van Ky, Marc Coosemans and Umberto D'Alessandro. Malaria Journal 2008, 7:28doi:10.1186/1475-2875-7-28. [PubMed] [Abstract] [Full text] [PDF]. This is an open source journal, so I can reproduce the image as long as I cite the original source.

Here's another regression tree. It is from

Explicit criteria for prioritization of cataract surgery. José M Quintana, Antonio Escobar and Amaia Bilbao for the IRYSS-Appropriateness Cataract Group. BMC Health Services Research 2006, 6:24doi:10.1186/1472-6963-6-24. [PubMed] [Abstract] [Full text] [PDF].

As you can guess, the computations required for a CART model are considerable.
CART can be considered a replacement for other variable selection methods such as stepwise regression, but its predictions for a continuous variable are a bit different. It produces regions where the outcome variable is relatively constant and the predicted values will look like a stairstep.
A nice explanation of CART models is on the web at

www.saem.org/download/lewis1.pdf

I would not fuss too much about CART models. I always stress during the critical appraisal step that you should focus on how the data was collected, not how it was analyzed. If you collect the wrong data (e.g., bad control group, wrong outcome measure) it doesn't matter how fancy the analysis is.

This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-01. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Modeling issues.