P.Mean: Another inquiry about slash and burn models (created 2008-08-20).

In a binary logistic regression model, do all variables including the constant need to be significant before you can include them in the model or is it just the constant that has to be significant?

This is an interesting question. First of all, the intercept term is usually not directly interpretable. It represents the estimated log odds when all the independent variables equal zero. In many, if not most situations, a zero value is implausible, well outside the range of the data and not of direct research interest. So statistical significance or lack of it for the intercept term is usually not of serious concern.

The type of model you are talking about, a model where every independent variable achieves a p-value of 0.05 or less is what I call a "slash and burn" model. You remove any variables failing to achieve statistical significance, possibly with great care, possibly not, until the only things left standing have small p-values.

Is this a good thing to do? Well, it is a very common thing to do, and I have done it myself. But you need to ask yourself first what the objective is for your logistic regression model. Here are some possible objectives.

1. to develop a model that can predict future binary outcome variables.
2. to develop a model that provides an unbiased comparison of the treatment and control groups in the presence of covariate imbalance.
3. to explore factors which might be associated with the binary outcome variable.

There may be other objectives, of course.

A "slash and burn" model will probably not achieve either the first or second objective very effectively. There is some evidence that indicates that you should select a set of independent variables based on your knowledge of the subject area and then include all of these variables regardless of the statistical significance or lack of it in any particular independent variable. This may not be practical if you have a lot of variables relative to your sample size.

If you are exploring relationships, you need to understand that the relationship of one independent variable with the outcome variable can change considerably in the presence or absence of other independent variables, especially if there is serious multicollinearity. If you are predicting infant mortality, for example, birth weight may be a very good predictor by itself but after you include gestational age (a variable very highly correlated with birthweight), the effect might disappear.

There are some new methods for logistic regression that work well instead of a "slash and burn" model. Propensity scores, for example, do a very good job at providing an unbiased comparison between a treatment and control group in the presence of covariate imbalance. But these models are more complex and may require specialized software.

The third objective is a bit tricky, and in some situations, no model, no matter how carefully crafted will provide a definitive answer. Models with a high degree of multicollinearity may represent situations where the statistical results are inherently ambiguous. Is it how small you are when you are born that matters or how many weeks early you appear? You can sometimes disentangle these variables, but sometimes you have to live with an ambiguous result.

That being said, there is nothing terrible about a "slash and burn" model. It usually produces a simple model with only a few number of independent variables. If nothing else this satisfies the parsimony principle. On the downside, "slash and burn" models do not always replicate well, especially when the sample size is small.

I would suggest first, though, that you think carefully about why you are running the logistic regression model in the first place. Then find an approach that meets your goal rather than relying on a simplistic approach like "slash and burn."