Stats: Guidelines for logistic regression models (September 27, 1999)

Guidelines for logistic regression models (created September 27, 1999)

This page is moving to a new website.

There are three steps in a typical logistic regression model.

Fit a crude model

Fit an adjusted model

Examine the predicted probabilities.

Step 1. Fit a crude model.

There are two types of models, crude models and adjusted models. A crude model looks at how a single factor affects your outcome measure and ignores potential covariates. An adjusted model incorporates these potential covariates. Start with a crude model. It's simpler and it helps you to get a quick overview of how things are panning out. Then continue by making adjustments for important confounders.

If the factor that you use to predict your binary outcome is itself binary, you can visualize how the logistic regression model works by arranging your data in a two by two table.

In this example, the treatment group (also labeled "ng tube" in other parts of this website) represents a group of children who received feeding by ng tube when the mother was not in the hospital while the control group (also labeled "bottle" in other parts of this website) received bottles when the mother was not in the hospital.

The Feeding type * Exclusive bf at discharge Crosstabulation shows us the frequency for the four possible combinations of feeding type and breast feeding status at discharge. It helps to also look at the row percentages and the risk option.

The table above shows row percentages for the exclusive breast feeding status at discharge. Notice that a much greater fraction of the Treatment group were exclusive breast feeding at discharge (86.8% versus 41.3% for the control group).

The Risk Estimate table appears when we select the RISK option. This table provides information about the odds ratio and two different risk ratios. The odds ratio is 9.379. You should always be careful about this estimate, because it is dependent on how we arrange the table. If we reversed the rows, for example, and placed the NG Tube group on top, the odds ratio would be inverted. We would have an odds ratio of of 0.107 (=1/9.379). If an odds ratio seems inconsistent with your previous results, be sure to compute the inverse and see if that is consistent.

Notice that SPSS provides two additional estimates. These two additional estimates are risk ratios and are computed by dividing one row percentage by the other. The value of 4.461 is the ratio of 58.7% divided by 13.2%. This is the increase in the probability of not exclusively breast feeding at discharge when we compare the NG Tube group to the Bottle Fed group.

The other estimate, 0.476 (= 41.3/86.8), represents the change in the probability of exclusive breast feeding when we compare the NG Tube group to the Bottle Fed group.

The logistic regression output from SPSS is quite extensive. We will break it apart into pieces and discuss each piece individually.

The Case Processing Summary table shows you information on missing cases and unselected cases. Make sure that you are not losing data unexpectedly.

The Dependent Variable Encoding table shows you which of the categories is labeled as 0 and which is labeled as 1. If the estimates that you get later in the output go in the opposite direction from what you would expect, check here to see if the encoding is reversed from what you expected.

We will skip any discussion of all of the tables in Step 0. These represent the status of a null model with no independent variables other than an intercept. These values are more likely to be interesting if you are fitting a sequential series of logistic regression models.

The Omnibus Tests of Model Coefficients table is mostly of interest for more complex logistic regression models. It provides a test of the joint predictive ability of all the covariates in the model.

The Model Summary table in Step 1 shows three measures of how well the logistic regression model fits the data. These measures are useful when you are comparing several different logistic regression models.

The Classification Table in Step1 is often useful for logistic regression models which involve diagnostic testing, but you usually have to set the Classification Cut-off field to a value other than the default of 0.5. You might want to try instead to use the prevalence of disease in your sample as your cut-off. Under certain circumstances, the percentage correct could relate to sensitivity and specificity (or the reverse), though the use of these terms is a bit unusual for a breast feeding study since this represents a condition not related to disease.

In the Variables in the Equation table for Step 1, the B column represents the estimated log odds ratio. The Sig. column represents the p-value for testing whether feeding type is significantly associated with exclusive breast feeding at discharge. The Exp(B) column represents the odds ratio. Notice that this odds ratio (0.107) is quite a bit different than the one computed using the crosstabulation (9.379). But it is just the inverse; check it out on your own calculator.

We can also get a confidence interval for the odds ratio by clicking on the Options button and selecting the the CI for exp(B) option box.

If we were interested in the earlier odds ratio of 9.379 instead of 0.107, then we would compute the reciprocal of the confidence limits. Thus 3.1 (=1/0.323) and 28.6 (=1/0.035) represent 95% confidence limits.

Let's look at another logistic regression model, where we try to predict exclusive breast feeding at discharge using the mother's age as a continuous covariate.

The log odds ratio is 0.157 and the p-value is 0.001. The odds ratio is 1.170. This implies that the estimated odds of successful breast feeding at discharge improve by about 17% for each additional year of the mother's age.

The confidence limit is 1.071 to 1.278, which tells you that even after allowing for sampling error, the estimated odds will increase by at least 7% for each additional year of age.

If you wanted to see how much the odds would change for each additional five years of age, take the odds ratio and raise it to the fifth power. This gets you a value of 2.19, which implies that a change of five years in age will more than double the odds of exclusive breast feeding.

Step 2. Fit an adjusted model

The crude model shown in step 1, tells you that the odds of breast feeding is nine times higher in the ng tube group than in the bottle group. A previous descriptive analysis, however, told you that older mothers were more likely to be in the ng tube group and younger mothers were more likely to be in the bottle fed group. This was in spite of randomization. So you may wish to see how much of the impact of feeding type on breast feeding can be accounted for by the discrepancy in mothers' ages. This is an adjusted logistic model.

When you run this model, put FEED_TYP as a covariate in the first block and put MOM_AGE as a covariate in the second block. The full output has much in common with the output for the crude model. Important excerpts appear below.

The Omnibus Tests of Model Coefficients table and the Model Summary table for Block 1 are identical to those in the crude model with MOM_AGE as the covariate. We wish to contrast these with the same tables for Block 2.

The Chi-square values in the Omnibus Tests of Model Coefficients table in Block 2 show some changes.

The test in the Model row shows the predictive power of all of the variables in Block 1 and Block 2. The large Chi-square value (28.242) and the small p-value (0.000) show you that either feeding type or mother's age or both are significantly associated with exclusive breast feeding at discharge.

The test in the Block row represents a test of the predictive power of all the variables in Block 2, after adjusting for all the variables in Block 1. The large Chi-square value (12.398) and the small p-value (0.000) indicates that feeding type is significantly associated with exclusive breast feeding at discharge, even after adjusting for mother's age. The Chi-square value is computed as the difference between the -2 Log likelihood at Block 1 (95.797) and Block 2 (83.399).

Notice that the two R-squared measures are larger. This also tells you that feeding type helps in predicting breastfeeding outcome, above and beyond mother's age.

The odds ratio for mother's age is 1.1367. That tells you that each for additional year of the mother's age, the odds of breast feeding increase by 1.14 (or 14%), assuming that the feeding type is held constant.

The odds ratio for feeding type is 0.1443 or, if we invert it, 6.9. This tells us that the odds for breast feeding are about 7 times great in the ng tube group than in the bottle fed group, assuming that mother's age is held constant. Notice that the effect of feeding type adjusting for mother's age is not quite as large as the crude odds ratio, but it is still large and it still is statistically significant (the p-value is .001 and the confidence interval excludes the value of 1.0).

Step 3. Examine the predicted probabilities.

The logistic regression model produces estimated or predicted probabilities and we should compare these to probabilities observed in the data. A large discrepancy indicates that you should look more closely at your data and possibly consider some alternative models.

If you coded your outcome variable as 0 and 1, then you can compute the average to get probabilities observed in the data. But if you have a lot of values for your covariate, you have to group it first.

The Report table shows average predicted probabilities (Predicted probability column) and observed probabilities (Exclusive bf at discharge column) for mother's age. We had to create a new variable where we created five groups of roughly equal size. The first group represented the 15 mothers with the youngest ages and the fifth group represented the 17 mothers with the oldest ages. The last column (Mother's age column) shows the average age in each of the five groups.

The Hosmer and Lemeshow Test table provides a formal test for whether the predicted probabilities for a covariate match the observed probabilities. A large p-value indicates a good match. A small p-value indicates a poor match, which tells you that you should look for some alternative ways to describe the relationship between this covariate and the outcome variable. In our example, the p-value is large (0.545), indicating a good match.

The Contingency Table for Hosmer and Lemeshow Test table shows more details. This test divides your data up into approximately ten groups. These groups are defined by increasing order of estimated risk. The first group corresponds to those subjects who have the lowest predicted risk. In this model it represents the seven subjects where the mother's age is 16, 17, or 18 years. Notice that in this group of 16-18 year old mothers, six were not successful BF and one was. This corresponds to the observed counts in the first three rows of the Mother's age * Exclusive bf at discharge Crosstabulation table (shown below, with the bottom half editted out). The second group of eight mothers represents 19 and 20 year olds, where 4 were exclusive breast feeding at discharge. The third group represents nine mothers aged 21 and 22 years old, and so forth.

The next group corresponds to those with the next lowest risk, those mothers who were 19 and 20 years old.

Summary

There are three steps in a typical logistic regression model.

First, fit a crude model that looks at how a single covariate influences your outcome.

Second, fit an adjusted model that looks at how two or more covariates influence your outcome.

Third, examine the predicted probabilities. If they do not match up well with the observed probabilities, consider modifying the relationship of this covariate.

Further reading

Logistic Regression. David Garson. (Accessed on November 19, 2002) Excerpt: "Binomial (or binary) logistic regression is a form of regression which is used when the dependent is a dichotomy and the independents are continuous variables, categorical variables, or both. Multinomial logistic regression exists to handle the case of dependents with more classes. Logistic regression applies maximum likelihood estimation after transforming the dependent into a logit variable (the natural log of the odds of the dependent occurring or not). In this way, logistic regression estimates the probability of a certain event occurring. Note that logistic regression calculates changes in the log odds of the dependent, not changes in the dependent itself as OLS regression does." www2.chass.ncsu.edu/garson/pa765/logistic.htm