P.Mean: What's the difference between regression and ANOVA? (created 2008-10-15).

This page is moving to a new website.

Someone asked me to explain the difference between regression and ANOVA. That's challenging because regression and ANOVA are like the flip sides of the same coin. They are different, but they have more in common that you might think at first glance.

A very simple explanation is that regression is the statistical model that you use to predict a continuous outcome on the basis of one or more continuous predictor variables. In contrast, ANOVA is the statistical model that you use to predict a continuous outcome on the basis of one or more categorical predictor variables. Most people will carve out one big exception to the "one or more categorical variables" statement. If you have a single categorical variable, and it only has two levels (in other words, a binary category), then most people would describe the method/approach as a two-sample t-test. A single categorical predictor with three or more levels or two plus categorical predictor variables with any number of levels would be considered an ANOVA model.

So if you're trying to predict the duration of breastfeeding in weeks using mother's age as a predictor variable, then you would use a regression model. If you are trying to predict the duration of breastfeeding in weeks using mother's marital status (single, married, divorced, widowed), the you would use an ANOVA model. If you are trying to predict the duration of breastfeeding in weeks using prenatal smoking status (smoked during pregnancy, did not smoke during pregnancy), then you would use a two-sample t-test. If you added delivery type (vaginal/c-section) to prenatal smoking status, then the two binary predictor variables would be analyzed using an ANOVA model.

What if you had two predictor variables, one continuous and one categorical? Suppose, for example, that you wanted to predict duration of breastfeeding in weeks using both the mother's age and the delivery type? Is it a regression model, an ANOVA model, or a t-test? Some people would use an entirely new term to describe this model, ANCOVA (Analysis of Covariance). Others might quibble with this terminology. In general, the language of statistics is not as standardized as you might like, and sometimes different people will use different terms for essentially the same model.

But one thing you should always keep in mind is that regression and ANOVA have a lot in common. First, both models are applicable only when you have a continuous outcome variable. A categorical outcome variable would rule out the use of either a regression model or an ANOVA model.

Second, you can use the regression algorithm, which is based on the principle of least squares, to fit an ANOVA model. You don't have to use the least squares principle because there are other ways to produce the ANOVA model. But because least squares, the basis for regression models, also works for ANOVA models, some people consider the regression model to be the more general model. you can incorporate categorical predictors into a regression model by using indicator variables. An indicator variable is equal to one for a particular category and zero for the remaining categories. If you have a categorical predictor variable with k levels, then you can input k-1 indicator variables (the last indicator is always redundant) in a regression program and effectively get the same results as an ANOVA model.

Third, the concept of partitioning variation into sums of squares (SS) in an ANOVA model also provides a nice way to examine complex regression models. In an ANOVA model, the total variation (total SS) is partitioned into variation between groups (between SS) and variation within groups (within SS). You can do the same sort of thing for a regression model, partitioning total variation into variation due to the model (model SS) and variation unexplained by the model (error SS).

Fourth, regression models and ANOVA models share many of the same diagnostic procedures (procedures used to examine the underlying assumptions). In particular, you can compute residuals in both models and the plots involving those residuals are often very helpful.

In most scientific circles, there are "lumpers" and "splitters." The lumpers try to find common elements in diverse objects and organize things into small groups with many members. The splitters try to create a large number of groups that have fewer members, but have many things in common. If you are a lumper, then you would describe most if not all models involving a continuous outcome as regression models. ANOVA models and even the t-test are quite different from most other regression models, but the lumpers find enough commonality to use a single term for all these models. The splitters are perfectly comfortable with different names and would draw a careful distinction between regression, ANOVA, and t-tests, and would come up with new terms like ANCOVA to handle other situations.

I'm a lumper in my heart, but it's easier during publication time to behave like a splitter. So I use ANOVA in a paper, even though in my heart, I view the ANOVA model much in the same way I treat a regression model. The problem is that the people who make a sharp distinction between regression and ANOVA are more likely to give you a hard time about your publication than the people who view regression and ANOVA as pretty much the same thing.