The analysis of variance table in a linear regression model

The analysis of variance table is used for an analysis of variance model, but it is also used for a linear regression model. I want to offer some intuition on what all the entries mean in this table using a simple artificial dataset.

The artificial dataset

# A tibble: 6 × 2
      x     y
  <dbl> <dbl>
1     4    34
2     6    20
3     8    10
4    10    32
5    12     6
6    14    24

Most of the time, I try to avoid using artificial datasets. I am making an exception here because I want to show the internal calculations using numbers that are easy to work with.

There are two variables with generic names, x and y. Treat x as the independent variable and y as the dependent variable.

Descriptive statistics

# A tibble: 1 × 2
  x_mean x_stdev
   <dbl>   <dbl>
1      9    3.74

# A tibble: 1 × 2
  y_mean y_stdev
   <dbl>   <dbl>
1     21    3.74

Here are the means and standard deviations for the two variables. I would normally include an interpretation here, but for an artificial data set, no interpretation is needed.

The analysis of variance table

Typically three rows
- Model or Regression or name of the independent variable
- Error or Residual
- Total or Corrected Total
Typically four columns
- Sum of squares or SS
- Degrees of freedom or df
- Mean square or MS
- F or F-ratio
- p or p-value

Regression results from R

Analysis of Variance Table

Response: y
          Df Sum Sq Mean Sq F value Pr(>F)
x          1     70      70  0.4861 0.5241
Residuals  4    576     144

This is what the analysis of variance table looks like in R. The first row is the name of the independent variable, in this case, just x. The second row is Residual. The third row is…well there is no third row for total or corrected total.

The columns are in a slightly different order. The degrees of freedom, labeled “df”, comes first, then the sum of squares, labeled “Sum Sq”, followed by the mean square column, labeled ““Mean Sq”, the F ratio, labelled “F value” and finally the p-value, labeled as “Pr(>F)”.

Analysis of variance table in SAS

In SAS, the first row is Model, the second is Error, and the third is Corrected Total. SAS also puts the degrees of freedom column first, labeled “DF”, followed by “Sum of Squares”, “Mean Square”, “F Value”, and “Pr>F”

Analysis of variance table in SPSS, Linear Regression

SPSS provides two different analysis of variance tables. If you select Linear Regression from the menu, this is the table you get. The rows are labeled “Regression”, “Residual”, and “Total”.

The columns are in a more logical order, in my opinion, than R or SAS. The first column, “Sum of Squares” is followed by the degrees of freedom column (“df”) and the mean square column. The F-ratio, labeled just “F” and the p-value, labeled “Sig” round out the rest of the table.

Analysis of variance table in SPSS, General Linear Model

If you select General Linear Model, you get six rows. You only need three, but for more complex models, you may need more. The first and third rows, “Corrected Model” and “x” are identical, but will differ in more complex regression models. SPSS includes an “Intercept” row, which you will almost always ignore. The fourth row, labeled “Error” is what you would normally see as the second row in most analysis of variance tables. The last row “Corrected Total” is what normally is placed as the third or last row in most other analysis of variance tables. The “Total” row, like the “intercept” row is another row that you almost always ignore.

The first column is labeled “Type III Sum of Squares”. The “Type III” is not important in this example, but becomes important when you have multiple independent variables. The rest of the columns match the other SPSS analysis of variance table.

Keeping only the important rows

It is easy to remove rows from a table in SPSS. Here is what the table would look like without rows 2, 3, and 5. This table deliberately kept the blank rows visible, but you can easily tighten things up by suppressing the blank rows entirely.

Analysis of variance by Andy Field

Does not arrange sums of squares into a table
- Model: \(SS_M\), \(MS_M\)
- Residual: \(SS_R\), \(MS_R\)
- Total: \(SS_T\)

I am using a book by Andy Field in a class I am teaching, and thought it would be good to see how he lays out the analysis of variance table. Unfortunately, he does not provide a table in the chapter on linear regression. In a later chapter on analysis of variance, he does produce such a table, but the row labels only make sense when you have defined the analysis of variance model as something different from the linear regression model. More on that in a separate webpage, maybe.

So what, exactly is the analysis of variance table measuring

Model/Regression: Explained variation (signal)
Error/Residual: Unexplained variation (noise)
Total/Corrected total: Total variation

There is a relationship between the various sum of squares. The sum of squares for the model or regression represents explained variation. The sum of squares for error or residual represents unexplained variation. The two combined add up to the sum of squares total or total variation. The relative size of explained versus unexplained variation indicates whether there is a strong relationship, a weak relationship, or little or no relationship between the independent and dependent variable. Let’s see what this means with a few graphs and a few calculations.

You can also think of the sum of squares for the model as the signal and the sum of squares for error as the noise. The total variation in your data is partitioned into variation associated with the signal and variation associated with noise.

Scatterplot of the hypothetical data

In any linear regression model, even an artificial one like this, you should always display a graph. There appears to be a weak negative trend in the data.

Least squares regression

Collect a sample
- \((X_1,Y_1),\ (X_2,Y_2),\ ...\ (X_n,Y_n)\)
Compute residuals
- \(e_i=Y_i-(b_0+b_1*X_i)\)
- Choose b_0 and b_1 to minimize \(\Sigma e_i^2\)

The least squares principle is an approach to finding a line that is close to most of the data. It involves compromise because if you try to get too close to one datapoint, the distance to the other data points can get worse.

You measure closeness by computing the residuals. These represent a deviation between the actual data and the prediction you would get using a straight line.

Let’s look at some examples.

A bad fitting line

Just to try to illustrate how linear regression works, I want to show some “trial and error” equations. The actual linear regression algorithm jumps immediately to the best equation, but let’s start with a very bad guess.

This graph illustrates what a line with an intercept of 20 and a slope of 1 would look like. Notice that it tends to be too high. Most of the residuals (the difference between the observed values and the values on the fitted line) are negative.

Sum of squared residuals for the bad fit

(34-24)²	+	(20-26)²	+	(10-28)²	+	(32-30)²	+	(6-32)²	+	(24-34)²
(10)²	+	(-6)²	+	(-18)²	+	(2)²	+	(-26)²	+	(-10)²
100	+	36	+	324	+	4	+	676	+	100
1240

You measure how well (or in this case how poorly) an equation is by squaring the residuals and adding them up. Squaring does two things. First, it treats deviations above and below the line equally. Being five units above contributes a squared residual of 25 and being five units below also contributed a squared residual of 25.

Squaring the residual also tends to emphasize the larger deviations. You get a squared deviation of 100 when you are ten units above or below the line compared to a squared deviation of 16 when you are four units above or below the line.

This emphasis on large deviations means that linear regression tries very hard to avoid large deviations even for just a single data point.

Notice that the large deviations for this line lead to a very large sum of squared residuals.

A better fitting line

You can improve things somewhat by shifting the line lower. This graph shows how well a line with an intercept of 10 and a slope of 1 performs.

Sum of squared residuals for a better fit

(34-16)²	+	(20-18)²	+	(10-20)²	+	(32-22)²	+	(6-24)²	+	(24-26)²
(18)²	+	(2)²	+	(-10)²	+	(10)²	+	(-18)²	+	(-2)²
324	+	4	+	100	+	100	+	324	+	4
856

An even better line

You can do even better by changing to a negative slope. The line shown here has an intercept of 30 and a slope of -1.

Sum of squared residuals for an even better fit

(34-26)²	+	(20-24)²	+	(10-22)²	+	(32-20)²	+	(6-18)²	+	(24-16)²
(8)²	+	(-4)²	+	(-12)²	+	(12)²	+	(-12)²	+	(8)²
64	+	16	+	144	+	144	+	144	+	64
576

The sum of squared residuals has declined even further.

It turns out that you can’t do any better than this line. If you tried to get closer to some of the data points, any gains would be offset from the points that move further away.

This is the least squares principle. The best line is one that is close to most of the data points. It minimizes the sum of squared residuals.

SS regression plot

This plot shows the variation of the predicted values. You can think of this as a comparison of how far the predicted values for the estimated regression line deviates from a flat line. A flat line represents no signal.

If the regression line is very close to flat, this is evidence of a weak signal. A steep line (large positive or large negative slope) is evidence of a strong signal.

SS regression calculations

(26-21)²	+	(24-21)²	+	(22-21)²	+	(20-21)²	+	(18-21)²	+	(16-21)²
(5)²	+	(3)²	+	(1)²	+	(-1)²	+	(-3)²	+	(-5)²
25	+	9	+	1	+	1	+	9	+	25
70

This shows the actual calculation of the sum of squares for the model.

You might ask. Is this value small (indicating a weak signal), or large (indicating a strong signal). To answer this question, you need to compare the variation in the predicted values or the variation in the signal to something else.

SS regression in the analysis of variance table

	SS	df	MS	F	p-value
Model	70
Residual
Total

You place the sum of squares regression in the first row, first column.

SS total plot

To compute the sum of squares total, you compare the observed data to a flat line.

SS total calculations

(34-21)²	+	(20-21)²	+	(10-21)²	+	(32-21)²	+	(6-21)²	+	(24-21)²
(13)²	+	(-1)²	+	(-11)²	+	(11)²	+	(-15)²	+	(3)²
169	+	1	+	121	+	121	+	225	+	9
646

There is a lot of variation in the observed values, much more than the variation in the predicted values.

SS total in the analysis of variance table

	SS	df	MS	F	p-value
Model	70
Residual
Total	646

You place the sum of squares total in the third row, first column.

Calculate R-squared

\(R^2=\frac{SS_M}{SS_T}=\frac{70}{646}=0.108\)

The sum of squares total is always larger than the sum of squares model. Sum of squares total represents variation both explained and unexplained. The ratio of sum of squares model to sum of squares total represents the proportion of explained variation. It is denoted by the symbol \(R^2\).

In the hypothetical model, you look at the ratio of 70 to 646, which is 0.108. You interpret this as a weak relationship. A regression model using X to predict Y can only explain about 11% of the variation in Y.

SS residual plot

The other sum of squares is the sum of squares for error or residuals. It represents the deviation of the observed values to the predicted values.

SS residual calculations

(34-26)²	+	(20-24)²	+	(10-22)²	+	(32-20)²	+	(6-18)²	+	(24-16)²
(8)²	+	(-4)²	+	(-12)²	+	(12)²	+	(-12)²	+	(8)²
64	+	16	+	144	+	144	+	144	+	64
576

SS residual in the analysis of variance table

	SS	df	MS	F	p-value
Model	70
Residual	576
Total	646

You place the sum of squares residual in the second row, first column.

Notice how the sum of squares add up. Explained variation (SS model) plus unexplained variation (SS residual) equals total variation (SS total).

Additive relationship among the sum of squares

\(\begin{matrix} Model & \Sigma(\hat{Y}_i-\bar{Y})^2 & \Sigma(Predicted_i-Mean)^2\ & Signal/Explained\ variation\\ Residual & \Sigma(Y_i-\hat{Y}_i)^2 & \Sigma(Observed_i-Predicted_i)^2 & Noise/Unexplained\ variation\\ Total & \Sigma(Y_i-\bar{Y})^2 & \Sigma(Observed_i-Mean)^2 & Total\ variation \end{matrix}\)

df model

\(df_M\) = number of independent variables

	SS	df	MS	F	p-value
Model	70	1
Residual	576
Total	646

The degrees of freedom represents the amount of data that is contributing to the various sum of squares quantities. The degrees of freedm is equal to the number of independent variables. In our example, there is only one independent variable, x, so the degrees of freedom associaton with the sum of squares model is 1.

df total

\(df_T\) = number of independent variables

	SS	df	MS	F	p-value
Model	70	1
Residual	576
Total	646	5

The degrees of freedom for the total sum of squares is equal to the number of observations minus 1. You lose a degree of freedom because this sum of squares includes an estimate of the mean.

df residual

\(df_R\) = number of observations - number of independent variables - 1

	SS	df	MS	F	p-value
Model	70	1
Residual	576	4
Total	646	5

The degrees of freedom for the residuals or error with a loss of a degree of freedom for each independent variable and a loss of another degree of freedom for the estimated intercept.

Notice how the degrees of freedom add up.

Mean square calculations, 1

\(MS_M\) = \(SS_M\) / \(df_M\)

	SS	df	MS	F	p-value
Model	70	1	70
Residual	576	4
Total	646	5

The mean square column is just the sum of squares column divided by the degrees of freedom column. In a linear regression analysis with only one independent variable, there is no difference between \(SS_M\) and \(MS_M\). This will change when you have multiple independent variables.

Mean square calculations, 2

\(MS_R\) = \(SS_R\) / \(df_R\)

	SS	df	MS	F	p-value
Model	70	1	70
Residual	576	4	144
Total	646	5

Do the same calculation for mean square residual by dividing the sum of squares residual by the degrees of freedom for the residual.

Although you could do a similar calculation for the total row, dividing the sum of squares total by the degrees of freedom total, this is not normally done. Leave this spot in the analysis of variance table blank.

F ratio

F = \(MS_M\) / \(MS_R\)
Tests the hypothesis
- \(H_0:\ \beta_1=0\)
- Accept \(H_0\) if F is close to 1
- Reject \(H_0\) if F is much larger than 1

	SS	df	MS	F	p-value
Model	70	1	70	0.49
Residual	576	4	144
Total	646	5

The F-ratio is the mean square model divided by the mean square residual. This ratio tests the null hypothesis that \(\beta_1\) equals zero.

If the mean square model is about equal to the mean square residual, the F-ratio is close to one. This is evidence that the signal, if there is one, is being swamped by the noise. A large F-ratio tells you that the signal stands out prominently relative to the noise in the data.

Place the F-ratio in the fourth column, first row. Nothing goes in any of the other spots in the fourth column, at least for a linear regression with only one independent variable.

p-value

Tests the hypothesis
- \(H_0:\ \beta_1=0\)
- Accept \(H_0\) if p-value is large
- Reject \(H_0\) if p-value is small

	SS	df	MS	F	p-value
Model	70	1	70	0.49	0.524
Residual	576	4	144
Total	646	5

The p-value provides an alternative way to test the null hypothesis. You should accept the null hypothesis if the p-value is large.

Place the p-value in the last column, first row. Nothing goes in any of the other spots in the last column, at least for a linear regression with only one independent variable.

Reminder: what is the population model?

This is a hypothetical dataset and it represents a sample of 6 hypothetical patients. I am showing, hypothetically, what the population that this sample was drawn from might look like.

The points in black represent the sample and the points in gray represent the population. Notice how much bigger the population is!

The line in black represents the regression line that you would estimate based on the sample of 6 patients. The line in gray represents the regression line that you would estimate based on the population.

Notice that the two lines disagree and this is to be expected. A sample is a subset of a population and even if it is selected without any bias, there will still be a difference between the results of the sample and the results in the population. This is sampling error.

Notice how the gray line is flat. It is showing no relationship between X and Y, even though there is a negative relationship (admittedly a weak negative relationship) when you look just at the sample.

What is the difference between R-squared and the F-ratio?

R-squared measures strength of the relationship
F-ratio measures statistical evidence of a relationship.
- A large F-ratio implies lots of evidence of a relationship
- It could be a strong relationship with a small sample size
- it could also be a weak relationship with a large sample size
R-squared is purely descriptive
- R-squared does not change as the sample size increases
F-ratio is inferential
- F-ratio increases as the sample size increases (if there is indeed a relationship of any strength)

Although it seems like R-squared and the F-ratio are just different measures of the same thing, there are some important differences.

R squared measures the strength of the relationship. A value larger than 0.5 implies a strong relationship. A value between 0.1 and 0.5 implies a weak relationship. A value less than 0.1 implies little or no relationship.

The R-ratio measures evidence of a relationship. A large F-ratio implies lots of evidence of a relationship. This could be a strong relationship even if found with a small sample size. It could just as easily be a weak relationship if found with a large enough sample size.

R-squared is purely descriptive. It does not change much as the sample size increases. In contrast, the F-ratio is inferential. As your sample size increases, the F-ratio also increases, assuming that there is indeed a relationship between the dependent and independent variables.