P.Mean: Example of power calculation for a repeated measures design (created 2008-10-19).

This page is moving to a new website.

I was asked how to calculate power for an interaction term in a repeated measures design. There were two groups (treatment and control), and subjects in each group were measured at four time points. The interaction involving the third time point was considered most critical.

Let's think about what an interaction means. Each group is going to have a time trend across the four time points. What is that trend likely to look like? I don't know what this person had in mind, but let's suppose that he said that there is an decreasing trend in both groups, but he expected a much sharper decrease in the treatment group.

It's impossible to list all the possible scenarios where there are decreasing trends with a sharper decrease in the treatment group. Instead find one, two, or three plausible scenarios. The general assumption (and one that is usually reasonably justified in my experience) is that if a research study has good power for one scenario then the study is presumed to have comparable power for comparable scenarios.

It is very important to get an impression about the expected time trends in the study from the subject matter expert. It may be that instead of a decreasing trend in both groups, you might expect an increasing trend in both groups because the disease process is increasingly severe over time. What you would hope is that the increasing trend in the treatment group is attenuated. In other words, the best you can hope for is a slowing in the progression of the disease.

In other situations, an interaction might reflect an expectation that the trend is flat in the control group and decreasing in the treatment group. In some situations, you might even expect a crossing pattern, a decrease in the treatment group and and increase in the control group.

The outcome variable in this study is the Oswestry Disability Index which ranges from 0 to 100. I do not know the standard deviation for this outcome measure, but a quick review using PubMed showed hundreds of articles that use this scale. Here is a table from one of these studies:


You need to read carefully elsewhere to discover that the number reported after the +/- is a standard deviation rather than a standard error. Tables 1 and 2 include the words "Mean +/- SD" so it is safe to assume that table 4 uses that same format, even though it is not explicitly stated.

The problem with this table is that the standard deviation is a standard deviation across patients that incorporates both between and within subject variation. The interaction test requires knowledge of variation across patients and variation within patients. It would be worthwhile to search through some of the other publications at this point.

Another publication states:

The mean difference in change in ODI scores of those participants in the FairMed and those in TENS groups was 0.4; this difference in change was not significant (p = 0.85). www.biomedcentral.com/1471-2474/9/97

This does use a change score and a change score represents variation within a patient, but this paper does not report a standard deviation or a confidence interval, just a p-value. It is possible to use a p-value here to backcalculate a standard deviation, but it is a poor approximation. I will try to show how you would do this in a separate web page.

A third publication has a graph


This publication shows a confidence interval, but only graphically. It is possible to take a ruler and measure the width of the confidence interval so as to calculate a standard deviation. Again this is a poor approximation, but I will try to show how it works in a separate web page.

It is in a fourth publication that we strike gold.


This publication has two tables. The first table shows the standard deviation for the ODI (Oswestry Disability Index) across patients and the second shows the standard deviation for the change score (within patients).

Now I am a total novice when it comes to the Oswestry Disability Index, so these papers are chosen more or less haphazardly. A careful person would pick a target population that was not too different from the one being considered for the current research study. There are literally hundreds of papers that use this scale.

Let's calculate power using the last study. Notice that the standard deviations for within patient comparisons (16.8, 11.7, 13.5, and 9.5) are not greatly less than the standard deviations for across patient comparisons (16.2 and 14.5).

Suppose you wanted to compute the power for an across patient comparison. An example would be comparing the baseline values for two groups. Let's use a sample size of 50 patients per group and assume that a 10 unit difference is considered clinically important. Then the formula for power would be the same as the one you have already used for calculating power for a two-sample t-test.

Plugging in the values of alpha=0.05, D=10, n=50, and sigma=16.2, we get

Three comments are worth mentioning here. First, this formula is not the best formula (the best formula uses a non-central T distribution), but it is a reasonable approximation. Second, most people, including myself, would probably use a program to calculate power. It is very easy to leave out a small detail like a square root and end up getting a result that is grossly incorrect. Third, most researchers using a repeated measures design probably have limited interest, at best, in a simple comparison of two means at baseline. I am including this example to contrast it with a more complex calculation.

A second hypothesis is that there is an interaction involving the baseline and the third time points. The contrast for this interaction would look like

If you looked at just the first four elements of this contrast, it is subtracting the baseline treatment mean from the time 3 treatment mean. This is essentially a change score. The last four elements represent a subtraction of the time 3 control mean from the baseline control mean. This is the negative of a change score. So one interpretation of the time 3 interaction is that it represents a difference between the change score in the treatment group and the change score in the control group.

We already know how variable the change scores are, based on the last publication. The standard deviations are 16.8, 11.7, 13.5, and 9.5. In the previous power calculation, we used the standard deviation for the primary sector patients, so for consistency, let's consider only the first two standard deviations in the change scores since they also represent primary sector patients.

Again, I would ask the subject matter expert a bit about the two treatments in this study (LBP only, Leg pain +/- LBP) to see which standard deviation was more likely to reflect the results in the new study. If there was no reason to think that one group was more relevant than the other, you could average the two values or choose the larger of the two values if you wanted to be conservative. For this example, let's suppose that the second standard deviation (11.7) was the more relevant value.

What size interaction should you be looking for? Let's suppose that both groups show an increase from baseline to time 3, but that the increase is quite a bit larger in the treatment group. For example, the control group might decrease from 30 to 17.5 and the treatment group might decrease from 30 to 10. If the difference in these change scores (20-12.5=7.5) represents the smallest disparity that would be considered clinically important, then we would use D=7.5 in the sample size formula.

The calculated power would again use the formula for a two sample t-test, but this time the standard deviation would represent variation within a patient.

Notice that for both power calculations, the power was inadequate. There is a general consensus in the research community that a good study should have at least 80% power, and some advocate 90% power. So at this point, I would ask about whether the sample size could be increased without causing serious logistical and budgetary problems or if the clinically important difference could be loosened somewhat.

We were very fortunate to find a paper that has the standard deviation of the change score, but in some situations, the papers will be like the first one, where the only standard deviation reported is a standard deviation across patients. This, sad to say, occurs even in papers that use change scores and/or repeated measures designs. The researchers do not understand how important it is to state the variation that occurs within subjects.

In a separate web page, I will show how to extrapolate a standard deviation for a change score when the only variation shown in a paper is variation across patients.

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-01. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Category: Sample size justification.