P.Mean: Hedging your bets on an informative prior (created 2013-09-26).

News: Sign up for "The Monthly Mean," the newsletter that dares to call itself average, www.pmean.com/news.

There are many situations where the careful use of an informative prior distribution in Bayesian data analysis can greatly improve the precision of your results. When there is a sharp divergence, however, between the prior distribution and the actual data collected, the prior could seriously bias your results. We present a hedging prior distribution that provides a means to discount the prior distribution when the newly acquired data is in sharp conflict. The hedging prior is often closely related to the modified power prior, but is simpler to implement. We show that this hedging prior offers good statistical properties both when the data and the prior distribution agree closely and when they diverge.

Many informative Bayesian prior distributions can be parameterized in terms of a prior sample size. This prior sample size controls how much weight is given to the prior distribution relative to the data. For example, the beta prior for the parameter pi in a binomial distribution has two parameters, alpha and beta. The prior weight for the beta distribution is simply the sum of alpha and beta. When this sum is small (for example, in a beta(1,1) or uniform prior distribution), the prior receives very little weight and the estimate is close to that of the data itself.

You may instead propose an informative prior, a beta(18,42) distribution. The mean of this beta distribution (0.3) represents your belief prior to data collection that pi has a mean of 0.3. The prior sample size in this setting (60) represents a level of certainty about your belief that is equivalent to 60 observations. A Bayesian posterior distribution would put greater weight on the prior distribution than the data until you accumulated at least 60 data points.

An informative prior can be based on historical information, or it can be elicited through expert opinions. When such information is available, it can greatly increase the precision of the posterior estimate, relative to an estimate that disregards this information. Sometimes, though, the new data diverges from the historical data. Sometimes the experts are wrong. When this happens, the Bayesian posterior estimate using an informative prior can produce seriously biased results.

We propose a simple alternative to the informative prior. It sacrifices part of the precision that an informative prior provides, but in return offers a downweighting of the prior distribution when the data is sharply divergent from the prior distribution. It can be thought of as an insurance policy that provides protection against a bad informative prior.

For a prior distribution that can be parameterized to produce a prior sample size, we propose weighting this prior sample size by a hyperparameter tau. A simple (though not recommended) choice for the distribution of tau is the uniform(0,1) distribution. If the data that you collect is consistent with the informative prior, the distirbution of tau remains largely unchanged. But if the data is inconsistent with the informative prior, the distribution of tau will slide down to lower values, producing smaller average weights for the prior distribution.

There's no free lunch, however. By spreading the prior sample size across a hyperprior, you are introducing an additional source of uncertainty. If you want a model that is capable of downweighting the prior distribution when needed, you need to sacrifice some precision when the prior distribution is actually on target.

Here's a simple conceptual example of the use of a hyperprior distribution for the prior sample size: Suppose you have an informative prior distribution that is beta(18,42). Create a uniform(0,1) hyperprior tau that controls the prior sample size through an adjustment to the two parameters of the beta distribution. Note that although we start the conceptual example with a uniform(0,1) distribution, we do not recommend its use in practice.

Suppose that you collect data with 20 total observations and 6 successes. In this setting the mean of the data is identical to the mean of the prior distribution. Here is code for the model that can run under BUGS or JAGS.

model {
  tau ~ dunif(lower,upper)
  a <- x0*tau
  b <- (n0-x0)*tau
  p ~ dbeta(a,b)
  x ~ dbin(p,n)
}

You can run this model through R using the rjags package. Here is the R code that would be needed.

library("rjags")
Fnm <- "JagsHedgedUniformBetaBinomial.txt"
Dat <- list(x0=18,n0=60,x=6,n=20,lower=0,upper=1)
Mon <- c("tau","p")
Mod <- jags.model(Fnm,data=Dat,quiet=TRUE)
update(Mod,1000,progress.bar=NULL)
simple.example <- coda.samples(Mod,variable.names=Mon,n.iter=50000)
summary(simple.example)
plot(simple.example)

If you fit this model, the posterior distribution for tau is shifted slightly to the right.

Density of posterior distribution for tau

The distibution is shifted slightly to the right, perhaps, but it covers most of the original range for a uniform(0,1). The mean is 0.6 and the 2.5 and 97.5 percentiles are 0.1 and 0.98, respectively.

One unusual feature is that the distribution for tau does not continue to rise as you accumulate more data, even if the data is perfectly consistent with the prior distribution.

Here is the posterior distribution for tau if you collect 200 observations and get 60 successes.

Graph of posterior density for tau

The mean is 0.55 and the 2.5 and 97.5 percentiles are 0.06 and 0.98. If you collect 2,000 observations and get 600 successes, the distribution still fails to change much.

Posterior distribution for tau

The mean in this case is only 0.52 and the percentiles are 0.04 and 0.98. It may seem strange at first that tau fails to increase markedly as you obtain more evidence of the consistency between data and the historical prior. Recall, though, that this consistency means that the estimate of binomial proportion remains largely unchanged no matter how much or how little weight you put on the prior. It is analagous, perhaps, to the distribution of the p-value, which is uniform under the null hypothesis.

There are some issues with convergence that I have not fully investigated, but there is pretty strong evidence that tau fails to converge because of behavior of the distribution of the hyperprior near zero. Even if tau converged, there are issues produced when tau is very close to zero that warrant a change of the distribution of tau from a uniform(0,1).

If the data is inconsistent with the prior, you will notice a more dramatic shift in the posterior distribution of tau. Suppose you have the same beta(18,42) distribution, but among the 20 data values you collect, you notice 12 successes. The data mean (0.6) is quite a bit higher than the prior mean (0.3).

Posterior distribution for tau

The posterior distribution of tau is now noticeably shifted to the left. The posterior mean of tau is 0.4 and the 2.5 and 97.5 percentiles are 0.04 and 0.94, showing that the model, on average, discounts the weight of the prior distribution somewhat, compared to a uniform distribution. The posterior distribution will show a more marked change when the size of the data increases. If you collect 200 data values and observe 120 successes, the posterior distribution is decidedly shifted towards zero.

Posterior distribution of tau

The posterior mean for tau is now only 0.08 and the percentiles are 0.005 and 0.40. This represents a drastic reduction in the weight given to the prior.

The posterior distribution of tau is also shifted to the right as the deviation of the data from the prior mean becomes more marked. With the same prior distribution (beta(18,42)), suppose that you collect 20 observations and note an astonishing 18 successes.

Posterior distribution of tau

The posterior distribution for this setting is quite extreme. The mean is only 0.12, representing a downweighting of the prior sample size to about 7 observations. This is an amazing feat when you consider that the prior sample size of 60 is three times larger than the data sample size of 20. An application of the simple beta binomial model would have required that the posterior mean for pi be weighted 75% towards the prior distribution.

The reuslts of these simple examples suggest a couple of changes. The first change is to widen the uniform distribution to the range 0 to 2, so that the mean is 1, rather than 0.5. The posterior distribution does not shift markedly towards the high end, even when the data shows perfect agreement with the prior. So there is no serious risk of putting too much weight on the prior distribution even with a uniform(0,2) distribution.

A second problem with the uniform distribution, however, is the problematic behavior that occurs near zero. Because the uniform distribution is continuous, it could potentially downweight the prior distribution to a small fraction of a single observation. This produces a variance that can be unboundedly large. Thus, the behavior of the MCMC estimate with a uniform hyperprior could be unstable. Even if it were not unstable, the occaisional use of a prior distribution with a fraction of an observation would introduce an unacceptable amount of imprecision into the model.

The simplest way to avoid this problem is to bound the uniform distribution away from 0. This places a limit on the amount of downweighting that could be done, but substantially improves the precision, especially when the data and the prior are in close agreement. A lower bound of 0.1 would still allow the prior to be downweighted by up to 90% if needed. Other possibilities for the hyperprior are also worth exploring, such as the use of a scaled beta distribution (e.g., a beta(2,2) rescaled to fit the range of 0 to 2) and the use of a discrete uniform distribution.

The harmonic mean of the hyperprior distribution is a useful index of the amount of precision lost. If a simple prior distribution adds precision proportional to 1/n0, then the amount of precision that a hedged prior provides is roughly approximated by the expectation of 1/(tau*n0). As the lower bound of the uniform is set increasingly further away from zero, the harmonic mean will increase showing improved performance when the historical prior and the data agree. But the further you bound the uniform away from zero, the less downweighting is possible when the data diverges from the historical prior. You can think of the lower bound as an insurance premium. If your lower bound is large, you pay a small premium, but you limit the benefits of the hedging prior. If your lower bound is close to zero, you can get more benefits, but you pay a higher premium in terms of precision lost when the data agrees with the historical prior. If you have a uniform(a,b) distribution then the harmonic mean is

1/ln(b/a)

The problem with a uniform(0,1) hyperprior is that the harmonic mean is undefined, but zero in the limiting case. Even if the model converged, you get no added precision from using this hyperprior, on average, which means no benefit, on average, from the historical information.

Here is a table showing the harmonic mean for various uniform distributions.

U(0.10,1.90) 0.34
U(0.25,1.75) 0.52
U(0.50,1.50) 0.91

This provides a rough idea of the loss of precision caused by hedging. The U(0.1,1.9) reduces the prior sample size by 2/3, the U(0.25,1.75) reduces it by 1/2 and the U(0.5,1.5) reduces it by only 1/10.

We ran a small simulation study to quantify the amount of downweighting that the hedging hyperprior does for the beta binomial setting. This simulation also quantifies the loss of precision. The simple beta binomial has a simple closed form solution that provides easy reference points for this simulation. Here's an illustration of those reference points.

Reference lines

In this graph, the horizontal axis represents the sample size for the observed data and the vertical axis represents the posterior estimate of pi. These reference lines represent the setting where the prior mean is set to 0.1, and the true mean is actually 0.3. The dark green curve labelled 100% represents a fully weighted prior with a prior sample size of 60. As the data set grows larger, the posterior mean is pulled closer to the true mean, as the data gets greater weight. At n=60, the weight is equal between the prior and the data, leading to a posterior mean that is exactly halfway between 0.1 and 0.3. The dark red curve labelled 0% represents an estimate based solely on the data with no weight on the prior distribution. This curve is flat because there is no averaging between the disparate means of the prior and the data. The intermediate values represent various levels of downweighting. The curve labelled 10%, for example, represents a prior based on 6 rather than 60 observations.

You can produce a similar graph for precision.

Plot of reference lines for precision

The dark green curve labelled 100% represents the posterior standard deviation for pi using a fully weighted prior and the dark red curve labelled 0% represents the posterior standard deviation using only the data. In general, the posterior standard deviation is smallest for the fully weighted prior, though the interplay between the mean and variance of a beta distirbution can sometimes cause anomalies in extreme situations.

Here is a graph showing the performance of four estimators when the prior mean is 0.1 and the true mean is 0.3. The "hlf" curve represents a prior hyperdistribution that is uniform(0.5,1.5). The "qrt" curve represents a uniform(0.25,1.75) and "ten" represents a uniform(0.1,1.9) distribution. Finally, "sng" represents a hyperdistribution that caps the lower bound to be equivalent to a single observation. For a prior sample size of 60, this corresponds to a uniform(0.0167,1.9833).

Simulation results

Notice that all of the estimates downweight the prior distribution somewhat, even for data with only 10 observations. The amount of downweighting also accelerates for all estimates as the size of the data increases. The downweighting is strongest for "sng" which has the lower bound closest to zero.

The following graph repeats this simulation in a more extreme setting where the prior mean is 0.1 and the true mean is 0.5.

Simulation results

The amount of downweigting is sharper in this setting and for data with 120 observations, the downweighting appears to reach the maximum possible for each estimate.

Here are the simulation results for a prior mean of 0.1 and a true mean of 0.7

Simulation results

Notice that as the true mean diverges from the prior mean, the downweighting starts immediately and rapidly progresses to the maximum possible downweighting.

We've run simulations for a prior mean of 0.3 and 0.5 and the patterns are similar.

The precision results for a prior mean of 0.1 and a true mean of 0.1 appear below.

Simulation results

Notice that all of the hedged priors show a loss of precision relative to the simple informative prior (represented by the dark green curvel labelled 100%). The loss of precision is smallest for the "hlf" hpyperprior. As your data set gets larger, the inefficiencies of the hyperpriors become less serious. The following graphs show the precision for a prior mean of 0.1 and a true mean of 0.3

Simulation results

a prior mean of 0.1 and a true mean of 0.5

Simulation results

and a prior mean of 0.1 and a true mean of 0.7

Simulation results

Notice that the precision tends correspond to the amount of downweighting done, and for the "sng" hyperprior, the precision is close to that of a purely data based posterior (dark red curve labelled 0%) or sometimes even a bit worse When you downweight almost all the prior, the precision has to be based only on the data itself. The "hlf" hyperprior, on the other hand, never downweights more than 50% and never has a precision worse than the 50% curve.

The hedged prior can be easily adapted to other settings.

Estimating a normal mean when the variance is known is a very simple example. An informative prior distribution for the mean, mu, in this setting would take the form

Normal distribution

which can be rewritten in terms of the known variance, sigma, and a prior sample size, n0.

Normal distribution

The hedging hyperprior would further modify this to

Normal distribution with hedged hyperprior

A gamma prior is often used for the rate parameter, lambda, in a Poisson distribution. With a hedged hyperprior, this would take the form

Gamma distribution with hedged hyperprior

The tau parameter modifies alpha, which can be considered the prior sample size. Multiplication of beta by tau is needed to insure that the prior keeps the same mean.

The concept for the hedged hyperprior is similar to that of the modified power prior described in Ibrahim and Chen 2001. In this model, the likelihood is scaled by a power of tau. This leads to a new prior which is proportional to

Modified power prior for beta distribution

With the hyperprior, the likelihood is proportional to

Hedged prior

With an informative prior, the values of alpha and beta will be large, making the two likelihoods similar. Similar results could be shown for the normal prior for a normal mean and the Gamma prior for the Poisson mean.

This is a very brief outline of the concept. I hope to spend some time soon updating this page and providing more details.

Creative Commons License This page was written by Steve Simon and is licensed under the Creative Commons Attribution 3.0 United States License. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at Bayesian Statistics.