Ambiguity in the definition of the exponential distribution (created 2010-11-16).

This page is moving to a new website.

I'm trying to run some Bayesian analyses using a program called BUGS (Bayes Using Gibbs Sampler), and this requires me to specify a prior distribution for the parameter associated with an exponential waiting time. I'm having more trouble that I should because the exponential distribution is defined two different ways.

Here's a very simple example of how things can go wrong. I needed, as part of a larger project, to generate 350 random waiting times from an exponential distribution with a mean waiting time of 3.1 days. If you enter the R command


you will get the following output.

  [1] 0.142598868 0.051273736 0.148351169 0.193698824 0.040124614 0.177293930
  [7] 0.229432136 0.097639574 0.729434047 0.154231783 0.802208609 0.243872164
 [13] 0.478097452 0.021161563 0.086523378 0.212651465 0.161178541 0.235346940
[343] 0.021009860 0.009656010 0.200375699 0.209521361 0.022620818 0.070489280
[349] 0.122053142 0.779715014

I can look at that and say that this is obviously wrong. There's no way that a random variable with a mean of 3.1 would produce a set of values consistently less than one.

Now the math savvy readers among you will probably sniff at my mistake, and of course it's obvious what needs to be fixed in this example. The problem is that in a program like BUGS, there are a series of random numbers generated in sequence and combined and if there is a mistake somewhere in the middle of the sequence, you'll get numbers that are obviously bogus, but it won't be immediately apparent where you made your mistake.

When I saw my BUGS program creating bogus results, I first tried a quick fix. Oh, it's obviously this line, I said. So I'd change the line and the results would still be bogus. Then I'd say, it's obviously this line. Fix that line and still more bogus results. Those were DIFFERENT bogus results, so I thought I was making progress. No I wasn't. I was randomly shifting things around from numerator to denominator and denominator to numerator and nothing seemed to fix things.

After about a dozen attempts, I realized I had to go back to first principles. What, exactly, are the probability distributions that I am working with, and what do their parameters mean? It leads to recalling some tedious formulas from my graduate school training 30 years ago. That's okay. My friends all know that "tedious" is my middle name. Let me outline all the tedium so you won't make the same dumb mistakes that I do.

Wikipedia defines the exponential distribution as a continuous distribution with probability density function

It seems simple enough, but there is an alternate probability density function that Wikipedia mentions

Wikipedia actually uses the Greek letter β instead of θ,

but I made the switch to simplify some of the discussion below.

Why would you choose the second form, which is slightly more complicated? Some people prefer this more complicated form because the parameter θ represents the mean of the exponential distribution. In the first form, the mean is 1/λ.

Some people interpret the parameter λ as a rate, because the hazard function for the exponential distribution is a constant rate, λ. There's also a relationship between the Poisson distribution, which is often used to represent counts, and the exponential distribution. The waiting time between successive events is exponential when the count of the total number of events is Poisson. The parameter, λ, that is used to describe the Poisson distribution associated with the counts is the same parameter used to describe waiting time between counts for the exponential distribution.

You can interpret the parameter θ as an expected waiting time. If the rate increases, the waiting time decreases, and vice versa. You can also interpret it as a scale parameter. The variance of the exponential distribution is θ2 which implies that large values of θ correspond to a widely spread distribution and small values of θ correspond to a tightly concentrated distribution.

So which exponential distribution is used by BUGS? It turns out that BUGS chooses to be largely consistent with R, and in R, the exponential distribution is defined in terms of a rate, λ, rather than a mean, θ.

The formula provided in the R help file is

and the formula provided from the help file for BUGS is

So here is what I should have written in R:


which would have produced the following output

  [1] 2.727229565 0.012210728 4.643225766 1.220943091 0.141121131
  [6] 1.399026191 1.154146283 0.266013528 0.382612274 3.594326553
 [11] 7.330894980 0.552629477 6.289188506 3.562178279 1.366771249
[341] 1.190045900 6.657467753 0.953907900 1.008005608 3.851789183
[346] 6.540581247 5.153978824 0.108987121 0.061849800 3.064767451

Much better!

Now in BUGS, I had to deal not only with the exponential distribution itself, which modeled the waiting times, but with a prior distribution for the parameter of the exponential distribution.

So what prior distribution should you use? If you define the exponential distribution in terms of a rate (λ), then the gamma distribution is a commonly used prior. The gamma distribution has some nice mathematical properties. Now nice mathematical properties, by themselves, should not dominate your considerations completely. But at a minimum, it does represent a simple starting point for further exploration. You could choose an alternative to the gamma distribution, and if you make sure that the prior distribution does not doesn't force you to do something weird like divide by zero or take the square root of a negative number or implicitly create a negative probability, you'll probably be okay.

The gamma distribution has probability density function

The Γ function is defined as

which looks really complicated, but for positive whole numbers, the Γ function simplifies to

I like to think of the Γ function as a continuous interpolation of the factorial computation. So Γ(4.5) is somewhere between Γ(4)=3!=6 and Γ(5)=4!=24 (it is actually 11.63173). Note that the gamma distribution uses the gamma function. There are only so many Greek letters to go around, so some level of duplication is unavoidable. Here, I suspect that the duplication is actually intentional.

Just like the exponential distribution, there is an alternate form for the gamma distribution as well.


Again, I am changing symbols from the Wikipedia entry on the gamma distribution

Why two forms again. Well, the gamma distribution is a generalization of the exponential distribution, and is equal to the exponential distribution when α=1. The first definition of the gamma distribution is an extension of the first definition of the exponential distribution (note that the exponential function has -λx for the exponential distribution and -βx for the gamma distribution). The second definition of the gamma distribution is an extension of the second definition of the exponential distribution (note that the exponential function has -x/θ for the exponential distribution and -x/δ for the gamma distribution).

The parameter δ is called a scale parameter because it behaves much like a standard deviation. A large value for δ produces a probability distribution with a very wide spread and a small value for δ produces a probability distribution that is tightly concentrated. The value of β in the alternate definition of the gamma distribution is also sometimes called a scale parameter, but it works in the opposite direction of a scale parameter. A small value of β corresponds to a probability distribution that is widely spread and a large value of β corresponds to a probability distribution that is tightly concentrated. It is probably more accurate to call the value of β a rate parameter because it corresponds to the exponential rate parameter when α=1.

The expected value and variance of the gamma distribution are αθ and αθ2 for the first definition and α/β and α/β2 for the second definition.

Which versions do R and BUGS use? It turns out that R does not help us figure this out, because R defines the gamma distribution either in terms of a rate (β) or a scale (δ). The formula that R provides is

and I assume that s represents "scale" rather than "rate". The R help file tries to further clarify things by reminding you about the mean and variance of the gamma distribution.

One way to find out for sure. If you generate 20 random gamma random variables with a rate of 800, you get

> rgamma(n=20,shape=5,rate=800)
 [1] 0.0058680244 0.0057204477 0.0109982475 0.0043613837 0.0006404141
 [6] 0.0112152769 0.0064985870 0.0028100728 0.0070199387 0.0132604991
[11] 0.0058106632 0.0090521526 0.0037188298 0.0085797938 0.0058939413
[16] 0.0126721646 0.0047010977 0.0064032151 0.0075825553 0.0070805447

and if you generate 20 random gamma random variables with a scale of 800, you get

> rgamma(n=20,shape=5,scale=800)
 [1] 2982.486 5666.349 6269.452 6085.399 3092.336 7902.355 2581.400 6867.902
 [9] 2979.230 4309.074 4226.503 3619.330 3778.685 5234.710 4648.940 5349.982
[17] 3072.502 4540.159 1980.021 4176.649

Clearly, the mean is about 5*800=4000 in the second case but not the first. So "rate" in R is the same as β in the first definition of the gamma distribution and "scale" in R is the same as δ in the second definition of the gamma distribution.

In BUGS, the help file defines the probability distribution function for the gamma distribution as

Here is a rate and corresponds to β in the first definition of the gamma distribution.

Now how does the gamma distribution work for a prior distribution. Recall the basic formula for Bayes theorem

or more simply

The gamma is used frequently as a prior distribution for the parameter of an exponential distribution because it produces a simple form for the posterior distribution. Assume that we observe times t1, t2, ..., tm from an exponential distribution. The likelihood is the exponential probability density function, evaluated at each data point and then multiplied together.

With a bit of work, this simplifies to

The prior distribution for λ is

When you multiply these two together you get,

and with a bit of work you can show that this is equivalent to

This is the heart of a gamma distribution with parameters

It turns out that if you used the alternate definition of the gamma distribution, you'd get the same results. The prior using a scale parameter version of the gamma distribution is

Multiply this by the likelihood to get

which can be simplified to

The heart of this posterior is also a gamma distribution, and here the parameters are

This is effectively the same result with β replaced by 1/δ.

Suppose I wanted to place a prior distribution not on λ derived from the first definition of the exponential distribution, but θ, derived from the second definition of the exponential distribution. For this, we should consider the inverse gamma distribution as a prior distribution. The inverse gamma random variable is defined as the inverse of (one divided by) a gamma random variable. The probability density function of the inverse gamma distribution is

which look eerily like the gamma probability density function but a few things have slipped from the numerator to denominator. The likelihood is now

which simplifies to

and the prior distribution for θ is

The posterior distribution is

which simplifies to

The heart of this distribution is an inverse gamma distribution, with parameters

So it doesn't really matter how you set up your problem, you will produce the same effective posterior distribution. But if you don't know the internal mechanics, you won't be able to write the program properly.

Let's see how this works in practice. The following data represents waiting times between successive patients in a clinical trial.


We are interested in getting an estimate of the average waiting time, using just the first 41 observations, to help forecast the duration of the remainder of the trial. We have a strong prior belief about waiting times, corresponding to an inverse gamma distribution with alpha=175 and beta=547.5. What does this prior distribution mean?

The mean of an inverse gamma is β/(α-1) which in this example equals 3.1 days. The goal is to recruit 350 patients total, which would be expected to take 3.1*350 = 1095 days or exactly 3 years. How much uncertainty is there in this prior distribution? You can't get the percentiles of the inverse gamma easily in R, but you can get the percentiles of the gamma distribution using the qgamma function and then invert them.

> 1/qgamma(c(0.975,0.025),shape=175,rate=547.5)
[1] 2.712253 3.649225

Notice that the probability list is inverted, because an upper percentile for the gamma distribution corresponds to a lower percentile of the inverse gamma distribution. What would we predict the duration of the entire trial to be, based on the prior distribution. You can simulate this by randomly selecting a value of θ from the prior inverse gamma distribution, and then sampling 350 exponential waiting times with that value of θ. Repeat this process with a new θ, and then another and another. About a thousand of these should do just fine. A short cut is worth noting: the sum of 350 exponential random variables is a gamma random variable with shape parameter 350.

> theta <- 1/rgamma(n=1000,shape=175,rate=547.5)
> lambda <- 1/theta
> trial.duration <- rgamma(n=1000,shape=350,rate=lambda)
> quantile(trial.duration,probs=c(0.025,0.975))
2.5% 97.5%
908.2524 1315.1199

With this prior distribution, you would expect the trial to last between 932 and 1,326 days with 95% probability (approximately 2.5 to 3.6 years). I should mention that there are a few inefficiencies in the code, such as the double inversion. I didn't want to spend a lot of time optimizing the code at the expense of clarity.

Now the total waiting time of the first 41 observations is 239 days, corresponding to an average waiting time of 5.8 days: quite a bit higher than the prior distribution. The posterior distribution for the waiting time is inverse gamma with α* = 41+175 = 216, β* = 239+547.5 = 786.5. The mean of inverse gamma distribution is 786.5 / 215 = 3.7 days expected waiting time per patient. This corresponds to an estimated trial duration of 239 + (350-41)*3.7 = 1328.3 days or 3.8 years. The 2.5 and 97.5 percentiles are

> 1/qgamma(c(0.975,0.025),shape=216,rate=786.5)
[1] 3.200529 4.180107

The estimated duration of the trial, using the posterior distribution, is

> theta <- 1/rgamma(n=1000,shape=216,rate=786.5)
> lambda <- 1/theta
> trial.duration <- 239 + rgamma(n=1000,shape=309,rate=lambda)
> quantile(trial.duration,probs=c(0.025,0.975))
    2.5%    97.5%
1188.516 1580.067

which corresponds to 3.3 to 4.3 years.

Now, how would you tackle this problem in BUGS. You don't really need BUGS because there is an analytic solution, and BUGS uses a Markov Chain Monte Carlo approximation. I need to run this model in BUGS, though, because I'm interested in some generalizations that do not have analytic solutions.

Here's the BUGS model

model {
for (i in 1:m) {
 t[i] ~ dexp(lambda)
theta <- 1/lambda
alpha <- n*P
beta  <- T*P
lambda ~ dgamma(alpha,beta)
for (i in 1:m) {
 tnew[i] <- t[i]
for (i in (m+1):n) {
 tnew[i] ~ dexp(lambda)
t.days <- sum(tnew[])
t.years <- t.days/365

Note a couple of things. First, I kept the loop and the individual exponentials because I eventually want to model a more complex situation where the exponential rate starts out low and gradually reaches a plateau. Second, note that I define the exponential distribution in terms of the rate, lambda, which is the only way to do it in BUGS, but I also compute a parameter theta representing the mean waiting time,

theta <- 1/lambda

I want a prior distribution for theta, using the inverse gamma distribution, but this is the same as placing a gamma prior distribution on lambda. I use a value P=0.5 in the prior distribution parameters, which represents a prior distribution which is given roughly the same weight as half of the total final sample size.

The mean of the inverse gamma is beta/(alpha-1) which in this example is 3.1. The mean of the gamma is alpha/beta which is 0.32. An average waiting time of about 3 days is equivalent to a rate of about one-third of a patient per day.

The two remaining loops calculated an expected duration of the trial, using the existing waiting time for the first 41 patients, and a simulation of the patients 42 through 350 using a posterior distribution.

When you run the model, the statistics on the posterior distribution for theta (based on 4,000 samples after a 1,000 sample burn-in period) are

       mean sd    MC_error val2.5pc median val97.5pc start sample
theta 3.655 0.249 0.003533 3.199    3.646  4.165     1000  4000

The posterior distribution on the waiting time is approximately 3.7 days. The 2.5 and 97.5 percentiles are 3.2 and 4.2 days. This matches up almost  perfectly with the analytic solution computed earlier in R.

The statistics on t.years are

mean  sd     MC_error val2.5pc median val97.5pc start sample
t.years 3.746 0.2697 0.003919 3.259    3.729  4.302     1000  4000

The expected duration of the trial is 3.7 years, and there is a 95% probability that the duration will be between 3.3 and 4.3 years. This is close to the results in R.

What are the basic lessons I learned.

  1. You need to understand the difference between a scale parameter for a distribution and a rate parameter. In general, the rate parameters show up in the numerators and the scale parameters show up in the denominators.
  2. Try running a few samples and see if they cluster around what you think is the mean. If the results are way off, maybe you are getting the rate and scale parameters confused.
  3. Alternate forms of the probability distributions will lead to the same results if you are careful and consistent
  4. You can't run a complex Bayesian model until you are comfortable with what all the parameters of the probability distributions really mean. Building a model based on guesswork cannot replace a careful review of the underlying formulas.

Now there's an exact closed form solution for the expected duration of a clinical trial, which I derive elsewhere.