Weblog entries from the 2004 JSM (August 7-11, 2004)

This page has moved to my new website.

I started this page as an experiment. I wanted to write down what I learned as I attended the various talks at the Joint Statistics Meetings. I never had time to polish these pages, though, so I decided it would be best to remove this page. Sorry!

(Note: these entries should probably be on individual pages and have individual categories.)

I just returned from Toronto and the Joint Statistical Meetings (JSM). I brought along my laptop and tried to take notes during some of the talks and classes. I will clean up these notes and turn them into entries on my weblog or other web pages. One change which I have already incorporated is an update to my bootstrap page.

Correlated data in S-plus (August 11, 2004)

Jose Pinehiro and Ed Chao taught a two hour class introducing a new library in S+CorrelatedData, that allows for correlated observations and a flexible range of distributions for your outcome variables.

Jose Pinheiro started with an overview of the Generalized Linear Model (GLM). The GLM is an extension of the linear regression model that allows for non-normal outcomes, especially binomial and Poisson distributions. It allows allows you to model a relationship between the mean level and the spread of the data. In many problems, the spread of the data changes systematically as the mean changes. For example, the Poisson distribution, which typically represents a good starting disctibution for data involving counts, has a simple relationship between the average value and the spread. The GLM incorporates the distribution and the mean-spread relationship into a single coherent model.

This is an iterative process with working values for the regression coefficients and working weights. You use the weights to update and improve the regression coefficients, and then use the updated regression coefficients to improve the estimates of the weights.

The glm function has the general format

glm(formula, family, data)

For example

glm(formula=resp~trt,family=binomial,data=Clinic)

Jose Pinheiro recommends

options(contrasts=c('contr.treatment','contr.poly'))

to improve interpretability of the models. I use this option also because it uses indicator variables for any categorical data in the model.

He explained the use of an offset variable in glm. In a study of seizures, patients gave a count of the number of seizures they had in a eight week period prior to randomization, and then three consecutive seizure counts after randomization. These counts, though, are over a two week period rather than an eight week period.

The Generalized Linear Mixed Model (GLMM) is an extension to allow for grouping. Grouping might represent multiple observations within clusters, repeated measurements on patients, and so forth. The GLMM allows coefficients to vary with the group, which will formally account for correlation in the data. The glme function in extends the llme function in S-plus because (like glm) it allows for fitting different distributions like Poisson and Binomal. It also

The GLMM model has three components: an outcome variable, factors associated with fixed effects and factors associated with random effects. The S-plus fuction for fitting a GLMM is glme(). A typical call for this function is

glme(fixed, data, random, family)

and an example is

glme(fixed = resp ~ trt, data = Clinic, random = ~ 1 | clinic, family = binomial)

The output for the glme output is an object that has information about the fit

If you want to try a different method for fitting the GLMM, you can use method='LAPLACE' or method="AGQUAD" .in the glme() function.

The glme function allows for multiple levels of clustering. For example, in many Education research studies, you sample schools from a district, classrooms from a school, and students from a classroom. This sampling method has two levels of clustering (schools, classrooms) that you can account for in the glme function. You fit this model by including an argument like random = ~ 1 | school\classroom in the glme function.

In one of the models, subject #49 seemed to have a large influence on the findings. To explore this further you could fit a model excluding this subject. Just add the statement subset = id != 49 to the glme function.

Tony Chao talked about Generalized Estimating Equations (GEE) methods. He motivated the GEE model by discussing a simple paired t-test but with abnormal behaviors such as correlation, heterogeneity, unequal sample size, and non-normality.

The GEE approach estimates the equation, the link, the variance function, the dispersion scalar, and a working correlation matrix. The typical call to the gee function is

gee(formual, cluster, variance, correlation, family, link, data)

The formula, family, link, and data arguments are similar to those in glm. The cluster argument identify the grouping. The variance argument specifies either a fixed (glm.1) or a variable (glm.sacle) estimate of the dispersion parameter. You cna even allow for heterogeneity across groups with a different dispersion for males and females, for example. The correlation argument specifies the working correlation matrix. There are six options for the worrking correlation matrix, inpdendent, fixed, exchangable, autoregressive, continuous autoregressive, stationary, nonstationary, and unstructured.

For advanced approaches use the geeDesign() function with a varDesign and corDesign object, then follow with a gee.fit() function. This approach allows you to specify differing levels of correlation to reflect differing levels in a hierarchical model.

Tony Chao also showed how you can use GEE methods for multinomial data. Work is currently ongoing for both the gee() and the lgme() functions in S-plus. You can download the latest version of the S+CorrelatedData library from the Insightful web site.

Data and Safety Monitoring Committees (August 10, 2004)

Janet Wittes (substituting for Lawrence Friedman) talked about the NIH perspective on Data Safety Monitoring Committees. There is an .

The DSMB makes recommendations about

NIH document on DSMB provides very general guidance. It is important to note that monitoring should be commensurate with risks and represents a continuum from having the principal investigator monitor the trial to having a full DSMB committee.

All phase III trials and earlier trials if

DSMB needs to be

There is uncertainty about who appoints them and to whom the board reports. The DSMB doesn't review the protocol except to the extent that they should not monitor a study that they are not comfortable with.

The DSMB summary report should provide regular feedback to the IRBs, documenting that a reivew took place and offering recommendations

The DSMB has to maintain

THE DSMB is not the only group responsible for protecting patient safety:

Issues the IRB should consider in its review of participant safety are

To maintain study integrity, the DSMB should:

Tools must be distinguished from goals and when they are in conflict, the goals should supersede the tools.

John Conlon provided an industry perspective. He described two Oncology trials where he was the DSMB support statisticians. Both studies used O'Brien-Fleming interim analysis.

In the second study there was an unplanned power analysis to see if it would be worth continuing. He supported the earlier viewpoint that the DSMB should view the unblinded data.

The DSMB should have a clear communication plan with the industry sponsor. Publicly traded companies have an obligation to disclose information relating to their stock price, so the DSMB needs to think about whether and how the results of the interim analysis should be disclosed to the public. Even fairly benign things like when a study will end could affect the stock price, so you need to plan ahead for these contingencies.

It is important to protect the study conduct while during the interim analysis. This means keeping the DSMB separate from the industry sponsor of the study. These people are interested in the results and they may pester you for information that is not in their best interest to know.

Should the trial subjects be aware of the results? This is an open question, but if the company and stockholders get some information, it seems that the research subjects.

Janet Wittes also gave a talk that she herself had prepared arguing that we need less rigidity in how the DSMB should operate. The purpose  of the DSMB (in decreasing priority) is

Unfortunately statisticians will have the opposite priorities, finding the mathematical rules about identifying rules for early stopping. But the safety data is essentially astatistical.

How the DSMB can insure the integrity of the study:

Independence does not mean ignorance. It means the person involved has no gain/less by early stopping and no gain/loss by continuing. This person needs the intellectual ability to remain neutral. This is hard because continued involvement on the DSMB may cloud your judgment.

Even if safety is boring, it is our job as statisticians to make sense of the data. They might see things that we don't see, so we need to better appreciate the physiology, biology, and so forth. We might see things that they don't see because we are more experienced

Speak clearly and don't use a big statistical stick. There is no single statistical view. Just because you are the only statistician on the DSMB.doesn't mean that your opinion represents the consensus of all statisticians. Try to present competing statistical viewpoints fairly and impartially.

Walt Offen was the discussant for these speakers. He re-emphasized that the DSMB reports have to go to the IRBs. There are times when the sponsor ignores the recommendations of the DSMB and they may do this for very good reasons. But when this happens, the various IRBs reviewing the study should know about this choice because they may decide that it is not in their best interests to let the study continue at their institutions.

He mentioned that for phase II or IV trials, a DSMB is not needed, but an internal committee would be sufficient.

A DSMB does not look at safety in isolation but has to consider it in tandem with efficacy data.

Further reading

Council of Chapters Business Meeting (August 10, 2004)

Here are few notes from the Council of Chapters (COC) business meeting that  need to share with the other officers.

Check out the COC web site. They have a new webmaster and there are a lot of recent additions.

ASA has noted a large number of chapter dues that are not being cashed. Please check with each chapter treasurer to make sure this is being done.

The COC has had a lot of requests for developing a credit card mechanism. This is especially important for short courses where some members have trouble getting a company check. This could be worked out, but they have to talk to the individual chapters. If there is need, talk to the COC for details.

The COC has a booth at the JSM. They recommended that each chapter have an application form for joining

It is still not too late to file a chapter report. It is important to note the date of the last meeting, the date of the last elections, and the term of the current chapter officers. The deadline for the 2004 chapter report is due January 31, 2005, which is much earlier than previous deadlines. This makes it easier for us because the information should still be fresh in our minds.

The COC is encouraging chapters to have a career day. ASA will offer a grant of $500 for a chapter that is putting on a career day. There are details in a recent Amstat News, by email at careergrants at amstat.org and on the web at www.amstat.org/careers/careerday.html.

John Boyer talked about work for judging the International Science and Engineering Fair (ISEF). He encouraged chapters to get involved with judging local and regional science fairs. If you are interested but have no experience, contact John Boyer and he will put you in touch with some people who have experience in this area.

Mu Sigma Rho is an honorary society for students in Statistics. It sponsors an educational award and the "College Bowl"  competition. There are only 20 chapters, and many students at locations without a chapter could not join. The COC wants see if chapters can serve as affiliated chapter for students at locations where there is no already a chapter. This is an empowerment, not an obligation and does not entail any financial obligation.

Our own Ron Wasserstein was elected to the Council of Chapters as the chair-elect for 2005.

Individual chapters can request a visit from the ASA president. This is at no cost.

The traveling course is undergoing some changes in format and asked for suggestions. I mentioned that making the locations geographically close might be a mistake because we drew some attendees at our short course that might otherwise have attended at Columbia. There is a lot of interest among chapters for the traveling course, so they will try to meet this demand. We probably won't be competitive for this because of our recent visit from Charles Davis, but the details on the traveling course will be available in November.

The romance of hidden components (August 9, 2004)

David Donoho gave the president's invited address on hidden components. Hidden components are a powerful set of ideas, originated by statisticiains. They are having a big impact, but mostly outside of statistics. Hidden components are like the black sheep in Statistics.

Much of the work with hidden components were developed by people interested in evolution and intelligence. Steven J. Gould criticized many of these approaches and it is important to keep this criticism

The tendency has always been strong to believe that whatever receive a name must be an entity or being, having an independent existence of its own. - John Stuart Mill

There are analogs in the physical world similar to statistical hidden components. Newton noticed how light could be decomposed into the colors of a rainbow by a prism and that these components could not be further broken down by a prism. These colors represented a hidden component of white light.

Fourier developed a different physical manifestation of hidden components. Fourier analysis is a kind of prism that takes a signal and decomposes it into sine waves of various frequencies.

Einstein derived the concept of brownian motion, and Werner(?) showed that brownian motion can be decomposed into sine components of various frequencies where the weights for these frequencies are themselves random.

There are a lot of new ideas being developed in this area. Three interesting ideas that extend the concepts of hidden components are:

A Science article published in 2000 discusses Isomap, a procedure for reducing high dimensional data into a small number of dimensions using a nonlinear smooth transformation.

Causal inference (August 9, 2004)

Several talks at the Joint Statistics Meetings discussed extensions and applications of the Rubin Causal Model.

He used the Rubin Causal Model where causal effects are effectively a missing data problem.  This is a very attractive approach because we statisticians know a lot about missing data problem. When you are comparing two treatments, subjects typically can only one of the two treatments. They are missing data on the potential outcome that would have occurred if they got the treatment that they did nor receive.

If we don't know what the outcome if A would be for half of our patients, then the best possible guess (assuming that the value is missing at random) is the average outcome for those patients who DID receive drug A. Similarly, if we don't know what the outcome if B would be for the other half of our patients, then the best possible guess is the average outcome for those patients who DID receive drug B. Let's make those substitutions and see what happens.

Notice that the overall effect is simply the differences in averages. This is a rather obvious finding, but it shows how we can apply the theory of imputing missing data to verify that the treatment effect observed in a randomized trial is indeed a valid estimate.

In an observational study, we can't rely on the data being missing at random. Treatment assignment is often related to one or more covariates. But if we observe all the possible covariates, and we assume strong ignorability (after controlling for all the covariates, the treatment assignment is independent of the outcome) then we can impute the missing potential outcomes using standard statistical tools.

The use of the Rubin model is very powerful, because we statisticians know a lot about when and how we can impute missing values. Imputation is a well studied area, so the extension allows us to rigorously study models that can show cause and effect relationships.

Roderick Little talked about inference in hybrid intervention trials involving treatment choice. His motivating example is the Women Take Pride Intervention Study. Randomization is not always ethically feasible. Sometimes inferences are limited only for subjects willing to be randomized and may exclude subjects with strong preferences. A behavioral treatment may be more successful if subjects are allowed to choose.

In general, we don't allow people to choose their treatments.

What is the treatment effect for people who prefer treatment A and are they different for those who prefer treatment B?

The study used females aged 60 or older with cardiac disease with two interventions, a group format and a self-directed home format. The outcomes included physical functioning, psychosocial functioning, symptoms, health-care utilization, and compliance. Two levels of randomization. The random arm was then randomized to group intervention, home intervention, or usual care. The choice arm allowed women to choose group or home intervention.

The choice part better reflects the real world setting, an educational intervention can't be blinded anyway which makes the randomization a bit less effective, and women who get to choose may be more enthusiastic and have better compliance.

This design allows you to compare compliance among those who prefer a particular treatment and are randomized to the opposite. The patients in the randomization arm and get group treatment are a mixture of those who prefer group treatment A and those who prefer self-directed home treatment. We can disentangle these groups if we assume that the proportion who would prefer group treatment is the same as in the choice arm (a very reasonable assumption since we randomized those who got a choice) and if

How does this compare to a Zelen design? When you randomize into the choice arm and the randomization arm, when do you get informed consent?

Michael Hudgens discussed vaccine effects on binary post-infection outcomes. These slides are on his web site. Using the nomenclature of Halloran et al 1997.

Two examples.

Principal Stratification and SUTVA (noninterference).

Miguel Hernan discussed postmenopausal hormone therapy on coronary heart disease. This is a reanalysis of the nurses' health study. This is a case study comparing randomized and observational trials. The Nurses Health Study found a lower risk of CHD in suers of postmenopausal hormone therapy. The Women's Health Initiative, a randomized trial, found a greater risk of CHD in users.

He is essentially treating the NHS as a single randomized trial in patients are enrolled multiple times.

Hormones have a beneficial effect if treatment initiated shortly after menopause, harmful if treatment initiated when women are at a more advanced stage of atherosclerosis. WHI participants were older.

Limitation of the NHS. .Lack of comparability between initiators and non. Do women who imitate have healthier life styles? Do women who continue therapy tend to be more health-conscious than others. There was also uncertain time of therapy initiation. Data was collected every two years, so there is unknown time of initiation within the interval.

Miguel Hernan assumed that there are no unmeasured confounders in the NHS. Under these assumptions, you can conceptualize the study as a series of randomized trials with unknown randomization probabilities.

The eligibility criteria between the tow trials are similar, but the age is quite different (50 vs 38) on average. Consider the randomization date for the NHS at the time of return of the 1984 criteria. Then do the same for women who returned a questionnaire in 1986, 1988, etc. and then pool the results. These women can participate in a maximum of 8 trials.

The analysis indicate that the protective effect exists only for women who initiate hormone therapy early and/or when they are relatively young.

Unmeasured confounding variables are not large enough.

can be explained by shorter average time from menopause to therapy initiation.

He argues that the big difference is that randomized trials are analyzed with ITT and observational are not. You also can't compare two trials analyzed under ITT if the noncompliance rates differ markedly.

Thomas Richardson, the discussant, pointed out that group therapy might violate the SUTVA assumption.

Robins and Wasserman 1997 is a possible reference to loo at.

Daniel McCaffrey gave a talk the next day on causal inference. He also used the Rubin model with an assumption of strong ignorability. This assumption is that the outcome variables are independent after adjusting for a set of known covariates. You then compute propensity scores and estimate weights to produce a weighted mean effect. It is a challenge to estimate these propensity scores, so this speaker used a generalized boosting model. GBM inherits the nice features of regression trees, computationally fast, can incorporate continuous, nominal, ordinal and missing covariates. It is an iterative approach so part of the "art" of using it is knwing when to stop. If you stop too early, you get a poor fit to the observed data and you do little to reduce bias and variance. If you stop too soon, your will overfit and the weights will too variable.

Jennifer Hill also talked about causal models and gave a nice overview. Ignorability is assuming that Y(1), Y(0) is independent of Z given X. Rosenbaum and Rubin (1983) is first reference to propensity scores. You want to weight and narrow the sample to a point where the two groups are similar with respect to the covariates. Bayesian Additive Regression Trees (BART ) is also an approach based on regression trees. The LaLonde (1983) is a classic data set for testing causal models. Most of the methods at the time did not do well. A propensity score approach (Dehejia and Wahba 1999) did perform well. A reference to Hill and McCulloch 2003) investigates the use of classification trees for this problem.

Samantha Cook talks about ITT analysis when a drug becomes open label. The problem is that when a drug does get approved, people on the placebo arm may go out and get the regular drug. The actual drug is Fabrizyme which treats Fabry disease which is a genetic disease that causes creatinine buildup in the drug. The actual trial had 1/3 assigned to placebo and 2/3 assigned to the drug. The primary outcome is time to a clinical event (death, dialysis, stroke, or a large increase in serum creatinine). For the patients on placebo who drop out, they want to impute the missing values. The drug company has historical data on untreated patients. This historical data was pared down to include patients who looked like they would have qualified for the trial if it had been available at the time.

They use a change point model because creatinine levels stay relatively constant until a certain event occurs and then serum creatinine increases until dialysis starts. The rate of increase is variable from one person to another.

Siddhartha Chib also talked about causual inference without the joint distribution of counterfactuals. This is where the ignorability assumption does not hold. He assumes the existence of an instrumental variable.

Proteomics (August 8, 2004)

Francoise Seiller-Moiseiwitsch and Anindya Roy provided two introductory lectures about proteomics.

Francoise Seiller-Moiseiwitsch reviewed the derviation of the word "proteome" which represents PROTEins expressed by a genOME. This is not a fixed feature of an orgnaism but depends on developmental stage, tissue, or environmental conditions. There are more proteins than genes because different parts of a gene can create different proteins.

These talks focussed mostly 2D PAGE (PolyAcrylimide Gel Electrophoresis) but also discussed mass spectrometry.

A 2D gel is like a messy microarray where you have to find where the protein lies in the gel. The 2D gel separates proteins in two independent dimensions. The first dimension is isoelectric focusing, where each protein will move to a point where the charge is neutral. The proteins are then coated with SDS, which removes the electric charge. Then the proteins are separated in a second dimension by their mass.

Francoise Seiller-Moiseiwitsch discused an important cautionary tale about proteomics gone bad. Petricoin et al used a genetic algorithm for clustering samples and found near perfect results  Keith Baggerly and others have documented problems with the Petricoin analysis.

MS has instrument drift, there is poor reproducibility within lab (humidity, temperature, sample storage and preparation, and chip quality. There is no information about the sensitivity of this approach.

The proteins are then coated with SDS and separated in a second dimention by mass.

MALDI-TOF a sample is mixed with an energy abosrbing matrix fixed on a metal plate and placed in cacuum chamegmer. A laser excites these materials which

The first important issue with 2D gels is noise reduction. These 2D gels have smears, streaks, and dust specks. There is a package, MELANIE, which performs spot detection. It uses the second derviatives and saturation levels to decide what is a spot. The thresholds for the second derivatives is highly sensitive to cutoff values and selecting these is more of an art than a science.

Spot matching is also tricky because there is some warping of the image from one gel to another. A linear, quadratic, or cubic polynomial transformation of the x and y coordinates is matched between one gel and the other using some "landmoark " spots.

Analysis of MS data involves baseline removal (robust polynomial fitting or wavelets), normalization takes the intensitiy minus the minimum intensity and divide that by the full range of the data (masximum minus minimum intensity). peak detection, peak alignment, and normalization.

Smoothing the 2D gels uses two dimensional wavelets. In two dimensions, the wavelets split into four pieces

Spots, spikes, and streaks have different gradient properties (different first and second derivatives) and you use these to remove the spikes and streaks while keeping the spots. Some biologists might argue that the streaks have biological meaning, so you might want to remove only the spikes.

Anindya Roy continued the discussion of noise removal. Streaks and spots look very similar from a wavelet perspective. Instead, you have to look at the gradients (derivative). The change in intensity will rise at the edge of a spot, drop off as

The watershed algorthm (Vincent and Soille, 1991) is used to break the image into segments representing individual spots. The attributes of these segments are

Anindya Roy also talked about alignment.

Differential expression is an area of active research. Dowsie has a nice review paper in 2003 Proteomics. This draws from a lot of the same tools as microarray differential expression such as the false discovery rate.

Microarray models (August 8, 2004)

I walked into the middle of Elizabeth Hauser's talk about association studies of Coronary Artery Diseasse (CAD) and microarray expression looking at expression levels in tissue from the aorta. She is trying to related SNP (single nucleotide polymorphisms) to disease and to expression levels. The combination should lead to more rapid identification of genes associated with CAD.

Kim-Anh Do talked about recent developments for massive multiple comparisons and clustering of microarray gene expression data.

Current moethods ofdifferential expression are just variants of student's t-test. SAM (Tusher 2002), and mixture models (EMMIX-GENE) are promising alternatives.

The gene chip that they used had 6500 genes and they examined the 2000 genes that had the highest expression levels. Her approach is a mixed model (mixture model?) that extends Efron's 2003 results.

The False Discovery Rate (RDR) is an approach to control for the large number of genes that you are testing simultaneously. Benjamini-Hochberg (1995) and Storey (2003) provides a frequentist approach for the FDR, while Genovese and Wasserman (2002, 2003) provide a Bayesian computation and interpretation of FDR.

Measuring the probability of joint expression is very easy using the posterior distribution.

http://odin.mdacc.tmc.edu/~pm/DMT02.pdf gives technical details on this model. http://odin.mdacc.tmc.edu/~kim/bayesmix includes software for this approach.

All of the clustering approaches boil down to either unsupervised (class discovery) methods or supervised (class prediction).

Gene Shaving (Hastie et al 2000) and mixture model based clustering (McLachlan).represent two good approaches for defining clusters. Gene shaving tries to find an optimal clusters that are similar and show large variance across cell clusters. An implementation of gene shaving is available at her web site as well as an S implementation at statlib.

Kim Anh-Do also showed a quick look at GeoScene, a nice networking tools for documenting network relationships among genes.

Recommended books:

I have a couple of books by Terry Speed and by Amaratunga and Cabrera that are pretty good. I like the second book a bit better.

Lisa Ying talked about using microarrays to detect alternative splicing events. She talked about NM_000229 (LCAT). This gene has six exons, but sometimes a seventh exon is inserted . Johnson et al 2000 estimates that more than 70% of all genes are spliced. Lisa YIng used a tiled microarray that includes overlapping segments of DNA sequences across several introns and exons. One of her references was Johnson (2003) Science 302: 2141.

Graphical models in R (August 8, 2004)

A graph is a set of nodes and a set of edges. Usually the nodes are random variables and the edges represent dependencies of these models.. There is a lot of research into the statistical analysis of graphs, but the software to run these models is limited. A research group is working on the gR project, an effort to develop graphical models using the R programming language. This is a friendly anarchy of developers that set general standards and encourage others to collaborate. They provide a library of efficient algorithms and a simple graphical user interface.

There are three components, gRbase, dynamic Graph (which uses tck/cl), and gRaph. There are other packages that have been available in R (ggm, deal, and mimR). There are also other programs (TETRAD, Bugs, and CoCo) that represent stand alone packages, and the authors of these systems to link to R or to make their code run under R.

The gRbase software has a gmData class that represents the data and the metadata for a graph. The gModel class contains information on the generators of the model (model formula) and metadata. A hierarchical log-linear model (hllm) is an example of a model that inherits from the gModel class. The loglm model in Venables and Ripley represents one way that you can analyze.

Giovanni Marchetti  then talked about ggm, a symbolic definition of directed acyclic graphs (DAGs), undirected graphs, and ancestral graphs. It allows maximum likelihood fitting of Gaussian graphical Markov models.

The DAG function (e.g., DAG (Y~X+U,x~U+Z)) represents a  directed graph. The UG function (e.g., UG(~A*B*C+C*E+D*E), and the BG function (e.g., BG(~U*Z)  represents a bidirected graph.

Bootstrap Methods and Permutation tests (August 8, 2004)

The bootstrap provides you with a way to estimate the sampling error in a statistic. It is very useful when you don't want to use standard formulas that give an approximation to the sampling error or when there are no other ways to estimate the sampling error.

Consider a company that offers repairs to two classes of customers. In one class of customers there are 1664 repairs with a mean of 8.4 hours. In the other group, there are 23 repairs with an average of 16.5. The repair times are skewed with several outliers and a tendency for values to cluster near multiples of 24 hours.

You might not trust the simple t-test for this data because of these unusual patterns in the distributions of repair times.

To compute a bootstrap you would sample a thousand times from the sample of 1,664 repair times from the first group, compute a mean for these thousand means, and then evaluate those thousand bootstrap means. Then from the sample of 23 repairs in the second group, you would sample a thousand times, and compute and analyze those thousand sample means.

A very simple confidence interval will be simply that original statistic plus or minus a multiple of the standard error estimated from the bootstrap sample. You could also compute the 2.5 and 97.5 percentiles from the bootstrap sample.

If the sample in the original experiment has special properties, then you would want to have your bootstrap procedure reflect a similar approach. For example, if you have a stratified sample, the bootstrap should take repeated stratified resamples from the data.. If your data is drawn from a finite population, create a sample equal to the population size by repeating the sample enough times to equal the population size and then draw a bootsrap sample equal to the population size and then sample without replacement from that population.

The estimate of sampling error from a bootstrap is a bit too small, especially when your sample sizes are too small. There are several remedies that might help.

Regression is tricky. If you are studying a model with two independent variables, you could sample triads of data (the dependent variable and the corresponding two independent variables). This treats the whole problem as a random effects model. In some problems you could end up with a few of your bootstrap samples where you have a perfect correlation among your variables.

Time series data are even trickier. You could fit a model, compute residuals and then bootstrap the residuals to create a new time series. You could also resample blocks of observations. This retains some of the short term correlations that are inherent in the time series but not any long term correlations (or seasonal variations). You also lose the short term correlation at the block boundaries.

There are two major sources of variation in a bootstrap distribution. Your original sample is chosen randomly from a population and the bootstrap is a random simulation from that population. The latter source of variation is not a serious issue. With a thousand bootstrap samples, the simulation variation is some constant divided by the square root of a thousand. In most situations, this will be pretty small. Certain procedures for computing confidence intervals will be a bit worse, and you might want to compute five or ten thousand bootstrap samples.

Methods for computing a bootstrap confidence intervals include

The Edgeworth expansions will show how these procedures work and why that latter two methods are more efficient..

As a general rule, if you have skewness in the sampling distribution, it decreases by a factor proportional to the square root of the sample size.

Resources

Bayesian approaches to clinical trials and health care evaluation (August 7, 2004)

My first day at the Joint Statistics Meetings was in a class on Bayesian statistics taught by David Spiegelhalter. This course is based on the text

The book has an accompanying website (http://www.mrc-bsu.cam.ac.uk/bayeseval/welcome.html). All the examples he discusses are on the web page.

This class sought to give me an understanding of the potential role of Bayesian methods in health-care evaluation in the public sector and the pharmaceutical industry.

Bayesian statistics are based on a posthumous publication in 1763 by Thomas Bayes that demonstrated a fundamental finding in probability which is now known Bayes theorem. A Bayesian analysis asks the question: how does this current data change our opinion about the treatment effect?

This analysis explicitly states

A Bayesian statistical analysis combines the prior distribution with the likelihood to draw a conclusion.

The specification of a prior distribution is the controversial aspect of Bayesian analysis, because it introduces a level of subjectivity to the analysis. Different people might specify different distributions and thy might therefore derive different conclusions from the same data. The counter argument is that everyone brings subjectivity into their interpretation already--Bayesian analysis just makes this process explicit.

David Spiegelhalter offers three guidelines for the use of prior information. This information must be

One philosophical viewpoint is that all practical probabilities are subjective, for example, the probability of aliens openly visiting earth in the next 10 years. They express your relationship to the event and different people have different information about this probability. People who worked at Area 47, for example,  might have a different probability. You should make sure that these probabilities are coherent, and you can then make rational choices based on your view of the world.

You should be cautious about traditional hypothesis testing. He offered an amusing example. Suppose you are running a drug company is testing 200 drugs in 200 different trials. Each trial is tested at an alpha level of 5% and power of 80%. Let's do the math. There are 20 (10% of 200) drugs that are effective. Since we have 80% power,

Likelihood ratios

There is a useful teaching program, First Bayes, that illustrates some simple Bayesian analyses. An interesting point he made with this software is that you might be interested in mixing an optimistic and a pessimistic prior distribution together.

You need to resolve any glaring inconsistencies between the data and the prior. There is a clever quote:

Bayesian is one who, vaguely  is expecting a horse and catching a glimpse of a donkey, strongly concludes that the animal is mule (Senn 1997).

This is something that had always bothered me about Bayesian analysis. When the prior distribution and the data itself are in sharp conflict, it really makes no sense to combine them to get a posterior distribution. David Spiegelhalter recommends that you conduct a formal test of hypothesis comparing the prior distribution and the likelihood.

Lapace's Law of Succession provides the probability that the sun will rise tomorrow.

Very rough rule of thumb for the log odds ratio Var log odds=4/m where m is the number of events.

Pocock and Spiegelhalter (1992) BMJ 305: 1015 shows a useful elicitation of the prior distribution.

Random effects models give perhaps the best justification for Bayesian analysis. In a multi-center clinical trial

Higgins and Spiegelhalter (2002) IJE 31: 96-104.

Markov Chain Monte Carlo (MCMC) produces useful Bayesian analyses for complex situations. A program called WinBUGS will use MCMC to conduct a Bayesian analysis.

www.mrc-bsu.cam.ac.uk/bugs/winbugs/contents.shtml

It is generally unrealistic to hope for large treatment effects bu that it might be reaonaable to hope that a new treatment for acute stroker or acute myocardial infarction could reduce recurrent stroke or death rates in hospital form 10% to 9% or 8% but not to hope that it could halve in-hospital mortality. Peto and Baigent (1998)

[There was an excellent slide on known biases when eliciting prior beliefs. Find those details and insert them here.]

Parmar (2001) Lancet 358: 375-81 gives a good example of eliciting prior beliefs from clinicians. Unfortunately, these clinicians tend to be too overly optimistic and tend to  people who

GUSTO is a study of steptokinase versus TPA.

Pocock and White (1999) state that randomization is unethical

Kass and Greenhouse (1989) say that it is ethical to randomize as long as a cautious but reasonable skeptic would not yet be convinced.

Fayers et al (2000) looks at the conflict between the physicians prior beliefs and the power calculations. The power was estimated at an effect size that most physicians thought was unreasonably high. You can estimate power across the range of values that physicians thought was plausible and weight according the prior probabilities. This resulted in an average power of only 30%. This showed that the design of the study was overly optimistic.

If there is uncertainty about the input into a power calculation, treat the uncertainties as prior distributions and get a posterior estimate of power.

Bayesian analysis also provides some guidance about early stopping of clinical trials. You shouldn't stop the study at the first moment that the data crosses a certain p-value threshold. Instead, combine the data with a skeptical prior. When the posterior distribution combining the data with a skeptical prior crosses a threshold, then you have enough evidence to stop the study.

To formalize a futility analysis, adopt an optimistic prior and stop the study when the posterior distribution with an optimistic prior would show convincing evidence of no effect.

An alternate approach for futility analysis is to ask the following question: given the data so far, what's the chance that you will see a statistically significant result at the end of the study? If the estimated probability drops too low, stop the study and spend your resources elsewhere.

What exactly, is a skeptical prior? an optimistic prior? A skeptical prior is one

Ellenberg (2002) Data Monitoring Committees in Clinical Triials: a Practical Perspective.

ASTIN: an adaptive dose-response study of UK-279,276 in acute ischaemic stroke (Krams (2003) Stroke 34: 2543-8).

There are also Bayesian approaches to modelling biases in research.

Biases to internal validity

Biases due to external validity

Bayesian analysis also provide a natural approach for institutional comparisons. Marshall and Spiegelhalter (1998) BMJ 317:1701-4 compare the ranking of institutions providing In Vitro Fertilization. Even though there is substantial variation in the success rates, the uncertainty of the rankings is so large as to make it difficult for you to distinguish between any but the very best and the very worst institutions.

The greatest strength of Bayesian analysis is that it allows for you to synthesize evidence in a variety of different contexts. Bayesian meta-analysis, for example, will allow for

Bayesian analysis allows for a formal derivation of quality weights. You would input some prior evidence about the additional uncertainty and additional biases that certain types of studies (e.g., unblinded studies) might have and then the Bayesian analysis produces an estimate of how much these types of studies should

Another example of evidence synthesis is the common problem of comparing two drugs that are tested extensively, but not against each other. You can model study effects using a Bayesian approach and draw an inference about the estimated effects of the two drugs. This estimate would directly incorporate the uncertainty of the indirect comparison because the Bayesian analysis formally models the variability from study to study.

For this and other approaches mentioned above, there are other solutions based on more traditional statistical methods. The Bayesian analysis is still worth considering because

Bayesian analysis can also help compute cost effectiveness calculations (CEAC).

O'Hagan and Luce is a freely downloadable resource that gives a general overview of Bayesian analysis.