Statistical Analysis of Microarrays by Insightful (August 31, 2005).

This page is moving to a new website.

I attended a seminar presented by Michael O'Connell of Insightful Corporation on microarray analysis. Insightful Corporation has a program, S+ArrayAnalyzer, and this talk showed some of the capabilities of this software. The archived version of this presentation appears at

The first example data set comes from Gretchen Darlington at the Huffington Center on Aging, Baylor College of Medicine. There were mice of two ages (young and old) and surgery induced injury was performed. The liver was sampled at time 0, 0.5 hours, 1 hour, 6 hours, and 24 hours. There were three replications at each combination of time and age. The total sample size is 30, and there are 5 degrees of freedom associates with the young - old contrast at each time point, and 4 degrees of freedom.

The goals are to identify genes that are differentially expressed, minimizing false negatives or equivalently maximizing the power of the statistical test. At the same time, you need to keep the probability of false discoveries acceptably low. You need to be careful throughout the analysis pipeline and filter genes that are not interesting (based on a priori hypotheses or on lack of variation).

The basic workflow is

S-plus ARRAY Analyzer reads in CDF (chip description file), CEL (Chip Expression Level?) and CHP binary files. All the standard Affymettrix chips are handled using the AADM links. This software also handles all Agilent two channel microarrays.

Quality control methods use

Filtering allows you to select a subset of arrays and/or genes. You can look, for example, for geneses where expression values > some level for at least one chip.

The Bland Altman plot uses a hexbin format where the graph is divided into hexagonal regions (like a honeycomb) and the counts of genes in each region is shown by shades of color. [note to myself: I need to find some good examples of this plotting technique.]

In this example, the Bland-Altman plots showed some outliers which indicate that possibly a more aggressive normalization such as RMA or GCRMA would be needed. These methods compress the variability. The Insightful web site describes these normalization approaches:

In the Affymetrix system, each gene is represented by 11-20 PM and MM pairs of probes, each probing a different region of the mRNA transcript, typically within 600 base pairs of the 3' end. The RMA method of Irizarry et al. (2003) models PM intensity as a sum of exponential and Gaussian distributions for signal and background respectively, and uses quantile normalization (Bolstad et al., 2003) and a log-scale expression effect plus probe effect model that is fit robustly (median polish) to define the robust multi-array analysis (RMA) expression estimate for each gene. The GC-RMA method of Wu et al. (2004) describe an algorithm similar to RMA, but incorporating the MM using a model based on GC content (GC-RMA). www.insightful.com/products/s-plus_arrayanalyzer/features.asp

Dr. O'Connell liked these methods, though he did mention that RMA tended to bias the fold changes downward because it does not use the MM (mismatch) adjustment. I'm not quite sure what this means. He also characterized quantile normalization as good, but somewhat aggressive. Again, I'm not quite sure what this means.

A box plot shows that the median expression level does indeed vary from chip to chip, which emphasizes the need to normalize across chips.

Dr. O'Connell highlighted a paper (Choe 2005) that is worth studying carefully. It seems to support the use of a variance function (CyberT, and LPE), compared to the basic t-test or SAM (I couldn't locate the Tibshirani reference that he mentioned, but Efron 2001 gives a good overview of SAM). Dr. O'Connell stressed that the LPE approach is especially useful when the degrees of freedom is small.

The use of a variance function was supported by the Bland-Altman plots. These plots showed that variability as a function of the average expression is consistent across arrays. To estimate a variance function, fit a large number of bins (say 100), compute a measure of variation and smooth that estimate to form a variance function of the mean. This approach borrows strength across  the local pool. It also addresses low replication and unreliable variance estimates. There is a strong precedent for the LPE test and has been used for bioassay models for several decades. Some good references for the use of a variance function are Jain (2003) and Lee (2003)

Some competing approaches for estimating differential expression include PDNN (he mentioned a Zhang (2003?) reference which I could not find but Nielsen 2005 also seems to describe this approach), PLIER (Hubbell--I can't find this reference), and MBEI (Li 2001). He also mentioned the resampling approach for testing (Reiner 2003).

Other references that I rapidly copied down, but have not had time to check are: Durbin and Rocke, Huber et al, Lin et al, Kamb and Ramawwarmi.

Dr. O'Connell also showed some volcano plots. I need to write a web page about this. In a volcano plot, the y-axis shows the p-value, and the x-axis shows the mean log2 fold change. This plot allows you to look at genese that have both a high fold change and a small p-value.

He also discussed the RNA Degradation plot. This plot shows the probesets from the 5' to the 3' end. The ideal RNA Degradation plot shows a series of parallel lines and ideally these lines will also be more or less horizontal. I can't seem to find a good reference for this approach.

Dr. O'Connell spent a fair amount of time discussing the gene enrichment test. This is something I want to elaborate on when I have time. The S+ArrayAnalyzer uses the Onto Express web site for the gene enrichment test.

He also spent some time on the identification of disease specific gene expression and relationships to survival. He listed several classic references in this area:

More recent work involves filter methods (eliminating genes prior to building the prediction model)., and wrapper methods (eliminating genes recursively during the model estimation using  machine learning weights, variable importance, etc.). One reference in this area that caught my eye was Furlanello (2003) because it used an entropy measure. Cross validation is important at each fold, to avoid selection bias (Ambroise 2002).

Dr. O'Connell was very impressed with the methods used in Wang (2005).

Some other references in predicting survival include a cross validation approach (Ambroise 2002) and univariate filtering (Chen 2003). I also learned a new acronym: RFE (recursive feature elimination).

Here are some of the references mentioned in this talk: