Stats: Statistical Analysis of Microarrays by Insightful (August 31, 2005)

Statistical Analysis of Microarrays by Insightful (August 31, 2005).

This page is moving to a new website.

I attended a seminar presented by Michael O'Connell of Insightful Corporation on microarray analysis. Insightful Corporation has a program, S+ArrayAnalyzer, and this talk showed some of the capabilities of this software. The archived version of this presentation appears at

www.insightful.com/news_events/webcasts/2005/08affy/default.asp

The first example data set comes from Gretchen Darlington at the Huffington Center on Aging, Baylor College of Medicine. There were mice of two ages (young and old) and surgery induced injury was performed. The liver was sampled at time 0, 0.5 hours, 1 hour, 6 hours, and 24 hours. There were three replications at each combination of time and age. The total sample size is 30, and there are 5 degrees of freedom associates with the young - old contrast at each time point, and 4 degrees of freedom.

The goals are to identify genes that are differentially expressed, minimizing false negatives or equivalently maximizing the power of the statistical test. At the same time, you need to keep the probability of false discoveries acceptably low. You need to be careful throughout the analysis pipeline and filter genes that are not interesting (based on a priori hypotheses or on lack of variation).

The basic workflow is

access the data
prepare the data
model the data
deploy/report the data

S-plus ARRAY Analyzer reads in CDF (chip description file), CEL (Chip Expression Level?) and CHP binary files. All the standard Affymettrix chips are handled using the AADM links. This software also handles all Agilent two channel microarrays.

Quality control methods use

Bland-Altman plots,
boxplots for individual chips,
image plots of spatial expression,
RNA degradation plots, and
principal components plots.

Filtering allows you to select a subset of arrays and/or genes. You can look, for example, for geneses where expression values > some level for at least one chip.

The Bland Altman plot uses a hexbin format where the graph is divided into hexagonal regions (like a honeycomb) and the counts of genes in each region is shown by shades of color. [note to myself: I need to find some good examples of this plotting technique.]

In this example, the Bland-Altman plots showed some outliers which indicate that possibly a more aggressive normalization such as RMA or GCRMA would be needed. These methods compress the variability. The Insightful web site describes these normalization approaches:

In the Affymetrix system, each gene is represented by 11-20 PM and MM pairs of probes, each probing a different region of the mRNA transcript, typically within 600 base pairs of the 3' end. The RMA method of Irizarry et al. (2003) models PM intensity as a sum of exponential and Gaussian distributions for signal and background respectively, and uses quantile normalization (Bolstad et al., 2003) and a log-scale expression effect plus probe effect model that is fit robustly (median polish) to define the robust multi-array analysis (RMA) expression estimate for each gene. The GC-RMA method of Wu et al. (2004) describe an algorithm similar to RMA, but incorporating the MM using a model based on GC content (GC-RMA). www.insightful.com/products/s-plus_arrayanalyzer/features.asp

Dr. O'Connell liked these methods, though he did mention that RMA tended to bias the fold changes downward because it does not use the MM (mismatch) adjustment. I'm not quite sure what this means. He also characterized quantile normalization as good, but somewhat aggressive. Again, I'm not quite sure what this means.

A box plot shows that the median expression level does indeed vary from chip to chip, which emphasizes the need to normalize across chips.

Dr. O'Connell highlighted a paper (Choe 2005) that is worth studying carefully. It seems to support the use of a variance function (CyberT, and LPE), compared to the basic t-test or SAM (I couldn't locate the Tibshirani reference that he mentioned, but Efron 2001 gives a good overview of SAM). Dr. O'Connell stressed that the LPE approach is especially useful when the degrees of freedom is small.

The use of a variance function was supported by the Bland-Altman plots. These plots showed that variability as a function of the average expression is consistent across arrays. To estimate a variance function, fit a large number of bins (say 100), compute a measure of variation and smooth that estimate to form a variance function of the mean. This approach borrows strength across the local pool. It also addresses low replication and unreliable variance estimates. There is a strong precedent for the LPE test and has been used for bioassay models for several decades. Some good references for the use of a variance function are Jain (2003) and Lee (2003)

Some competing approaches for estimating differential expression include PDNN (he mentioned a Zhang (2003?) reference which I could not find but Nielsen 2005 also seems to describe this approach), PLIER (Hubbell--I can't find this reference), and MBEI (Li 2001). He also mentioned the resampling approach for testing (Reiner 2003).

Other references that I rapidly copied down, but have not had time to check are: Durbin and Rocke, Huber et al, Lin et al, Kamb and Ramawwarmi.

Dr. O'Connell also showed some volcano plots. I need to write a web page about this. In a volcano plot, the y-axis shows the p-value, and the x-axis shows the mean log2 fold change. This plot allows you to look at genese that have both a high fold change and a small p-value.

He also discussed the RNA Degradation plot. This plot shows the probesets from the 5' to the 3' end. The ideal RNA Degradation plot shows a series of parallel lines and ideally these lines will also be more or less horizontal. I can't seem to find a good reference for this approach.

Dr. O'Connell spent a fair amount of time discussing the gene enrichment test. This is something I want to elaborate on when I have time. The S+ArrayAnalyzer uses the Onto Express web site for the gene enrichment test.

He also spent some time on the identification of disease specific gene expression and relationships to survival. He listed several classic references in this area:

Alizadeh (2000)--diffuse large B-cell lymphoma
Perou (2000)--breast cancer
Khan (2001)--small round blue cell tomors
Yeoh (2002)--pediatric acutle lymphoblastic leukemia
Rosenwald (2002)--DLBCL
van't Veer (2002)--breast cancer

More recent work involves filter methods (eliminating genes prior to building the prediction model)., and wrapper methods (eliminating genes recursively during the model estimation using machine learning weights, variable importance, etc.). One reference in this area that caught my eye was Furlanello (2003) because it used an entropy measure. Cross validation is important at each fold, to avoid selection bias (Ambroise 2002).

Dr. O'Connell was very impressed with the methods used in Wang (2005).

Some other references in predicting survival include a cross validation approach (Ambroise 2002) and univariate filtering (Chen 2003). I also learned a new acronym: RFE (recursive feature elimination).

Here are some of the references mentioned in this talk:

Selection bias in gene extraction on the basis of microarray gene-expression data. Ambroise C, McLachlan GJ. Proc Natl Acad Sci U S A 2002: 99(10); 6562-6. [Medline] [Abstract] [Full text] [PDF]
Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci U S A. 2002 May 14;99(10):6562-6. Epub 2002 Apr 30.
Preferred analysis methods for Affymetrix GeneChips revealed by a wholly defined control dataset. Choe SE, Boutros M, Michelson AM, Church GM, Halfon MS. Genome Biol 2005: 6(2); R16. [Medline] [Abstract] [Full text] [PDF]
Protein profiles associated with survival in lung adenocarcinoma. Chen G, Gharib TG, Wang H, Huang CC, Kuick R, Thomas DG, Shedden KA, Misek DE, Taylor JM, Giordano TJ, Kardia SL, Iannettoni MD, Yee J, Hogg PJ, Orringer MB, Hanash SM, Beer DG. Proc Natl Acad Sci U S A 2003: 100(23); 13537-42. [Medline] [Abstract] [Full text] [PDF]
Empirical Bayes Analysis of a Microarray Experiment. Efron B, Tibshirani R, Storey JD, Tusher V. Journal of the American Statistical Association 2001: 96(456); 1151-1160. [PDF]
Entropy-based gene ranking without selection bias for the predictive classification of microarray data. Furlanello C, Serafini M, Merler S, Jurman G. BMC Bioinformatics 2003: 4; 54. [Medline] [Abstract] [Full text] [PDF]
Local-pooled-error test for identifying differentially expressed genes with a small number of replicated microarrays. Jain N, Thatte J, Braciale T, Ley K, O'Connell M, Lee JK. Bioinformatics 2003: 19(15); 1945-51. [Medline] [Abstract] [PDF]
An S-PLUS library for the analysis of differential expression. Lee,J.K. and O’Connell,M. (2003) In Parmigiani, G., Garrett, E.S., Irizarry, R.A. and Zeger, S.L. (ed.), The Analysis of Gene Expression Data: Methods and Software. chapter 7, Springer, Berlin.
Model-based analysis of oligonucleotide arrays: model validation, design issues and standard error application. Li C, Hung Wong W. Genome Biol 2001: 2(8); RESEARCH0032. [Medline] [Abstract] [Full text] [PDF]
Implementation of a gene expression index calculation method based on the PDNN model. Nielsen HB, Gautier L, Knudsen S. Bioinformatics 2005: 21(5); 687-8. [Medline]
Identifying differentially expressed genes using false discovery rate controlling procedures. Reiner A, Yekutieli D, Benjamini Y. Bioinformatics 2003: 19(3); 368-75. [Medline] [Abstract] [PDF]
Gene-expression profiles to predict distant metastasis of lymph-node-negative primary breast cancer. Wang Y, Klijn JG, Zhang Y, Sieuwerts AM, Look MP, Yang F, Talantov D, Timmermans M, Meijer-van Gelder ME, Yu J, Jatkoe T, Berns EM, Atkins D, Foekens JA. Lancet 2005: 365(9460); 671-9. [Medline]