P.Mean >> Category >> Data mining (created 2007-06-18).

Data mining is a broad class of statistical tools that are designed for massive data sets. Many of the links in this category refer to methods for genetic data sets, especially microarray studies. Articles are arranged by date with the most recent entries at the top. You can find outside resources at the bottom of this page. Other entries about data mining can be found in the data mining page at the StATS website.

2008

  1. P.Mean: Comparing a set of microarray experiments to a model experiment (created 2008-11-01). I have a matrix of effect sizes from numerous microarray experiments.  For example, in one matrix I have 200 genes (rows) and 107 experiments (columns).  In addition, I also have a sort of “model experiment” which contains the values in which I am most interested. For each gene, I am trying to determine which genes are not statistically different from the “model experiment” value.

Outside resources:

  1. E E Schadt, C Li, C Su, W H Wong. Analyzing high-density oligonucleotide gene expression array data. J. Cell. Biochem. 2000;80(2):192-202. Abstract: "We have developed methods and identified problems associated with the analysis of data generated by high-density, oligonuceotide gene expression arrays. Our methods are aimed at accounting for many of the sources of variation that make it difficult, at times, to realize consistent results. We present here descriptions of some of these methods and how they impact the analysis of oligonucleotide gene expression array data. We will discuss the process of recognizing the "spots" (or features) on the Affymetrix GeneChip(R) probe arrays, correcting for background and intensity gradients in the resulting images, scaling/normalizing an array to allow array-to-array comparisons, monitoring probe performance with respect to hybridization efficiency, and assessing whether a gene is present or differentially expressed. Examples from the analyses of gene expression validation data are presented to contrast the different methods applied to these types of data." [Accessed February 22, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/11074587.
  2. Eugene Chudin, Randal Walker, Alan Kosaka, et al. Assessment of the relationship between signal intensities and transcript concentration for Affymetrix GeneChip arrays. Genome Biol. 2002;3(1):RESEARCH0005. Abstract: "BACKGROUND: Affymetrix microarrays have become increasingly popular in gene-expression studies; however, limitations of the technology have not been well established for commercially available arrays. The hybridization signal has been shown to be proportional to actual transcript concentration for specialized arrays containing hundreds of distinct probe pairs per gene. Additionally, the technology has been described as capable of distinguishing concentration levels within a factor of 2, and of detecting transcript frequencies as low as 1 in 2,000,000. Using commercially available arrays, we assessed these representations directly through a series of 'spike-in' hybridizations involving four prokaryotic transcripts in the absence and presence of fixed eukaryotic background. The contribution of probe-target interactions to the mismatch signal was quantified under various analyte concentrations. RESULTS: A linear relationship between transcript abundance and signal was consistently observed between 1 pM and 10 pM transcripts. The signal ceased to be linear above the 10 pM level and commenced saturating around the 100 pM level. The 0.1 pM transcripts were virtually undetectable in the presence of eukaryotic background. Our measurements show that preponderance of the signal for mismatch probes derives from interactions with the target transcripts. CONCLUSIONS: Landmark studies outlining an observed linear relationship between signal and transcript concentration were carried out under highly specialized conditions and may not extend to commercially available arrays under routine operating conditions. Additionally, alternative metrics that are not based on the difference in the signal of members of a probe pair may further improve the quantitative utility of the Affymetrix GeneChip array." [Accessed February 22, 2011]. Available at: http://genomebiology.com/2001/3/1/research/0005.
  3. Torsten Hothorn, Berthold Lausen, Axel Benner, Martin Radespiel-Tröger. Bagging survival trees. Stat Med. 2004;23(1):77-91. Abstract: "Predicted survival probability functions of censored event free survival are improved by bagging survival trees. We suggest a new method to aggregate survival trees in order to obtain better predictions for breast cancer and lymphoma patients. A set of survival trees based on B bootstrap samples is computed. We define the aggregated Kaplan-Meier curve of a new observation by the Kaplan-Meier curve of all observations identified by the B leaves containing the new observation. The integrated Brier score is used for the evaluation of predictive models. We analyse data of a large trial on node positive breast cancer patients conducted by the German Breast Cancer Study Group and a smaller 'pilot' study on diffuse large B-cell lymphoma, where prognostic factors are derived from microarray expression values. In addition, simulation experiments underline the predictive power of our proposal." [Accessed February 22, 2011]. Available at: http://www.ncbi.nlm.nih.gov/pubmed/14695641.
  4. Paola Rancoita, Marcus Hutter, Francesco Bertoni, Ivo Kwee. Bayesian DNA copy number analysis. BMC Bioinformatics. 2009;10(1):10. BACKGROUND:Some diseases, like tumors, can be related to chromosomal aberrations, leading to changes of DNA copy number. The copy number of an aberrant genome can be represented as a piecewise constant function, since it can exhibit regions of deletions or gains. Instead, in a healthy cell the copy number is two because we inherit one copy of each chromosome from each our parents. Bayesian Piecewise Constant Regression (BPCR) is a Bayesian regression method for data that are noisy observations of a piecewise constant function. The method estimates the unknown segment number, the endpoints of the segments and the value of the segment levels of the underlying piecewise constant function. The Bayesian Regression Curve (BRC) estimates the same data with a smoothing curve. However, in the original formulation, some estimators failed to properly determine the corresponding parameters. For example, the boundary estimator did not take into account the dependency among the boundaries and succeeded in estimating more than one breakpoint at the same position, losing segments.RESULTS:We derived an improved version of the BPCR (called mBPCR) and BRC, changing the segment number estimator and the boundary estimator to enhance the fitting procedure. We also proposed an alternative estimator of the variance of the segment levels, which is useful in case of data with high noise. Using artificial data, we compared the original and the modified version of BPCR and BRC with other regression methods, showing that our improved version of BPCR generally outperformed all the others. Similar results were also observed on real data.CONCLUSIONS:We propose an improved method for DNA copy number estimation, mBPCR, which performed very well compared to previously published algorithms. In particular, mBPCR was more powerful in the detection of the true position of the breakpoints and of small aberrations in very noisy data. Hence, from a biological point of view, our method can be very useful, for example, to find targets of genomic aberrations in clinical cancer samples. [Accessed February 24, 2009]. Available at: http://www.biomedcentral.com/1471-2105/10/10.
  5. Joseph G Ibrahim, Ming-Hui Chen, Robert J Gray. Bayesian Models for Gene Expression With DNA Microarray Data. Journal of the American Statistical Association. 2002;97(457):88-99. Abstract: "Two of the critical issues that arise when examining DNA microarray data are (1) determination of which genes best discriminate among the different types of tissue, and (2) characterization of expression patterns in tumor tissues. For (1), there are many genes that characterize DNA expression, and it is of critical importance to try and identify a small set of genes that best discriminate between normal and tumor tissues. For (2), it is critical to be able to characterize the DNA expression of the normal and tumor tissue samples and develop suitable models that explain patterns of DNA expression for these types of tissues. Toward this goal, we propose a novel Bayesian model for analyzing DNA microarray data and propose a model selection methodology for identifying subsets of genes that show different expression levels between normal and cancer tissues. In addition, we propose a novel class of hierarchical priors for the parameters that allow us to borrow strength across genes for making inference. The properties of the priors are examined in detail. We introduce a Bayesian model selection criterion for assessing the various models, and develop Markov chain Monte Carlo algorithms for sampling from the posterior distributions of the parameters and for computing the criterion. We present a detailed case study in endometrial cancer to demonstrate our proposed methodology." [Accessed February 22, 2011]. Available at: http://pubs.amstat.org/doi/abs/10.1198/016214502753479257.
  6. Journal article: Chang Gue Son, Sven Bilke, Sean Davis, Braden T. Greer, Jun S. Wei, Craig C. Whiteford, Qing-Rong Chen, Nicola Cenacchi, Javed Khan. Database of mRNA gene expression profiles of multiple human organs Genome Research. 2005;15(3):443 -450. Abstract: "Genome-wide expression profiling of normal tissue may facilitate our understanding of the etiology of diseased organs and augment the development of new targeted therapeutics. Here, we have developed a high-density gene expression database of 18,927 unique genes for 158 normal human samples from 19 different organs of 30 different individuals using DNA microarrays. We report four main findings. First, despite very diverse sample parameters (e.g., age, ethnicity, sex, and postmortem interval), the expression profiles belonging to the same organs cluster together, demonstrating internal stability of the database. Second, the gene expression profiles reflect major organ-specific functions on the molecular level, indicating consistency of our database with known biology. Third, we demonstrate that any small (i.e., n ~ 100), randomly selected subset of genes can approximately reproduce the hierarchical clustering of the full data set, suggesting that the observed differential expression of >90% of the probed genes is of biological origin. Fourth, we demonstrate a potential application of this database to cancer research by identifying 19 tumor-specific genes in neuroblastoma. The selected genes are relatively underexpressed in all of the organs examined and belong to therapeutically relevant pathways, making them potential novel diagnostic markers and targets for therapy. We expect this database will be of utility for developing rationally designed molecularly targeted therapeutics in diseases such as cancer, as well as for exploring the functions of genes." [Accessed on October 10, 2011]. http://genome.cshlp.org/content/15/3/443.abstract.
  7. Amber Hackstadt, Ann Hess. Filtering for Increased Power for Microarray Data Analysis. BMC Bioinformatics. 2009;10(1):11. BACKGROUND:Due to the large number of hypothesis tests performed during the process of routine analysis of microarray data, a multiple testing adjustment is certainly warranted. However, when the number of tests is very large and the proportion of differentially expressed genes is relatively low, the use of a multiple testing adjustment can result in very low power to detect those genes which are truly differentially expressed. Filtering allows for a reduction in the number of tests and a corresponding increase in power. Common filtering methods include filtering by variance, average signal or MAS detection call (for Affymetrix arrays). In this paper, we study the effects of filtering in combination with the Benjamini-Hochberg method for false discovery rate control and q-value for false discovery rate estimation.RESULTS:Three case studies are used to compare three different filtering methods in combination with the two false discovery rate methods and three different preprocessing methods. For the case studies considered, filtering by detection call and variance (on the original scale) consistently led to an increase in the number of differentially expressed genes identified. On the other hand, filtering by variance on the log2 scale had a detrimental effect when paired with MAS5 and PLIER preprocessing methods, even when the testing was done on the log2 scale. A simulation study was done to further examine the effect of filtering by variance. We find that filtering by variance leads to higher power, often with a decrease in false discovery rate, when paired with either false discovery rate method. This holds regardless of the proportion of genes which are differentially expressed or whether we assume dependence or independence among genes.CONCLUSIONS:The case studies show that both detection call and variance filtering are viable methods of filtering which can increase the number of differentially expressed genes identified. The simulation study demonstrates that when paired with a false discovery rate method, filtering by variance can increase power while still controlling the false discovery rate. Filtering out $50\%$ of probe sets seems reasonable as long as the majority of genes are not expected to be differentially expressed. [Accessed February 24, 2009]. Available at: http://www.biomedcentral.com/1471-2105/10/11.
  8. J. C. Barrett, B. Fry, J. Maller, M. J. Daly. Haploview: analysis and visualization of LD and haplotype maps. Bioinformatics. 2005;21(2):263 -265. Abstract: "Summary: Research over the last few years has revealed significant haplotype structure in the human genome. The characterization of these patterns, particularly in the context of medical genetic association studies, is becoming a routine research activity. Haploview is a software package that provides computation of linkage disequilibrium statistics and population haplotype patterns from primary genotype data in a visually appealing and interactive interface." [Accessed February 22, 2011]. Available at: http://bioinformatics.oxfordjournals.org/content/21/2/263.abstract.
  9. Newspaper article: Gina Kolata. How a New Hope in Cancer Fell Apart. The New York Times. July 7, 2011. Excerpt: "The episode is a stark illustration of serious problems in a field in which the medical community has placed great hope: using patterns from large groups of genes or other molecules to improve the detection and treatment of cancer. Companies have been formed and products have been introduced that claim to use genetics in this way, but assertions have turned out to be unfounded. While researchers agree there is great promise in this science, it has yet to yield many reliable methods for diagnosing cancer or identifying the best treatment." [Accessed on July 9, 2011]. http://www.nytimes.com/2011/07/08/health/research/08genes.html.
  10. Jing Tang, Chong Tan, Matej Oresic, Antonio Vidal-Puig. Integrating post-genomic approaches as a strategy to advance our understanding of health and disease. 2009. [Accessed April 22, 2009]. Available at: http://genomemedicine.com/content/1/3/35/.
  11. T. R. Golub, D. K. Slonim, P. Tamayo, et al. Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression Monitoring. Science. 1999;286(5439):531 -537. Abstract: "Although cancer classification has improved over the past 30 years, there has been no general approach for identifying new cancer classes (class discovery) or for assigning tumors to known classes (class prediction). Here, a generic approach to cancer classification based on gene expression monitoring by DNA microarrays is described and applied to human acute leukemias as a test case. A class discovery procedure automatically discovered the distinction between acute myeloid leukemia (AML) and acute lymphoblastic leukemia (ALL) without previous knowledge of these classes. An automatically derived class predictor was able to determine the class of new leukemia cases. The results demonstrate the feasibility of cancer classification based solely on gene expression monitoring and suggest a general strategy for discovering and predicting cancer classes for other types of cancer, independent of previous biological knowledge." [Accessed February 22, 2011]. Available at: http://www.sciencemag.org/content/286/5439/531.abstract.
  12. Victoria Martin-Requena, Antonio Munoz-Merida, M.Gonzalo Claros, Oswaldo Trelles. PreP+07: improvements of a user friendly tool to pre-process and analyse microarray data. BMC Bioinformatics. 2009;10(1):16. BACKGROUND:Nowadays, microarray gene expression analysis is a widely used technology that scientists handle but whose final interpretation usually requires the participation of a specialist. The need for this participation is due to the requirement of some background in statistics that most users lack or have a very vague notion of. Moreover, programming skills could also be essential to analyse these data. An interactive, easy to use application seems therefore necessary to help researchers to extract full information from data and analyse them in a simple, powerful and confident way.RESULTS:PreP+07 is a standalone Windows XP application that presents a friendly interface for spot filtration, inter- and intra-slide normalization, duplicate resolution, dye-swapping, error removal and statistical analyses. Additionally, it contains two unique implementation of the procedures--double scan and Supervised Lowess--, a complete set of graphical representations--MA plot, RG plot, QQ plot, PP plot, PN plot-- and can deal with many data formats, such as tabulated text, GenePix GPR and ArrayPRO. PreP+07 performance has been compared with the equivalent functions in Bioconductor using a tomato chip with 13056 spots. The number of differentially expressed genes considering p-values coming from the PreP+07 and Bioconductor Limma packages were statistically identical when the data set was only normalized; however, a slight variability was appreciated when the data was both normalized and scaled. CONCLUSIONS:PreP+07 implementation provides a high degree of freedom in selecting and organizing a small set of widely used data processing protocols, and can handle many data formats. Its reliability has been proven so that a laboratory researcher can afford a statistical pre-processing of his/her microarray results and obtain a list of differentially expressed genes using PreP+07 without any programming skills. All of this gives support to scientists that have been using previous PreP releases since its first version in 2003. [Accessed February 24, 2009]. Available at: http://www.biomedcentral.com/1471-2105/10/16.
  13. Richard Pearson, Xuejun Liu, Guido Sanguinetti, et al. puma: a Bioconductor package for propagating uncertainty in microarray analysis. BMC Bioinformatics. 2009;10(1):211. BACKGROUND:Most analyses of microarray data are based on point estimates of expression levels and ignore the uncertainty of such estimates. By determining uncertainties from Affymetrix GeneChip data and propagating these uncertainties to downstream analyses it has been shown that we can improve results of differential expression detection, principal component analysis and clustering. Previously, implementations of these uncertainty propagation methods have only been available as separate packages, written in different languages. Previous implementations have also suffered from being very costly to compute, and in the case of differential expression detection, have been limited in the experimental designs to which they can be applied.RESULTS:puma is a Bioconductor package incorporating a suite of analysis methods for use on Affymetrix GeneChip data. puma extends the differential expression detection methods of previous work from the 2-class case to the multi-factorial case. puma can be used to automatically create design and contrast matrices for typical experimental designs, which can be used both within the package itself but also in other Bioconductor packages. The implementation of differential expression detection methods has been parallelised leading to significant decreases in processing time on a range of computer architectures. puma incorporates the first R implementation of an uncertainty propagation version of principal component analysis, and an implementation of a clustering method based on uncertainty propagation. All of these techniques are brought together in a single, easy-to-use package with clear, task-based documentation.CONCLUSION:For the first time, the puma package makes a suite of uncertainty propagation methods available to a general audience. These methods can be used to improve results from more traditional analyses of microarray data. puma also offers improvements in terms of scope and speed of execution over previously available methods. puma is recommended for anyone working with the Affymetrix GeneChip platform for gene expression analysis and can also be applied more generally. [Accessed August 19, 2009]. Available at: http://www.biomedcentral.com/1471-2105/10/211.
  14. Alexander Karpikov, Joel Rozowsky, Mark Gerstein. Tiling array data analysis: a multiscale approach using wavelets. BMC Bioinformatics. 2011;12(1):57. Abstract: "BACKGROUND: Tiling array data is hard to interpret due to noise. The wavelet transformation is a widely used technique in signal processing for elucidating the true signal from noisy data. Consequently, we attempted to denoise representative tiling array datasets for ChIP-chip experiments using wavelets. In doing this, we used specific wavelet basis functions, Coiflets, since their triangular shape closely resembles the expected profiles of true ChIP-chip peaks. RESULTS: In our wavelet-transformed data, we observed that noise tends to be confined to small scales while the useful signal-of-interest spans multiple large scales. We were also able to show that wavelet coefficients due to non-specific cross-hybridization follow a log-normal distribution, and we used this fact in developing a thresholding procedure. In particular, wavelets allow one to set an unambiguous, absolute threshold, which has been hard to define in ChIP-chip experiments. One can set this threshold by requiring a similar confidence level at different length-scales of the transformed signal. We applied our algorithm to a number of representative ChIP-chip data sets, including those of Pol II and histone modifications, which have a diverse distribution of length-scales of biochemical activity, including some broad peaks. CONCLUSIONS: Finally, we benchmarked our method in comparison to other approaches for scoring ChIP-chip data using spike-ins on the ENCODE Nimblegen tiling array. This comparison demonstrated excellent performance, with wavelets getting the best overall score." [Accessed February 22, 2011]. Available at: http://www.biomedcentral.com/1471-2105/12/57.
  15. Don Maier, Farrell Wymore, Gavin Sherlock, Catherine Ball. The XBabelPhish MAGE-ML and XML Translator. BMC Bioinformatics. 2008;9(1):28. Abstract: "BACKGROUND: MAGE-ML has been promoted as a standard format for describing microarray experiments and the data they produce. Two characteristics of the MAGE-ML format compromise its use as a universal standard: First, MAGE-ML files are exceptionally large - too large to be easily read by most people, and often too large to be read by most software programs. Second, the MAGE-ML standard permits many ways of representing the same information. As a result, different producers of MAGE-ML create different documents describing the same experiment and its data. Recognizing all the variants is an unwieldy software engineering task, resulting in software packages that can read and process MAGE-ML from some, but not all producers. This Tower of MAGE-ML Babel bars the unencumbered exchange of microarray experiment descriptions couched in MAGE-ML. RESULTS: We have developed XBabelPhish - an XQuery-based technology for translating one MAGE-ML variant into another. XBabelPhish's use is not restricted to translating MAGE-ML documents. It can transform XML files independent of their DTD, XML schema, or semantic content. Moreover, it is designed to work on very large (> 200 Mb.) files, which are common in the world of MAGE-ML. CONCLUSION: XBabelPhish provides a way to inter-translate MAGE-ML variants for improved interchange of microarray experiment information. More generally, it can be used to transform most XML files, including very large ones that exceed the capacity of most XML tools." [Accessed February 22, 2011]. Available at: http://www.biomedcentral.com/1471-2105/9/28.

Creative Commons License All of the material above this paragraph is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-11. The material below this paragraph links to my old website, StATS. Although I wrote all of the material listed below, my ex-employer, Children's Mercy Hospital, has claimed copyright ownership of this material. The brief excerpts shown here are included under the fair use provisions of U.S. Copyright laws.

2008

[[Links to the old website have been temporarily misplaced. I hope to fix this soon.]]

What now?

Browse other categories at this site

Browse through the most recent entries

Get help

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. This page was written by Steve Simon and was last modified on 2010-04-11.