Steps in a typical micro array analysis (no date)

This page is moving to a new website.

I am not an expert in micro array data analysis. In fact, I'm just starting. I thought that outlining some of the things I am learning as I start to do micro array data analyses would be helpful to others.

This page is a composite of several different pages:

You may also want to look at some of the additional material about microarrays on my weblog. Entries without a link represent more recent entries which were not yet archived when I added them to this list.

Keep an eye on my weblog for more recent entries (newer than July 20, 2004)


What is a microarray? (no date) Category: Data mining [incomplete]

What is a microarray?

A microarray is a tool for measuring the amount of messenger RNA (mRNA) that is circulating in a cell. It is the mRNA that transfers information from the genes from DNA inside the nucleus of a cell to create various proteins. Even though they have the exact same DNA, different cells have different amounts of various mRNA because they need to produce different proteins. For example, only certain cells in the pancreas produce insulin even though the DNA code for producing insulin exists inside all cells.

Genes that produce a lot of mRNA are said to be upregulated and genes that produce little or no mRNA are said to be downregulated.

Microarrays simultaneously measure the amount of circulating mRNA for hundreds or even thousands of different mRNAs. They have applications in many different research areas.

  1. Because the types and amounts of circulating mRNA effectively distinguish one type of cell from another, microarrays can help us better understand the factors that differentiate cells.
  2. Stems cells are undifferentiated cells that have the potential to transform into many different types of cells. Microarrays can help us understand how the transformation from a stem cell to a differentiated cell occurs.
  3. Within a cell, the types and amounts of mRNA vary during different stages of a cells life, so microarrays can help us understand what proteins are needed during these different stages.
  4. Certain diseases occur because of the failure of cells to produce the proteins that they are expected to produce, so microarrays can help us understand these diseases better and identify targets for new therapies.
  5. Other diseases like cancer occur when a group of cells grows out of control. Microarrays can help us better understand what triggers this rapid uncontrolled growth.

How does a microarray work?

Circulating mRNA from a clump of cells is converted to the complementary DNA strands (cDNA). The cDNA is amplified by polymerase chain reaction (PCR) and a molecular tag that glows is attached to each piece of cDNA. This mixture is then washed over a slide that has spots where the cDNA can bind. Any loose or unbound cDNA is then washed away. The amount of fluorescence at a particular spot on the microarray gives you an indication as to how much mRNA of a particular type was in the original sample.

The Polymerase Chain Reaction. Tabitha M. Powledge.

Spotted arrays

[Insert a brief discussion here.]

The Affy chip

Affymetrix produces a different type of microarray that uses photolithography to build the microarray, a process similar to the approach used to create semiconductor chips.  The Affy chips as they are often called, place thousands or tens of thousands of genes on a slide that can fit in the palm of your hand. Unlike spotted arrays, that places spots of the full sequence of the gene (or Expressed Sequence Tag) on a slide, the Affy chip selects 20 probes for each gene, each of which has a length of 25 base pairs. Next to each of these probes is another probe that represents the same 25 base pair sequence except that the middle base is changed. The 20 probes are called PM (Perfect Match) probes and the probes with the changed middle base are called MM (MisMatch) probes. The MM probes are an attempt to measure and control for cross-hybridization, the tendency for genes that are similar, but not identical to a particular gene sequence to bind weakly to these sites.


Importing data from microarray studies (no date) Category: Data mining [incomplete]

There are so many different ways that data can come to you in a microarray experiment that it is hard to document how to import the data. Here are a few examples, plus some random notes and thoughts.

The data structure for Affymetrix chips

Affymetrix has several formats, DAT (image), CEL (probe), and CHP. [Explain what these formats are]

The data structure for cDNA arrays

A cDNA array (spotted array) is interesting from a statistical perspective because there is a pairing that adds precision but which also adds a layer of complexity.

Red signal is also called the Cy5 signal. The green signal is also called the Cy3 signal.

Typically, data from these arrays appears in large text files with tab or comma delimiters. There are several header lines at the top of the file before you see data on individual spots. Because these formats are text files, you can manipulate them easily. But there is only limited standardization of these files at this stage. A nice summary of the conflicting data formats appears at www.scmbb.ulb.ac.be/~jvanheld/web_course_microarrays/practicals/data_formats.html.

There is usually one file per slide with a header of several lines. Here's an example of the header file for array lc4b007 from the website providing supplemental data for the Alizadeh 2000 Nature Study. This file was produced by the ScanAlyze software system which is available for free for academic and non-profit researchers from Eisen Lab at Lawrence Berkeley National Lab..

The first line provides names for each of the columns to follow. The SPOT, GRID, TOP, LEFT, BOT, RIGHT ROW, and COL give information about the physical location of the spot. The values for CH*I (* equals 1 for channel 1 and 2 for channel 2) are the green and red intensities for an individual spot on the array. The values for CH*B are median background intensities and CH*BA are mean background intensities. SPIX and BGPIX are the number of pixels used for the spot and the background, respectively. The quantities MRAT, REGR, CORR, and LFRAT represent quality checks on the spot image. In a large production environment, these values might be useful in a quality control chart. Additional parameters for assessing the quality of a spot are CH*GTB1 which represent how many pixels in the spot exceed background. For a weak spot, these values will be close to 0.5. Alternate measures CH*GTB2, CH*KSD, and CH*KSP represent other criteria for identifying weak spots. FLAG is a user defined variable to identify spots that the user has special information on.

Here's what the header looks like for another file, produced locally with Affymetrix Jaguar Image Analysis software:

The spot intensities are stored under "Girl 647" and "Normal 555." The standard deviation of the spot intensity is labelled SD, Pixels is the number of pixels in the spot, BG is the intensity of the background, and BGSD is the standard deviation of the background. The 647/555 column represents the ratio of spot intensities (unadjusted for background levels).

Genepix uses a format (GPR) that has similar information about the signal and background, with some quality checks along the lines of those produced by SnanAlyze software. The specification for the GPR format is on the web at www.axon.com/gn_GenePix_File_Formats.html#gpr.

The GPR format is also a text file. It has several header lines that give information about the particular experiment. After the header lines comes data about particular spots:

The first seven columns of data give the location and information about individual spots. The green signal information is labeled as 635 and the red signal information is labeled as 532, though later in the file the signals are 1 and 2, respectively. The letter F refers to the Foreground and B refers to the background.

Spot is an open source software program for image analysis of microarrays that also uses the R programming language. The files produced by Spot has a single header line

The letter G represents the green signal, R represents red. and bg represents background information.

Layout files

A second file will provide information linking particular spots for an array to gene names. For the Alizadeh study, this appears as two separate text files. The first text file links spots on a particular slide to an inhouse DNA ID number. The second text file links the inhouse DNA ID number to CLONEID and curated names. This is a bit messy because several different batches of chips were used and the gene locations moved around from one batch to another.

Genepix software has a format for a layout file (GAL format) that appears to be widely used. The specification for the GAL format is on the web at www.axon.com/gn_GenePix_File_Formats.html#gal.

Here the header lines for a GAL file that comes with Bioconductor.

The "19 5" in the second line tells you that there are 19 header lines and that the information about individual spots comes in 5 columns. This particular array has a four by four grid of printing tips and the Block information gives details about where these tips produced spots on the microarray. On this microarray, each spot is uniquely identified by the printing tip (block), row and column. Id is an internal code name and Name is the actual name of the gene. 

If your layout information does not come in a GAL file, it is not too difficult to convert it to this format. You can skip all of the block information, if you like. It helps some software link particular data values to locations on the image file, but that is completely optional. A minimal GAL file would have the following header:

If your data has only one block then number that block "1" for all of the spots on your array.

Here's some R code that reads in the two layout files for the Alizadeh study, selects the rows corresponding to the

Image files

In addition to these text files, microarray data sets will often include an image of the array in TIFF format.

Reading information into Bioconductor

In all of these files, the four key pieces of information you definitely need are the green signal, the green background, the red signal, and the red background.

The marrayInput library of Bioconductor has

for reading various microarray formats. While you are reading in the data, you can also specify layout values, probe sequence information, and/or target sample information.

MAGE-ML or Microarray Gene Expression Markup Language

[Explain]

Importing data from the prenatal liver study

I received 22 Excel files for this project.

These were tab delimited files, with key information stored in the name of the file itself. The filename can be split into eight pieces:

 The first line in each file included the following names for the data:

The second column, the signal, is the most important piece of information. Third column gives a detection code based on the fourth column, the detection p-value. The three codes are:

I had to convert from a tab delimited file to a comma separated file. There are several ways to do this. For example, you can read the tab delimited file into Excel and then save it as a .CSV file). I found it faster to search and replace the TAB character with a comma. It is hard to search directly for a tab character, but if your software allows it, you can look for the ASCII code 09.

Once I had the comma separated values, I created the pieces of the file separately and pasted them together. This took longer than just typing the names of the files, but in the long run I save time because several of these pieces become important variables in the analysis.

Here is the R code for importing this data.

us <- "_"
dir <- "x:/sleeder/csv/"
id1 <- as.character(c(6286:6294,7446:7458))
loc <- c(   "UM",   "UM",   "UM",    "H",    "H",    "H",
             "H",    "H",    "H",   "UM",   "UM",   "UM",
            "UM",    "H",    "H",    "H",    "H",    "H",
             "H",   "WU",   "WU",   "WU")
id2 <- c( "1589", "1589", "1589","18058","18058","18058",
         "17869","17869","17869", "1690", "1621", "1631",
          "1566","18354","18381","18390","18401","18508",
	 "18535", "0831", "3881", "5025")
tis <- c(   "Ki",   "Li",   "Lu",   "Ki",   "Lu",   "Li",
            "Lu",   "Ki",   "Li",   "Li",   "Li",   "Li",
            "Li",   "Li",   "Li",   "Li",   "Li",   "Li",
            "Li",   "Li",   "Li",   "Li")
age <- c(  "133",  "133",  "133",  "098",  "098",  "098",
           "075",  "075",  "075",  "140",  "130",  "133",
           "134",  "096",  "096",  "094",  "076",  "076",
	   "075",  "1.7",  "3.0",  "2.7")
unt <- c(     "",     "",     "",     "",     "",     "",
              "",     "",     "",     "",     "",     "",
              "",     "",     "",     "",     "",     "",
              "",    "y",    "y",    "y")
sex <- c(    "F",    "F",    "F",     "F",    "F",   "F",
             "X",    "X",    "X",     "M",    "M",   "M",
             "M",    "M",    "M",     "M",    "X",   "X",
             "X",    "F",    "M",     "F")
rac <- c(   "AA",   "AA",   "AA",   "XX",   "XX",   "XX",
            "CA",   "CA",   "CA",   "CA",   "CA",   "AA",
            "AA",   "XX",   "XX",   "XX",   "XX",   "XX",
            "XX",   "CA",   "AA",   "XX")
ext <- ".csv"

fz <- paste(dir,id1,us,loc,id2,tis,us,age,unt,us,sex,rac,ext,sep="")

signal.all <- matrix(-1,54675,22)
detect.all <- matrix(-1,54675,22)
for (i in 1:22) {
	tmp <- read.csv(fz[i])
	signal.all[,i] <- tmp[,2]
	detect.all[,i] <- tmp[,3]
}

The value of breaking the filenames up into segments becomes apparent when you want to look at a particular subset of the genes. For example, the R code

signal.li <- signal.all[,tis=="Li"]

will create an array consisting of the 16 chips associated with liver tissue. We can get the gender information for these 16 patients with the command

gender.li <- gender[tis=="Li"]

and so forth.

Averting a disaster in the prenatal liver study

When I was trying to normalize the data, I noticed that three of the arrays had rather unusual properties. When trying to normalize array 6287 versus the median array, the R vs I plot looked like

which was much more scattered than most of the other plots, such as 7446.

When I plotted pairs of arrays versus each other, it became even more apparent. Here is what 6287 versus 7446 looked like.

Compare this to 7446 versus 7447.

It turns out that the order of the genes were not the same in all of the files. For example in file 6287, the first ten genes were

  1. 1007_s_at
  2. 1053_at
  3. 117_at
  4. 121_at
  5. 1255_g_at
  6. 1294_at
  7. 1316_at
  8. 1320_at
  9. 1405_i_at
  10. 1431_at

while in file 7446, the first ten genes were

  1. 117_at
  2. 121_at
  3. 177_at
  4. 179_at
  5. 320_at
  6. 336_at
  7. 564_at
  8. 632_at
  9. 823_at
  10. 1053_at

By assuming that all the files listed their genes in the exact same order, I had effectively shuffled the values of three of the arrays and effectively ruined any analyses. To fix this, I had to sort the CSV files to insure that the gene names were in the same order for each file. Then I added a couple of extra lines of code to double-check that the files were now in a consistent order. First, I got the probeset list from the first file. Then when reading in the remaining files, I compared the probeset list to the first file. If there were any mismatches, then the sum would equal a value larger than zero.

tmp <- read.csv(fz[1])
signal.all[,1] <- tmp[,2]
detect.all[,1] <- tmp[,3]
gene.probeset <- trimWhiteSpace(as.character(tmp$Probeset))

for (i in 2:22) { tmp <- read.csv(fz[i])
        signal.all[,i] <- tmp[,2]
        detect.all[,i] <- tmp[,3]
        check.probeset <- trimWhiteSpace(as.character(tmp$Probeset))
        print(sum(gene.probeset!=check.probeset))
}

I should have been more careful at the beginning, but at least I caught the problem before I ran any serious analyses. Whew!


Data management in a microarray experiment (no date) Category: Data mining [incomplete]

Selecting a subset of genes from the prenatal liver study

I was asked to concentrate on a set of genes associated with the Folate pathway. This list of 43 genes,

ABCB1, ABCC1, ABCC3, ABCC4, ABCG2, AHCYL1, AHCYL1, AMT, ATIC, BHMT, BHMT2, CBS, CTH, DHFR, FOLH1, FOLH2, FOLR1, FOLR2, FOLR2L, FOLR3, FPGS, FTCD, FTH1, FTHFD, FTHFSDC1, GART, GCH1, GGH, GNMT, MAT1A, MAT2A, MAT2B, MTHFD1, MTHFD2, MTHFR, MTHFS, MTR, MTRR, PPAT, PTS, RUVBL2, SHMT1, SLC19A1, and TYMS

were stored in an Excel file called FolateGeneList.xls. I converted this file to a csv format, and read it into R.

f0 <- "x:/sleeder/csv/FolateGeneList.csv"
folate.dat <- read.csv(f0,header=F)

Unfortunately, none of these names matched the names found in the Probeset column of the data sets. I had to look instead at a separate file, HG-U133_Plus_2_annot.xls, which I also converted to csv format and read into R.

f1 <- "x:/sleeder/csv/HG-U133_Plus_2_annot.csv"
annotate.dat <- read.csv(f1)
dim(annotate.dat)

[1] 54675  43

names(annotate.dat)

 [1] "Probe.Set.ID"                     "GeneChip.Array"
 [3] "Species.Scientific.Name"          "Annotation.Date"
 [5] "Sequence.Type"                    "Sequence.Source"
 [7] "Transcript.ID"                    "Target.Description"
 [9] "Representative.Public.ID"         "Archival.UniGene.Cluster"
[11] "UniGene.ID"                       "Genome.Version"
[13] "Alignments"                       "Gene.Title"
[15] "Gene.Symbol"                      "Chromosomal.Location"
[17] "Unigene.Cluster.Type"             "Ensembl"
[19] "LocusLink"                        "SwissProt"
[21] "EC"                               "OMIM"
[23] "RefSeq.Protein.ID"                "RefSeq.Transcript.ID"
[25] "FlyBase"                          "AGI"
[27] "WormBase"                         "MGI.Name"
[29] "RGD.Name"                         "SGD.accession.number"
[31] "Gene.Ontology.Biological.Process" "Gene.Ontology.Cellular.Component"
[33] "Gene.Ontology.Molecular.Function" "Pathway"
[35] "Protein.Families"                 "Protein.Domains"
[37] "InterPro"                         "Trans.Membrane"
[39] "QTL"                              "Annotation.Description"
[41] "Annotation.Transcript.Cluster"    "Transcript.Assignments"
[43] "Annotation.Notes"

This file is so complex in part because there are so many different ways that you can name a gene. For this particular experiment, here is what the columns represent.

  1. "Probe.Set.ID" is a medium length string (6-27 characters) that represents the Affymetrix name. The values are unique. Examples include 236631_at and AFFX-r2-Ec-bioB-M_at.
  2. "GeneChip.Array" is Human Genome U133 Plus 2.0 Array for all rows.
  3. "Species.Scientific.Name" is Homo sapiens for all rows.
  4. "Annotation.Date" is  7-Dec-04 for all rows.
  5. "Sequence.Type" is Consensus sequence (39,534), Exemplar sequence (15,079), or Control sequence (62). "An Exemplar is a single nucleotide sequence taken directly from a public database. This sequence could be an mRNA or EST. A Consensus sequence, is a nucleotide sequence assembled by Affymetrix, based on one or more sequence taken from a public database." www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GPL93
  6. "Sequence.Source" is GenBank (54,600) or Affymetrix Proprietary Database (75).
  7. "Transcript.ID" is a medium length string (6-24 characters). The values are not unique. Examples include Hs.121419.0 and Hs.138751.0. It is never missing in this data set.
  8. "Target.Description" is a long text string (56 to 1,023 characters). An example is U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds 51842 Levels: Cluster Incl. AB001325:Human AQP3 gene for aquaporine 3 (water channel), partail cds /cds=(60,938) /gb=AB001325 /gi=1854373 /ug=Hs.234642 /len=1442 ...
  9. "Representative.Public.ID" is a medium length string (6-24 characters). The values are not unique. Examples include NM_001785 and AL525367. It is never missing in this data set.
  10. "Archival.UniGene.Cluster" is a short text string (3-17 characters). The values are not unique and it is missing for 12,430 rows. This and the next variable are part of "an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location." Examples include Hs.220998, --- /// Hs.900316, and Hs.6336.
  11. "UniGene.ID" is a medium length string (3-135 characters). The values are not unique and they are missing for 12,878 rows. Examples include Hs.278441, Hs.383534 /// Hs.449450 /// Hs.511725 /// Hs.511727 /// Hs.511728 /// Hs.534484 and Hs.249587 /// Hs.406224.
  12. "Genome.Version" is May 2004 (NCBI 35) for all rows,
  13. "Alignments" is a long text string (3-1,004 characters). An example is chr6:30964144-30975910 (+) // 95.63 // p21.33 50315 Levels: --- chr1:100038786-100101597 (+) // 99.57 // p21.2 ... chrY:9897982-9899549 (+) // 98.23 // p11.2 /// chrY:9938577-9940161 (+) // 96.58 // p11.2 /// chrY:9238824-9240391 (+) // 98.16 // p11.2 /// chrY:9279448-9281015 (+) // 98.16 // p11.2 /// chrY:9958907-9960474 (+) // 98.16 // p11.2 /// chrY:9218491-9220076 (+) // 96.51 // p11.2 /// chrY:9259103-9260669 (+) // 98.16 // p11.2 /// chrY:9918294-9919879 (+) // 96.39 // p11.2 /// chrY:6157682-6159243 (+) // 97.72 // p11.2 /// chrY:9979090-9980643 (+) // 90.56 // p11.2.
  14. "Gene.Title" is a long text string (3-30,000 characters). It is missing for 12,267 rows. An example is androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease) /// androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease).
  15. "Gene.Symbol" is a long text string (1-8,911 characters). Examples include RANBP2, MGC27165 /// IGH@ /// IGHG1, and PLK4. The values are not unique and they are missing for 16,943 rows.
  16. "Chromosomal.Location" gives the physical location of the gene. Examples include chr5q35.3 and chr6p22.1-p21.2. It is missing for 15,891 rows.
  17. "Unigene.Cluster.Type" is full length (33,859), est /// full length (30), or est (3). It is missing for 20,783 rows.
  18. "Ensembl" Examples include ENSG00000166478 and ENSG00000006606.  It is missing for 22,243 rows.
  19. "LocusLink" is a classification scheme developed by the National Center for Biotechnology Information, but it has been superceded by Entrez Gene. Examples include 399655 /// 9534 and 440033. It is missing for 15,777 rows.
  20. "SwissProt" is a repository for protein data developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. Examples include Q8N6H1 /// Q6UXD7 /// Q6XYD4 /// Q9H6H6 and P29992 /// Q96DH5. It is missing for 19,162 rows.
  21. "EC" is the Enzyme Commission family number.
  22. "OMIM" is the Online Mendelian Inheritance in Man accession number.
  23. "RefSeq.Protein.ID" is the ID of the protein sequence in the NCBI RefSeq database.
  24. "RefSeq.Transcript.ID"
  25. "FlyBase"
  26. "AGI"
  27. "WormBase"
  28. "MGI.Name"
  29. "RGD.Name"
  30. "SGD.accession.number"
  31. "Gene.Ontology.Biological.Process"
  32. "Gene.Ontology.Cellular.Component"
  33. "Gene.Ontology.Molecular.Function"
  34. "Pathway"
  35. "Protein.Families"
  36. "Protein.Domains"
  37. "InterPro"
  38. "Trans.Membrane"
  39. "QTL"
  40. "Annotation.Description"
  41. "Annotation.Transcript.Cluster"
  42. "Transcript.Assignments"
  43. "Annotation.Notes"

EC Enzyme Commission family number. OMIM OMIM: Online Mendelian Inheritance in Man accession number. RefSeq Protein ID ID of the protein sequence in the NCBI RefSeq database. RefSeq Transcript ID References to multiple sequences in RefSeq. The field contains the ID and Description for each entry, and there can be multiple entries per ProbeSet. Gene Ontology Biological Process Gene Ontology Consortium Biological Process derived from LocusLink. Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended". Gene Ontology Cellular Component Gene Ontology Consortium Cellular Component derived from LocusLink. Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended". Gene Ontology Molecular Function Gene Ontology Consortium Molecular Function derived from LocusLink. Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended". Also see www.affymetrix.com/support/technical/manual/taf_manual.affx

Obviously, we need to match with column 15 (Gene.Symbol). One difficulty is that the order that the probesets are presented in the HG-U133_Plus_2_annot file is not the same as the order that the probesets are presented in data files. So you have to match twice. First you find the Gene.Symbol values in the annotate data set that match the symbols in the FolateGeneList file. Get the associated Probe.Set.ID value from the HG-U133_Plus_2_annot file. Now match this with the appropriate probeset values in the data files. Here is the R code to do all this.

f0 <- "x:/sleeder/csv/FolateGeneList.csv"
folate.dat <- read.csv(f0,header=F)
f1 <- "x:/sleeder/csv/HG-U133_Plus_2_annot.csv"
annotate.dat <- read.csv(f1)
folate.symbol <- trimWhiteSpace(as.character(folate.dat$V1))
annotate.symbol <- trimWhiteSpace(as.character(annotate.dat$Gene.Symbol))
annotate.probeset <- trimWhiteSpace(as.character(annotate.dat$Probe.Set.ID))
symbol.match <- annotate.symbol %in% folate.symbol
folate.probeset <- annotate.probeset[symbol.match]
probeset.match <- gene.probeset %in% folate.probeset

Now you might want to know what the symbols are for each of the probeset.match values. You just reverse the search process.

reverse.match <- annotate.probeset %in% gene.probeset[probeset.match]
gene.symbol <- annotate.symbol[reverse.match]

Now let's see what sort of graphs we can draw using the folate pathway. Note that the 14th through 16th values represent postnatal ages in years and have to be converted to post-conceptual age in days to match the rest of the data.

age.pca <- as.numeric(age[tis=="Li"])
age.pca[14:16] <- age.pca[14:16]*365+270
age.group <- (age.pca>76)+(age.pca>98)+(age.pca>140)
win.print(width=7,height=7,pointsize=12,printer="Adobe PDF")
for (i in 1:94) {
        plot(age.group,folate.signal[i,])
        title(gene.symbol[i],"0=75-76, 1=94-98, 2=130-140, 3=postnatal")
}
dev.off()

Dumping data to a text file

In the prenatal liver study, I needed to give some of the normalized gene expression levels to a researcher in a form he could use. The data he needed was in a data frame with 94 rows and 16 columns (folate.signal). But unfortunately, the names of the rows (gene.symbol) and columns (liver.names) were stored in separate objects. Here's one way to match the values back up.

 First, there are duplicates in the gene.symbol list, so to create new names, use the makeUnique function found in the limma library. Then change folate.signal from a data frame to a matrix. Use the dimnames function to add the row and column names to the matrix. Finally, use the write.table function to create a text file.

gene.unique <- makeUnique(paste(gene.symbol," "))
folate.matrix <- as.matrix(folate.signal)
dimnames(folate.matrix) <- list(gene.unique,liver.names)
write.table(folate.matrix,"x:/sleeder/folate.txt")


Software for microarray data analysis (no date) Category: Data mining [incomplete]

R and the Bioconductor package

There is a wide range of software available for the analysis of microarrays. I will use Bioconductor which is a set of libraries for a statistical programming language called R. Both Bioconductor and R are open source, which means that you can obtain the pacakge at no cost. Open source also means that you can modify the program, if you are so inclined, to add new features. Most of us (myself especially) do not have the programming skills to modify Bioconductor or R, but there are scores of graduate students and faculty who do have this skill and who open source software as a way to quickly publicize and promote their research. As a result, R and Bioconductor have some of the best and most cutting edge analysis methods around.

Instructions for setting up R on your computer are on the web at

For Windows users, this simply means downloading and running an install program. To install Bioconductor, type in the following two commands:

This will get the default set of packages. You can install a particular set of packages, such as the "affy" packages by typing in

and if you have a high speed connection and some free time,  you can download the entire Bioconductor system by typing


Design of microarray experiments (no date) Category: Data mining [incomplete]

There are a variety of research designs that you can use in a microarray experiment.

Designs for cDNA arrays

Loop designs [explain]

A dye swap design use the treated DNA as red on the first array and as green on the second array. The control is green on the first array and red on the second. A second pair of treated and control DNA would use a third and fourth array.

In contrast, a loop design has overlap between pairs of treated and control DNA. It still uses the first treated DNA sample as red on the first array and green on the second. The first control uses red on the second array and green on the third. The second treated DNA sample is red on the third array and green on the fourth. This continues similarly. The last control DNA sample is red on the last array and green on the first array. The return to the last sample to the first array is the loop design.

Expression level polymorphism is a quantitative variation in gene expression associated with a complex process such as resistance.

Duplicate samples give the biggest boost in power. Triplicates and higher levels of replication provide smaller but still significant gains.

Designs for Affy arrays

[Explain]


Normalization for microarray data (no date) Category: Data mining [incomplete]

Normalization is the process of adjusting values in a microarray experiment to improve consistency and reduce bias.

Image processing

[explain]

Background correction

[explain]

Log transformation

[explain]

Normalization

If you run the same biological sample on two separate microarrays (or on both the red and the green channel of a two color array), you will get slightly different results. This is just part of the inherent variation that you have with any laboratory assay.

Normalization is a method that attempts to remove some of this variation. There are several approaches which can be used separately or in combination to normalize a set of microarrays.

1. Multiply each array by a constant to make the mean (median) intensity the same for each individual array.

2. Adjust the arrays using some control or housekeeping genes that you would expect to have the same intensity level across all of the samples.

3. Match the percentiles of each array.

4. Adjust using a nonlinear smoothing curve.

5. Adjust using control genes

Amaratunga and Cabrera (2004) include a library of R functions with their book and that library includes several sample data sets. The data set mice2.txt represents a simple microarray experiment with four control mice and four treated mice. There is substantial variation in the arrays.

The following program reads the mice2.txt file, compute the base 2 logarithm for each value, and then estimates the average intensity.

rDirectory <- "c:/Program Files/R/rw1090"
FileName <- "/library/DNAMR/data/mice2.txt"
m2.dat <- read.table(file=paste(rDirectory,FileName,sep=""))
m2.log <- log(m2.dat,base=2)
m2.array.means <- apply(m2.log,2,mean)
print(round(m2.array.means,2))

Here is what the output looks like (I slightly modified the appearance of this output and some of the later output so it would fit better on this web page).

cnt1011a1 9.07
cnt1011a2 9.03
cnt1011b1 8.93
cnt1011b2 9.2
trt2501a1 8.8
trt2501a2 9.08
trt2501b1 9.49
trt2501b2 9.23

boxplot(split(as.matrix(m2.log),col(as.matrix(m2.log))))

Notice that even within the control and treatment arrays there is some variation in the average intensity. There are several factors that can cause this. Perhaps one array got slightly more DNA, or maybe there are slight variations during the production of the arrays. Maybe there were variations in the laboratory environment (temperature or humidity) during the preparation of these samples that influenced the readings.

Here's a direct comparison of the seventh and eighth arrays.

par(mar=c(5,4,0,0)+0.1)
plot(m2.log[,7],m2.log[,8],
  xlim=range(m2.log),ylim=range(m2.log))
abline(a=0,b=1)

Notice how the bulk of the data lies below the diagonal line. This is an indication that signal intensity was lower, across the board, for the eighth array.

What you are seeing with these eight arrays is not too much different than what you might experience if you burn your own mix of songs onto a CD. Some of the songs might be recorded at a louder or softer volume than others, and if you take no steps to adjust the recording, then you would have to be constantly adjusting your volume as you listen to that CD.

The simplest adjustment is to take each signal and divide it by the average signal for each array. This guarantees that the adjusted signals will all have a mean of 1.0. Here is the R code to normalize each array to have the same mean intensity.

m2.n.genes <- dim(m2.log)[[1]]
m2.n.arrays <- dim(m2.log)[[2]]
m2.normalize1 <- matrix(NA,m2.n.genes,m2.n.arrays)
for (i in 1:m2.n.arrays) {
  m2.normalize1[,i] <- m2.log[,i]/mean(m2.log[,i])
}

By the way, experts in R would probably point out the inefficiencies in this code. In general, if you can use matrix operations to avoid using loops, you should. Here's a bit more efficient code

one <- matrix(1,3434,1)
m2.normalize1a <- m2.log / (one%*%m2.array.means)

You might wonder if this normalization is good or bad. It is possible, perhaps, that the treatment changes the cells by upregulating most genes to produce more mRNA. This is indeed possible, and normalization would effectively remove most of the effect of the treatment before you had a chance to analyze it.

Fortunately, the more likely scenario is that a treatment would upregulate only a few genes, downregulate only a few more genes, and would leave the vast majority of genes unaffected. There's safety in numbers for most microarrays, so normalization does not end up distorting your data.

Since the mean is influenced by outliers, some people prefer to adjust by the median intensity level rather than the mean level. Another commonly used choice is to adjust using the 75th percentile. The rationale for this choice is that about half the genes might not show any significant expression, so the 75th percentile represents the median of the remaining 50%.

> round(f.concor(m2.log),2)

     cna1 cna2 cnb1 cnb2 tra1 tra2 trb1 trb2
cna1 1.00 0.97 0.97 0.96 0.78 0.80 0.75 0.78
cna2 0.97 1.00 0.97 0.96 0.78 0.81 0.76 0.79
cnb1 0.97 0.97 1.00 0.96 0.82 0.80 0.75 0.79
cnb2 0.96 0.96 0.96 1.00 0.77 0.81 0.79 0.81
tra1 0.78 0.78 0.82 0.77 1.00 0.92 0.85 0.89
tra2 0.80 0.81 0.80 0.81 0.92 1.00 0.92 0.96
trb1 0.75 0.76 0.75 0.79 0.85 0.92 1.00 0.95
trb2 0.78 0.79 0.79 0.81 0.89 0.96 0.95 1.00

> round(f.concor(m2.normalize1),2)

     cna1 cna2 cnb1 cnb2 tra1 tra2 trb1 trb2
cna1 1.00 0.97 0.98 0.97 0.80 0.80 0.79 0.79
cna2 0.97 1.00 0.97 0.97 0.79 0.81 0.80 0.80
cnb1 0.98 0.97 1.00 0.98 0.82 0.81 0.80 0.81
cnb2 0.97 0.97 0.98 1.00 0.80 0.82 0.81 0.81
tra1 0.80 0.79 0.82 0.80 1.00 0.94 0.95 0.94
tra2 0.80 0.81 0.81 0.82 0.94 1.00 0.96 0.97
trb1 0.79 0.80 0.80 0.81 0.95 0.96 1.00 0.96
trb2 0.79 0.80 0.81 0.81 0.94 0.97 0.96 1.00

plot(m2.normalize1[,7],m2.normalize1[,8],
  xlim=range(m2.normalize1),ylim=range(m2.normalize1))
abline(a=0,b=1)

boxplot(split(as.matrix(m2.normalize1),
  col(as.matrix(m2.normalize1))))

This normalization helps only a little. There is still too much of the data below the diagonal line. The problem is that the variation from array to array is often intensity dependent and these arrays show systematic variations at low intensities that differ from variations seen at medium and high intensities.

Another approach is quantile normalization. If you look at the sorted values of any two arrays, they will deviate from an identity line.

plot(sort(m2.log[,7]),sort(m2.log[,8]),
  xlim=range(m2.log),ylim=range(m2.log))
abline(a=0,b=1)

Quantile normalization will match up these values across all the arrays so that the smallest value on each array is identical, the second smallest is identical, and so forth. Note that the smallest value for one array might be a different gene than the smallest value on another array.

The DNAMR package includes a function, f.qn() that performs quantile normalization. Let's apply this to the data and see what we get.

library("DNAMR")
m2.normalize2 <- f.qn(m2.log)
plot(sort(m2.normalize2[,7]),sort(m2.normalize2[,8]),
 xlim=range(m2.normalize2),ylim=range(m2.normalize2))
abline(a=0,b=1)

> round(f.concor(m2.normalize2),2)

     cna1 cna2 cnb1 cnb2 tra1 tra2 trb1 trb2
cna1 1.00 0.97 0.98 0.97 0.80 0.82 0.83 0.81
cna2 0.97 1.00 0.97 0.97 0.80 0.82 0.83 0.81
cnb1 0.98 0.97 1.00 0.99 0.82 0.82 0.84 0.83
cnb2 0.97 0.97 0.99 1.00 0.81 0.82 0.83 0.82
tra1 0.80 0.80 0.82 0.81 1.00 0.95 0.97 0.94
tra2 0.82 0.82 0.82 0.82 0.95 1.00 0.96 0.97
trb1 0.83 0.83 0.84 0.83 0.97 0.96 1.00 0.97
trb2 0.81 0.81 0.83 0.82 0.94 0.97 0.97 1.00

plot(m2.normalize2[,7],m2.normalize2[,8],
  xlim=range(m2.normalize2),ylim=range(m2.normalize2))
abline(a=0,b=1)
 

boxplot(split(as.matrix(m2.normalize2),
  col(as.matrix(m2.normalize2))))

Notice that the concordance correlations are just a bit better. The seventh and eighth arrays line up better as well, with about half the genes above and below the diagonal.

It's interesting to look at the code behind the f.qn() function.

> f.qn
function (x)
{
  xm <- apply(x, 2, sort)
  xxm <- f.rmedian.na(xm)
  xr <- c(apply(x, 2, rank))
  array(approx(1:nrow(x), xxm, xr)$y, dim(x), dimnames(x))
}

The program creates a temporary matrix xm consisting of the sorted columns of x. Then it computes a median across all the rows of xm. Then the approx() function applies linear interpolation to the data.

A third approach to normalization uses smoothing curves to adjust. Define a median signal across all the arrays and then look at the relationship between the smooth fit and the diagonal line. Adjust the array values up or down depending on whether the smooth fit is below or above the diagonal line.

m2.median <- apply(m2.log,1,median)
m2.deviation <- matrix(NA,m2.n.genes,m2.n.arrays)
for (i in 1:8) {
  smooth.fit <- fitted(loess(m2.log[,i]~m2.median))
  m2.deviation[,i] <- smooth.fit - m2.median
  plot(m2.median,m2.log[,i],
    xlim=range(m2.log),ylim=range(m2.log))
  abline(a=0,b=1)
  points(m2.median,smooth.fit,col=3)
}

m2.normalize3 <- m2.log - m2.deviation
round(f.concor(m2.normalize3),2)


     cna1 cna2 cnb1 cnb2 tra1 tra2 trb1 trb2
cna1 1.00 0.98 0.98 0.98 0.83 0.82 0.80 0.82
cna2 0.98 1.00 0.98 0.98 0.82 0.82 0.79 0.81
cnb1 0.98 0.98 1.00 0.99 0.85 0.83 0.81 0.82
cnb2 0.98 0.98 0.99 1.00 0.82 0.81 0.78 0.80
tra1 0.83 0.82 0.85 0.82 1.00 0.95 0.96 0.94
tra2 0.82 0.82 0.83 0.81 0.95 1.00 0.95 0.96
trb1 0.80 0.79 0.81 0.78 0.96 0.95 1.00 0.96
trb2 0.82 0.81 0.82 0.80 0.94 0.96 0.96 1.00

plot(m2.normalize3[,7],m2.normalize3[,8],
  xlim=range(m2.normalize2),ylim=range(m2.normalize2))
abline(a=0,b=1)

boxplot(split(as.matrix(m2.normalize3),
  col(as.matrix(m2.normalize3))))

This normalization seems to work about as well as the quantile normalization did.

You can get a better feel for the need for normalization by using an MvA plot. This plot effectively rotates the plot 45 degrees so that the diagonal reference line becomes a horizontal reference line. Here's some code for creating an MvA plot and an example of how that plot differs from the one shown just above.

mva.plot <- function(x,y) {
  m <- y-x
  a <- (x+y)/2
  plot(a,m)
  abline(h=0)
}
mva.plot(m2.normalize3[,7],m2.normalize3[,8])

You can do your normalization calculations on the MvA plot, though they should not be significantly different from the  normalizations described above. The MvA plot is also called a Bland-Altman plot.

Depending on what you know about the microarrays, you might want to consider additional normalization steps. For example, if you know the physical location of the spots on the microarray, you might want to see if there are physical regions of the array where signals are stronger or weaker. If the array was printed used a set of 16 print tips, you might want to see if some of the tips produced stronger or weaker signals than the others.

I did not discuss the use of control genes for normalization. There are several problems with the use of control genes. First, these genes might not cover the full range of intensity levels on the array. Second, there is some question about whether certain genes will truly serve effectively as controls, and they may not have the same expression levels across different samples.

Once the data is normalized, you may wish to store in an object of class "exprSet". Here are the R commands to do this.

library("marray")
m2.indicators <- data.frame(ic=c(1,1,1,1,0,0,0,0),
  it=c(0,0,0,0,1,1,1,1),row.names=dimnames(m2.dat)[[2]])
covLabels <- list("Indicator for Control","Indicator for Treatment")
names(covLabels) <- names(m2.indicators)
m2.phenoData <- new("phenoData",pData=m2.indicators,varLabels=covLabels)
m2.exprSet <- new("exprSet",exprs=as.matrix(m2.normalize3),
  phenoData=m2.phenoData)
print(m2.exprSet)

Expression Set (exprSet) with
3434 genes
8 samples
phenoData object with 2 variables and 8 cases
varLabels
ic: Indicator for Control
it: Indicator for Treatment

You can then use this object with most of the Bioconductor routines for differential expression, clustering, and so forth.


Differential expression in microarray data (no date) Category: Data mining [incomplete]

You can compute an expression ratio for each gene by taking the average of the log expression levels in the treatment group and subtracting the average of the log expression levels in the control group. This actually produces a log ratio, and you can compute the actual ratio by taking the antilog.

m2.control <- as.matrix(m2.normalize3[,1:4])
m2.treat <- as.matrix(m2.normalize3[,5:8])
m2.logratio <- apply(m2.treat,1,mean)-apply(m2.control,1,mean)

Which of these expression levels is significantly different from 1? The answer to this question is surprisingly difficult. Here are some approaches.

  1. Declare any ratio bigger than 2 or smaller than 0.5 to be significant
  2. Compute a two-sample t-test for each gene
  3. Modify the significance level of the t-test using Bonferroni
  4. Add a fudge factor to the denominator of each t-test.
  5. Pool results across similar genes

The simplest approach, but an approach that is probably not defensible, is to state that any ratios beyond a certain threshold represent genes with differential expression. Typically, this approach highlights ratios larger than 2 (and smaller than 1/2) though threshold of 3 (1/3) and 5 (1/5) have also been used.

Although I cannot recommend this approach, here is how it would work in R. Recall that a ratios of 2 and 1/2 correspond to log ratios of -1 and +1 if you use base 2 logarithms.

diff1 <- m2.logratio[abs(m2.logratio)>1]
names(diff1)


[1] "5" "11" "16" "17" "37" "49" "50" "53" "54" "57"
[11] "58" "60" "63" "64" "65" "68" "80" "98" "102" "105"
[21] "107" "116" "125" "133" "134" "147" "150" "158" "162" "166"
[31] "168" "171" "173" "177" "193" "205" "218" "224" "232" "236"
.
.
.
[601] "3253" "3260" "3261" "3271" "3272" "3282" "3289" "3290" "3311" "3331"
[611] "3348" "3350" "3354" "3358" "3367" "3379" "3404" "3421" "3423"

Another approach is to apply a two-sample t-test or a related statistic to each gene.

f.custom.t <- function(x) {
  t.test(x[1:4],x[5:8])$p.value
}
m2.t.tests <- apply(m2.normalize3,1,f.custom.t)
diff2 <- m2.t.tests[m2.t.tests < 0.05]
names(diff2)


[1] "3" "5" "8" "9" "11" "13" "16" "17" "19" "21"
[11] "22" "26" "27" "30" "32" "33" "35" "36" "37" "40"
[21] "42" "47" "49" "50" "53" "54" "55" "57" "58" "59"
[31] "60" "63" "64" "65" "68" "72" "76" "79" "80" "81"
.
.
.
[1731] "3401" "3403" "3404" "3405" "3407" "3408" "3410" "3412" "3421" "3423"
[1741] "3430" "3433"

The problem with this approach, of course, is that you are running thousands of t-test simultaneously. Some of these tests will end up being significant, even if nothing is going on.

To adjust for the large number of multiple tests, you can use a correction, such as Bonferroni. Here's how Bonferroni would work.

diff3 <- m2.t.tests[m2.t.tests < (0.05/m2.n.genes)]
names(diff3)


[1] "5" "64" "102" "224" "232" "305" "311" "339" "354" "384"
[11] "462" "516" "560" "591" "603" "736" "745" "758" "778" "798"
[21] "805" "834" "931" "967" "987" "988" "1000" "1029" "1065" "1103"
[31] "1129" "1157" "1202" "1235" "1263" "1269" "1290" "1305" "1352" "1356"
.
.
.
[101] "2905" "2910" "2990" "3031" "3063" "3142" "3195" "3209" "3220" "3225"
[111] "3252" "3267" "3311" "3354" "3358"

The Bonferroni approach has been criticized for being too conservative. There are several alternatives that look at p-values in sequence and apply a more strict standard for the smallest p-values, but loosen this standard as more p-values are examined.

The Holm-Bonferroni approach is the easiest one to follow. Sort the p-values from smallest to largest. Adjust the smallest p-value by mulitplying by K where K is the number of genes being tested. Multiply the next smallest p-value by (K-1), the next smallest by (K-2) and so forth.

o <- order(m2.t.tests)
m2.sorted.pvalues <- m2.t.tests[o]
m2.sorted.names <- names(m2.t.tests)[o]
m2.adjusted.pvalues <- cummax(pmin(m2.sorted.pvalues*(m2.n.genes:1),1.0))
m2.sorted.names[m2.adjusted.pvalues < 0.05]

[1] "354" "2854" "1103" "1379" "3225" "2252" "2766" "3358" "1458" "1546"
[11] "1696" "3195" "2067" "1690" "2744" "1430" "384" "1629" "560" "1758"
[21] "1269" "3031" "1915" "1499" "2199" "988" "2445" "3063" "1751" "2910"
[31] "798" "1821" "1434" "339" "3252" "3311" "1548" "1352" "745" "2038"
.
.
.
[101] "1931" "3267" "1925" "311" "591" "834" "1889" "1671" "64" "516"
[111] "2905" "2810" "2651" "2863" "2023" "2686"

Differential expression (DE)

DE - Within gene analysis

[Explain]

DE - Borrowing strength across genes

[Explain]

Local pooled error. At various levels of expression, pool standard deviations of multiple genes.

Model variance as a function of the mean. Sometimes this approach is not flexible enough.

Shrinkage estimates. Add a constant to the denominator.

DE - Adjusting for multiple tests

Family Wise Error Rate: Bonferroni, Holm (1979) step down, Hochberg.

False Discovery Rate

Classic reference Alizadeh et al (2000)


Unsupervised learning (no date) Category: Data mining [incomplete]

Here are some documented examples of how to use unsupervised learning methods of the analysis of microarray data.

Unsupervised learning

The examples in this section willl use the Khan data set that is part of the DNAMR library. To read in the Khan data set, use the following commands:

rDirectory <- "c:/Program Files/R/rw1090"
FileName <- "/library/DNAMR/data/Khan.txt"
kh.dat <- read.table(file=paste(rDirectory,FileName,sep=""))
kh.log <- log(kh.dat,base=2)[,-1]

The [,-1] at the end of the last command omits the first column from the data set before computing logs, since the first column is an image id.

Unsupervised learning (class discovery)

Hierarchical clustering, partitioning clustering (silhouette widths), model based clustering.

Clustering methods in the Acuity software system:


Supervised learning (no date) Category: Data mining [incomplete]

Here are some documented examples of how to use supervised learning methods of the analysis of microarray data.

Supervised learning

The Bioconductor package has a sample data set, golubEsets, that is available on the web at

http://www-genome.wi.mit.edu/mpr/data_set_ALL_AML.html

and is based on the publication

Molecular Classification of Cancer: Class Discovery and Class Prediction by Gene Expression. Golub TR, Slonim DK, Tamayo P, Huard C, Gaasenbeek M, Mesirov JP, Coller H, Loh M, Downing JR, Caligiuri MA, Bloomfield CD, Lander ES. Science 1999: 286; 531-537. [PDF]

To get this data, use the following commands:

library("marray")
library("golubEsets")
data(golubTrain)
data(golubTest)
data(golubMerge)

The phenotypic data for the training data set looks like

> pData(golubTrain)

  
Samples ALL.AML BM.PB T.B.cell ... Source
1        1     ALL    BM   B-cell ...   DFCI
2        2     ALL    BM   T-cell ...   DFCI
3        3     ALL    BM   T-cell ...   DFCI
.
.
.
32      32     AML    BM     <NA> ...  CALGB
33      33     AML    BM     <NA> ...  CALGB

table(pData(golubTrain)$ALL.AML)

ALL AML
27 11

table(pData(golubTest)$ALL.AML)

ALL AML
20 14

table(pData(golubMerge)$ALL.AML)

ALL AML
47 25

x <- exprs(GolubTrain)

Supervised learning (class prediction)

Regression analysis, discriminant analysis, cart, neural net

support vector machines: http://www.acm.org/sigs/sigkdd/explorations/issue2-2/bennett.pdf