Data management in a microarray experiment (no date) [incomplete]

This page is moving to a new website.

Selecting a subset of genes from the prenatal liver study

I was asked to concentrate on a set of genes associated with the Folate pathway. This list of 43 genes,

ABCB1, ABCC1, ABCC3, ABCC4, ABCG2, AHCYL1, AHCYL1, AMT, ATIC, BHMT, BHMT2, CBS, CTH, DHFR, FOLH1, FOLH2, FOLR1, FOLR2, FOLR2L, FOLR3, FPGS, FTCD, FTH1, FTHFD, FTHFSDC1, GART, GCH1, GGH, GNMT, MAT1A, MAT2A, MAT2B, MTHFD1, MTHFD2, MTHFR, MTHFS, MTR, MTRR, PPAT, PTS, RUVBL2, SHMT1, SLC19A1, and TYMS

were stored in an Excel file called FolateGeneList.xls. I converted this file to a csv format, and read it into R.

f0 <- "x:/sleeder/csv/FolateGeneList.csv"
folate.dat <- read.csv(f0,header=F)

Unfortunately, none of these names matched the names found in the Probeset column of the data sets. I had to look instead at a separate file, HG-U133_Plus_2_annot.xls, which I also converted to csv format and read into R.

f1 <- "x:/sleeder/csv/HG-U133_Plus_2_annot.csv"
annotate.dat <- read.csv(f1)
dim(annotate.dat)

[1] 54675  43

names(annotate.dat)

 [1] "Probe.Set.ID"                     "GeneChip.Array"
 [3] "Species.Scientific.Name"          "Annotation.Date"
 [5] "Sequence.Type"                    "Sequence.Source"
 [7] "Transcript.ID"                    "Target.Description"
 [9] "Representative.Public.ID"         "Archival.UniGene.Cluster"
[11] "UniGene.ID"                       "Genome.Version"
[13] "Alignments"                       "Gene.Title"
[15] "Gene.Symbol"                      "Chromosomal.Location"
[17] "Unigene.Cluster.Type"             "Ensembl"
[19] "LocusLink"                        "SwissProt"
[21] "EC"                               "OMIM"
[23] "RefSeq.Protein.ID"                "RefSeq.Transcript.ID"
[25] "FlyBase"                          "AGI"
[27] "WormBase"                         "MGI.Name"
[29] "RGD.Name"                         "SGD.accession.number"
[31] "Gene.Ontology.Biological.Process" "Gene.Ontology.Cellular.Component"
[33] "Gene.Ontology.Molecular.Function" "Pathway"
[35] "Protein.Families"                 "Protein.Domains"
[37] "InterPro"                         "Trans.Membrane"
[39] "QTL"                              "Annotation.Description"
[41] "Annotation.Transcript.Cluster"    "Transcript.Assignments"
[43] "Annotation.Notes"

This file is so complex in part because there are so many different ways that you can name a gene. For this particular experiment, here is what the columns represent.

  1. "Probe.Set.ID" is a medium length string (6-27 characters) that represents the Affymetrix name. The values are unique. Examples include 236631_at and AFFX-r2-Ec-bioB-M_at.
  2. "GeneChip.Array" is Human Genome U133 Plus 2.0 Array for all rows.
  3. "Species.Scientific.Name" is Homo sapiens for all rows.
  4. "Annotation.Date" is  7-Dec-04 for all rows.
  5. "Sequence.Type" is Consensus sequence (39,534), Exemplar sequence (15,079), or Control sequence (62). "An Exemplar is a single nucleotide sequence taken directly from a public database. This sequence could be an mRNA or EST. A Consensus sequence, is a nucleotide sequence assembled by Affymetrix, based on one or more sequence taken from a public database." www.ncbi.nlm.nih.gov/projects/geo/query/acc.cgi?acc=GPL93
  6. "Sequence.Source" is GenBank (54,600) or Affymetrix Proprietary Database (75).
  7. "Transcript.ID" is a medium length string (6-24 characters). The values are not unique. Examples include Hs.121419.0 and Hs.138751.0. It is never missing in this data set.
  8. "Target.Description" is a long text string (56 to 1,023 characters). An example is U48705 /FEATURE=mRNA /DEFINITION=HSU48705 Human receptor tyrosine kinase DDR gene, complete cds 51842 Levels: Cluster Incl. AB001325:Human AQP3 gene for aquaporine 3 (water channel), partail cds /cds=(60,938) /gb=AB001325 /gi=1854373 /ug=Hs.234642 /len=1442 ...
  9. "Representative.Public.ID" is a medium length string (6-24 characters). The values are not unique. Examples include NM_001785 and AL525367. It is never missing in this data set.
  10. "Archival.UniGene.Cluster" is a short text string (3-17 characters). The values are not unique and it is missing for 12,430 rows. This and the next variable are part of "an experimental system for automatically partitioning GenBank sequences into a non-redundant set of gene-oriented clusters. Each UniGene cluster contains sequences that represent a unique gene, as well as related information such as the tissue types in which the gene has been expressed and map location." Examples include Hs.220998, --- /// Hs.900316, and Hs.6336.
  11. "UniGene.ID" is a medium length string (3-135 characters). The values are not unique and they are missing for 12,878 rows. Examples include Hs.278441, Hs.383534 /// Hs.449450 /// Hs.511725 /// Hs.511727 /// Hs.511728 /// Hs.534484 and Hs.249587 /// Hs.406224.
  12. "Genome.Version" is May 2004 (NCBI 35) for all rows,
  13. "Alignments" is a long text string (3-1,004 characters). An example is chr6:30964144-30975910 (+) // 95.63 // p21.33 50315 Levels: --- chr1:100038786-100101597 (+) // 99.57 // p21.2 ... chrY:9897982-9899549 (+) // 98.23 // p11.2 /// chrY:9938577-9940161 (+) // 96.58 // p11.2 /// chrY:9238824-9240391 (+) // 98.16 // p11.2 /// chrY:9279448-9281015 (+) // 98.16 // p11.2 /// chrY:9958907-9960474 (+) // 98.16 // p11.2 /// chrY:9218491-9220076 (+) // 96.51 // p11.2 /// chrY:9259103-9260669 (+) // 98.16 // p11.2 /// chrY:9918294-9919879 (+) // 96.39 // p11.2 /// chrY:6157682-6159243 (+) // 97.72 // p11.2 /// chrY:9979090-9980643 (+) // 90.56 // p11.2.
  14. "Gene.Title" is a long text string (3-30,000 characters). It is missing for 12,267 rows. An example is androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease) /// androgen receptor (dihydrotestosterone receptor; testicular feminization; spinal and bulbar muscular atrophy; Kennedy disease).
  15. "Gene.Symbol" is a long text string (1-8,911 characters). Examples include RANBP2, MGC27165 /// IGH@ /// IGHG1, and PLK4. The values are not unique and they are missing for 16,943 rows.
  16. "Chromosomal.Location" gives the physical location of the gene. Examples include chr5q35.3 and chr6p22.1-p21.2. It is missing for 15,891 rows.
  17. "Unigene.Cluster.Type" is full length (33,859), est /// full length (30), or est (3). It is missing for 20,783 rows.
  18. "Ensembl" Examples include ENSG00000166478 and ENSG00000006606.  It is missing for 22,243 rows.
  19. "LocusLink" is a classification scheme developed by the National Center for Biotechnology Information, but it has been superceded by Entrez Gene. Examples include 399655 /// 9534 and 440033. It is missing for 15,777 rows.
  20. "SwissProt" is a repository for protein data developed by the Swiss Institute of Bioinformatics and the European Bioinformatics Institute. Examples include Q8N6H1 /// Q6UXD7 /// Q6XYD4 /// Q9H6H6 and P29992 /// Q96DH5. It is missing for 19,162 rows.
  21. "EC" is the Enzyme Commission family number.
  22. "OMIM" is the Online Mendelian Inheritance in Man accession number.
  23. "RefSeq.Protein.ID" is the ID of the protein sequence in the NCBI RefSeq database.
  24. "RefSeq.Transcript.ID"
  25. "FlyBase"
  26. "AGI"
  27. "WormBase"
  28. "MGI.Name"
  29. "RGD.Name"
  30. "SGD.accession.number"
  31. "Gene.Ontology.Biological.Process"
  32. "Gene.Ontology.Cellular.Component"
  33. "Gene.Ontology.Molecular.Function"
  34. "Pathway"
  35. "Protein.Families"
  36. "Protein.Domains"
  37. "InterPro"
  38. "Trans.Membrane"
  39. "QTL"
  40. "Annotation.Description"
  41. "Annotation.Transcript.Cluster"
  42. "Transcript.Assignments"
  43. "Annotation.Notes"

EC Enzyme Commission family number. OMIM OMIM: Online Mendelian Inheritance in Man accession number. RefSeq Protein ID ID of the protein sequence in the NCBI RefSeq database. RefSeq Transcript ID References to multiple sequences in RefSeq. The field contains the ID and Description for each entry, and there can be multiple entries per ProbeSet. Gene Ontology Biological Process Gene Ontology Consortium Biological Process derived from LocusLink. Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended". Gene Ontology Cellular Component Gene Ontology Consortium Cellular Component derived from LocusLink. Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended". Gene Ontology Molecular Function Gene Ontology Consortium Molecular Function derived from LocusLink. Each annotation consists of three parts: "Accession Number // Description // Evidence". The description corresponds directly to the GO ID. The evidence can be "direct", or "extended". Also see www.affymetrix.com/support/technical/manual/taf_manual.affx

Obviously, we need to match with column 15 (Gene.Symbol). One difficulty is that the order that the probesets are presented in the HG-U133_Plus_2_annot file is not the same as the order that the probesets are presented in data files. So you have to match twice. First you find the Gene.Symbol values in the annotate data set that match the symbols in the FolateGeneList file. Get the associated Probe.Set.ID value from the HG-U133_Plus_2_annot file. Now match this with the appropriate probeset values in the data files. Here is the R code to do all this.

f0 <- "x:/sleeder/csv/FolateGeneList.csv"
folate.dat <- read.csv(f0,header=F)
f1 <- "x:/sleeder/csv/HG-U133_Plus_2_annot.csv"
annotate.dat <- read.csv(f1)
folate.symbol <- trimWhiteSpace(as.character(folate.dat$V1))
annotate.symbol <- trimWhiteSpace(as.character(annotate.dat$Gene.Symbol))
annotate.probeset <- trimWhiteSpace(as.character(annotate.dat$Probe.Set.ID))
symbol.match <- annotate.symbol %in% folate.symbol
folate.probeset <- annotate.probeset[symbol.match]
probeset.match <- gene.probeset %in% folate.probeset

Now you might want to know what the symbols are for each of the probeset.match values. You just reverse the search process.

reverse.match <- annotate.probeset %in% gene.probeset[probeset.match]
gene.symbol <- annotate.symbol[reverse.match]

Now let's see what sort of graphs we can draw using the folate pathway. Note that the 14th through 16th values represent postnatal ages in years and have to be converted to post-conceptual age in days to match the rest of the data.

age.pca <- as.numeric(age[tis=="Li"])
age.pca[14:16] <- age.pca[14:16]*365+270
age.group <- (age.pca>76)+(age.pca>98)+(age.pca>140)
win.print(width=7,height=7,pointsize=12,printer="Adobe PDF")
for (i in 1:94) {
        plot(age.group,folate.signal[i,])
        title(gene.symbol[i],"0=75-76, 1=94-98, 2=130-140, 3=postnatal")
}
dev.off()

Dumping data to a text file

In the prenatal liver study, I needed to give some of the normalized gene expression levels to a researcher in a form he could use. The data he needed was in a data frame with 94 rows and 16 columns (folate.signal). But unfortunately, the names of the rows (gene.symbol) and columns (liver.names) were stored in separate objects. Here's one way to match the values back up.

 First, there are duplicates in the gene.symbol list, so to create new names, use the makeUnique function found in the limma library. Then change folate.signal from a data frame to a matrix. Use the dimnames function to add the row and column names to the matrix. Finally, use the write.table function to create a text file.

gene.unique <- makeUnique(paste(gene.symbol," "))
folate.matrix <- as.matrix(folate.signal)
dimnames(folate.matrix) <- list(gene.unique,liver.names)
write.table(folate.matrix,"x:/sleeder/folate.txt")