StATS: Methods for haplotype analysis (May 31, 2006).

I am not an expert on haplotype analysis, but as I understand it, a haplotype is a combination of several SNPs (Single Nucleotide Polymorphisms) that show a stronger association with disease than any single SNP might.

Haplotype analysis is difficult because you often only have partial information about the genomes. Here is a small piece of information about the first fifteen SNPs on chromosome 22 for a subject in the HapMap project.

rs3016036  AA
rs2334386  GG
rs2844882  AA
rs11089130 GG
rs738829   GG
rs7510853  CC
rs10154488 CC
rs915674   AG
rs915675   AC
rs915677   GG
rs9604648  GG
rs7286962  CC
rs9604721  CC
rs12159982 CC
rs4389403  AG

There are eight possible ways that these SNPs could arrange themselves on the two strands of DNA:

Haplotype 1: AGAGGCCAAGGCCCA and
             AGAGGCCGCGGCCCG

Haplotype 2: AGAGGCCAAGGCCCG and
             AGAGGCCGCGGCCCA

Haplotype 3: AGAGGCCACGGCCCA and
             AGAGGCCGAGGCCCG

Haplotype 4: AGAGGCCACGGCCCG and
             AGAGGCCGAGGCCCA

Haplotype 5: AGAGGCCGAGGCCCA and
             AGAGGCCACGGCCCG

Haplotype 6: AGAGGCCGAGGCCCG and
             AGAGGCCACGGCCCA

Haplotype 7: AGAGGCCGCGGCCCA and
             AGAGGCCAAGGCCCG

Haplotype 8: AGAGGCCGCGGCCCG and
             AGAGGCCAAGGCCCA

Actually, if you look closely at this, there are only four unique haplotypes (1/8, 2/7, 3/6, and 4/5 are effectively the same haplotypes).

In most realistic situations, you do not know what particular haplotype a patient has. You could sequence the DNA strand to figure out which of these haplotype combinations is actually present, but sequencing is a very expensive thing to do. Instead, you might be able to infer the likelihood of these haplotypes by looking at multiple patients and  making assumptions consistent with Hardy-Wienberg equilibrium.

These inferences are effectively the same as many missing data problems and use an approach, the EM algorithm that is commonly relied on to help with this sort of problem. There is a library of programs for R called haplo.stats

I've run some experiments on applying information theory to the HapMap project, and I might investigate whether this provides an alternative way to identifying haplotypes.

I attended a talk last week by Pengyuan Liu and she described how to assess haplotype information with special attention to the case where you have data on related siblings. Some of the references she mentioned are worth reviewing.

Creative Commons License This work is licensed under a Creative Commons Attribution 3.0 United States License. It was written by Steve Simon and was last modified on 04/01/2010.

This page was written by Steve Simon while working at Children's Mercy Hospital. Although I do not hold the copyright for this material, I am reproducing it here as a service, as it is no longer available on the Children's Mercy Hospital website. Need more information? I have a page with general help resources. You can also browse for pages similar to this one at