How to use the Haplotyper macro (README):


README for Haplotyper V 1.0

David G. Cox
Genome Analysis Group
International Agency for Research on Cancer
cox@iarc.fr

Changes:

September 6, 2001

After a period of working in Spain, I have gotten to the bottom
of the language problem. See below. I have also changed the
GOLD preparation so that the macro does not need the Arlequin
output file to get all its information. This will make it
considerably faster. The last change deals in how the sheets
are saved. You shouldn't see that pesky "File already exists"
message, unless you run the same data more than once. If anyone
has problems with this, please let me know.

September 13, 2001

Upon further discussion between Dr. Canzian and myself, we have
decided to include information about "Delta" in the GOLD data.
Formulas for calculating it are at the end of this document, along
with the references...

November 7, 2001

Due to a question from Jason Simon at Schering-Plough Research
Institute, I changed the "Organize" function to allow missing data.
USER BEWARE!!!! Arlequin will NOT accept missing data in diploid
data! It will treat missing data as a separate allele!

January 08, 2002
Finished making changes for Microsats....
We can now handle any type of polymorphism, SNP or Micro etc...
Also, at this point, Genotype Transposer is now Haplotyper!


May 08, 2002
Fixed a bug in Excel 2000 versions where a space is added to the
alleles extracted from Arlequin for bi-locus haplotypes.

Changed the name of delta to r, and added it's p value
Also, I added the confidence intervals of r, based on fisher's
exact test. See the readme for more info

July 24, 2002

I added a macro that will make an input file sheet to use David Clayton's STATA program to pick out htSNPs.  There will be a button on the "Multi-locus Haplos" worksheet that will make you the input file.  You can only do this with SNP data.  You will need to download the STATA program, in order to do it, see Dr. Clayton's web site for more information.  When you are asked to give a name to the input file, it will automatically add the .csv extension, so don't worry.  Just be careful to NOTE WHERE THE FILE WILL BE SAVED!  It should be in the same area as your Arlequin files etc.  I would then suggest moving the csv file into wherever you use STATA.  I give no guarantees as to the way people use the htSNP stuff....

August 23, 2002

I have fixed a bug in the macro that reconstructs haplotypes, so it should now work for polymorphisms with more than 9 alleles.  I also found and fixed a bug that caused the MakeArlequin module to malfunction, and individuals heterozygous for a polymorphism with more than 9 alleles was counted as homozygous.  As you can tell, I don't work very often with this sort of data!  Additionally, you will find that I now use buttons instead of a toolbar item.  In order to run the macro, you need to click the button that is found on the "README" page!  This will keep you from having a large number of "Genotype_Transposer" items on your menu bar.  Additionally, the buttons for reconstructing haplotypes and making the htSNP input are dynamic, making them disappear when the macros run.


YOU DO NOT NEED A PASSWORD TO OPEN THE FILE!!!!
The password is only needed if you want to 
modify the worksheets or macros AND save the
modifications under the same file name.

Known Incompatabilities!!!!

This program has been run on multiple platforms of Windows and Mac.
However, there has been limited work done with various versions or
languages of Excel. It has run on Windows 95, 98, NT, and 2000 with
Excel 5, 97, and 2000. On the Mac, it has been run on OS 9, with Excel
2000. 

Also, I have had success with French versions of Excel, BUT NOT
WITH ITALIAN!!!!(Thank you to Giuseppe Matullo in Turin for helping)
This does not mean you shouldn't try it. By all means, do. However,
please do report any incompatabilities, or compatabilities not mentioned
here to me! I have gotten to the bottom of the problem with languages,
and hope to have fixed it. Please please please foreign users give it
a try!

As of July, 2002, most of the language problems have been fixed. However,
do still contact me if you run into something strange!

This program provides time saving shortcuts in SNP genotype analysis.
Each SNP is given a worksheet with data containing allele and genotype
counts and frequencies, as well as a test for Hardy-Weinberg 
equilibrium. Then, your data are put into a format suitable for entry
the Arlequin program (see below). When running Arlequin, the only
NECESSARY thing to do would be to calculate either the bi-loci or
multi-loci haplotypes (or both). CALCULATING LINKAGE DISEQUILIBRIUM
WITH ARLEQUIN IS VERY COMPUTER INTENSIVE! This means it will take
a long time. This program will calculate D and D' (with a p-value)
from the bi-loci haplotype frequencies given by Arlequin. These are
the general statistics of linkage disequilibrium (see the notes below).


Below, you will find pertinent information about the programs involved,
as well as the formulas used in calculating the various statistics.

Arlequin can be found at the following web site:

http://lgb.unige.ch/arlequin/

GOLD can be found at the following web site:

http://www.well.ox.ac.uk/asthma/GOLD/

Programs/Instructions/Formulas:

There should be no need to install or change anything in the programs.
The software is written in Microsoft Visual Basic. The code should 
run smoothly on Mac or PC. We have not been able to fully test 
this on the Mac, due to our inablity to run the Arlquin software
on the Mac. All the macros have been run on the Mac, with the actual
Arlequin manipulations taking place on a PC.

It is recommended that you open this macro in Excel (and not from the 
recent documents list!), and then save it in a new folder. Then, 
when you run the macros, each new file will be saved in this folder.
Also, you should not move the .txt file that is to be read by Arlequin.
In order to read the Arlequin output, the results folder from Arlequin
needs to be in the same folder as the worksheet you are running the macro
from.

General Format:
(See also worksheet "Example" in the software)

Your data needs to be in the worksheet titled "Genotypes".
Cell A1 (Top left-hand corner) needs to contain the name you would
like to give to the project. This will be used in the future as
the Arlequin "Project title". The "Project title" will also be the
names of the files saved by this program. So, it needs to be a 
valid file name. 
The first row of the worksheet (not including Cell A1) will
contain a discriptive name of each polymorphism. This will be the
name given to each separate worksheet in the "Organize" macro. 
The Cell A2 (just bellow the "Project title") will become
the "Sample name" in Arlequin. For this reason, it needs to be one
descriptive word of your sample set. The rest of the row is the
template for each polymorphism. It needs to consist of each allele,
separated by a spacer character. The following are a few examples:

A/G or A:G or A?G or A\G or A|G etc. 

Any character can go between your alleles. 

If you prefer to use numbers to determine your alleles, you will need
to use something like 1/1 1/2 or 2/2. In the row that has information
regarding the alleles of your polymorphism, you will need to use 2/2.
This is the moment some of you have been waiting for... I can now
accept micorsats or other multi-allelic markers. All you need to do
is use the number of alleles separated by a separater in the second
row of the excel sheet. So, for 2 alleles (SNP), you would have 2/2.
However, if you had a microsat with 10 alleles, you would input 10/10.
I hope you don't have any polymorphisms with more than 99 alleles, 
because we can't handle them. 

If you want to run Arlequin, your data CAN NOT contain any 
missing genotypes. This will not be a problem for figuring out the 
allele and genotype frequencies to check Hardy-Weinberg Equilibrium. There
is a problem with the Arlequin software, in that with Genotypic data,
missing data are considered as another allele.

Option 1: Organize Data and Check Hardy-Weinberg Equilibrium

This macro reads the data from your "Genotypes" sheet, and creates
a new worksheet for each polymorphism. This new worksheet will take its 
name from the first cell of the column, and contain the allele and genotype
counts of the column, as well as information regarding Hardy-Weinberg 
equilibrium. The formulas used to check Hardy-Weinberg equilibrium are as
follows:

Calculations for Expected genotype counts:

N = number of samples

p^2 + 2pq + q^2 = 1 (This is Hardy-Weinberg equilibrium!)
where p = observed major allele frequency
q = observed minor allele frequency

p^2 * N = expected number of homozygous major
2pq * N = expected number of heterozygous
q^2 * N = expected number of homozygous minor

Calculations for Chi-squared test of H-W equilibrium:

Chi-squared = Sum for each genotype class of (Obs counts - Exp Counts)^2 / Exp Counts

For the P-value, the Excel formula of ChiDist was used, with 1 degree of freedom.



Option 2: Put data into Arlequin format:

This macro again reads the data from your "Genotypes" sheet. Then
it creates a text file to be read by Arlequin. The text file will be saved in
the same place as the original file. It is suggested that you run Arlequin 
immediately at this point. We have left the choice of Arlequin features you
want to use up to you. To continue using the next macros, you will need to run
at least the Arlequin functions of Maximum likelihood frequencies for both the
Multi- and Bi- loci options. If you wish to compare more than one sample set,
all you need to do is run the macro on each, and then cut and paste the sample
information into one worksheet. Do not forget to change the number of samples
in the Arlequin input, and to save it as a text (*.txt) file. One more thing.
It is important that you are NOT appending Arlequin output. In the configuration
of Arlequin, there is a checkbox to this effect. If you append the results, 
they will be obviously at the end, and anything you read with the macro will just
be with the first run...


Options 3-5: Extracting Multi- or Bi- Loci haplotypes.
When you run Arlequin, it will create a new folder in the same directory
as the input file. This directory will have multiple files. The main file of
interest will have the same name as the .txt file, but with the extension .htm
These macros will search this file for the type of haplotypes you have asked for.
IT IS VERY IMPORTANT THAT ALL YOUR FILES STAY IN THE SAME PLACE! This means that
the worksheet from which you are running the macro needs to be in the same folder
as the folder with the Arlequin results. A good idea would be to make a separate
folder each time you START running the macro!

Extracting Multi-loci Haplotypes:
Here, a new worksheet will be created with the Haplotype numbers, the
haplotype frequencies, and the actual haplotype.

Extracting Bi-loci haplotypes:
This macro will extract each bi-locus haplotype for each pair of alleles.
The allele names will be taken from the "Genotypes" sheet, and each of the four
possible haplotypes will be shown. Also, the observed and expected haplotype
frequencies will be shown.

Prepare GOLD Sheet:
This macro is similar to "Extracting Bi-loci haplotypes", but creates
a new worksheet named "GOLD Data". This will contain each pair of loci, along
with the D, D', and p value for that pair. A new text file will be created and
saved, with the extension ".xt". This file can be read directly by GOLD. We are
currently working on a way to automatically create a GOLD map file, and hope to
have that in a future release.

Calculating D: 

Devlin, B. and Risch, N. (1995) A comparison of linkage disequilibrium measures 
for fine-scale mapping. Genomics., 29, 311-322

These allele frequencies are taken from the Arlequin output:

fD1A1 = The frequency of the haplotype consisting of both major alleles
fD2A2 = The frequency of the haplotype consisting of both minor alleles
fD1A2 = The frequency of the haplotype consisiting of major allele one and minor allele 2
fD2A1 = The frequency of the haplotype consisiting of minor allele one and major allele 2

D = (fD1A1 * fD2A2) - (fD1A2 * fD2A1)


Calculating D'

These frequencies are taken from the information on each polymorphism sheet

fD1 = The frequency of the major allele for the first polymorphism
fD2 = The frequency of the minor allele for the first polymorphism
fA1 = The frequency of the major allele for the second polymorphism
fA2 = The frequency of the minor allele for the second polymorphism

If D > 0 

D' = D / min(fD1 * fA2, fD2 * fA1)

If D < 0

D' = D / min(fD1 * fA1, fD2* FA2)

See:

Devlin, B. and Risch, N. (1995) A comparison of linkage disequilibrium measures 
for fine-scale mapping. Genomics., 29, 311-322


Calculating p-values

This is actually the significance of D' from 0, distributed as
chi-square with 1 degree of freedom.

Xsqr = (D^2 * (N)) / (fD1 * fD2 * fA1 * f A2)

Where N is the number of chromosomes.

See:

Elbein, S.C. (1992) Linkage disequilibrium among RFLPs at the insulin-receptor locus 
despite intervening Alu repeat sequences. Am. J. Hum. Genet., 51, 1103-1110.


Calculating Delta:

Delta = |D/(fD1 * fA1 * fD2 * fA2)^.5|

See:
Devlin, B. and Risch, N. (1995) A comparison of linkage disequilibrium measures 
for fine-scale mapping. Genomics., 29, 311-322

Since Delta is the corrolation coefficient of linkage disequilibrium, we can calculate
the confidence intervals of it. This makes it possible to compare LD between polymorphisms,
between populations, or any combination thereof. This is done using Fisher's exact test
in Excel.

Reconstructing individual level haplotypes:

Once the multi-locus haplotypes have been extracted from the Arlequin output, a button
will appear on the Genotypes worksheet, which will allow you to reconstruct the individual
level haplotypes. This is done using a maximum likelihood method. What the program does
is determines the number of haplotypes possible for each individual given their combination
of genotypes. For individuals that are homozygous at each polymorphism, or heterozygous at
only one, there is only one pair of haplotypes possible. However, when an individual is
heterozygous at more than one polymorphism, the haplotype needs to be inferred from the
haplotypes present in the population. The program will then check to see how many combinations
of haplotypes are possible given the genotypes of the individual, and the haplotypes in the
population. If only one combination is possible, then we can determine the haplotype with a 
very high degree of certainty. Howver, when more than one combination is possible, the program
will calculate the relative probability of each haplotype, given the genotypes. This is done
with the following formula:

T= (pq)_max
Where p and q are the two haplotypes making up the combination

Xi = T_max/ SUM(pq)

This sum is the sum of the frequency of each combination of haplotypes.

The most likely haplotype is given in bold, and the program gives the other possibilities
in the following columns.

Back to the program list.

Description of the macro.

Download the macro.

This page was last updated on: August 28, 2002 by David G. Cox