ShapeIT

ShapeIT is an haplotype inference software allowing to phase whole chromosomes into samples of unrelated individuals and nuclear families.

1. Preparing your data ( expand / collapse )

1.0. READ ME FIRST

1.1. Filtering SNPs (optional)

1.2. Filtering individuals (optional)

1.3. Remove Mendel error (required for trios/duos)

To remove individuals or snps based on Mendel errors rates:

plink --file example --me 0.05 0.1 --recode --out example_me_filtered

Here, the families and the SNPs with respectively more than 5% and 10% of Mendel errors are removed from the dataset.

You MUST set to missing all the remaining Mendel errors. Otherwise, ShapeIT will give you an error message without phasing your data. You can do that with the following command:

plink --file example --me 1 1 --set-me-missing --recode --out example_me_blanked

1.4. Splitting the dataset by chromosomes (required)

1.5. Genotype file formats for Unrelated individuals (required)

ShapeIT can read several formats for the genotype data of unrelated individuals :

Plink PED/MAP format as described here
Plink BED/BIM/FAM format as described here
Phase v2.1 INP format as described here

The PED/MAP and INP formats are text format which allows you to open those files into a classical text editor. However, for large datasets, a large amount of HDD memory is required and ShapeIT needs more time to read the data since it has to sparse the files. This is why, we advice the users to use the Binary format (BED/BIM/FAM). To use this format, convert your PED/MAP data to BED/BIM/FAM by using:

plink --file example_text --make-bed --out example_binary

To recovert the PED/MAP format, just use:

plink --bfile example_binary --recode --out example_text

Note 1: The transposed PED format is not handled, so convert it to the classical PED/MAP format.
Note 2: The Binary formats can be given in both SNP or individual major modes.

Sometimes, it can be usefull to use the INP format of Phase v2.1, especially if you want to run both software (ShapeIT and Phase v2.1) or if all your data were formated for Phase v2.1. Cedric Coulonges kindly provided a patch for Plink to output the genotypes into Phase v2.1 format :

Download Plink source code here
Unzip it with unzip plink-1.07-src.zip
Go into the source file folder with cd plink-1.07-src
Download the patch file here
Apply the patch with patch < patch_plink-1.07-src.diff
Then compile with make

With this patched plink, you can use the following command to convert your data files into INP format with:

plink_patched --file example --recode-phase --out example_phase

1.6. Genotype file formats for Nuclear families (required)

1.7. Missing data code (optional)

1.8. The genetic map (required)

It is strongly recommended to provide to ShapeIT a genetic map of the chromosome you want to phase. It will greatly improve accuracy of phasing. You can find genetic maps for Humans here:

The genetic map provided to ShapeIT should have 3 columns delimited by a single space character. The first column is the physical position (in bp), the second one is the rate (in cM/Mb) and the third one is the genetic position (in cM). Only the first and the third columns are used by ShapeIT, so the second on can be set to any values if you generate this file on your own. The HapMap phase II NCBI b36 genetic map is already formatted corectly for ShapeIT. If you use the HapMap phase II NCBI b37, you have to change the format of the file in order to use it with ShapeIT. To change the format, use the following command:

for chr in $seq(1 22); do cat genetic_mapGRC_chr$chr.txt | awk '{ print $2, $3, $4}' > genetic_mapGRC_chr$chr_reformatted.txt; done

The physical positions in the snp map file (MAP or BIM or INP) and in the genetic map file MUST be of the same NCBI version. If your positions in the BIM/MAP file of your data are from the b36, you MUST use the HapMap phase II b36 genetic map!

2. Phasing your data ( expand / collapse )

2.0. READ ME FIRST

If you want to phase the example dataset provided with the software with the default settings, use this command line:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased

The meanings of the arguments are the following:

--phase : This flag specifies to ShapeIT to use MCMC iterations to phase the data
--input-ped : This option gives the filenames of the genotype input files (example.ped/example.map) and that the genotypes are in PED/MAP format.
--input-map : This option gives the genetic map file (example.gmap) formatted as described in section 1.8.
--output-max : This option gives the filename of the phased haplotypes (example.phased). The most likely pair of haplotypes is given for each individual.

2.1. Input files (required)

If you want to phase a dataset in PED/MAP format, use the option --input-ped to specify the input files as follows:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased

If you want to phase a dataset in BED/BIM/FAM format, use the option --input-bed to specify the input files as follows:

shapeit --phase --input-bed example.bed example.bim example.fam --input-map example.gmap --output-max example.phased

If you want to phase a dataset in INP format, use the option --input-phase to specify the input file as follows:

shapeit --phase --input-phase example.inp --input-map example.gmap --output-max example.phased

2.2. Phasing nuclear families (required for trios/duos)

If you have some duos or trios into your dataset and you want that ShapeIT takes into account the corresponding pedigree information during the phasing process, you MUST add the --pedigree flag to the command line as follows:

shapeit --phase --pedigree --input-ped example.ped example.map --input-map example.gmap --output-max example.phased

If the --pedigree flag is not given, then all the individuals are phased as unrelated. In the example.phased, you will have the most likely pair of haplotypes for each individuals (children and parents).

2.3. Output files (required)

You can specify to ShapeIT to output the most likely pairs of haplotypes in Phase v2.1 format (described here) as follows:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased

Or you can alternatinely output the haplotypes in the HapMap phase II format (described here) as follows:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-hapmap2 example.haplotypes example.legend example.sample

Note that, conservely to the original HapMap phase II haplotype format, the haplotypes of the children are written in the example.haplotypes file.

2.4. Algorithm parameters (optional)

Iteration number

You can change the default number of MCMC iterations with the following options:

--burn1 X : where X is the number of first stage burn-in iterations. The default value is 10 which is sufficient for most application. Those iterations are used to find a good starting point estimate for the haplotypes.
--burn2 Y : where Y is the number of iterations of the second stage of burn-in iteration. The default value is 10. Those iterations are used to obtain a parsimonious haplotype graph for each genotype.
--main Z : where Z is the number of main iterations. The default value is 50. The final haplotype estimate is found by averaing over those iterations.

Hereafter, an example of a run of ShapeIT with 15 first stage burn-in iterations, 15 second stage burn-in iterations and 100 main iterations:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --burn1 15 --burn2 15 --main 100

Seed specification

You can change the seed of the random number generator as follows:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --seed 123456789

Note that ShapeIt initialize automatically the seed by calling the stdlib function time(NULL) if the option --seed is not specified.

Multithreading

You can change the number of thread that ShapeIT use for parallel computing. The default is to use 1 thread (monothreaded). This option allows ShapeIT to run on multi-core computers. For example, if you have 4 cores into your computer and 1000 individuals in your dataset, then you can run the following command:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --thread 4

Then your 4 cores will be used to phase 4 individuals in parallel conditionnal upon the 1000-4=996 others individuals. It allows to divide the running time of ShapeIT by almost 4 in this case, so 4 days becomes 1!
Note1: This option is recommended and usefull only if you have a large number of individuals in your dataset since you reduce the number of conditioning haplotypes.
Note2: Setting the number of thread higher than the number of available cores will not decrease more the runinning times.

Conditioning states

You can increase accuracy of ShapeIT by increasing the number of conditioning states on which haplotype inference is based. By default, ShapeIT uses ~50 states per SNP across the dataset which gives good accuracy while maintaining raisonable running times. The complexity of the algorithm is linear with the number of conditioning states, so feel free to increase this number. For instance, if you set this number to 100, it will be two times longer than with 50 states. You can set the number of conditioning states with the --states option as follows:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --states 100

2.5. Model parameters (optional)

ShapeIT uses the Li & Stephens hidden Markov model [] which requires recombination rates estimates between successive SNPs. ShapeIT offers two ways to estimate them:

You can compute them from a genetic map (position in cM of the SNPs) and an effective population size as done in [].
You can infer them from the data as done in [].

Case1 : I have a genetic map

You can specify the genetic map of the chromosome you want to phase and the effective population size with respectively the options --input-map and --effective-size. To download and format correctly a genetic map for Human, see section 1.8. The default value for the effective population size is 14000. Hereafter, a command line example to specify to use example.gmap as genetic map and 11418 for the effective population size:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --effective-size 11418

Notes for the genetic positions:
Most of the SNPs of your dataset MUST have a genetic position specified into this file. For the SNPs that don't have a genetic position, it is internally determined with linear interpolation. If the intersection between your snp map (example.bim example.map) and your genetic map (example.gmap) is poor, you should verify that the positions in both files are from the same NCBI genome build (b36 or b37).
Notes for the effective population size:
You can use the effective population sizes estimated for the HapMap phase II populations:

European (CEU) : 11418
African (YRI) : 17469
Asian (CHB+JPT) : 14269

Case2 : I don't have a genetic map

If you don't have a genetic map for your data, you can estimates the recombination rates with Mettropolis-Hasting (MH) algorithm as done in []. However, it will require a substancial additional computational effort for phasing the data. The MH algorithm has the following parameters:

--init-rho X : Where X is the initial recombination rate value between SNPs. The default value is 0.0004.
--states-rho Y : Where Y is the number of haplotypes used to infer recombination rates between SNPs. A large value of Y will increase accuracy of the estimation as well as the running times (the MH algorithm is quadratic with Y). The default value is 50 which allows to phase large dataset with tractable running times.
--infer-rho : This flag indicates to ShapeIT to use the MH algorithm to estimate recombination rates. If this flag is not provided, then ShapeIT does not use the MH algorithm and assumes a constant recombination rate accross the genome whose value is determined by the --init-rho option.

Hereafter, an example of recombination rates inference:

shapeit --phase --input-ped example.ped example.map --output-max example.phased --infer-rho

Note that int the above command, no genetic map (example.gmap) is provided since the flag --infer-rho is used.

2.6. Phasing a window of SNPs (optional)

2.7. Excluding individuals (optional)

3. Capturing and using phasing uncertainty ( expand / collapse )

3.0. READ ME FIRST

In section 2.3, we shown how to output the most likely pair of haplotypes for each genotype. In this section, we show how to output compact representations of the phasing uncertainty. This uncertainty is captured by ShapeIT thanks to graph representations of the possible haplotype space of haplotypes associted with each genotype. Nodes of these graphs are haplotype segments while edges between nodes are the possible phases between haplotype segments. Each edge is weighted by a phase probability.

In a first section (3.1), we describe the options required to output these haplotype graphs.
In a second section (3.2), we describe the options required to specify to ShapeIT to read these haplotypes graphs for posterior uses.
In the following sections, we decribe how the functionnalities currently implemented in ShapeIT to use these haplotype graphs.

Now, let see a short example of phasing the example dataset like we have done in section 2.0, but this time in 2 steps. First, we phase the dataset while capturing the uncertainty with the following command:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-tgraph example.tgraph example_dup.map

The meanings of the arguments are the following:

--phase : This flag specifies to ShapeIT to use MCMC iterations to phase the data !
--input-ped : Input files (example.ped/example.map) in PED/MAP format.
--input-map : Genetic map (example.gmap).
--output-tgraph : This option gives the filenames of the infered haplotype graphs and the associated SNP map. The file example.tgraph contains the haplotype graphs in text format and the file example_dup.map contains the associated SNP map.

Now that we have the haplotype graphs for each genotype, we use them to determine the most likely pairs of haplotypes. For that, use the following command:

shapeit --input-tgraph example.tgraph example_dup.map --output-max example.phased

The meanings of the arguments are the following:

--input-tgraph : Input files (example.tgraph/example_dup.map) in TGRAPH/MAP format. Those files were obtained thanks to the previous command.
--output-map : This option specifies to ShapeIT to output the most likely haplotypes for the sample in the file example.phased. Those haplotypes were determined by finding the most likely paths into the haplotype graphs.

3.1. Output uncertainty

To output the haplotypes graphs, you can use the text format by using the --output-tgraph option as in the following example:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-tgraph example.tgraph example_dup.map

However, the text files can require a large amount of HDD memory and large sparsing times for posterior uses. For this reason, we have implemented a more compact file format, named Binary haplotype graph. To make use of it, use the --output-bgraph option as follows:

shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-bgraph example.bgraph example.bim example.fam

The file example.bgraph will be musch smaller than example.tgraph, but it remains impossible to open it into a text editor. Note that the files example.bim and example.fam are in the same format than the binary files of Plink.

3.2. Input uncertainty

3.3. Finding the most probable haplotype pair

As previously shown, to infer the most likely pair of haplotypes for each individual from the haplotype graphs, use the following command:

shapeit --input-tgraph example.tgraph example_dup.map --output-max example.phased

Alternatively, to output the haplotypes into the HapMap2 file format, use:

shapeit --input-tgraph example.tgraph example_dup.map --output-hapmap2 example.haplotypes example.legend example.sample

These options work also when specifying haplotypes graphs with --input-bgraph.

3.4. Sampling in the haplotype graphs

4. Option list ( expand / collapse )

5. Download ( expand / collapse )

To know which version you need, use the following command on your computer:

uname -m

You will usually obtain x86_64 or i686 which indicates you which version you should download. Then, try the dynamic version compiled for your architecture. If it doesn't work, try the static. If a problem persists, then contact me.

Platform File

GNU/Linux x86_64 static ShapeIT.Linux.x86_64.static.tar.gz

GNU/Linux x86_64 dynamic ShapeIT.Linux.x86_64.dynamic.tar.gz

GNU/Linux i686 static ShapeIT.Linux.i686.static.tar.gz

GNU/Linux i686 dynamic ShapeIT.Linux.i686.dynamic.tar.gz

Platform	File
GNU/Linux x86_64 static	ShapeIT.Linux.x86_64.static.tar.gz
GNU/Linux x86_64 dynamic	ShapeIT.Linux.x86_64.dynamic.tar.gz
GNU/Linux i686 static	ShapeIT.Linux.i686.static.tar.gz
GNU/Linux i686 dynamic	ShapeIT.Linux.i686.dynamic.tar.gz

ShapeIT documentation

1. Preparing your data ( expand / collapse )

2. Phasing your data ( expand / collapse )

Iteration number

Seed specification

Multithreading

Conditioning states

Case1 : I have a genetic map

Case2 : I don't have a genetic map

3. Capturing and using phasing uncertainty ( expand / collapse )

4. Option list ( expand / collapse )

5. Download ( expand / collapse )