ShapeIT is an haplotype inference software allowing to phase whole chromosomes into samples of unrelated individuals and nuclear families.
We strongly advise to prepare your data with the Plink software available here since the standard file formats of plink (PED/MAP and BED/BIM/FAM) are handled by ShapeIT.
To remove SNPs with a very low minor allele frequency (MAF) and/or monomorphic (MAF=0) and/or with a high missing data rate (MD), use the following command :
plink --file example --maf 0.001 --geno 0.2 --recode --out example_snp_filtered |
Here, the SNPs with a MAF below 0.001 and with more than 20% of MD are removed from the dataset.
To remove individuals with a high missing data rate (MD), use the following command :
plink --file example --mind 0.2 --recode --out example_ind_filtered |
Here, the individuals with more than 20% of MD are removed from the dataset.
To remove individuals or snps based on Mendel errors rates:
plink --file example --me 0.05 0.1 --recode --out example_me_filtered |
Here, the families and the SNPs with respectively more than 5% and 10% of Mendel errors are removed from the dataset.
You MUST set to missing all the remaining Mendel errors. Otherwise, ShapeIT will give you an error message without phasing your data. You can do that with the following command:
plink --file example --me 1 1 --set-me-missing --recode --out example_me_blanked |
You have to split the dataset by chromosomes. ShapeIT can just phase one chromosome at a time. If you are working on GWA data, you can proceed as follows :
for chr in $(seq 1 22) ; do plink --file example --chr $chr --recode --out example_chr$chr ; done |
You will obtain 44 files; example_chr1.ped, example_chr1.map, ..., example_chr22.ped, example_chr22.map that you can phase separately.
ShapeIT can read several formats for the genotype data of unrelated individuals :
plink --file example_text --make-bed --out example_binary |
To recovert the PED/MAP format, just use:
plink --bfile example_binary --recode --out example_text |
Note 1: The transposed PED format is not handled, so convert it to the classical PED/MAP format.
Note 2: The Binary formats can be given in both SNP or individual major modes.
Sometimes, it can be usefull to use the INP format of Phase v2.1, especially if you want to run both software (ShapeIT and Phase v2.1) or if all your data were formated for Phase v2.1. Cedric Coulonges kindly provided a patch for Plink to output the genotypes into Phase v2.1 format :
With this patched plink, you can use the following command to convert your data files into INP format with:
plink_patched --file example --recode-phase --out example_phase |
ShapeIt accept mixed sample of unrelated individuals, duos and trios. The allowed file formats for these datasets are the following:
Don't forget to remove the Mendel errors !
In the case of text files (PED/MAP or INP), the missing data code has to be specified. By default, ShapeIT considers 0 as a missing allele as Plink does. You can change the missing data code to ? for example with Plink :
plink --file example --missing-genotype ? --recode --out example_mc? |
Alternatively, you can specify to ShapeIT that the missing character is ? with the option --missing-code ?
It is strongly recommended to provide to ShapeIT a genetic map of the chromosome you want to phase. It will greatly improve accuracy of phasing. You can find genetic maps for Humans here:
The genetic map provided to ShapeIT should have 3 columns delimited by a single space character. The first column is the physical position (in bp), the second one is the rate (in cM/Mb) and the third one is the genetic position (in cM). Only the first and the third columns are used by ShapeIT, so the second on can be set to any values if you generate this file on your own. The HapMap phase II NCBI b36 genetic map is already formatted corectly for ShapeIT. If you use the HapMap phase II NCBI b37, you have to change the format of the file in order to use it with ShapeIT. To change the format, use the following command:
for chr in $seq(1 22); do cat genetic_mapGRC_chr$chr.txt | awk '{ print $2, $3, $4}' > genetic_mapGRC_chr$chr_reformatted.txt; done |
The physical positions in the snp map file (MAP or BIM or INP) and in the genetic map file MUST be of the same NCBI version. If your positions in the BIM/MAP file of your data are from the b36, you MUST use the HapMap phase II b36 genetic map!
If you want to phase the example dataset provided with the software with the default settings, use this command line:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased |
The meanings of the arguments are the following:
If you want to phase a dataset in PED/MAP format, use the option --input-ped to specify the input files as follows:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased |
If you want to phase a dataset in BED/BIM/FAM format, use the option --input-bed to specify the input files as follows:
shapeit --phase --input-bed example.bed example.bim example.fam --input-map example.gmap --output-max example.phased |
If you want to phase a dataset in INP format, use the option --input-phase to specify the input file as follows:
shapeit --phase --input-phase example.inp --input-map example.gmap --output-max example.phased |
If you have some duos or trios into your dataset and you want that ShapeIT takes into account the corresponding pedigree information during the phasing process, you MUST add the --pedigree flag to the command line as follows:
shapeit --phase --pedigree --input-ped example.ped example.map --input-map example.gmap --output-max example.phased |
If the --pedigree flag is not given, then all the individuals are phased as unrelated. In the example.phased, you will have the most likely pair of haplotypes for each individuals (children and parents).
You can specify to ShapeIT to output the most likely pairs of haplotypes in Phase v2.1 format (described here) as follows:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased |
Or you can alternatinely output the haplotypes in the HapMap phase II format (described here) as follows:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-hapmap2 example.haplotypes example.legend example.sample |
Note that, conservely to the original HapMap phase II haplotype format, the haplotypes of the children are written in the example.haplotypes file.
You can change the default number of MCMC iterations with the following options:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --burn1 15 --burn2 15 --main 100 |
You can change the seed of the random number generator as follows:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --seed 123456789 |
Note that ShapeIt initialize automatically the seed by calling the stdlib function time(NULL) if the option --seed is not specified.
You can change the number of thread that ShapeIT use for parallel computing. The default is to use 1 thread (monothreaded). This option allows ShapeIT to run on multi-core computers. For example, if you have 4 cores into your computer and 1000 individuals in your dataset, then you can run the following command:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --thread 4 |
Then your 4 cores will be used to phase 4 individuals in parallel
conditionnal upon the 1000-4=996 others individuals. It allows to divide
the running time of ShapeIT by almost 4 in this case, so 4 days becomes
1!
Note1: This option is recommended and usefull
only if you have a large number of individuals in your dataset since you
reduce the number of conditioning haplotypes.
Note2: Setting the number of thread higher than the number of available
cores will not decrease more the runinning times.
You can increase accuracy of ShapeIT by increasing the number of conditioning states on which haplotype inference is based. By default, ShapeIT uses ~50 states per SNP across the dataset which gives good accuracy while maintaining raisonable running times. The complexity of the algorithm is linear with the number of conditioning states, so feel free to increase this number. For instance, if you set this number to 100, it will be two times longer than with 50 states. You can set the number of conditioning states with the --states option as follows:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --states 100 |
ShapeIT uses the Li & Stephens hidden Markov model [] which requires recombination rates estimates between successive SNPs. ShapeIT offers two ways to estimate them:
You can specify the genetic map of the chromosome you want to phase and the effective population size with respectively the options --input-map and --effective-size. To download and format correctly a genetic map for Human, see section 1.8. The default value for the effective population size is 14000. Hereafter, a command line example to specify to use example.gmap as genetic map and 11418 for the effective population size:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --effective-size 11418 |
Notes for the genetic positions:
Most of the SNPs of your dataset MUST have a genetic position specified
into this file. For the SNPs that don't have a genetic position, it is
internally determined with linear interpolation. If the intersection
between your snp map (example.bim example.map) and your genetic map
(example.gmap) is poor, you should verify that the positions in both
files are from the same NCBI genome build (b36 or b37).
Notes for the effective population size:
You can use the effective population sizes estimated for the HapMap
phase II populations:
If you don't have a genetic map for your data, you can estimates the recombination rates with Mettropolis-Hasting (MH) algorithm as done in []. However, it will require a substancial additional computational effort for phasing the data. The MH algorithm has the following parameters:
shapeit --phase --input-ped example.ped example.map --output-max example.phased --infer-rho |
Note that int the above command, no genetic map (example.gmap) is provided since the flag --infer-rho is used.
To phase only the SNPs located in a window, use the following options:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --from-bp 10e6 --to-bp 20e6 |
To exclude individuals from the phasing process, first make a space delimited text file where one line contains a familyID and an individualID. This file should contain the famID and the indID as they appear in the PED/FAM file. Then, give this file to ShapeIT with the option --exclude-ind. Hereafter a short example:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-max example.phased --exclude-ind inds_to_exclude.txt |
Where cat inds_to_exclude.txt gives:
FAM1 IND1
FAM1 IND2
FAM2 IND2
... ...
In section 2.3, we shown how to output the most likely pair of haplotypes for each genotype.
In this section, we show how to output compact representations of the phasing uncertainty.
This uncertainty is captured by ShapeIT thanks to graph representations of the possible haplotype
space of haplotypes associted with each genotype. Nodes of these graphs are haplotype segments while
edges between nodes are the possible phases between haplotype segments. Each edge is weighted by
a phase probability.
Now, let see a short example of phasing the example dataset like we have done in section 2.0, but this time in 2 steps. First, we phase the dataset while capturing the uncertainty with the following command:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-tgraph example.tgraph example_dup.map |
The meanings of the arguments are the following:
Now that we have the haplotype graphs for each genotype, we use them to determine the most likely pairs of haplotypes. For that, use the following command:
shapeit --input-tgraph example.tgraph example_dup.map --output-max example.phased |
The meanings of the arguments are the following:
To output the haplotypes graphs, you can use the text format by using the --output-tgraph option as in the following example:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-tgraph example.tgraph example_dup.map |
However, the text files can require a large amount of HDD memory and large sparsing times for posterior uses. For this reason, we have implemented a more compact file format, named Binary haplotype graph. To make use of it, use the --output-bgraph option as follows:
shapeit --phase --input-ped example.ped example.map --input-map example.gmap --output-bgraph example.bgraph example.bim example.fam |
The file example.bgraph will be musch smaller than example.tgraph, but it remains impossible to open it into a text editor. Note that the files example.bim and example.fam are in the same format than the binary files of Plink.
To specify haplotype graphs files to ShapeIT, use the following command with text files:
shapeit --input-tgraph example.tgraph example.map --output-max example.phased |
Or the following command with binary files:
shapeit --input-bgraph example.bgraph example.bim example.fam --output-max example.phased |
As previously shown, to infer the most likely pair of haplotypes for each individual from the haplotype graphs, use the following command:
shapeit --input-tgraph example.tgraph example_dup.map --output-max example.phased |
Alternatively, to output the haplotypes into the HapMap2 file format, use:
shapeit --input-tgraph example.tgraph example_dup.map --output-hapmap2 example.haplotypes example.legend example.sample |
These options work also when specifying haplotypes graphs with --input-bgraph.
Instead of maxiamizing the path as done in section 3.3, ShapeIT can also sample a pair of haplotypes for each genotype according to the phase probabilities into the haplotypes graphs.
shapeit --input-tgraph example.tgraph example.map --output-max example.phased |
Or the following command with binary files:
shapeit --input-bgraph example.bgraph example.bim example.fam --output-max example.phased |
To know which version you need, use the following command on your computer:
uname -m |
You will usually obtain x86_64 or i686 which indicates you which version you should download. Then, try the dynamic version compiled for your architecture. If it doesn't work, try the static. If a problem persists, then contact me.
Platform | File |
---|---|
GNU/Linux x86_64 static | ShapeIT.Linux.x86_64.static.tar.gz |
GNU/Linux x86_64 dynamic | ShapeIT.Linux.x86_64.dynamic.tar.gz |
GNU/Linux i686 static | ShapeIT.Linux.i686.static.tar.gz |
GNU/Linux i686 dynamic | ShapeIT.Linux.i686.dynamic.tar.gz |