Filtering SNPs

Created Mar 16, 2011

To remove SNPs with a very low minor allele frequency (MAF) and/or monomorphic (MAF=0) and/or with a high missing data rate (MD), use the following command :

plink --file example --maf 0.001 --geno 0.2 --recode --out example_snp_filtered

Here, the SNPs with a MAF below 0.001 and with more than 20% of MD are removed from the dataset.

Note: ShapeIT handles monomorphic SNPs in the dataset (MAF=0.0), but they will slow down computations. Conversely, the totally missing SNPs are not handled and must be remove form the dataset.

Splitting the dataset by chromosome

Created Mar 17, 2011

You have to split the dataset by chromosomes. ShapeIT can just phase one chromosome at a time. If you are working on GWA data, you can proceed as follows :

for chr in $(seq 1 22) ; do plink --file example --chr $chr --recode --out example_chr$chr ; done

You will obtain 44 files; example_chr1.ped, example_chr1.map, ..., example_chr22.ped, example_chr22.map that you can phase separately.

SNPs labels

Created Mar 18, 2011

Each SNP must have an unique name as the RS number for example.

SNPs positions

Created Mar 18, 2011

Each SNP must have a unique physical position. You must check that all the SNP have distinct positions and that they are ordered by physical position.

To order your SNPs according the physical position, you can use the following command, since plink reorder automatically the SNPs:

plink --file example --recode --out example_reordered