Filtering SNPs
To remove SNPs with a very low minor allele frequency (MAF) and/or monomorphic (MAF=0) and/or with a high missing data rate (MD), use the following command :
plink --file example --maf 0.001 --geno 0.2 --recode --out example_snp_filtered
Here, the SNPs with a MAF below 0.001 and with more than 20% of MD are removed from the dataset.
Splitting the dataset by chromosome
You have to split the dataset by chromosomes. ShapeIT can just phase one chromosome at a time. If you are working on GWA data, you can proceed as follows :
for chr in $(seq 1 22) ; do plink --file example --chr $chr --recode --out example_chr$chr ; done
You will obtain 44 files; example_chr1.ped, example_chr1.map, ..., example_chr22.ped, example_chr22.map that you can phase separately.
SNPs labels
Each SNP must have an unique name as the RS number for example.
SNPs positions
Each SNP must have a unique physical position. You must check that all the SNP have distinct positions and that they are ordered by physical position.
To order your SNPs according the physical position, you can use the following command, since plink reorder automatically the SNPs:
plink --file example --recode --out example_reordered