File splitting¶
The idea of file splitting is to split the whole genome into loci. There are several ways to split the genome, and here the systematic way is to use the approximately independent LD block.
Approximately independent LD block¶
- Read the range of each LD block from the approximately independent LD block file with matched population
- Use tabix to extract the SNPs from the GWAS file according to the LD block range
- Filter loci by Pvalue/Zscore since there are thousands of LD blocks
Example LD block file¶
chr start stop
chr1 10583 1892607
chr1 1892607 3582736
chr1 3582736 4380811
chr1 4380811 5913893
...
Example code¶
Note
This should not be used directly since it applies to specific files
# comrepss the file using bgzip
bgzip gwas_summary_temp_sorted
# make index file using tabix
tabix -s 1 -b 2 -e 2 -S 1 gwas_summary_temp_sorted.gz
# extract range of LD block
cat fourier_ls-all.bed | while read line
do
array=(${line})
chrnum=${array[0]#*chr}
chrnum=${chrnum// /}
start=${array[1]// /}
end=${array[2]// /}
if [ "${start}" != "start" ]; then
# use tabix to extract the LD block from the GWAS file
tabix gwas_summary_temp_sorted.gz ${chrnum}:${start}-${end} > output.${chrnum}_${start}_${end}
done
Other methods¶
It is not necessary to split the genome into loci by approximately independent LD block. Sometimes people may just select the lead SNP and include SNPs within 500K bp of the lead SNP on both sides. If you have already define the loci somehow, it's OK to just skip this step.