Skip to content

SciLifeLab/joint_variant_calling

This branch is up to date with vezzi/joint_variant_calling:master.

Folders and files

NameName
Last commit message
Last commit date

Latest commit

author
Francesco Vezzi
Mar 15, 2017
afdce07 · Mar 15, 2017

History

16 Commits
Mar 15, 2017
Feb 15, 2017
Aug 3, 2016
Aug 31, 2016
Feb 15, 2017
Feb 15, 2017
Aug 3, 2016
Feb 15, 2017
Sep 22, 2016
Jan 30, 2017

Repository files navigation

joint_variant_calling

Python package to run Join Calling on population at NGI (National Genomics Infrastructure) Sweden. The script joint_variant_calling.py implements the GATK-Workflow described in https://www.broadinstitute.org/gatk/guide/article?id=3893

With option --mixed-positions VQSR step is executed following best-practice described in http://gatkforums.broadinstitute.org/gatk/discussion/2805/howto-recalibrate-variant-quality-scores-run-vqsr in order to avoid the problem with MIXED positions (i.e., position where an indel overlaps a SNP). This is the mode used to run SweGen dataset.

Samples to be be join called can be specified in two ways:

  • in the sample field of the config.yaml file to be provided as input. Complete path to the gvcf file needs to be provided
  • a file named 00_samples.txt in the cwd. If present samples specified in this file will be used instead of those speccified in the config

If run like python joint_variant_calling.py --config config.yaml it creates the following folder structure

  • 00_intervals: optional see Intervals section
  • 00_samples.txt: samples that are processed (i.e., samples that are join called)
  • 01_CombineGVCFs: step one is CombineGVCFs, batching gvcf files together
  • 02_GenotypeGVCFs: then run GenotypeGVCFs
  • 03_CatVariants: merge the gvcfs into one in case computation has been spread out into intervals (see Intervals section)
  • 04_SelectVariants: extract SNPs and INDELs and run eveluation tool from GATK to prepare VQSR
  • 05_VariantRecalibrator: first step of VQSR
  • 06_ApplyRecalibration: second step of VQSR

If run like python joint_variant_calling.py --config config.yaml --mixed-positions it creates the following folder structure

Folders 01, 02, ..., 06 are all organised in the same way:

  • sbatch: sbatch files to be submitted to the slurm queue (Uppmax assumed)
  • std_err: output of the standard error
  • std_out: output of the standard output
  • VCF: contains the gvcf or vcf files (in general results files, in case of 05_VariantRecalibrator contains recalibration tables)

Folder 01_CombineGVCFs contains an extra sub-folder:

  • batches: used to restart joint calling if new samples are available [UNSTABLE: UNDER TEST]

If run like python joint_variant_calling.py --config config.yaml --resume it resumes the joint calling adding new samples and recomputing only the last (if needed) and the new batches (in 01_CombineGVCFs)

Intervals

The most time consuming steps of the workflow are 01_CombineGVCFs and 02_GenotypeGVCFs. These two steps can be parallised running the commands on non overlapping sections of the genome. For this purpose an utility script is provided: python utils/create_interval_list.py --dict intervals/human_g1k_v37.dict --block 1000000000 the file intervals/human_g1k_v37.dict can be found in this repo. If run in a directory (e.g. 00_intervals) it creates the intrvals files (.intervals). This folder need to be provided in the config.yaml file. Running the example command will create 4 blocks, first 3 of 1Gbp and the last one of ~200Mbp.

The example directory in this repo contains a dry-run of joint_variant_calling.py run on intervals generated by this command, on 7 samples, in batches of 4 (i.e.,this creates two batches, one of 4 and one of 3 samples)

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 98.8%
  • Python 1.2%