Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

👨‍🔬 Beta Workflow to Modify Ploidy #145

Merged
merged 11 commits into from
Oct 21, 2024
34 changes: 34 additions & 0 deletions docs/KFDRC_GATK_HC_MOD_PLOIDY_README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
# Kids First DRC GATK HaplotypeCaller Modified Ploidy BETA Workflow
This is a research workflow for users wishing to modify the ploidy of certain
regions of their existing GVCF calls.

## Inputs
### Ploidy-related
- input_gvcf: GVCF generated in standard workflow
- region: Specific region to pull, in format 'chr21' or 'chr3:1-1000'
- re_calling_interval_list: Interval list to re-call __in GATK Picard-style Interval List format__
- sample_ploidy: If sample/interval is expected to not have ploidy=2, enter expected ploidy
> For example, for trisomy 21, one might do:
>- region = chr21
>- sample_ploidy = 3
>- re_calling_interval_list = the regions of chr21 that are expected to be ploidy = 3
### Typical haplotype calling
- input_cram: Input CRAM file
- biospecimen_name: String name of biospcimen
- output_basename: String to use as the base for output filenames
- reference_fasta: FASTA file that was used during alignment. Also need
corresponding `.fai` and `.dict` files.
- dbsnp_vcf: dbSNP vcf file
- dbsnp_idx: dbSNP vcf index file
- contamination: Precalculated contamination value. Providing the value here
will skip the run of VerifyBAMID and use the provided value as ground truth.
- contamination_sites_bed: .Bed file for markers used in this
analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)
- contamination_sites_mu: .mu matrix file of genotype matrix
- contamination_sites_ud: .UD matrix file from SVD result of genotype matrix
- wgs_evaluation_interval_list: Target intervals to restrict GVCF metric
analysis (for VariantCallingMetrics)

## Outputs

- mixed_ploidy_gvcf: Updated complete GVCF in which the desired region has had its ploidy updated
2 changes: 1 addition & 1 deletion docs/dockers_gatk_cramtogvcf.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,7 @@ gatk_indexfeaturefile.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.1.7.0R
picard_collectgvcfcallingmetrics.cwl|pgc-images.sbgenomics.com/d3b-bixu/picard:2.18.9R
picard_intervallisttools.cwl|pgc-images.sbgenomics.com/d3b-bixu/picard:2.18.9R
picard_mergevcfs_python_renamesample.cwl|pgc-images.sbgenomics.com/d3b-bixu/picard:2.18.9R
samtools_cram_to_bam.cwl|pgc-images.sbgenomics.com/d3b-bixu/samtools:1.8-dev
samtools_cram_to_bam.cwl|pgc-images.sbgenomics.com/d3b-bixu/samtools:1.9
samtools_idxstats_xy_ratio.cwl|pgc-images.sbgenomics.com/d3b-bixu/samtools:1.9
untar_indexed_reference_2.cwl|None
verifybamid_contamination_conditional.cwl|pgc-images.sbgenomics.com/d3b-bixu/verifybamid:1.0.2
6 changes: 4 additions & 2 deletions subworkflows/kfdrc_bam_to_gvcf.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -12,15 +12,16 @@ inputs:
contamination_sites_bed: File
contamination_sites_mu: File
contamination_sites_ud: File
input_bam: File
indexed_reference_fasta: File
input_bam: { type: File, secondaryFiles: ['^.bai'] }
indexed_reference_fasta: { type: File, secondaryFiles: ['.fai', '^.dict'] }
output_basename: string
wgs_calling_interval_list: File
dbsnp_vcf: File
dbsnp_idx: File?
reference_dict: File
wgs_evaluation_interval_list: File
conditional_run: int
sample_ploidy: { type: 'int?', doc: "If sample/interval is expected to not have ploidy=2, enter expected ploidy" }

outputs:
verifybamid_output: {type: File, outputSource: verifybamid_checkcontam_conditional/output}
Expand Down Expand Up @@ -63,6 +64,7 @@ steps:
input_bam: input_bam
interval_list: picard_intervallisttools/output
reference: indexed_reference_fasta
sample_ploidy: sample_ploidy
scatter: [interval_list]
out: [output]

Expand Down
46 changes: 46 additions & 0 deletions tools/bcftools_amend_vcf_header.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,46 @@
cwlVersion: v1.2
class: CommandLineTool
id: bcftools-amend-vcf-header
doc: "A specialized tool to make clear that GATK HC was run again"
requirements:
- class: ShellCommandRequirement
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/bcftools:1.20'
- class: ResourceRequirement
ramMin: 16000
coresMin: 8
- class: InlineJavascriptRequirement
- class: InitialWorkDirRequirement
listing:
- entryname: "amend_header.sh"
entry: |
#!/usr/bin/env bash
set -xeo pipefail

awk 'NR == FNR && $line ~ /^##GATKCommandLine.HaplotypeCaller/ {sub(/Caller,/, "Caller_rpt_subset,"); newcmd=$line}; NR != FNR; NR != FNR && $line ~ /^##GATKCommandLine.HaplotypeCaller/ {print newcmd}' <(bcftools head $2) <(bcftools head $1) > header_build.txt

baseCommand: []
arguments:
- position: 0
shellQuote: false
valueFrom: >-
/bin/bash amend_header.sh
- position: 3
shellQuote: false
valueFrom: >-
&& bcftools reheader --threads $(inputs.threads) -h header_build.txt $(inputs.input_vcf.path) > $(inputs.output_basename).ploidy_mod.g.vcf.gz
&& tabix $(inputs.output_basename).ploidy_mod.g.vcf.gz

inputs:
input_vcf: { type: File, secondaryFiles: ['.tbi'], doc: "Reconstituted VCF with modified ploidy region calls",
inputBinding: { position: 1 } }
mod_vcf: { type: File, secondaryFiles: ['.tbi'], doc: "VCF with modified region cals only. Header will be used to modift input_vcf",
inputBinding: { position: 2 } }
threads: { type: 'int?', default: 4 }
output_basename: string
outputs:
header_amended_vcf:
type: File
outputBinding:
glob: "*.{v,b}cf{,.gz}"
secondaryFiles: ['.tbi?']
43 changes: 43 additions & 0 deletions tools/bedtools_intersect.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,43 @@
cwlVersion: v1.0
class: CommandLineTool
id: bedtools_intersect
doc: "Subset VCF with bedtools intersect. Can add -v flag for negative selection"
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: ResourceRequirement
ramMin: 8000
coresMin: 4
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/vcfutils:latest'

baseCommand: [/bin/bash -c]
arguments:
- position: 1
shellQuote: false
valueFrom: >-
'set -eo pipefail; bedtools intersect -wa -header
- position: 3
shellQuote: false
valueFrom: >-
| bgzip -c -@ 4 > $(inputs.output_basename).bed_intersect.vcf.gz'
- position: 4
shellQuote: false
valueFrom: >-
&& tabix $(inputs.output_basename).bed_intersect.vcf.gz

inputs:
input_vcf: { type: File, secondaryFiles: ['.tbi'], doc: "Input VCF file.",
inputBinding: { position: 2, prefix: "-a" } }
input_bed_file: { type: File, doc: "bed intervals to intersect with.",
inputBinding: { position: 2, prefix: "-b" } }
output_basename: string
inverse: {type: 'boolean?', doc: "Select whatever is NOT in the interval bed file",
inputBinding: { position: 2, prefix: "-v"} }

outputs:
intersected_vcf:
type: File
outputBinding:
glob: '*.bed_intersect.vcf.gz'
secondaryFiles: ['.tbi']
2 changes: 2 additions & 0 deletions tools/gatk_haplotypecaller.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -40,5 +40,7 @@ inputs:
input_bam: { type: File, secondaryFiles: [^.bai], doc: "Input bam file" }
interval_list: { type: File, doc: "File containing one or more genomic intervals over which to operate" }
contamination: { type: float, doc: "Fraction of contamination in sequencing data (for all samples) to aggressively remove" }
sample_ploidy: { type: 'int?', doc: "If sample/interval is expected to not have ploidy=2, enter expected ploidy",
inputBinding: {position: 1, prefix: "--sample_ploidy" } }
outputs:
output: { type: File, outputBinding: { glob: '*.vcf.gz' }, secondaryFiles: [.tbi] }
26 changes: 26 additions & 0 deletions tools/gatk_intervallist_to_bed.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
cwlVersion: v1.0
class: CommandLineTool
id: gatk4_intervallist2bed
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/gatk:4.1.1.0'
- class: ResourceRequirement
ramMin: 2000

baseCommand: [/gatk, IntervalListToBed, --java-options, -Xmx100m]
arguments:
- position: 1
shellQuote: false
valueFrom: >-
-O $(inputs.interval_list.nameroot).bed

inputs:
interval_list: { type: File, doc: "Interval list to convert",
inputBinding: { position: 0, prefix: "-I"} }
outputs:
output:
type: File
outputBinding:
glob: '*.bed'
18 changes: 14 additions & 4 deletions tools/picard_mergevcfs.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -10,24 +10,34 @@ requirements:
- class: ShellCommandRequirement
- class: ResourceRequirement
ramMin: 3000
coresMin: 4
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/picard:2.18.9R'
baseCommand: [ java, -Xms2000m, -jar, /picard.jar, MergeVcfs]
baseCommand: [ "/bin/bash", "-c" ]
arguments:
- position: 1
shellQuote: false
valueFrom: >-
OUTPUT=$(inputs.output_vcf_basename).g.vcf.gz
'set -eo pipefail;
java -Xms2000m -jar /picard.jar MergeVcfs
OUTPUT=/dev/stdout
CREATE_INDEX=false
VERBOSITY=WARNING
- position: 3
shellQuote: false
valueFrom: >-
| bgzip -c -@ 4 > $(inputs.output_vcf_basename).g.vcf.gz && tabix -p vcf $(inputs.output_vcf_basename).g.vcf.gz'
inputs:
input_vcf:
type:
type: array
items: File
inputBinding:
prefix: INPUT=
prefix: "INPUT="
separate: false
position: 2
inputBinding: { position: 2, shellQuote: false}
secondaryFiles: [.tbi]
doc: "List of input VCF files"
output_vcf_basename: { type: string, doc: "String to be used as the base filename for the output" }
outputs:
output: { type: File, outputBinding: { glob: '*.vcf.gz' }, secondaryFiles: [.tbi] }
19 changes: 12 additions & 7 deletions tools/samtools_cram_to_bam.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -8,19 +8,24 @@ requirements:
ramMin: 16000
coresMin: $(inputs.threads)
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/samtools:1.8-dev'
baseCommand: [samtools, view]
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/samtools:1.9'
baseCommand: [samtools, view, -b]
arguments:
- position: 1
- position: 4
shellQuote: false
valueFrom: >-
-b -T $(inputs.reference.path) -@ $(inputs.threads) $(inputs.input_cram.path) > $(inputs.output_basename).bam
> $(inputs.output_basename).bam
&& samtools index -@ $(inputs.threads) $(inputs.output_basename).bam $(inputs.output_basename).bai
inputs:
input_cram: File
input_cram: { type: File, secondaryFiles: ['.crai'], doc: "cram file to convert",
inputBinding: { position: 2 } }
region: { type: 'string?', doc: "Specific region to pull, in format 'chr21' or 'chr3:1-1000'",
inputBinding: { position: 3 } }
output_basename: string
threads: { type: 'int?', doc: "num threads to use", default: 8}
reference: {type: File, secondaryFiles: [.fai]}
threads: { type: 'int?', doc: "num threads to use", default: 8,
inputBinding: { position: 1, prefix: "-@"} }
reference: {type: File, secondaryFiles: [.fai], doc: "reference file use for cram",
inputBinding: { position: 1, prefix: "--reference" } }
outputs:
output:
type: File
Expand Down
Loading
Loading