Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🎉 add hardfilter annotation filter plotting #59

Merged
merged 8 commits into from
Jan 16, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions docs/GERMLINE_SNV_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -121,6 +121,7 @@ For more information on the specific annotations, please see the documentation.
- `peddy_html`: HTML metrics files from Peddy
- `peddy_csv`: CSV metrics for het_check, ped_check, and sex_check from Peddy
- `peddy_ped`: PED file with additional metrics information from Peddy
- `gatk_annotation_plots`: TAR file containing plots of all annotations used for hardfiltering
- Strelka2
- `strelka2_prepass_variants`: Raw variants output from Strelka2
- `strelka2_gvcfs`: gVCF output from Strelka2
Expand Down
1 change: 1 addition & 0 deletions docs/GERMLINE_VARIANT_README.md
Original file line number Diff line number Diff line change
Expand Up @@ -146,6 +146,7 @@ user must provide the associated gVCF in the `input_gvcf` input.
- `peddy_csv`: CSV metrics for het_check, ped_check, and sex_check from Peddy
- `peddy_ped`: PED file with additional metrics information from Peddy
- `verifybamid_output`: VerifyBAMID output, including contamination score
- `gatk_annotation_plots`: TAR file containing plots of all annotations used for hardfiltering
- Strelka2
- `strelka2_prepass_variants`: Raw variants output from Strelka2
- `strelka2_gvcfs`: gVCF output from Strelka2
Expand Down
3 changes: 3 additions & 0 deletions docs/dockers_gatk_genotyping.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,13 +15,16 @@ gatk_gathervcfs.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_genomicsdbimport_genotypegvcfs.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_indelsvariantrecalibrator.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_makesitesonlyvcf.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_plot_annotations.cwl|pgc-images.sbgenomics.com/d3b-bixu/tidyverse:4.4.2-gatk-plotter
gatk_selectvariants.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.2.0.0R
gatk_snpsvariantrecalibratorcreatemodel.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_snpsvariantrecalibratorscattered.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_variantfiltration.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.2.0.0R
gatk_variantstotable.cwl|broadinstitute/gatk:4.6.1.0
generic_rename_outputs.cwl|None
kfdrc_peddy_tool.cwl|pgc-images.sbgenomics.com/d3b-bixu/peddy:latest
normalize_vcf.cwl|pgc-images.sbgenomics.com/d3b-bixu/vcfutils:latest
picard_collectvariantcallingmetrics.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
script_dynamicallycombineintervals.cwl|pgc-images.sbgenomics.com/d3b-bixu/python:2.7.13
tar.cwl|None
variant_effect_predictor_105.cwl|ensemblorg/ensembl-vep:release_105.0
3 changes: 3 additions & 0 deletions docs/dockers_germline_variant.md
Original file line number Diff line number Diff line change
Expand Up @@ -41,12 +41,14 @@ gatk_intervallisttobed.cwl|broadinstitute/gatk:4.4.0.0
gatk_intervallisttools.cwl|broadinstitute/gatk:4.4.0.0
gatk_makesitesonlyvcf.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_mergevcfs.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.1.1.0
gatk_plot_annotations.cwl|pgc-images.sbgenomics.com/d3b-bixu/tidyverse:4.4.2-gatk-plotter
gatk_postprocessgermlinecnvcalls.cwl|broadinstitute/gatk:4.2.0.0
gatk_preprocessintervals.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.2.0.0R
gatk_selectvariants.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.2.0.0R
gatk_snpsvariantrecalibratorcreatemodel.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_snpsvariantrecalibratorscattered.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.0.12.0
gatk_variantfiltration.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.2.0.0R
gatk_variantstotable.cwl|broadinstitute/gatk:4.6.1.0
generic_rename_outputs.cwl|None
guess_bin_size.cwl|None
kfdrc_peddy_tool.cwl|pgc-images.sbgenomics.com/d3b-bixu/peddy:latest
Expand All @@ -60,5 +62,6 @@ scatter_ploidy_calls_by_sample.cwl|pgc-images.sbgenomics.com/d3b-bixu/gatk:4.2.0
script_dynamicallycombineintervals.cwl|pgc-images.sbgenomics.com/d3b-bixu/python:2.7.13
strelka2_germline.cwl|pgc-images.sbgenomics.com/d3b-bixu/strelka:v2.9.10
svaba.cwl|pgc-images.sbgenomics.com/d3b-bixu/svaba:1.1.0
tar.cwl|None
variant_effect_predictor_105.cwl|ensemblorg/ensembl-vep:release_105.0
verifybamid_contamination_conditional.cwl|pgc-images.sbgenomics.com/d3b-bixu/verifybamid:1.0.2
57 changes: 57 additions & 0 deletions scripts/gatk_plot_annotations.R
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
#!/usr/bin/env Rscript

# plotting.R script loads ggplot and gridExtra libraries and defines functions to plot variant annotations
# Usage:
# Rscript gatk_plot_annotations.R <output_basename> <input_type> <input_file> <input_field_1> <input_field_2> ... <input_field_n>
# <output_basename> <input_type> and <input_file> are required
# <input_field_1> <input_field_2> ... <input_field_n> are optional overrides to the field names that will be plotted

library(ggplot2)
library(gridExtra)
library(readr)

# Function for making density plots of a single annotation
makeDensityPlot <- function(dataframe, xvar, split, xmin=min(dataframe[xvar], na.rm=TRUE), xmax=max(dataframe[xvar], na.rm=TRUE), alpha=0.5, log10=FALSE) {
if(missing(split)) {
plot = ggplot(data=dataframe, aes_string(x=xvar)) + xlim(xmin,xmax) + geom_density()
if (log10) {
plot = plot + scale_x_log10()
}
return(plot)
}
else {
return(ggplot(data=dataframe, aes_string(x=xvar, fill=split)) + xlim(xmin,xmax) + geom_density(alpha=alpha) )
}
}

args <- commandArgs(trailingOnly=TRUE)

if (length(args) < 3) {
stop("Three arguments are required. First, an output_basename. Second, the input type. Third, the input table.", call.=FALSE)
}

output_basename <- args[1]
input_type <- args[2]
input_file <- args[3]
input_fields <- if (length(args) > 3) tail(args, -3)

message(paste("Loading", input_file))
sampleSNP <- read_delim(input_file, "\t", escape_double = FALSE, trim_ws = TRUE)

message(paste("Picking annotations for", input_type))
snp_annots <- c("QD", "QUAL", "SOR", "FS", "MQ", "MQRankSum", "ReadPosRankSum")
indel_annots <- c("QD", "QUAL", "FS", "ReadPosRankSum")
mode_annots <- if (input_type=="SNP") snp_annots else indel_annots
picked_annots <- if (is.null(input_fields)) mode_annots else input_fields

for (annot in picked_annots) {
message(paste("Creating Density Plot for", annot))
if (annot %in% c("FS", "QUAL")) {
plot <- makeDensityPlot(sampleSNP, annot, log10 = TRUE)
}
else {
plot <- makeDensityPlot(sampleSNP, annot)
}
message(paste("Saving plot to", paste(output_basename, input_type, annot, "plot.pdf", sep=".")))
ggsave(plot, filename=paste(output_basename, input_type, annot, "plot.pdf", sep="."))
}
64 changes: 64 additions & 0 deletions subworkflows/gatk_plot_genotyping_annotations.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
cwlVersion: v1.2
class: Workflow
id: gatk_plot_genotyping_annotations
requirements:
- class: StepInputExpressionRequirement
- class: InlineJavascriptRequirement
doc: |-
This workflow performs manual site-level variant filtration on an input VCF using the generic hard-filtering thresholds and example commands in the
[documentation from Broad](https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering#2).

The input VCF is split into SNP and INDEL VCFs using GATK SelectVariants. Those individual VCFs are then filtered using GATK VariantFiltration.
Finally the VCFs are merged back together using bcftools concat and returned.

inputs:
snps_vcf: {type: 'File', secondaryFiles: [{pattern: '.tbi', required: false}], doc: "Input VCF containing SNP variants"}
indels_vcf: {type: 'File', secondaryFiles: [{pattern: '.tbi', required: false}], doc: "Input VCF containing INDEL variants"}
output_basename: {type: 'string', doc: "String value to use as the base for the filename of the output"}
snp_plot_annots: {type: 'string[]?', doc: "The name of a standard VCF field or an INFO field to include in the output table for SNPs"}
indel_plot_annots: {type: 'string[]?', doc: "The name of a standard VCF field or an INFO field to include in the output table for INDELs"}

outputs:
annotation_plots: {type: 'File', outputSource: tar_plots/output}

steps:
gatk_variantstotable_snps:
run: ../tools/gatk_variantstotable.cwl
in:
input_vcf: snps_vcf
fields: snp_plot_annots
output_filename: { valueFrom: "temp.snp.tsv"}
out: [output]
gatk_variantstotable_indels:
run: ../tools/gatk_variantstotable.cwl
in:
input_vcf: indels_vcf
fields: indel_plot_annots
output_filename: {valueFrom: "temp.indel.tsv"}
out: [output]
gatk_plot_annotations_snps:
run: ../tools/gatk_plot_annotations.cwl
in:
input_table: gatk_variantstotable_snps/output
input_type: {valueFrom: "SNP"}
output_basename: output_basename
annotation_fields: snp_plot_annots
out: [plots]
gatk_plot_annotations_indels:
run: ../tools/gatk_plot_annotations.cwl
in:
input_table: gatk_variantstotable_indels/output
input_type: {valueFrom: "INDEL"}
output_basename: output_basename
annotation_fields: indel_plot_annots
out: [plots]
tar_plots:
run: ../tools/tar.cwl
in:
output_filename:
source: output_basename
valueFrom: '$(self).genotyping_annotation_plots.tar.gz'
input_files:
source: [gatk_plot_annotations_snps/plots, gatk_plot_annotations_indels/plots]
valueFrom: '$(self[0].concat(self[1]))'
out: [output]
13 changes: 13 additions & 0 deletions subworkflows/kfdrc-gatk-hardfiltering.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@ id: kfdrc-gatk-hardfiltering
requirements:
- class: StepInputExpressionRequirement
- class: InlineJavascriptRequirement
- class: SubworkflowFeatureRequirement
doc: |-
This workflow performs manual site-level variant filtration on an input VCF using the generic hard-filtering thresholds and example commands in the
[documentation from Broad](https://gatk.broadinstitute.org/hc/en-us/articles/360035531112--How-to-Filter-variants-either-with-VQSR-or-by-hard-filtering#2).
Expand All @@ -14,6 +15,8 @@ doc: |-
inputs:
input_vcf: {type: 'File', secondaryFiles: [{pattern: '.tbi', required: true}], doc: "Input VCF containing INDEL and SNP variants"}
output_basename: {type: 'string', doc: "String value to use as the base for the filename of the output"}
snp_plot_annots: {type: 'string[]?', doc: "The name of a standard VCF field or an INFO field to include in the output table for SNPs"}
indel_plot_annots: {type: 'string[]?', doc: "The name of a standard VCF field or an INFO field to include in the output table for INDELs"}
snp_hardfilters: {type: 'string', doc: "String value of hardfilters to set for SNPs in input_vcf" }
indel_hardfilters: {type: 'string', doc: "String value of hardfilters to set for INDELs in input_vcf" }
snp_filtration_extra_args: {type: 'string?', doc: "Any extra arguments for SNP VariantFiltration" }
Expand All @@ -22,6 +25,7 @@ inputs:
filtration_ram: { type: 'int?', doc: "GB of RAM to allocate to GATK VariantFiltration" }

outputs:
annotation_plots: {type: 'File', outputSource: gatk_plot_genotyping_annotations/annotation_plots}
hardfiltered_vcf: {type: 'File', secondaryFiles: [{pattern: '.tbi', required: true}], outputSource: bcftools_concat_snps_indels/output}

steps:
Expand All @@ -39,6 +43,15 @@ steps:
output_basename: output_basename
selection: {valueFrom: "INDEL"}
out: [output]
gatk_plot_genotyping_annotations:
run: ../subworkflows/gatk_plot_genotyping_annotations.cwl
in:
snps_vcf: gatk_selectvariants_snps/output
indels_vcf: gatk_selectvariants_indels/output
output_basename: output_basename
snp_plot_annots: snp_plot_annots
indel_plot_annots: indel_plot_annots
out: [annotation_plots]
gatk_variantfiltration_snps:
run: ../tools/gatk_variantfiltration.cwl
in:
Expand Down
6 changes: 6 additions & 0 deletions tools/filtering_defaults.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -32,6 +32,8 @@ outputs:
indel_ts_filter_level: float?
snp_hardfilter: string?
indel_hardfilter: string?
snp_plot_annots: string[]?
indel_plot_annots: string[]?
expression: |
${
var OUTPUTS = {
Expand All @@ -44,6 +46,8 @@ expression: |
"indel_ts_filter_level": null,
"snp_hardfilter": null,
"indel_hardfilter": null,
"snp_plot_annots": null,
"indel_plot_annots": null,
};
var IS_LOW_DATA = {
"WGS": false,
Expand Down Expand Up @@ -79,10 +83,12 @@ expression: |
'-filter "MQ < 40.0" --filter-name "MQ40_SNP" ' +
'-filter "MQRankSum < -12.5" --filter-name "MQRankSum-12.5_SNP" ' +
'-filter "ReadPosRankSum < -8.0" --filter-name "ReadPosRankSum-8_SNP"',
"snp_plot_annots": ["QD", "QUAL", "SOR", "FS", "MQ", "MQRankSum", "ReadPosRankSum"],
"indel_hardfilter": '-filter "QD < 2.0" --filter-name "QD2" ' +
'-filter "QUAL < 30.0" --filter-name "QUAL30" ' +
'-filter "FS > 200.0" --filter-name "FS200_INDEL" ' +
'-filter "ReadPosRankSum < -20.0" --filter-name "ReadPosRankSum-20_INDEL"',
"indel_plot_annots": ["QD", "QUAL", "FS", "ReadPosRankSum"],
};
var PICKED = IS_LOW_DATA[inputs.experiment_type] ? LOW_DATA_FILTERS : HIGH_DATA_FILTERS[inputs.experiment_type];
return Object.assign({}, OUTPUTS, PICKED);
Expand Down
31 changes: 31 additions & 0 deletions tools/gatk_plot_annotations.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,31 @@
cwlVersion: v1.2
class: CommandLineTool
id: gatk_plot_annotations
doc: |
Plot annotations relevant to GATK hardfiltering. Take a TSV table generated
by VariantsToTable and create density charts for relevant annotations.

TAR the results and return.
requirements:
- class: InlineJavascriptRequirement
- class: ShellCommandRequirement
- class: ResourceRequirement
ramMin: $(inputs.ram * 1000)
coresMin: $(inputs.cpu)
- class: DockerRequirement
dockerPull: 'pgc-images.sbgenomics.com/d3b-bixu/tidyverse:4.4.2-gatk-plotter'
- class: InitialWorkDirRequirement
listing:
- entryname: gatk_plot_annotations.R
entry:
$include: ../scripts/gatk_plot_annotations.R
baseCommand: [Rscript, gatk_plot_annotations.R]
inputs:
input_table: { type: 'File', inputBinding: { position: 8 }, doc: "TSV table with variants and annotations." }
input_type: { type: 'string', inputBinding: { position: 7 }, doc: "Type of input. SNP or INDEL." }
output_basename: { type: 'string', inputBinding: { position: 6 }, doc: "String to use as basename for outputs." }
annotation_fields: { type: 'string[]?', inputBinding: { position: 9 }, doc: "Annotation fields being examined." }
ram: { type: 'int?', default: 4, doc: "GB of RAM to allocate to the task." }
cpu: { type: 'int?', default: 2, doc: "Minimum reserved number of CPU cores for the task." }
outputs:
plots: { type: 'File[]', outputBinding: { glob: '*.pdf' } }
34 changes: 34 additions & 0 deletions tools/gatk_variantstotable.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
cwlVersion: v1.2
class: CommandLineTool
id: gatk_variantstotable
doc: |
Extract fields from a VCF file to a tab-delimited table
requirements:
- class: InlineJavascriptRequirement
- class: ShellCommandRequirement
- class: ResourceRequirement
ramMin: $(inputs.ram * 1000)
coresMin: $(inputs.cpu)
- class: DockerRequirement
dockerPull: 'broadinstitute/gatk:4.6.1.0'
baseCommand: [gatk, VariantsToTable]
inputs:
input_vcf: { type: 'File', secondaryFiles: [{pattern: '.tbi', required: false}], inputBinding: { position: 2, prefix: "--variant" }, doc: "The input VCF file to convert to a table." }
input_intervals: { type: 'File?', inputBinding: { position: 2, prefix: "--intervals" }, doc: "One or more genomic intervals over which to operate" }
reference: { type: 'File?', secondaryFiles: [{pattern: '.fai', required: true}, {pattern: '^.dict', required: true}], inputBinding: { position: 2, prefix: "--reference" }, doc: "Reference sequence" }
fields:
type:
- 'null'
- type: array
items: string
inputBinding:
prefix: '--fields'
inputBinding:
position: 2
doc: "The name of a standard VCF field or an INFO field to include in the output table"
output_filename: { type: 'string?', default: "out.interval_list", inputBinding: { position: 2, prefix: "--output" }, doc: "Name for output file." }
extra_args: { type: 'string?', inputBinding: { position: 2 }, doc: "Extra args for this task" }
ram: { type: 'int?', default: 4, doc: "GB of RAM to allocate to the task." }
cpu: { type: 'int?', default: 2, doc: "Minimum reserved number of CPU cores for the task." }
outputs:
output: { type: 'File', outputBinding: { glob: $(inputs.output_filename) } }
29 changes: 29 additions & 0 deletions tools/tar.cwl
Original file line number Diff line number Diff line change
@@ -0,0 +1,29 @@
cwlVersion: v1.2
class: CommandLineTool
id: tar
requirements:
- class: ShellCommandRequirement
- class: InlineJavascriptRequirement
- class: LoadListingRequirement
- class: ResourceRequirement
coresMin: $(inputs.cpu)
ramMin: $(inputs.ram * 1000)
- class: InitialWorkDirRequirement
listing: $(inputs.input_files)
baseCommand: [tar, czf]
inputs:
output_filename: { type: 'string', inputBinding: { position: 1 } }
input_files:
type:
type: array
items: File
inputBinding:
valueFrom: $(self.basename)
inputBinding: { position: 9 }
cpu: { type: 'int?', default: 1, doc: "Number of threads to use." }
ram: { type: 'int?', default: 4, doc: "GB of RAM to allocate to this task." }
outputs:
output:
type: File
outputBinding:
glob: $(inputs.output_filename)
3 changes: 2 additions & 1 deletion workflows/kfdrc-germline-snv-wf.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -279,6 +279,7 @@ outputs:
peddy_html: {type: 'File[]', doc: 'html summary of peddy results', outputSource: single_sample_genotyping/peddy_html}
peddy_csv: {type: 'File[]', doc: 'csv details of peddy results', outputSource: single_sample_genotyping/peddy_csv}
peddy_ped: {type: 'File[]', doc: 'ped format summary of peddy results', outputSource: single_sample_genotyping/peddy_ped}
gatk_annotation_plots: {type: 'File?', doc: "TAR file containing plots of all annotations used for hardfiltering", outputSource: single_sample_genotyping/annotation_plots}
freebayes_unfiltered_vcf: {type: 'File', outputSource: freebayes/unfiltered_vcf}
strelka2_prepass_variants: {type: 'File', outputSource: strelka2/prepass_variants_vcf}
strelka2_gvcfs: {type: 'File[]', outputSource: strelka2/genome_vcfs}
Expand Down Expand Up @@ -479,7 +480,7 @@ steps:
cadd_indels: cadd_indels
cadd_snvs: cadd_snvs
intervar: intervar
out: [collectvariantcallingmetrics, peddy_html, peddy_csv, peddy_ped, vep_annotated_vcf]
out: [collectvariantcallingmetrics, peddy_html, peddy_csv, peddy_ped, vep_annotated_vcf, annotation_plots]
hints:
- class: "sbg:maxNumberOfParallelInstances"
value: 3
Expand Down
4 changes: 3 additions & 1 deletion workflows/kfdrc-germline-variant-wf.cwl
Original file line number Diff line number Diff line change
Expand Up @@ -456,6 +456,7 @@ outputs:
peddy_html: {type: 'File[]?', doc: 'html summary of peddy results', outputSource: snv/peddy_html}
peddy_csv: {type: 'File[]?', doc: 'csv details of peddy results', outputSource: snv/peddy_csv}
peddy_ped: {type: 'File[]?', doc: 'ped format summary of peddy results', outputSource: snv/peddy_ped}
gatk_annotation_plots: {type: 'File?', doc: "TAR file containing plots of all annotations used for hardfiltering", outputSource: snv/gatk_annotation_plots}
vep_annotated_gatk_vcf: {type: 'File[]?', outputSource: snv/vep_annotated_gatk_vcf}
vep_annotated_strelka_vcf: {type: 'File[]?', outputSource: snv/vep_annotated_strelka_vcf}
vep_annotated_freebayes_vcf: {type: 'File[]?', outputSource: snv/vep_annotated_freebayes_vcf}
Expand Down Expand Up @@ -609,7 +610,8 @@ steps:
run_freebayes: run_freebayes
run_strelka: run_strelka
out: [gatk_gvcf, gatk_gvcf_metrics, verifybamid_output, gatk_vcf_metrics, peddy_html, peddy_csv, peddy_ped, vep_annotated_gatk_vcf,
freebayes_unfiltered_vcf, vep_annotated_freebayes_vcf, vep_annotated_strelka_vcf, strelka2_prepass_variants, strelka2_gvcfs]
freebayes_unfiltered_vcf, vep_annotated_freebayes_vcf, vep_annotated_strelka_vcf, strelka2_prepass_variants, strelka2_gvcfs,
gatk_annotation_plots]
sv:
run: ../workflows/kfdrc-germline-sv-wf.cwl
in:
Expand Down
Loading