|
| 1 | +# Kids First Data Resource Center BWA-GATK Short Reads Alignment and HaplotypeCaller Workflow |
| 2 | + |
| 3 | +<p align="center"> |
| 4 | + <img src="./kids_first_logo.svg" alt="Kids First repository logo" width="660px" /> |
| 5 | +</p> |
| 6 | + |
| 7 | +The Kids First Data Resource Center (KFDRC) BWA-GATK Short Reads Alignment and |
| 8 | +GATK Haplotyper Workflow is a Common Workflow Language (CWL) implementation of |
| 9 | +various software used to take reads generated by next generation sequencing |
| 10 | +(NGS) technologies and use those reads to generate alignment and, optionally, |
| 11 | +variant information. |
| 12 | + |
| 13 | +This workflow is an all-in-one alignment workflow capable of handling any kind |
| 14 | +of reads inputs: BAM inputs, PE reads and mates inputs, SE reads inputs, or |
| 15 | +any combination of these. The workflow will naively attempt to process these |
| 16 | +depending on what you tell it you have provided. The user informs the workflow |
| 17 | +of which inputs to process using three boolean inputs: `run_bam_processing`, |
| 18 | +`run_pe_reads_processing`, and `run_se_reads_processing`. Providing `true` |
| 19 | +values for these as well their corresponding inputs will result in those inputs |
| 20 | +being processed. |
| 21 | + |
| 22 | +In addition to the core alignment functionality, this workflow is also capable |
| 23 | +of the following: |
| 24 | +- Performing Haplotype Calling to generate a gVCF as well as associated metrics |
| 25 | +- Collecting Hybrid-Selection (HS) Sequencing Metrics |
| 26 | +- Collecting Whole Genome Sequencing (WGS) Metrics |
| 27 | +- Collecting Aggregate Sequencing Metrics: Alignment Summary, Insert Size, Sequencing Artifact, GC Bias, and Quality Score Distribution |
| 28 | +- Collecting Sex Chromosome Metrics |
| 29 | +- Collecting HLA Genotyping Data |
| 30 | + |
| 31 | +## Pipeline Software and Versions |
| 32 | + |
| 33 | +Below is a breakdown of the software used in the workflow. Please note that this |
| 34 | +breakdown is manually updated and, as such, may fall out of date with our workflow. |
| 35 | +For the most up to date versions of our software please refer to the Docker |
| 36 | +images present in the workflow. A table of the images and their usage can be |
| 37 | +found [here](./dockers_bwagatk_alignment.md). |
| 38 | + |
| 39 | +| Step | KFDRC BWA-GATK | |
| 40 | +|----------------------------|-------------------------------------| |
| 41 | +| Bam to Read Group (RG) BAM | samtools split | |
| 42 | +| RG Bam to Fastq | biobambam2 bamtofastq | |
| 43 | +| Adapter Trimming | cutadapt | |
| 44 | +| Fastq to RG Bam | bwa mem | |
| 45 | +| Merge RG Bams | sambamba merge | |
| 46 | +| Sort Bam | sambamba sort | |
| 47 | +| Mark Duplicates | samblaster | |
| 48 | +| BaseRecalibration | GATK BaseRecalibrator | |
| 49 | +| ApplyRecalibration | GATK ApplyBQSR | |
| 50 | +| Gather Recalibrated BAMs | Picard GatherBamFiles | |
| 51 | +| Bam to Cram | samtools view | |
| 52 | +| Metrics | Picard | |
| 53 | +| Sex Metrics | samtools idxstats | |
| 54 | +| HLA Genotyping | T1k | |
| 55 | +| Contamination Calculation | VerifyBamID | |
| 56 | +| gVCF Calling | GATK HaplotypeCaller | |
| 57 | +| Gather VCFs | Picard MergeVcfs | |
| 58 | +| Metrics | Picard CollectVariantCallingMetrics | |
| 59 | + |
| 60 | +- [biobambam2](https://github.com/gt1/biobambam2): [2.0.50](https://github.com/gt1/biobambam2/releases/tag/2.0.50-release-20160705161609) |
| 61 | +- [bwa](https://github.com/lh3/bwa): [0.7.17-r1188](https://github.com/lh3/bwa/releases/tag/v0.7.17) |
| 62 | +- [Cutadapt](https://github.com/marcelm/cutadapt): [4.6](https://github.com/marcelm/cutadapt/releases/tag/v4.5) |
| 63 | +- [GATK](https://github.com/broadinstitute/gatk) Our workflow uses three versions of GATK: |
| 64 | + - HaplotypeCaller: [4.beta.1](https://github.com/broadinstitute/gatk/releases/tag/4.beta.1) |
| 65 | + - BaseRecalibration: [4.0.3.0](https://github.com/broadinstitute/gatk/releases/tag/4.0.3.0) |
| 66 | + - File Indexing: [4.1.7.0](https://github.com/broadinstitute/gatk/releases/tag/4.1.7.0) |
| 67 | +- [Picard](https://github.com/broadinstitute/picard): [2.18.9](https://github.com/broadinstitute/picard/releases/tag/2.18.9) |
| 68 | +- [Sambamba](https://github.com/biod/sambamba): [0.6.3](https://github.com/biod/sambamba/releases/tag/v0.6.3) |
| 69 | +- [samblaster](https://github.com/GregoryFaust/samblaster): [0.1.24](https://github.com/GregoryFaust/samblaster/releases/tag/v.0.1.24) |
| 70 | +- [Samtools](https://github.com/samtools/samtools): |
| 71 | + - idxstats, split: [1.9](https://github.com/samtools/samtools/releases/tag/1.9) |
| 72 | + - BAM to CRAM conversion: [1.8](https://github.com/samtools/samtools/releases/tag/1.8) |
| 73 | +- [T1k](https://github.com/mourisl/T1K): [1.0.5](https://github.com/mourisl/T1K/releases/tag/v1.0.5) |
| 74 | +- [VerifyBamID](https://github.com/Griffan/VerifyBamID): [1.0.2](https://github.com/Griffan/VerifyBamID/releases/tag/1.0.2) |
| 75 | + |
| 76 | +### Alignment |
| 77 | + |
| 78 | +The principle function of this pipeline is the realignment of short reads to a reference genome. |
| 79 | + |
| 80 | +#### Inputing Reads |
| 81 | + |
| 82 | +Read information are passed in through the following inputs: |
| 83 | +```yaml |
| 84 | + input_bam_list: "List of input BAM/CRAM/SAM files" |
| 85 | + cram_reference: "If input_bam_list contains CRAM, provided the reference used to generate that CRAM here" |
| 86 | + input_pe_reads_list: "List of input R1 paired end FASTQ reads" |
| 87 | + input_pe_mates_list: "List of input R2 paired end FASTQ reads" |
| 88 | + input_pe_rgs_list: "List of RG strings to use in PE processing" |
| 89 | + input_se_reads_list: "List of input single end FASTQ reads" |
| 90 | + input_se_rgs_list: "List of RG strings to use in SE processing" |
| 91 | +``` |
| 92 | +
|
| 93 | +This workflow accepts reads in any of the of the following formats: |
| 94 | +- Aligned BAM/CRAM/SAM |
| 95 | +- Unaligned BAM/CRAM/SAM |
| 96 | +- Paired End FASTQ |
| 97 | +- Single End FASTQ |
| 98 | +
|
| 99 | +Additionally, these files must have appropriate read group information. For |
| 100 | +aligned and unaligned BAM/CRAM/SAM files, we assume that information is |
| 101 | +contained within the file header. If your files do not have read group (`@RG`) |
| 102 | +header information, you **must** add that information to the file header for |
| 103 | +the pipeline to work! |
| 104 | + |
| 105 | +If the user provides CRAM inputs, they must also provide the reference that was |
| 106 | +used to generate the CRAM file. |
| 107 | + |
| 108 | +FASTQ files do not have headers; their header information must be provided by |
| 109 | +the user at runtime. The inputs that the user provides are handed directly to |
| 110 | +the bwa `-R` option. From bwa: |
| 111 | +``` |
| 112 | +-R Complete read group header line. '\t' can be used in STR and will be |
| 113 | + converted to a TAB in the output SAM. The read group ID will be attached to |
| 114 | + every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null] |
| 115 | +``` |
| 116 | + |
| 117 | +If the user provides multiple single end FASTQs or multiple sets of paired end |
| 118 | +FASTQs, they must keep the files and read group lines in the same order. |
| 119 | +For example: |
| 120 | +``` |
| 121 | +input_pe_reads_list: |
| 122 | + - sample1_R1.fq |
| 123 | + - sample2_R1.fq |
| 124 | + - sample3_R1.fq |
| 125 | +input_pe_mates_list: |
| 126 | + - sample1_R2.fq |
| 127 | + - sample2_R2.fq |
| 128 | + - sample3_R2.fq |
| 129 | +input_pe_rgs_list: |
| 130 | + - sample1_rg_line |
| 131 | + - sample2_rg_line |
| 132 | + - sample3_rg_line |
| 133 | +``` |
| 134 | + |
| 135 | +##### Activating Reads Processing |
| 136 | + |
| 137 | +Once the user has provided their reads, they must then choose which processing |
| 138 | +the pipeline will perform. This choice is controlled by the following |
| 139 | +parameters: |
| 140 | +```yaml |
| 141 | + run_bam_processing: "BAM/CRAM/SAM processing will be run. Requires: input_bam_list" |
| 142 | + run_pe_reads_processing: "PE reads processing will be run. Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list" |
| 143 | + run_se_reads_processing: "SE reads processing will be run. Requires: input_se_reads_list, input_se_rgs_list" |
| 144 | +``` |
| 145 | + |
| 146 | +If the user provides BAM/CRAM/SAMs they should set `run_bam_processing` to true. |
| 147 | +If the user provides paired end FASTQs, they should set `run_pe_reads_processing` to true. |
| 148 | +If the user provides single end FASTQs, they should set `run_se_reads_processing` to true. |
| 149 | + |
| 150 | +### Alignment Metrics |
| 151 | + |
| 152 | +After alignment, the pipeline contains a number of optional metrics collection |
| 153 | +steps that the user can enable. These tools can be enabled using the following |
| 154 | +parameters: |
| 155 | +```yaml |
| 156 | + run_hs_metrics: "HsMetrics will be collected. Only recommended for WXS inputs. Requires: wxs_bait_interval_list, wxs_target_interval_list" |
| 157 | + run_wgs_metrics: "WgsMetrics will be collected. Only recommended for WGS inputs. Requires: wgs_coverage_interval_list" |
| 158 | + run_agg_metrics: "AlignmentSummaryMetrics, GcBiasMetrics, InsertSizeMetrics, QualityScoreDistribution, and SequencingArtifactMetrics will be collected." |
| 159 | + run_sex_metrics: "idxstats will be collected and X/Y ratios calculated" |
| 160 | +``` |
| 161 | + |
| 162 | +In all but the most frugal cases, we recommend running the agg(regate) and sex |
| 163 | +metrics. These steps require no additional inputs. |
| 164 | + |
| 165 | +HS and WGS Metrics, however, should only be enabled for appropriate inputs. If |
| 166 | +the inputs are WGS, we recommend enabling WGS metrics. If the inputs are WXS, |
| 167 | +we recommend enabling WXS metrics. Both of these metrics tools require |
| 168 | +additional inputs. For WGS, the user must provide a |
| 169 | +`wgs_coverage_interval_list`. For WXS, the user must provide both a |
| 170 | +`wxs_bait_interval_list` and `wxs_target_interval_list`. While we provide a |
| 171 | +`wgs_coverage_interval_list` with the workflow, the `wxs_bait_interval_list` |
| 172 | +and `wxs_target_interval_list` are going to be unique to the sequencing |
| 173 | +experiment used to generate the reads. These inputs will vary run to run. |
| 174 | +Therefore, the user must provide these intervals. |
| 175 | + |
| 176 | +### Haplotype Calling |
| 177 | + |
| 178 | +Following alignment, this workflow can also perform GATK haplotype calling to |
| 179 | +generate a gVCF and metrics files. To generate the gVCF, set |
| 180 | +`run_gvcf_processing` to `true` and provide the following optional files: |
| 181 | +`dbsnp_vcf`, `contamination_sites_bed`, `contamination_sites_mu`, |
| 182 | +`contamination_sites_ud`, `wgs_calling_interval_list`, and |
| 183 | +`wgs_evaluation_interval_list`. |
| 184 | + |
| 185 | +These are the gVCF processing relevant inputs and descriptions: |
| 186 | +```yaml |
| 187 | + dbsnp_vcf: "dbSNP vcf file" |
| 188 | + dbsnp_idx: "dbSNP vcf index file" |
| 189 | + contamination_sites_bed: ".bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)" |
| 190 | + contamination_sites_mu: ".mu matrix file of genotype matrix" |
| 191 | + contamination_sites_ud: ".UD matrix file from SVD result of genotype matrix" |
| 192 | + wgs_calling_interval_list: "WGS interval list used to aid scattering Haplotype caller" |
| 193 | + wgs_evaluation_interval_list: "Target intervals to restrict gVCF metric analysis (for VariantCallingMetrics)" |
| 194 | + run_gvcf_processing: "gVCF will be generated. Requires: dbsnp_vcf, contamination_sites_bed, contamination_sites_mu, contamination_sites_ud, wgs_calling_interval_list, wgs_evaluation_interval_list" |
| 195 | +``` |
| 196 | + |
| 197 | +### HLA Genotyping |
| 198 | + |
| 199 | +This workflow can also collect HLA genotyping data. To read more on this |
| 200 | +feature please see [our documentation](./T1K_README.md). This feature is |
| 201 | +on by default but can be disabled by setting `run_t1k` to `false`. HLA |
| 202 | +genotyping also requires the `hla_dna_ref_seqs` and `hla_dna_gene_coords` |
| 203 | +inputs. The public app provides default files for these. If you have are using |
| 204 | +a different reference genome, see the [T1k readme](https://github.com/mourisl/T1K?tab=readme-ov-file#install) |
| 205 | +for instructions on creating these files. |
| 206 | + |
| 207 | +## Outputs |
| 208 | + |
| 209 | +Below are the complete list of the outputs that can be generated by the pipeline: |
| 210 | +```yaml |
| 211 | + cram: "(Re)Aligned Reads File" |
| 212 | + gvcf: "Genomic VCF generated from the realigned alignment file." |
| 213 | + verifybamid_output: "Output from VerifyBamID that is used to calculate contamination." |
| 214 | + cutadapt_stats: "Stats from Cutadapt runs, if run. One or more per read group." |
| 215 | + bqsr_report: "Recalibration report from BQSR." |
| 216 | + gvcf_calling_metrics: "General metrics for gVCF calling quality." |
| 217 | + hs_metrics: "Picard CollectHsMetrics metrics for the analysis of target-capture sequencing experiments." |
| 218 | + wgs_metrics: "Picard CollectWgsMetrics metrics for evaluating the performance of whole genome sequencing experiments." |
| 219 | + alignment_metrics: "Picard CollectAlignmentSummaryMetrics high level metrics about the alignment of reads within a SAM/BAM file." |
| 220 | + gc_bias_detail: "Picard CollectGcBiasMetrics detailed metrics about reads that fall within windows of a certain GC bin on the reference genome." |
| 221 | + gc_bias_summary: "Picard CollectGcBiasMetrics high level metrics that capture how biased the coverage in a certain lane is." |
| 222 | + gc_bias_chart: "Picard CollectGcBiasMetrics plot of GC bias." |
| 223 | + insert_metrics: "Picard CollectInsertSizeMetrics metrics about the insert size distribution of a paired-end library." |
| 224 | + insert_plot: "Picard CollectInsertSizeMetrics insert size distribution plotted." |
| 225 | + artifact_bait_bias_detail_metrics: "Picard CollectSequencingArtifactMetrics bait bias artifacts broken down by context." |
| 226 | + artifact_bait_bias_summary_metrics: "Picard CollectSequencingArtifactMetrics summary analysis of a single bait bias artifact." |
| 227 | + artifact_error_summary_metrics: "Picard CollectSequencingArtifactMetrics summary metrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution." |
| 228 | + artifact_pre_adapter_detail_metrics: "Picard CollectSequencingArtifactMetrics pre-adapter artifacts broken down by context." |
| 229 | + artifact_pre_adapter_summary_metrics: "Picard CollectSequencingArtifactMetrics summary analysis of a single pre-adapter artifact." |
| 230 | + qual_metrics: "Quality metrics for the realigned CRAM." |
| 231 | + qual_chart: "Visualization of quality metrics." |
| 232 | + idxstats: "samtools idxstats of the realigned reads file." |
| 233 | + xy_ratio: "Text file containing X and Y reads statistics generated from idxstats." |
| 234 | + t1k_genotype_tsv: "HLA genotype results from T1k" |
| 235 | +``` |
| 236 | + |
| 237 | +## Caveats: |
| 238 | +- Duplicates are flagged in a process that is connected to bwa mem. The |
| 239 | + implication of this design decision is that duplicates are flagged only on the |
| 240 | + inputs of that are scattered into bwa. Duplicates, therefore, are not being |
| 241 | + flagged at a library level and, for large BAM and FASTQ inputs, duplicates are |
| 242 | + only being detected within a portion of the read group. |
| 243 | + |
| 244 | +## Tips for running: |
| 245 | +- For the FASTQ input file lists (PE or SE), make sure the lists are properly |
| 246 | + ordered. The items in the arrays are processed based on their position. These |
| 247 | + lists are dotproduct scattered. This means that the first file in |
| 248 | + `input_pe_reads_list` is run with the first file in `input_pe_mates_list` and |
| 249 | + the first string in `input_pe_rgs_list`. This also means these arrays must be |
| 250 | + the same length or the workflow will fail. |
| 251 | +- The input for the `reference_tar` must be a tar file containing the |
| 252 | + reference fasta along with its indexes. The required indexes are |
| 253 | + `[.64.ann,.64.amb,.64.bwt,.64.pac,.64.sa,.dict,.fai]` and are generated by bwa, |
| 254 | + Picard, and Samtools. Additionally, an `.64.alt` index is recommended. The |
| 255 | + following is an example of a complete reference tar input: |
| 256 | +``` |
| 257 | +~ tar tf Homo_sapiens_assembly38.tgz |
| 258 | +Homo_sapiens_assembly38.dict |
| 259 | +Homo_sapiens_assembly38.fasta |
| 260 | +Homo_sapiens_assembly38.fasta.64.alt |
| 261 | +Homo_sapiens_assembly38.fasta.64.amb |
| 262 | +Homo_sapiens_assembly38.fasta.64.ann |
| 263 | +Homo_sapiens_assembly38.fasta.64.bwt |
| 264 | +Homo_sapiens_assembly38.fasta.64.pac |
| 265 | +Homo_sapiens_assembly38.fasta.64.sa |
| 266 | +Homo_sapiens_assembly38.fasta.fai |
| 267 | +``` |
| 268 | +- If you are making your own bwa indexes make sure to use the `-6` flag to |
| 269 | + obtain the `.64` version of the indexes. Indexes that do not match this naming |
| 270 | + schema will cause a failure in certain runner ecosystems. |
| 271 | +- Should you decide to create your own reference indexes and omit the ALT |
| 272 | + index file from the reference, or if its naming structure mismatches the other |
| 273 | + indexes, then your alignments will be equivalent to the results you would |
| 274 | + obtain if you run BWA-MEM with the `-j` option. |
| 275 | +- For advanced usage, users can skip the knownsite indexing by providing the |
| 276 | + `knownsites_indexes` input. This file list should contain the indexes for each |
| 277 | + of the files in your knownsites input. Please note this list must be ordered in |
| 278 | + such a way where the position of the index file in the `knownsites_indexes` |
| 279 | + list must correspond with the position of the VCF file in the knownsites list |
| 280 | + that it indexes. Failure to order in this way will result in the pipeline |
| 281 | + failing or generating erroneous files. |
| 282 | +- For large BAM inputs, users may encounter a scenario where jobs fail during |
| 283 | + the bamtofastq step. The given error will recommend that users try increasing |
| 284 | + disk space. Increasing the disk space will solve the error for all but the |
| 285 | + largest inputs. For those extremely large inputs with many read groups that |
| 286 | + continue to get this error, it is recommended that users increase the value for |
| 287 | + `bamtofastq_cpu`. Increasing this value will decrease the number of concurrent |
| 288 | + bamtofastq jobs that run on the same instance. |
| 289 | + |
| 290 | +## Basic Info |
| 291 | +- [D3b dockerfiles](https://github.com/d3b-center/bixtools) |
| 292 | +- Testing Tools: |
| 293 | + - [CAVATICA Platform](https://cavatica.sbgenomics.com/) |
| 294 | + - [Common Workflow Language reference implementation (cwltool)](https://github.com/common-workflow-language/cwltool/) |
| 295 | + |
| 296 | +## References |
| 297 | +- KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/ |
| 298 | +- CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/ |
| 299 | +- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0 |
0 commit comments