Skip to content

Commit d4e818c

Browse files
committed
📚 update readmes
🧹 fix typos 🧹 fix typo 🧹 fix typo 🧹 typos
1 parent 01cde00 commit d4e818c

5 files changed

+597
-785
lines changed
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,299 @@
1+
# Kids First Data Resource Center BWA-GATK Short Reads Alignment and HaplotypeCaller Workflow
2+
3+
<p align="center">
4+
<img src="./kids_first_logo.svg" alt="Kids First repository logo" width="660px" />
5+
</p>
6+
7+
The Kids First Data Resource Center (KFDRC) BWA-GATK Short Reads Alignment and
8+
GATK Haplotyper Workflow is a Common Workflow Language (CWL) implementation of
9+
various software used to take reads generated by next generation sequencing
10+
(NGS) technologies and use those reads to generate alignment and, optionally,
11+
variant information.
12+
13+
This workflow is an all-in-one alignment workflow capable of handling any kind
14+
of reads inputs: BAM inputs, PE reads and mates inputs, SE reads inputs, or
15+
any combination of these. The workflow will naively attempt to process these
16+
depending on what you tell it you have provided. The user informs the workflow
17+
of which inputs to process using three boolean inputs: `run_bam_processing`,
18+
`run_pe_reads_processing`, and `run_se_reads_processing`. Providing `true`
19+
values for these as well their corresponding inputs will result in those inputs
20+
being processed.
21+
22+
In addition to the core alignment functionality, this workflow is also capable
23+
of the following:
24+
- Performing Haplotype Calling to generate a gVCF as well as associated metrics
25+
- Collecting Hybrid-Selection (HS) Sequencing Metrics
26+
- Collecting Whole Genome Sequencing (WGS) Metrics
27+
- Collecting Aggregate Sequencing Metrics: Alignment Summary, Insert Size, Sequencing Artifact, GC Bias, and Quality Score Distribution
28+
- Collecting Sex Chromosome Metrics
29+
- Collecting HLA Genotyping Data
30+
31+
## Pipeline Software and Versions
32+
33+
Below is a breakdown of the software used in the workflow. Please note that this
34+
breakdown is manually updated and, as such, may fall out of date with our workflow.
35+
For the most up to date versions of our software please refer to the Docker
36+
images present in the workflow. A table of the images and their usage can be
37+
found [here](./dockers_bwagatk_alignment.md).
38+
39+
| Step | KFDRC BWA-GATK |
40+
|----------------------------|-------------------------------------|
41+
| Bam to Read Group (RG) BAM | samtools split |
42+
| RG Bam to Fastq | biobambam2 bamtofastq |
43+
| Adapter Trimming | cutadapt |
44+
| Fastq to RG Bam | bwa mem |
45+
| Merge RG Bams | sambamba merge |
46+
| Sort Bam | sambamba sort |
47+
| Mark Duplicates | samblaster |
48+
| BaseRecalibration | GATK BaseRecalibrator |
49+
| ApplyRecalibration | GATK ApplyBQSR |
50+
| Gather Recalibrated BAMs | Picard GatherBamFiles |
51+
| Bam to Cram | samtools view |
52+
| Metrics | Picard |
53+
| Sex Metrics | samtools idxstats |
54+
| HLA Genotyping | T1k |
55+
| Contamination Calculation | VerifyBamID |
56+
| gVCF Calling | GATK HaplotypeCaller |
57+
| Gather VCFs | Picard MergeVcfs |
58+
| Metrics | Picard CollectVariantCallingMetrics |
59+
60+
- [biobambam2](https://github.com/gt1/biobambam2): [2.0.50](https://github.com/gt1/biobambam2/releases/tag/2.0.50-release-20160705161609)
61+
- [bwa](https://github.com/lh3/bwa): [0.7.17-r1188](https://github.com/lh3/bwa/releases/tag/v0.7.17)
62+
- [Cutadapt](https://github.com/marcelm/cutadapt): [4.6](https://github.com/marcelm/cutadapt/releases/tag/v4.5)
63+
- [GATK](https://github.com/broadinstitute/gatk) Our workflow uses three versions of GATK:
64+
- HaplotypeCaller: [4.beta.1](https://github.com/broadinstitute/gatk/releases/tag/4.beta.1)
65+
- BaseRecalibration: [4.0.3.0](https://github.com/broadinstitute/gatk/releases/tag/4.0.3.0)
66+
- File Indexing: [4.1.7.0](https://github.com/broadinstitute/gatk/releases/tag/4.1.7.0)
67+
- [Picard](https://github.com/broadinstitute/picard): [2.18.9](https://github.com/broadinstitute/picard/releases/tag/2.18.9)
68+
- [Sambamba](https://github.com/biod/sambamba): [0.6.3](https://github.com/biod/sambamba/releases/tag/v0.6.3)
69+
- [samblaster](https://github.com/GregoryFaust/samblaster): [0.1.24](https://github.com/GregoryFaust/samblaster/releases/tag/v.0.1.24)
70+
- [Samtools](https://github.com/samtools/samtools):
71+
- idxstats, split: [1.9](https://github.com/samtools/samtools/releases/tag/1.9)
72+
- BAM to CRAM conversion: [1.8](https://github.com/samtools/samtools/releases/tag/1.8)
73+
- [T1k](https://github.com/mourisl/T1K): [1.0.5](https://github.com/mourisl/T1K/releases/tag/v1.0.5)
74+
- [VerifyBamID](https://github.com/Griffan/VerifyBamID): [1.0.2](https://github.com/Griffan/VerifyBamID/releases/tag/1.0.2)
75+
76+
### Alignment
77+
78+
The principle function of this pipeline is the realignment of short reads to a reference genome.
79+
80+
#### Inputing Reads
81+
82+
Read information are passed in through the following inputs:
83+
```yaml
84+
input_bam_list: "List of input BAM/CRAM/SAM files"
85+
cram_reference: "If input_bam_list contains CRAM, provided the reference used to generate that CRAM here"
86+
input_pe_reads_list: "List of input R1 paired end FASTQ reads"
87+
input_pe_mates_list: "List of input R2 paired end FASTQ reads"
88+
input_pe_rgs_list: "List of RG strings to use in PE processing"
89+
input_se_reads_list: "List of input single end FASTQ reads"
90+
input_se_rgs_list: "List of RG strings to use in SE processing"
91+
```
92+
93+
This workflow accepts reads in any of the of the following formats:
94+
- Aligned BAM/CRAM/SAM
95+
- Unaligned BAM/CRAM/SAM
96+
- Paired End FASTQ
97+
- Single End FASTQ
98+
99+
Additionally, these files must have appropriate read group information. For
100+
aligned and unaligned BAM/CRAM/SAM files, we assume that information is
101+
contained within the file header. If your files do not have read group (`@RG`)
102+
header information, you **must** add that information to the file header for
103+
the pipeline to work!
104+
105+
If the user provides CRAM inputs, they must also provide the reference that was
106+
used to generate the CRAM file.
107+
108+
FASTQ files do not have headers; their header information must be provided by
109+
the user at runtime. The inputs that the user provides are handed directly to
110+
the bwa `-R` option. From bwa:
111+
```
112+
-R Complete read group header line. '\t' can be used in STR and will be
113+
converted to a TAB in the output SAM. The read group ID will be attached to
114+
every read in the output. An example is '@RG\tID:foo\tSM:bar'. [null]
115+
```
116+
117+
If the user provides multiple single end FASTQs or multiple sets of paired end
118+
FASTQs, they must keep the files and read group lines in the same order.
119+
For example:
120+
```
121+
input_pe_reads_list:
122+
- sample1_R1.fq
123+
- sample2_R1.fq
124+
- sample3_R1.fq
125+
input_pe_mates_list:
126+
- sample1_R2.fq
127+
- sample2_R2.fq
128+
- sample3_R2.fq
129+
input_pe_rgs_list:
130+
- sample1_rg_line
131+
- sample2_rg_line
132+
- sample3_rg_line
133+
```
134+
135+
##### Activating Reads Processing
136+
137+
Once the user has provided their reads, they must then choose which processing
138+
the pipeline will perform. This choice is controlled by the following
139+
parameters:
140+
```yaml
141+
run_bam_processing: "BAM/CRAM/SAM processing will be run. Requires: input_bam_list"
142+
run_pe_reads_processing: "PE reads processing will be run. Requires: input_pe_reads_list, input_pe_mates_list, input_pe_rgs_list"
143+
run_se_reads_processing: "SE reads processing will be run. Requires: input_se_reads_list, input_se_rgs_list"
144+
```
145+
146+
If the user provides BAM/CRAM/SAMs they should set `run_bam_processing` to true.
147+
If the user provides paired end FASTQs, they should set `run_pe_reads_processing` to true.
148+
If the user provides single end FASTQs, they should set `run_se_reads_processing` to true.
149+
150+
### Alignment Metrics
151+
152+
After alignment, the pipeline contains a number of optional metrics collection
153+
steps that the user can enable. These tools can be enabled using the following
154+
parameters:
155+
```yaml
156+
run_hs_metrics: "HsMetrics will be collected. Only recommended for WXS inputs. Requires: wxs_bait_interval_list, wxs_target_interval_list"
157+
run_wgs_metrics: "WgsMetrics will be collected. Only recommended for WGS inputs. Requires: wgs_coverage_interval_list"
158+
run_agg_metrics: "AlignmentSummaryMetrics, GcBiasMetrics, InsertSizeMetrics, QualityScoreDistribution, and SequencingArtifactMetrics will be collected."
159+
run_sex_metrics: "idxstats will be collected and X/Y ratios calculated"
160+
```
161+
162+
In all but the most frugal cases, we recommend running the agg(regate) and sex
163+
metrics. These steps require no additional inputs.
164+
165+
HS and WGS Metrics, however, should only be enabled for appropriate inputs. If
166+
the inputs are WGS, we recommend enabling WGS metrics. If the inputs are WXS,
167+
we recommend enabling WXS metrics. Both of these metrics tools require
168+
additional inputs. For WGS, the user must provide a
169+
`wgs_coverage_interval_list`. For WXS, the user must provide both a
170+
`wxs_bait_interval_list` and `wxs_target_interval_list`. While we provide a
171+
`wgs_coverage_interval_list` with the workflow, the `wxs_bait_interval_list`
172+
and `wxs_target_interval_list` are going to be unique to the sequencing
173+
experiment used to generate the reads. These inputs will vary run to run.
174+
Therefore, the user must provide these intervals.
175+
176+
### Haplotype Calling
177+
178+
Following alignment, this workflow can also perform GATK haplotype calling to
179+
generate a gVCF and metrics files. To generate the gVCF, set
180+
`run_gvcf_processing` to `true` and provide the following optional files:
181+
`dbsnp_vcf`, `contamination_sites_bed`, `contamination_sites_mu`,
182+
`contamination_sites_ud`, `wgs_calling_interval_list`, and
183+
`wgs_evaluation_interval_list`.
184+
185+
These are the gVCF processing relevant inputs and descriptions:
186+
```yaml
187+
dbsnp_vcf: "dbSNP vcf file"
188+
dbsnp_idx: "dbSNP vcf index file"
189+
contamination_sites_bed: ".bed file for markers used in this analysis,format(chr\tpos-1\tpos\trefAllele\taltAllele)"
190+
contamination_sites_mu: ".mu matrix file of genotype matrix"
191+
contamination_sites_ud: ".UD matrix file from SVD result of genotype matrix"
192+
wgs_calling_interval_list: "WGS interval list used to aid scattering Haplotype caller"
193+
wgs_evaluation_interval_list: "Target intervals to restrict gVCF metric analysis (for VariantCallingMetrics)"
194+
run_gvcf_processing: "gVCF will be generated. Requires: dbsnp_vcf, contamination_sites_bed, contamination_sites_mu, contamination_sites_ud, wgs_calling_interval_list, wgs_evaluation_interval_list"
195+
```
196+
197+
### HLA Genotyping
198+
199+
This workflow can also collect HLA genotyping data. To read more on this
200+
feature please see [our documentation](./T1K_README.md). This feature is
201+
on by default but can be disabled by setting `run_t1k` to `false`. HLA
202+
genotyping also requires the `hla_dna_ref_seqs` and `hla_dna_gene_coords`
203+
inputs. The public app provides default files for these. If you have are using
204+
a different reference genome, see the [T1k readme](https://github.com/mourisl/T1K?tab=readme-ov-file#install)
205+
for instructions on creating these files.
206+
207+
## Outputs
208+
209+
Below are the complete list of the outputs that can be generated by the pipeline:
210+
```yaml
211+
cram: "(Re)Aligned Reads File"
212+
gvcf: "Genomic VCF generated from the realigned alignment file."
213+
verifybamid_output: "Output from VerifyBamID that is used to calculate contamination."
214+
cutadapt_stats: "Stats from Cutadapt runs, if run. One or more per read group."
215+
bqsr_report: "Recalibration report from BQSR."
216+
gvcf_calling_metrics: "General metrics for gVCF calling quality."
217+
hs_metrics: "Picard CollectHsMetrics metrics for the analysis of target-capture sequencing experiments."
218+
wgs_metrics: "Picard CollectWgsMetrics metrics for evaluating the performance of whole genome sequencing experiments."
219+
alignment_metrics: "Picard CollectAlignmentSummaryMetrics high level metrics about the alignment of reads within a SAM/BAM file."
220+
gc_bias_detail: "Picard CollectGcBiasMetrics detailed metrics about reads that fall within windows of a certain GC bin on the reference genome."
221+
gc_bias_summary: "Picard CollectGcBiasMetrics high level metrics that capture how biased the coverage in a certain lane is."
222+
gc_bias_chart: "Picard CollectGcBiasMetrics plot of GC bias."
223+
insert_metrics: "Picard CollectInsertSizeMetrics metrics about the insert size distribution of a paired-end library."
224+
insert_plot: "Picard CollectInsertSizeMetrics insert size distribution plotted."
225+
artifact_bait_bias_detail_metrics: "Picard CollectSequencingArtifactMetrics bait bias artifacts broken down by context."
226+
artifact_bait_bias_summary_metrics: "Picard CollectSequencingArtifactMetrics summary analysis of a single bait bias artifact."
227+
artifact_error_summary_metrics: "Picard CollectSequencingArtifactMetrics summary metrics as a roll up of the context-specific error rates, to provide global error rates per type of base substitution."
228+
artifact_pre_adapter_detail_metrics: "Picard CollectSequencingArtifactMetrics pre-adapter artifacts broken down by context."
229+
artifact_pre_adapter_summary_metrics: "Picard CollectSequencingArtifactMetrics summary analysis of a single pre-adapter artifact."
230+
qual_metrics: "Quality metrics for the realigned CRAM."
231+
qual_chart: "Visualization of quality metrics."
232+
idxstats: "samtools idxstats of the realigned reads file."
233+
xy_ratio: "Text file containing X and Y reads statistics generated from idxstats."
234+
t1k_genotype_tsv: "HLA genotype results from T1k"
235+
```
236+
237+
## Caveats:
238+
- Duplicates are flagged in a process that is connected to bwa mem. The
239+
implication of this design decision is that duplicates are flagged only on the
240+
inputs of that are scattered into bwa. Duplicates, therefore, are not being
241+
flagged at a library level and, for large BAM and FASTQ inputs, duplicates are
242+
only being detected within a portion of the read group.
243+
244+
## Tips for running:
245+
- For the FASTQ input file lists (PE or SE), make sure the lists are properly
246+
ordered. The items in the arrays are processed based on their position. These
247+
lists are dotproduct scattered. This means that the first file in
248+
`input_pe_reads_list` is run with the first file in `input_pe_mates_list` and
249+
the first string in `input_pe_rgs_list`. This also means these arrays must be
250+
the same length or the workflow will fail.
251+
- The input for the `reference_tar` must be a tar file containing the
252+
reference fasta along with its indexes. The required indexes are
253+
`[.64.ann,.64.amb,.64.bwt,.64.pac,.64.sa,.dict,.fai]` and are generated by bwa,
254+
Picard, and Samtools. Additionally, an `.64.alt` index is recommended. The
255+
following is an example of a complete reference tar input:
256+
```
257+
~ tar tf Homo_sapiens_assembly38.tgz
258+
Homo_sapiens_assembly38.dict
259+
Homo_sapiens_assembly38.fasta
260+
Homo_sapiens_assembly38.fasta.64.alt
261+
Homo_sapiens_assembly38.fasta.64.amb
262+
Homo_sapiens_assembly38.fasta.64.ann
263+
Homo_sapiens_assembly38.fasta.64.bwt
264+
Homo_sapiens_assembly38.fasta.64.pac
265+
Homo_sapiens_assembly38.fasta.64.sa
266+
Homo_sapiens_assembly38.fasta.fai
267+
```
268+
- If you are making your own bwa indexes make sure to use the `-6` flag to
269+
obtain the `.64` version of the indexes. Indexes that do not match this naming
270+
schema will cause a failure in certain runner ecosystems.
271+
- Should you decide to create your own reference indexes and omit the ALT
272+
index file from the reference, or if its naming structure mismatches the other
273+
indexes, then your alignments will be equivalent to the results you would
274+
obtain if you run BWA-MEM with the `-j` option.
275+
- For advanced usage, users can skip the knownsite indexing by providing the
276+
`knownsites_indexes` input. This file list should contain the indexes for each
277+
of the files in your knownsites input. Please note this list must be ordered in
278+
such a way where the position of the index file in the `knownsites_indexes`
279+
list must correspond with the position of the VCF file in the knownsites list
280+
that it indexes. Failure to order in this way will result in the pipeline
281+
failing or generating erroneous files.
282+
- For large BAM inputs, users may encounter a scenario where jobs fail during
283+
the bamtofastq step. The given error will recommend that users try increasing
284+
disk space. Increasing the disk space will solve the error for all but the
285+
largest inputs. For those extremely large inputs with many read groups that
286+
continue to get this error, it is recommended that users increase the value for
287+
`bamtofastq_cpu`. Increasing this value will decrease the number of concurrent
288+
bamtofastq jobs that run on the same instance.
289+
290+
## Basic Info
291+
- [D3b dockerfiles](https://github.com/d3b-center/bixtools)
292+
- Testing Tools:
293+
- [CAVATICA Platform](https://cavatica.sbgenomics.com/)
294+
- [Common Workflow Language reference implementation (cwltool)](https://github.com/common-workflow-language/cwltool/)
295+
296+
## References
297+
- KFDRC AWS S3 bucket: s3://kids-first-seq-data/broad-references/
298+
- CAVATICA: https://cavatica.sbgenomics.com/u/kfdrc-harmonization/kf-references/
299+
- Broad Institute Goolge Cloud: https://console.cloud.google.com/storage/browser/gcp-public-data--broad-references/hg38/v0

docs/KFDRC_SENTIEON_ALIGNMENT_GVCF_WORKFLOW_README.md

+6-3
Original file line numberDiff line numberDiff line change
@@ -1,14 +1,14 @@
1-
# Kids First Data Resource Center Sentieon Short Reads Alignment and gVCF Workflow
1+
# Kids First Data Resource Center Sentieon Short Reads Alignment and Haplotyper Workflow
22

33
<p align="center">
44
<img src="./kids_first_logo.svg" alt="Kids First repository logo" width="660px" />
55
</p>
66

7-
The Kids First Data Resource Center (KFDRC) Short Reads Alignment and GATK
7+
The Kids First Data Resource Center (KFDRC) Sentieon Short Reads Alignment and
88
Haplotyper Workflow is a Common Workflow Language (CWL) implementation of
99
various software used to take reads generated by next generation sequencing
1010
(NGS) technologies and use those reads to generate alignment and, optionally,
11-
variant information. This workflow mirrors the approach of our existing [Input Agnostic Alignment Workflow](https://github.com/kids-first/kf-alignment-workflow#input-agnostic-alignment-workflow),
11+
variant information. This workflow mirrors the approach of our existing [BWA-GATK Workflow](./KFDRC_BWA_GATK_ALIGNMENT_GVCF_WORKFLOW_README.md),
1212
and the two have been internally benchmarked as functionally equivalent. The key
1313
difference between the two workflows is found in the tools used during the
1414
alignment process.
@@ -62,6 +62,7 @@ that match our existing Picard metrics suite.
6262
|----------------------------|-----------------------|-----------------------------------|
6363
| Bam to Read Group (RG) BAM | samtools split | samtools split |
6464
| RG Bam to Fastq | biobambam2 bamtofastq | biobambam2 bamtofastq |
65+
| Adapter Trimming | cutadapt | cutadapt |
6566
| Fastq to RG Bam | bwa mem | Sentieon bwa mem |
6667
| Merge RG Bams | sambamba merge | Sentieon ReadWriter |
6768
| Sort Bam | sambamba sort | Sentieon ReadWriter |
@@ -71,6 +72,8 @@ that match our existing Picard metrics suite.
7172
| Gather Recalibrated BAMs | Picard GatherBamFiles | No splitting occurs in Sentieon |
7273
| Bam to Cram | samtools view | Sentieon ReadWriter |
7374
| Metrics | Picard | Sentieon |
75+
| Sex Metrics | samtools idxstats | samtools idxstats |
76+
| HLA Genotyping | T1k | T1k |
7477

7578
## Sentieon gVCF Creation: Similarities and Differences
7679

0 commit comments

Comments
 (0)