- An analysis of various statistics based on stats from a raw fastq file, Kraken 2 report and Bracken report as inputs.
- A python script- Tax_Stats.py is run to get the- 1) Number of reads from a raw fastq file, 2) followed by calculating the no of classifed reads (under root) from a Kraken 2 report, 3) Number of unclassified reads in Kraken 2 report, 4) Number of classified reads (under root) in Bracken report and finally 5) Number of unlassified reads in Bracken report (Kraken 2 classified reads- Bracken classified reads).
usage- python3 tax_stats.py --file1 </path/to/kraken_report> --file2 </path/to/bracken report> --file3 </path/to/fastq file>
Kraken 2 is a k-mer based taxonomic classification system, which assigns taxonomic labels to DNA sequences. It functions based on exact k-mer matches- the classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. Kraken2 actually stores minimizers (l-mers) of each k-mer. The length of each l-mer must be ≤ the k-mer length. Each k-mer is treated by Kraken 2 as if its LCA is the same as its minimizer's LCA. Only minimizers of the k-mers in the query sequences are used as database queries. Similarly, only minimizers of the k-mers in the reference sequences in the database's genomic library are stored in the database. All k-mers are considered to have the same LCA as their minimizer's database LCA value. By default, the values of and are 35 and 31 respectively.
- phanta_kraken2.report
- The above is an example of standard Kraken 2 sample report format. It is tab-delimited with one line per taxon.
- The fields of the output, from left to right-
- Percentage of the fragments covered by the claded rooted at this taxon (Reads)
- Number of fragments covered by the clade rooted at this taxon
- Number of fragments assigned directly to this taxon
- A rank code, indicating (U)nclassified, (R)oot, (D)omain, (K)ingdom, (P)hylum, (C)lass, (O)rder, (F)amily, (G)enus, or (S)pecies. Taxa that are not at any of these 10 ranks have a rank code that is formed by using the rank code of the closest ancestor rank with a number indicating the distance from that rank. E.g., "G2" is a rank code indicating a taxon is between genus and species and the grandparent taxon is at the genus rank.
- NCBI taxonomic ID number
- Indented scientific name -By default, taxa with no reads assigned to (or under) them will not have any output produced. -Refer to Kraken 2 report
- Bracken is a companion program to Kraken 1, KrakenUniq, or Kraken 2 While Kraken classifies reads to multiple levels in the taxonomic tree, Bracken allows estimation of abundance at a single level using those classifications (e.g. Bracken can estimate abundance of species within a sample).
- Unclassified reads will not be included in the report.
- Format is similar to Kraken 2 report. Refer to Bracken-report
- If Kraken fails to identify a species (e.g., if the species was missing from the Kraken database), Bracken too will not identify that species.