The quicksand helper-pipeline
See the Github Pages of quicksand for a comprehensive documentation of the pipeline.
This repostory is an addition to the mpieva/quicksand pipeline see here. Starting quicksand-build will download the mitochondiral genomes from the current NCBI/RefSeq release and create - for the given taxa - the datastructure and files required by the quicksand pipeline.
Make sure to check the RefSeq Website and note down the current RefSeq Release that is used for your database
The output of the pipeline is structured as followes
ncbi:
mitochondrion.{n}.genomic.gbff.gz - raw downloaded files from NCBI
genomes:
genomes/{family}/{species}.fasta - The indexed mitochondrial genomes used for mapping with bwa
genomes/taxid_map.tsv - A table with all nodes in the database - used to get all reference genomes for one taxon ID
masked:
masked/{species}.masked.bed - Bed files for all species in the database showing low-complexity regions
kraken:
kraken/Mito_db_kmer{kmersize} - A preindexed Kraken-database for the given kmers containing all the species in the database
work: contains nextflow-specific files and can be deleted after the run
To run the pipeline the following programms need to be installed:
- Nextflow (tested on v.20.04.10): Installation
- Singularity (tested on v3.7.1): Installation or Docker
To run the pipeline with default parameters open the terminal and type
nextflow run mpieva/quicksand-build -profile singularity
This will construct the kraken-database for kmer 22 from all mitochondrial genomes in the current refseq-release \
The pipeline accepts the following parameters:
Pipeline ARGS
--outdir PATH : Directory to save the output in. Default = "out"
--kmers KMERS : Comma-separated list of kmers for which databases are created (e.g. 21,22,23). Default=22
--include STRING : comma-separated string of Taxa that should be in the DB, e.g. "Mammalia". Default='root'
--exclude STRING : comma-separated string of Taxa that mustn't be in the DB, e.g. "Pan,Gorilla".
Nextflow ARGS (only one dash!)
-profile PROFILE : Run the pipeline with the assigned profile (see profiles below)
-resume : Resume the previous run (if it was stopped in the mean time)
-w PATH : Specify a different "work" directory for intermediate files
-c PATH : Path to a nextflow.config file that provides ADDITIONAL parameters
To integrate the created datastructure, run the quicksand pipeline with the following parameters:
--genome <OUTDIR>/genomes
--bedfiles <OUTDIR>/masked
--db <OUTDIR>/kraken/Mito_db_kmer<KMER>/