Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-threading #7

Open
baraaorabi opened this issue Sep 26, 2019 · 4 comments
Open

Multi-threading #7

baraaorabi opened this issue Sep 26, 2019 · 4 comments
Labels
enhancement New feature or request

Comments

@baraaorabi
Copy link

Is your feature request related to a problem? Please describe.

The simulator is very slow when it comes to

  • Adjusting the lengths of reference contigs. This might be not an issue for human genomes, but it is a big issue for the transcriptome (>180K vs. 23)
  • Generating reads.

Both of these steps should have straighforward data parallelism

Describe the solution you'd like
Multithreading of the two steps (and possible others?)

Describe alternatives you've considered
Adding a program command to prepare the reference contigs and pickle the results so rerunning won't be slow. That won't really resolve the read generation speed thu

Additional context
I am building a wrapper around Badread for transcriptomic reads. It's still in the design stage. I plan to code the multithreading described above on a separate branch and make PR

@baraaorabi baraaorabi added the enhancement New feature or request label Sep 26, 2019
@W-L
Copy link

W-L commented Jul 2, 2021

For others stumbling across this issue, here's a little snakemake template that mimmicks multi-threading by running badread multiple times and concatenating the fastq files at the end.

threads = list(range(10))
genome = "genome.fa"

# example to pass through parameters
rlen_mean = 15000
rlen_sd = 13000
sim_params = {"rlen": f"{rlen_mean},{rlen_sd}"}

rule all:
    input:
        expand("reads_{t}.fq", t=threads),
        "sim_reads.fq"

# run badread simulate multiple times on the same input genome
rule badread_sim:
    input: genome
    output: "reads_{t}.fq"
    params:
        rlen = lambda wildcards: sim_params['rlen']
    shell:
        "badread simulate --reference {input} --length {params.rlen} >{output}"

# afterwards simply concatenate all output read files
rule concat_sim:
    input: expand("reads_{t}.fq",t=threads)
    output: "sim_reads.fq"
    shell:
        "cat {input} > {output}"

Just saw that there is already a wiki entry for doing exactly the same thing in bash. Anyway, maybe this is still useful for someone.

@jsgounot
Copy link

Before anyone do the same thing that I did and follow blindly W-L's answer, note that doing so will in some occasion generate the same read name multiple times. This might affect your pipeline, especially if you're cleaning your reads later since minimap2 do not care if multiple reads with the same name appear, and will just map them individually, leading to secondary / chimeric alignments.

@mbhall88
Copy link

@jsgounot did you get the same read name multiple times? If so, you should buy a lottery ticket as the read names are generated with uuid

read_name = uuid.UUID(int=random.getrandbits(128))

@jsgounot
Copy link

I know but I'm not as lucky with the lottery sadly ...

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

4 participants