Multi-threading #7

baraaorabi · 2019-09-26T23:17:47Z

Is your feature request related to a problem? Please describe.

The simulator is very slow when it comes to

Adjusting the lengths of reference contigs. This might be not an issue for human genomes, but it is a big issue for the transcriptome (>180K vs. 23)
Generating reads.

Both of these steps should have straighforward data parallelism

Describe the solution you'd like
Multithreading of the two steps (and possible others?)

Describe alternatives you've considered
Adding a program command to prepare the reference contigs and pickle the results so rerunning won't be slow. That won't really resolve the read generation speed thu

Additional context
I am building a wrapper around Badread for transcriptomic reads. It's still in the design stage. I plan to code the multithreading described above on a separate branch and make PR

W-L · 2021-07-02T13:06:47Z

For others stumbling across this issue, here's a little snakemake template that mimmicks multi-threading by running badread multiple times and concatenating the fastq files at the end.

threads = list(range(10))
genome = "genome.fa"

# example to pass through parameters
rlen_mean = 15000
rlen_sd = 13000
sim_params = {"rlen": f"{rlen_mean},{rlen_sd}"}

rule all:
    input:
        expand("reads_{t}.fq", t=threads),
        "sim_reads.fq"

# run badread simulate multiple times on the same input genome
rule badread_sim:
    input: genome
    output: "reads_{t}.fq"
    params:
        rlen = lambda wildcards: sim_params['rlen']
    shell:
        "badread simulate --reference {input} --length {params.rlen} >{output}"

# afterwards simply concatenate all output read files
rule concat_sim:
    input: expand("reads_{t}.fq",t=threads)
    output: "sim_reads.fq"
    shell:
        "cat {input} > {output}"

Just saw that there is already a wiki entry for doing exactly the same thing in bash. Anyway, maybe this is still useful for someone.

jsgounot · 2022-03-25T06:25:36Z

Before anyone do the same thing that I did and follow blindly W-L's answer, note that doing so will in some occasion generate the same read name multiple times. This might affect your pipeline, especially if you're cleaning your reads later since minimap2 do not care if multiple reads with the same name appear, and will just map them individually, leading to secondary / chimeric alignments.

mbhall88 · 2023-06-14T05:20:39Z

@jsgounot did you get the same read name multiple times? If so, you should buy a lottery ticket as the read names are generated with uuid

Badread/badread/simulate.py

Line 77 in 09fb308

read_name = uuid.UUID(int=random.getrandbits(128))

jsgounot · 2023-06-14T09:14:50Z

I know but I'm not as lucky with the lottery sadly ...

baraaorabi added the enhancement New feature or request label Sep 26, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Multi-threading #7

Multi-threading #7

baraaorabi commented Sep 26, 2019

W-L commented Jul 2, 2021 •

edited

Loading

jsgounot commented Mar 25, 2022

mbhall88 commented Jun 14, 2023

jsgounot commented Jun 14, 2023

Multi-threading #7

Multi-threading #7

Comments

baraaorabi commented Sep 26, 2019

W-L commented Jul 2, 2021 • edited Loading

jsgounot commented Mar 25, 2022

mbhall88 commented Jun 14, 2023

jsgounot commented Jun 14, 2023

W-L commented Jul 2, 2021 •

edited

Loading