Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice for optimising ABySS assemblies #490

Open
NatJWalker-Hale opened this issue Jan 21, 2025 · 5 comments
Open

Advice for optimising ABySS assemblies #490

NatJWalker-Hale opened this issue Jan 21, 2025 · 5 comments
Labels

Comments

@NatJWalker-Hale
Copy link

Dear ABySS team,

Thanks very much for developing and supporting ABySS!

I'm working on a large multispecies plant genome sequencing project where due to the quality of the input DNA (an unavoidable constraint) we can only sequence relatively short insert libraries with PE150 reads. I've been comparing the performance of ABySS and SOAPdenovo2. We have very high coverage (in this particular test case, close to 300x for a 1Gb haploid size diploid genome with 0.12% het), and so prior to assembly I've used Brian Bushnell's tadpole to do error correction and bbnorm to normalise Kmer coverage to 60x (this is 31mers by default). Because of the short inserts, I've also tested using merged reads in the assembly (with bbmerge) - typically a very high proportion (~70-80%) are overlapping and merged.

I've used the same inputs for ABySS and compared a couple of options, one run with unmerged reads and a second run with both the original unmerged reads and the merged reads (that is, not just the pairs surviving merging, but the whole original dataset). In each case, I've done a grid search over k=53-123 (step size 10) and kc=2-3. I've found that using the merged reads is generally slightly deleterious to scaffold N50 and BUSCO completeness. However, I'm wondering if I'm actually providing ABySS the best setup to succeed. For example, I'm wondering if Konnector2 might work well due to our high coverage, and I'm thinking that it should probably be used prior to any coverage normalisation to make the most of the excess coverage. I'm also curious for your opinion on error correction - I'm assuming it will generally improve assemblies, or at least not be deleterious.

The ABySS pipeline I'm planning to test is:

  • run Konnector2 on the tadpole error-corrected reads (no normalisation).
  • assemble Konnector2 reads alongside normalised paired end reads.

My major question is if the normalisation could be potentially deleterious and if instead it would be better to run ABySS on the original or error-corrected reads and just grid search over a higher range of kc values?

Ultimately I can test this comparatively, but I just wanted to ask for your thoughts before embarking on it in case there are any obvious pitfalls.

Thanks!

Nat

@warrenlr
Copy link
Contributor

Hello Nat,
Thank you for your message and interest in ABySS.

Error-correction is typically not a concern for DBG-based assemblers, as error k-mers formed short branches in the graph, that are pruned by the ABySS algorithms. To help get some guidance on setting kc (a sweep in generally always a great first approach), I recommend plotting the k-mer multiplicity histogram with ntCard and seeing where the error threshold sits.

So, yes, run Konnector2 and ABySS-mergepairs; we have assembled several spruce genomes using this strategy to first merge your libraries, but also use both the raw PE reads and the merge reads as a source of k-mers.

Rene

@NatJWalker-Hale
Copy link
Author

NatJWalker-Hale commented Jan 23, 2025

Thanks @warrenlr,

Just to clarify, as in that paper, one first runs Konnector with cascading Kmers and then uses anything that is not connected as the input for ABySS-mergepairs?

When I input multiple libraries to ABySS, could you clarify if there is a particular usage of lib='' that needs to be specified for this to work optimally? So far, when trialing both merged and raw PE reads, I have done the following:

lib='pe se' pe='raw_1.fq.gz raw_2.fq.gz' se='merged.fq.gz'

such that the input with Konnector reads would be:

lib='pe se kon' pe='raw_1.fq raw_2.fq' se='merged.fq' kon='konnector.fq'

Is that correct?

Thanks very much again for the help!

@warrenlr
Copy link
Contributor

warrenlr commented Jan 23, 2025

Q1: it makes more sense to run ABySS-mergepairs first, for PE reads with sequence overlap. Non-overlapping PE reads would then be merged/connected with Konnector since it is the more computationally-intensive/demanding step.

Q2:
From the docs:

abyss-pe k=96 B=2G name=ecoli lib='pea peb' \
	pea='pea_1.fa pea_2.fa' peb='peb_1.fa peb_2.fa' \
	se='se1.fa se2.fa'

This is more likely:
lib='pe' pe='raw_1.fq raw_2.fq' se='merged.fq konnector.fq'

@jwcodee : could you please confirm?

Copy link

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your interest in ABySS!

@github-actions github-actions bot added the stale label Feb 14, 2025
@jwcodee jwcodee removed the stale label Feb 14, 2025
@jwcodee
Copy link
Member

jwcodee commented Feb 18, 2025

@warrenlr, that appears to be correct.

@NatJWalker-Hale, the lib variable is primarily a convenience when working with multiple paired-end libraries. However, specifying just pe should generally suffice. The lib variable is used to pass paired-end reads to a fixmate step, which relies on the read pair information. Including single-end (SE) reads (such as Konnector or merged pairs) will increase runtime without providing additional benefit. Therefore, it is advisable to exclude SE reads, as shown in Rene's example.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

3 participants