Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Advice for optimising ABySS assemblies #490

Open
NatJWalker-Hale opened this issue Jan 21, 2025 · 3 comments
Open

Advice for optimising ABySS assemblies #490

NatJWalker-Hale opened this issue Jan 21, 2025 · 3 comments
Labels

Comments

@NatJWalker-Hale
Copy link

Dear ABySS team,

Thanks very much for developing and supporting ABySS!

I'm working on a large multispecies plant genome sequencing project where due to the quality of the input DNA (an unavoidable constraint) we can only sequence relatively short insert libraries with PE150 reads. I've been comparing the performance of ABySS and SOAPdenovo2. We have very high coverage (in this particular test case, close to 300x for a 1Gb haploid size diploid genome with 0.12% het), and so prior to assembly I've used Brian Bushnell's tadpole to do error correction and bbnorm to normalise Kmer coverage to 60x (this is 31mers by default). Because of the short inserts, I've also tested using merged reads in the assembly (with bbmerge) - typically a very high proportion (~70-80%) are overlapping and merged.

I've used the same inputs for ABySS and compared a couple of options, one run with unmerged reads and a second run with both the original unmerged reads and the merged reads (that is, not just the pairs surviving merging, but the whole original dataset). In each case, I've done a grid search over k=53-123 (step size 10) and kc=2-3. I've found that using the merged reads is generally slightly deleterious to scaffold N50 and BUSCO completeness. However, I'm wondering if I'm actually providing ABySS the best setup to succeed. For example, I'm wondering if Konnector2 might work well due to our high coverage, and I'm thinking that it should probably be used prior to any coverage normalisation to make the most of the excess coverage. I'm also curious for your opinion on error correction - I'm assuming it will generally improve assemblies, or at least not be deleterious.

The ABySS pipeline I'm planning to test is:

  • run Konnector2 on the tadpole error-corrected reads (no normalisation).
  • assemble Konnector2 reads alongside normalised paired end reads.

My major question is if the normalisation could be potentially deleterious and if instead it would be better to run ABySS on the original or error-corrected reads and just grid search over a higher range of kc values?

Ultimately I can test this comparatively, but I just wanted to ask for your thoughts before embarking on it in case there are any obvious pitfalls.

Thanks!

Nat

@warrenlr
Copy link
Contributor

Hello Nat,
Thank you for your message and interest in ABySS.

Error-correction is typically not a concern for DBG-based assemblers, as error k-mers formed short branches in the graph, that are pruned by the ABySS algorithms. To help get some guidance on setting kc (a sweep in generally always a great first approach), I recommend plotting the k-mer multiplicity histogram with ntCard and seeing where the error threshold sits.

So, yes, run Konnector2 and ABySS-mergepairs; we have assembled several spruce genomes using this strategy to first merge your libraries, but also use both the raw PE reads and the merge reads as a source of k-mers.

Rene

@NatJWalker-Hale
Copy link
Author

NatJWalker-Hale commented Jan 23, 2025

Thanks @warrenlr,

Just to clarify, as in that paper, one first runs Konnector with cascading Kmers and then uses anything that is not connected as the input for ABySS-mergepairs?

When I input multiple libraries to ABySS, could you clarify if there is a particular usage of lib='' that needs to be specified for this to work optimally? So far, when trialing both merged and raw PE reads, I have done the following:

lib='pe se' pe='raw_1.fq.gz raw_2.fq.gz' se='merged.fq.gz'

such that the input with Konnector reads would be:

lib='pe se kon' pe='raw_1.fq raw_2.fq' se='merged.fq' kon='konnector.fq'

Is that correct?

Thanks very much again for the help!

@warrenlr
Copy link
Contributor

warrenlr commented Jan 23, 2025

Q1: it makes more sense to run ABySS-mergepairs first, for PE reads with sequence overlap. Non-overlapping PE reads would then be merged/connected with Konnector since it is the more computationally-intensive/demanding step.

Q2:
From the docs:

abyss-pe k=96 B=2G name=ecoli lib='pea peb' \
	pea='pea_1.fa pea_2.fa' peb='peb_1.fa peb_2.fa' \
	se='se1.fa se2.fa'

This is more likely:
lib='pe' pe='raw_1.fq raw_2.fq' se='merged.fq konnector.fq'

@jwcodee : could you please confirm?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants