Advice for optimising ABySS assemblies #490

NatJWalker-Hale · 2025-01-21T10:34:14Z

Dear ABySS team,

Thanks very much for developing and supporting ABySS!

I'm working on a large multispecies plant genome sequencing project where due to the quality of the input DNA (an unavoidable constraint) we can only sequence relatively short insert libraries with PE150 reads. I've been comparing the performance of ABySS and SOAPdenovo2. We have very high coverage (in this particular test case, close to 300x for a 1Gb haploid size diploid genome with 0.12% het), and so prior to assembly I've used Brian Bushnell's tadpole to do error correction and bbnorm to normalise Kmer coverage to 60x (this is 31mers by default). Because of the short inserts, I've also tested using merged reads in the assembly (with bbmerge) - typically a very high proportion (~70-80%) are overlapping and merged.

I've used the same inputs for ABySS and compared a couple of options, one run with unmerged reads and a second run with both the original unmerged reads and the merged reads (that is, not just the pairs surviving merging, but the whole original dataset). In each case, I've done a grid search over k=53-123 (step size 10) and kc=2-3. I've found that using the merged reads is generally slightly deleterious to scaffold N50 and BUSCO completeness. However, I'm wondering if I'm actually providing ABySS the best setup to succeed. For example, I'm wondering if Konnector2 might work well due to our high coverage, and I'm thinking that it should probably be used prior to any coverage normalisation to make the most of the excess coverage. I'm also curious for your opinion on error correction - I'm assuming it will generally improve assemblies, or at least not be deleterious.

The ABySS pipeline I'm planning to test is:

run Konnector2 on the tadpole error-corrected reads (no normalisation).
assemble Konnector2 reads alongside normalised paired end reads.

My major question is if the normalisation could be potentially deleterious and if instead it would be better to run ABySS on the original or error-corrected reads and just grid search over a higher range of kc values?

Ultimately I can test this comparatively, but I just wanted to ask for your thoughts before embarking on it in case there are any obvious pitfalls.

Thanks!

Nat

warrenlr · 2025-01-21T17:56:30Z

Hello Nat,
Thank you for your message and interest in ABySS.

Error-correction is typically not a concern for DBG-based assemblers, as error k-mers formed short branches in the graph, that are pruned by the ABySS algorithms. To help get some guidance on setting kc (a sweep in generally always a great first approach), I recommend plotting the k-mer multiplicity histogram with ntCard and seeing where the error threshold sits.

So, yes, run Konnector2 and ABySS-mergepairs; we have assembled several spruce genomes using this strategy to first merge your libraries, but also use both the raw PE reads and the merge reads as a source of k-mers.

Rene

NatJWalker-Hale · 2025-01-23T12:30:11Z

Thanks @warrenlr,

Just to clarify, as in that paper, one first runs Konnector with cascading Kmers and then uses anything that is not connected as the input for ABySS-mergepairs?

When I input multiple libraries to ABySS, could you clarify if there is a particular usage of lib='' that needs to be specified for this to work optimally? So far, when trialing both merged and raw PE reads, I have done the following:

lib='pe se' pe='raw_1.fq.gz raw_2.fq.gz' se='merged.fq.gz'

such that the input with Konnector reads would be:

lib='pe se kon' pe='raw_1.fq raw_2.fq' se='merged.fq' kon='konnector.fq'

Is that correct?

Thanks very much again for the help!

warrenlr · 2025-01-23T16:57:12Z

Q1: it makes more sense to run ABySS-mergepairs first, for PE reads with sequence overlap. Non-overlapping PE reads would then be merged/connected with Konnector since it is the more computationally-intensive/demanding step.

Q2:
From the docs:

abyss-pe k=96 B=2G name=ecoli lib='pea peb' \
	pea='pea_1.fa pea_2.fa' peb='peb_1.fa peb_2.fa' \
	se='se1.fa se2.fa'

This is more likely:
lib='pe' pe='raw_1.fq raw_2.fq' se='merged.fq konnector.fq'

@jwcodee : could you please confirm?

warrenlr added the question label Jan 21, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Advice for optimising ABySS assemblies #490

Advice for optimising ABySS assemblies #490

NatJWalker-Hale commented Jan 21, 2025

warrenlr commented Jan 21, 2025

NatJWalker-Hale commented Jan 23, 2025 •

edited

Loading

warrenlr commented Jan 23, 2025 •

edited

Loading

Advice for optimising ABySS assemblies #490

Advice for optimising ABySS assemblies #490

Comments

NatJWalker-Hale commented Jan 21, 2025

warrenlr commented Jan 21, 2025

NatJWalker-Hale commented Jan 23, 2025 • edited Loading

warrenlr commented Jan 23, 2025 • edited Loading

NatJWalker-Hale commented Jan 23, 2025 •

edited

Loading

warrenlr commented Jan 23, 2025 •

edited

Loading