-
Notifications
You must be signed in to change notification settings - Fork 111
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Advice for optimising ABySS assemblies #490
Comments
Hello Nat, Error-correction is typically not a concern for DBG-based assemblers, as error k-mers formed short branches in the graph, that are pruned by the ABySS algorithms. To help get some guidance on setting kc (a sweep in generally always a great first approach), I recommend plotting the k-mer multiplicity histogram with ntCard and seeing where the error threshold sits. So, yes, run Konnector2 and ABySS-mergepairs; we have assembled several spruce genomes using this strategy to first merge your libraries, but also use both the raw PE reads and the merge reads as a source of k-mers. Rene |
Thanks @warrenlr, Just to clarify, as in that paper, one first runs Konnector with cascading Kmers and then uses anything that is not connected as the input for ABySS-mergepairs? When I input multiple libraries to ABySS, could you clarify if there is a particular usage of
such that the input with Konnector reads would be:
Is that correct? Thanks very much again for the help! |
Q1: it makes more sense to run ABySS-mergepairs first, for PE reads with sequence overlap. Non-overlapping PE reads would then be merged/connected with Konnector since it is the more computationally-intensive/demanding step. Q2:
This is more likely: @jwcodee : could you please confirm? |
Dear ABySS team,
Thanks very much for developing and supporting ABySS!
I'm working on a large multispecies plant genome sequencing project where due to the quality of the input DNA (an unavoidable constraint) we can only sequence relatively short insert libraries with PE150 reads. I've been comparing the performance of ABySS and SOAPdenovo2. We have very high coverage (in this particular test case, close to 300x for a 1Gb haploid size diploid genome with 0.12% het), and so prior to assembly I've used Brian Bushnell's tadpole to do error correction and bbnorm to normalise Kmer coverage to 60x (this is 31mers by default). Because of the short inserts, I've also tested using merged reads in the assembly (with bbmerge) - typically a very high proportion (~70-80%) are overlapping and merged.
I've used the same inputs for ABySS and compared a couple of options, one run with unmerged reads and a second run with both the original unmerged reads and the merged reads (that is, not just the pairs surviving merging, but the whole original dataset). In each case, I've done a grid search over k=53-123 (step size 10) and kc=2-3. I've found that using the merged reads is generally slightly deleterious to scaffold N50 and BUSCO completeness. However, I'm wondering if I'm actually providing ABySS the best setup to succeed. For example, I'm wondering if Konnector2 might work well due to our high coverage, and I'm thinking that it should probably be used prior to any coverage normalisation to make the most of the excess coverage. I'm also curious for your opinion on error correction - I'm assuming it will generally improve assemblies, or at least not be deleterious.
The ABySS pipeline I'm planning to test is:
My major question is if the normalisation could be potentially deleterious and if instead it would be better to run ABySS on the original or error-corrected reads and just grid search over a higher range of kc values?
Ultimately I can test this comparatively, but I just wanted to ask for your thoughts before embarking on it in case there are any obvious pitfalls.
Thanks!
Nat
The text was updated successfully, but these errors were encountered: