Assemblathon 1

I’ve been a little slow to catch up on the results of the Assemblathon, a competitive assembly event where teams use their best method(s) to generate assemblies from raw read data and the results are compared by a variety of metrics. The results from the first assemblathon, using simulated read sets, are now available pre-publication from Genome Research. The second assemblathon, using real (Illumina and Illumina+454) data from eukaryotic genomes, is happening now.

Firstly I think this is a brilliant idea and there should be far more of it in bioinformatics! So many of us are engaged in the same basic analysis tasks for dealing with short read sequence data – assembly, mapping and variant calling – but there are so many different programs & approaches (see this compilation over at seqanswers.com) out there that it quickly becomes overwhelming.

So, the results are in but, as always in the comparison of methods, there is not really a clear-cut winner. Each assembly was assessed using an enormous set of metrics (>100 apparently), including N50 (at contig and scaffold levels), miscalled bases, depth of coverage, misassemblies, etc… and unsurprisingly there was no single assembly that scored top on all metrics. BGI’s SOAPdenovo, Broad’s ALLPATHS, and Sanger’s SGA were consistently among the best for most metrics… but with clear differences. E.g. for contig N50 SOAPdenovo and ALLPATHS were both superior to SGA, which performed better than the others on scaffolding N50. SGA had the least substitution errors, but SOAPdenovo had fewer copy number errors and ALLPATHS had the best contig-level stats. For all the gory details see the results website or summaries in the paper [free at Genome Res] or this talk [PDF] presented at the Cold Spring Harbour Lab Biology of Genomes meeting.

I am trying to wrap my head around how informative this is for assembling bacterial genomes. I know a lot of people run their own in-house comparisons to determine the best approach for a particular project, but the assemblathon approach is systematic, and manages to be both competitive and collaborative, which is an awesome combination. While bacterial genomes are small and therefore raise fewer computational issues associated with large data & memory requirements, assembling them is still far from trivial and is often a crucial element of the analysis, because gene content is so variable among even very closely-related bacteria. The parameters I usually have in mind when considering bacterial assemblers are:

  • impact of different sequencing platforms & error profiles
  • impact of different insert sizes for paired or mate-pair reads
  • genomes with high or low GC
  • genomes with excessive IS elements

I guess the only aspect of this that’s missing from the current/planned assemblathon datasets is the effect of high or low G+C content, i.e. low complexity sequence, which isn’t really bacteria-specific anyway (think e.g. P. falciparum, the malaria parasite).