Visualising trees

I just came across this very cool visual dictionary of tree visualisation methods – treevis, http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html, which made me think about the language of phylogenetic trees. It’s interesting to think how our ideas about phylogenetic inference, and the ways in which we interpret trees, are influenced by the way the tree is represented.

Whether studying bacterial populations or bacterial communities, we tend to use phylogenetics to breakdown, comprehend and represent the relationships between bacterial genomes. A basic phylogenetic tree can capture so much information… a great example is the recent Nature paper on the 7th Cholera pandemic, out of the Sanger Institute (see pubmed entry, the paper itself is behind the NPG paywall). The phylogenetic tree structure showing relationships between Vibrio cholera strains (obtained of course by whole genome sequencing with Illumina), together with the time and location that each strain was isolated, reveals an incredible amount of detail about how the pandemic has spread around the world over the last few decades.

But interpreting trees can be difficult. And the way the trees are represented can make them more or less difficult to interpret. As a simple example, I’m often surprised how many people are unaware that these two trees are just alternative representations of the same structure:

(If you don’t believe me, copy the tree structure below into a text file and open it up in a tree viewer like DendroScope or FigTree and click around the different representations.)

((A:0,B:0):0.2,(C:0.5,((J:0,K:0):2,(D:8,(E:12,(F:5,(G:5,(H:0.5,I:0):3):1):3):2):5):2):0.2);



In the unrooted tree on the right, it’s easy to see for example that E is about equidistant from all other leaves on the tree. But from the (randomly) rooted tree on the left, this is less apparent and requires some thinking about… many people interpret this as E being closer to F, G, H & I than to the other points. Granted, a proper rooting of the dendrogram on the left would help the situation, but still it’s a good example of how the visual representation makes interpretation more or less intuitive.

Many of the representations in treevis were created by computer scientists for purposes entirely unrelated to phylogenetics, so it would probably take a bit of effort to apply them to your favourite phylogeny…but could be worth it depending on what you are trying to convey.

An easy option for overlaying annotations of all kinds onto traditional phylogenetic tree structures (rectangular, circular and unrooted) is iTOL, the interactive tree of life. It’s a handy webtool where you can upload your tree file in newick format (like the one given above) or nexus format, plus some text files of annotations for nodes or leaves, and display the annotations overlaid on the tree in all kinds of cool ways. These are the examples in the iTOL paper in Nucleic Acids Research’s web server issue (open access), or just see the iTOL website for loads more examples and to try it out.

(Figure 1 from “Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy”, Ivica Letunic & Peer Bork, NAR 39 (S2):W475-W478)

Advertisements

Assemblathon 1

I’ve been a little slow to catch up on the results of the Assemblathon, a competitive assembly event where teams use their best method(s) to generate assemblies from raw read data and the results are compared by a variety of metrics. The results from the first assemblathon, using simulated read sets, are now available pre-publication from Genome Research. The second assemblathon, using real (Illumina and Illumina+454) data from eukaryotic genomes, is happening now.

Firstly I think this is a brilliant idea and there should be far more of it in bioinformatics! So many of us are engaged in the same basic analysis tasks for dealing with short read sequence data – assembly, mapping and variant calling – but there are so many different programs & approaches (see this compilation over at seqanswers.com) out there that it quickly becomes overwhelming.

So, the results are in but, as always in the comparison of methods, there is not really a clear-cut winner. Each assembly was assessed using an enormous set of metrics (>100 apparently), including N50 (at contig and scaffold levels), miscalled bases, depth of coverage, misassemblies, etc… and unsurprisingly there was no single assembly that scored top on all metrics. BGI’s SOAPdenovo, Broad’s ALLPATHS, and Sanger’s SGA were consistently among the best for most metrics… but with clear differences. E.g. for contig N50 SOAPdenovo and ALLPATHS were both superior to SGA, which performed better than the others on scaffolding N50. SGA had the least substitution errors, but SOAPdenovo had fewer copy number errors and ALLPATHS had the best contig-level stats. For all the gory details see the results website or summaries in the paper [free at Genome Res] or this talk [PDF] presented at the Cold Spring Harbour Lab Biology of Genomes meeting.

I am trying to wrap my head around how informative this is for assembling bacterial genomes. I know a lot of people run their own in-house comparisons to determine the best approach for a particular project, but the assemblathon approach is systematic, and manages to be both competitive and collaborative, which is an awesome combination. While bacterial genomes are small and therefore raise fewer computational issues associated with large data & memory requirements, assembling them is still far from trivial and is often a crucial element of the analysis, because gene content is so variable among even very closely-related bacteria. The parameters I usually have in mind when considering bacterial assemblers are:

  • impact of different sequencing platforms & error profiles
  • impact of different insert sizes for paired or mate-pair reads
  • genomes with high or low GC
  • genomes with excessive IS elements

I guess the only aspect of this that’s missing from the current/planned assemblathon datasets is the effect of high or low G+C content, i.e. low complexity sequence, which isn’t really bacteria-specific anyway (think e.g. P. falciparum, the malaria parasite).