Are we doing FAKEPHY? REALPHY paper in MBE and the ‘inaccuracy’ of phylogenies based on whole genome SNPs identified by mapping to a reference

Just been reading an interesting paper in Molecular Biology & Evolution, titled “Automated reconstruction of whole genome phylogenies from short sequence reads“. Read it, it’s open access :)

The authors simulated sequences with different tree structures and divergence levels, and compared the accuracy of maximum likelihood tree inference (PhyML) to recover the true tree from the data (examining topology and branch length) using either (i) full sequence alignment, (ii) alignments of SNPs identified by mapping to a single reference or (iii) their REALPHY method which combines data from mapping to multiple references and includes not only SNPs but invariable sites as well. 

Unsurprising (and comfortingly!) they found that using the full sequence alignment gave the right tree (whew!). However they also showed that using SNPs identified by mapping to a single reference could give incorrect topologies, and considering only the SNPs (ie excluding invariant sites) could screw with branch lengths.

Uh oh. This is what we do routinely for teasing apart phylogenies within clonal bacterial populations. The authors even name-check some of my papers and those of my Sanger Institute colleagues…

“In recent years, numerous studies (e.g., (Harris et al. 2010; Croucher et al. 2011; Mutreja et al. 2011; Holt et al. 2012; McAdam et al. 2012)) have reconstructed phylogenetic trees for large numbers of closely related bacterial strains by mapping short sequence reads to a reference genome sequence. Here we have analysed the performance of such methods on simulated and real sequence data and have shown that there are two primary pitfalls to this approach. The most readily apparent is that when SNP alignments are used to construct trees with maximum likelihood methods, it can lead to incorrect tree topologies and inaccuracy in the inferred branch lengths. Furthermore, when query sequences are sufficiently divergent from the reference sequence, the most divergent regions of the genome may fail to map, and this mapping bias may lead to incorrect branch lengths and incorrect topologies.”

Have we been getting it wrong for the last 6 years?

Well, not really, no.

Firstly, we have actually thought about this before and looked at the effect of reference choice and excluding invariant sites. But we didn’t publish any of these observations, and probably should have.

Secondly, the scenarios tested in the MBE paper are not representative of the highly clonal bacterial populations to which we have been applying the map-to-a-single-ref-and-analyse-SNP-sites approach.

Figure 2 in the MBE paper shows the accuracy of topology inference over a range of tree shapes and sequence divergence levels. The levels of divergence they considered were 0.5%, 1%, 2%, 4% and 8%. From this figure, topologies were always accurate for 0.5% divergence and usually accurate for 1% divergence too, depending a little on the tree shape. The thing is, we only use this method for phylogenetic analysis of clonal bacteria, where the divergence rate is very low, e.g. in our Shigella sonnei paper, the total divergence across the whole data set (132 strains from global sources representing the entire extant population of Shigella sonnei) was 1 SNP every 432 bases, ie 0.23% divergence (incidentally this is detailed in paragraph 1). The actual pairwise divergence between strains is much lower still. The same is true for the other papers cited (and loads of others that we have used this method for), which deal with high-resolution analyses (hundreds of genomes) of clonal populations including single sequences types of S. aureus, S. pneumoniae, and recent clonal expansion of V. cholerae.

But what about branch lengths? Figure 4 in the MBE paper indicates that branch lengths are often underestimated when using SNP alignments. Again, the effect is more pronounced for higher divergence levels. However at 0.5% divergence it looks like half of all branch lengths were precisely accurate, and the vast majority of branch lengths (looks like ~90%) were accurate within 10%. At <0.5% divergence this effect would probably be even smaller.

So. I guess I don’t need to retract all my papers just yet.

But what about the future?

1. Including all sites, not just SNPs. Point taken. We are increasingly moving to phylogenetic inference an all sites, not just SNPs. The main issue here though is analysis time and memory, especially for BEAST. Performing Bayesian phylogenetic and demographic analysis on a 10GB alignment is not trivial. And if most of the genome is invariant then this can be really quite excessive (e.g. for our Shigella sonnei with 0.23% divergence, we would only need a 23 MB alignment instead of 10 GB). One option is to use analyse variant codons rather than variant SNPs, as this at least allows use of a codon model and incorporates some invariant sites.

2. Getting away from mapping to a single reference. I agree this is useful (and in fact necessary) when dealing with diverse populations, but I don’t see any reason to drop this approach for within-clone whole-genome analysis that we are routinely doing (and was the subject of the cited papers). The most common approach in more diverse populations is to work from the assembled genomes and generate alignments of each orthologous gene. See e.g. the first stage of analysis of 3000 S. pneumoniae genomes recently reported by Stephen Bentley in Nature Genetics, followed by within-clone analysis using the usual mapping-to-close-reference approach.

The authors of the MBE paper present an alternative solution, which involves mapping to multiple references (with Bowtie2) and combines the alignments from these mappings (see the paper for details, esp Supp Figure 6). This is implemented in a program called REALPHY (ouch! I am trying hard not to take this personally… do they think my current analysis pipeline should be called FAKEPHY?), which is available as a web service or to download here. Now I’m all for this and will give it a try. But the ‘computational resources’ section in the Methods makes me wary…

“The disk space required (~60MB per Mbase × #genomes × #references) and the computing time (~2 min per Mbase × #genomes × #references) are linearly dependent on these three factors. Furthermore, the amount of RAM required depends primarily on the sequence length and the number of genomes (~250MB per Mbase × #genomes).”

Now, the papers I’m working on at the moment include some with 100-300 genomes, one with 800 and a couple with 2000 genomes. Not quite as big as the 3000 Strep pneumo, but getting there :). I’m not so concerned about disk space. This would be 30GB – 600 GB. Time is a bit more problematic – for 100 genomes it would take 16h, for 200 it would be 33 hours (call it 1.5 days) but for 2000 it would be 2 weeks. Compare that to a few hours for our current pipelines, since each genome can be mapped to a single reference independently of all others and distributed over a cluster with hundreds or thousands of cores. RAM is also concerning. 125 GB for 100 genomes, 2.5 TB for 2000 genomes. Even our supercomputer would struggle with that! I’m not entirely sure, but it seems that these estimates include both the generation of the alignment and tree construction using PhyML and dnapars (from phylip). I doubt the time & RAM calculations would hold across the range of genomes I’m talking about here. And there are ways to do this much faster, including RAxML. But I do know that using all sites in a whole genome alignment increases the time taken, especially for BEAST analysis which we usually need to do.

If I do want to look at more diverse populations, I will have to decide between using an approach like REALPHY or deriving gene alignments from the assemblies. The latter avoids the problem of mapping entirely and allows you to detect differences in evolutionary histories of individual genes, but it does introduce a new problem of ortholog detection. So perhaps the answer is mapping to single reference for deeply sampled intra-clone analysis; REALPHY for intermediate cases (1-5% divergence, ie within species and with lower resolution required) and ortholog alignments for anything more?

So. I think I will be sticking with the mapping-to-single-reference approach for my current projects. But I do like this paper, as it made me think again and throws another free tool into the mix.

Inferring transmission from genomic + epi data

There is a flurry of new papers on this, how exciting! One is through peer review and published in final form at PLoS, and the others in preprint (arXiv and the new BioRXiv)!!

It will take me a while to read and understand these, especially the subtleties of the different applications they are aiming for…. so in the meantime let’s just compare them the lazy way – is the software available, and who has the best pictures?

1. In PLoS Computational Biology, using SARS to demonstrate utility:

Bayesian Reconstruction of Disease Outbreaks by Combining Epidemiologic and Genomic Data

Thibaut Jombart, Xavier Didelot, Simon Cauchemez, Christophe Fraser, Neil Ferguson

Abstract: Recent years have seen progress in the development of statistically rigorous frameworks to infer outbreak transmission trees (“who infected whom”) from epidemiological and genetic data. Making use of pathogen genome sequences in such analyses remains a challenge, however, with a variety of heuristic approaches having been explored to date. We introduce a statistical method exploiting both pathogen sequences and collection dates to unravel the dynamics of densely sampled outbreaks. Our approach identifies likely transmission events and infers dates of infections, unobserved cases and separate introductions of the disease. It also proves useful for inferring numbers of secondary infections and identifying heterogeneous infectivity and super-spreaders. After testing our approach using simulations, we illustrate the method with the analysis of the beginning of the 2003 Singaporean outbreak of Severe Acute Respiratory Syndrome (SARS), providing new insights into the early stage of this epidemic. Our approach is the first tool for disease outbreak reconstruction from genetic data widely available as free software, the R package outbreaker. It is applicable to various densely sampled epidemics, and improves previous approaches by detecting unobserved and imported cases, as well as allowing multiple introductions of the pathogen. Because of its generality, we believe this method will become a tool of choice for the analysis of densely sampled disease outbreaks, and will form a rigorous framework for subsequent methodological developments.

Software: R package ‘outbreaker’, which includes functions for simulating outbreak data. Very nice.

thumbsupthumbsupthumbsup

Pictures: Pretty good!

thumbsupthumbsupthumbsup

2. In preprint on BioRXiv, using tuberculosis as an example:

Bayesian inference of infectious disease transmission from whole genome sequence data

Xavier Didelot, Jennifer Gardy, Caroline Colijn

Abstract: Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

Software: No software package. Uses BEAST trees, for data visualisation uses DensiTree, R and Cytoscape.

thumbsup

Pictures: I probably shouldn’t post these as it’s still preprint, so you’ll have to just look at the PDF. But there are some nice figures and a few different ways of presenting the data. I particularly like Figure 4A, which shows the phylogenetic tree against a binary matrix indicating SNP alleles in each sample, for those that vary within the group. This would be good to have handy for inspecting fine phylogenies… it looks like it was done in R, I think I might add this to my tree visualisation R functions (to be posted soon).

thumbsup

3. In preprint on arXiv, using a Staph. aureus hospital outbreak example

Reconstructing transmission networks for communicable diseases using densely sampled genomic data: a generalized approach

Colin J. Worby1, Philip D. O’Neill, Theodore Kypraios, Julie V. Robotham, Daniela De Angelis, Edward J. P. Cartwright, Sharon J. Peacock and Ben S. Cooper

Abstract: Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

Software: No mention of any software package or how this was implemented.

Pictures: I probably shouldn’t post these as it’s still preprint, so you’ll have to just look at the PDF. But in my opinion figures are much less clear and harder to follow than the others (and mostly black and white…why???).

Eddie Holmes reviews Peter Doherty’s new book “Pandemics”

I love this because it involves two fantastic Australian-based infectious disease researchers, and is published in PLoS Biology. Yay PLoS!

The wonderful Eddie Holmes, who recently moved to the University of Sydney in Australia, has written a great review of Nobel prize winner Peter Doherty’s latest book, “Pandemics – what everyone needs to know”. Peter Doherty is of course Australian, and has been based at the University of Melbourne for the last few years… and just last week has moved into the brand new building that bears his name – the Doherty Institute for Infection and Immunity. He has written many great books, including the fabulous “Beginner’s Guide to Winning the Nobel Prize: Advice for Young Scientists”

 

Anyway, back to the review. As always, Eddie’s review is clear and interesting, and makes me really want to read the book!

 

If like me, you don’t have time to read it now, I also found this video of Peter Doherty talking about the book, over on BookTV which I am going to listen to while doing some boring admin tasks this afternoon…

epidemic – and How To Read A Phylogenetic Tree

Bacterial molecular epidemiologists have a lot to learn from our virus-studying friends.

A great example is the ‘epidemic’ website over at http://epidemic.bio.ed.ac.uk/, which has the byline “Molecular Epidemiology and Evolution of Viral Pathogens” and is led by the wonderful Andrew Rambaut of BEAST fame.

The site currently covers Coronavirus (where the current concern is Middle Eastern Respiratory Syndrome or MERS, which is a close relative of SARS) and influenza. There are really cool features, including the visual display of spatial and epidemiological data on current outbreaks, collated from numerous sources (http://epidemic.bio.ed.ac.uk/coronavirus_background) as well as regular blog posts about new cases, sequences and analyses. The data is stored on github, which means anyone can easily access it (fork it in git language), add and edit data, and generally share the love.

There is also a fantastic blog post by Andrew Rambaut explaining How To Read A Phylogenetic Tree, with a particular focus on epidemiological analyses, e.g. how to interpret pathogen trees from different hosts to learn about transmission between hosts.

As I often say, if there is one thing better than a good tree, it’s a good tree with a map… so I’m super excited by the phylogeography software Pandemix which facilitates visualisation of tree + map data, with a temporal animation thrown in for good measure (!), over the web. Go here to see it in action on MERS data.

mersmap

Screenshot of Pandemix showing MERS data from Cotten et al (2013) in The Lancet. See http://epidemic.bio.ed.ac.uk/mers-cov_phylogeography

PhD position in TB host-pathogen genomics

TB_poster

I am looking for a PhD student to work on host-pathogen genomics of tuberculosis (TB) at the University of Melbourne (with co-supervisors across the Departments of Pathology; Biochemistry & Molecular Biology; Bio21 Institute; and Nossal Global Health Institute). There will also be opportunities to interact with collaborators at the Oxford University Clinical Research Unit in Vietnam and the National University of Singapore.

The project will involve analysis of large numbers of TB genomes, paired with host genotype data, using phylogenetic, computational and statistical analysis.

Interested students with a background in microbiology, bioinformatics, phylogenomics and/or biostatistics, can contact me here.

Multidrug_resistant_Mycobacterium_tuberculosis_15058261_1

SPAdes vs Velvet assemby comparison

This is a guest post from Dave Savage, a post-doc working on resistance gene analysis at the University of Melbourne. We have been trying out SPAdes as a replacement for Velvet + Velvet Optimizer which we routinely use for assembling bacterial genomes. The SPAdes paper, and the B-GAGE assembler comparison, show that SPAdes does better than most other assemblers on a small set of bacterial read sets, but we routinely assemble hundreds of genomes so needed to see whether this performance holds across large sets of reads with variable coverage etc. The results confirm SPAdes consistently performs well, and better than Velvet. This, and the fact that SPAdes gets you from fastq reads to error-corrected contigs and scaffolds in a single command with no need for tuning parameters, makes it a clear winner for our purposes.

Dave says:

SPAdes (http://bioinf.spbau.ru/spades) is a new(ish) genome assembly program that is particularly suitable for assembly of bacterial genomes. SPAdes makes a number of improvements on the de Bruijn graph algorithms used by assemblers like Velvet, IDBA and SOAPdenovo, including iteration over values of kmer sizes, and incorporation of paired kmers (k-bimers), which allows information from paired end reads to be introduced into the computation at an earlier stage.

In their paper, the authors of SPAdes make comparisons with a number of popular assemblers using single-cell and multi-cell E. coli datasets. Their results show that SPAdes can in fact deliver better assemblies, with a higher N50 length and larger contigs. Since we regularly use Velvet for assembly, we wanted to perform some further comparisons between Velvet and SPAdes on the types of data sets we typically deal with. To this end we ran SPAdes and Velvet on 114 Shigella sonnei 54 bp paired end Illumina read sets (from our 2012 paper, reads available at ERP000182), and 67 unpaired 37 bp Staphylococcus aureus samples (from Harris 2010, available at ERP000070). We then calculated the N50, length, L50, the number of contigs > 500 bp, and the total contig length produced by each assembler. The figures below show the distributions of each of these values for the two data sets.

Dark = SPAdes; light = Velvet Optimizer

Shigella sonnei assemblies. Dark = SPAdes; light = Velvet Optimizer

Staphylococcus aureus assemblies. Dark = SPAdes; light = Velvet Optimizer.

Staphylococcus aureus assemblies. Dark = SPAdes; light = Velvet Optimizer.

There’s a few things to notice about these plots. Firstly there’s quite a bit of difference between the S. shigella and the S. aureus assembly statistics, although this is hardly surprising given that the S. sonnei reads are paired ends, while the S. aureus are unpaired (and only 37 bp long). What’s apparent though is that SPAdes and Velvet perform similarly on the S. aureus reads, but have very different performance on the S. sonnei reads, where SPAdes clearly outperforms Velvet. We can probably attribute this to the fact that SPAdes includes an improved algorithm for handling paired end reads.

For both species, SPAdes frequently results in a higher N50 than Velvet and a larger total contig length. For the paired end S. sonnei reads SPAdes also results in a larger number of contigs greater than 500 bp in length. For the L50 metric, Velvet results in quite low scores for the S. sonnei reads, with 10 or less contigs frequently accounting for at least half of the total contig size. However, given the total contig length obtained using SPAdes, the much higher values for L50 obtained using SPAdes are not unexpected.

In our current assembly pipeline, we run Velvet Optimizer a number of times using smaller and smaller steps for k in order to hone in on an optimal value. Using SPAdes, not only is this step incorporated into the assembly program, but the information from each iteration is combined to produce the final output. Thus SPAdes is able to capture the information obtained using small, highly sensitive kmer sizes, as well as large, highly specific, kmer sizes, and at the same time simplify our assembly pipeline. We’re currently beginning a new project looking at the pan-genome of a number of bacterial species, and we anticipate that this project will involve quite a bit of assembly. Based on the results shown above, and the pipeline process for SPAdes, it looks like we’ll be using SPAdes to do the bulk of this work.

Run specifics:

  • Velvet was run on the S. sonnei pared end reads using VelvetOptimser with a minimum kmer size of 29 and a maximum kmer size of 89.
  • SPAdes was run on the S. sonnei pared end reads using the default settings.
  • Velvet was run on the S. sonnei pared end reads using VelvetOptimser with a minimum kmer size of 15 and a maximum kmer size of 35.
  • SPAdes was run on the S. sonnei pared end reads using the kmer sizes of 15, 23, 35.