Are we doing FAKEPHY? REALPHY paper in MBE and the ‘inaccuracy’ of phylogenies based on whole genome SNPs identified by mapping to a reference

Just been reading an interesting paper in Molecular Biology & Evolution, titled “Automated reconstruction of whole genome phylogenies from short sequence reads“. Read it, it’s open access ūüôā

The authors simulated sequences with different tree structures and divergence levels, and compared the accuracy of maximum likelihood tree inference (PhyML) to recover the true tree from the data (examining topology and branch length) using either (i) full sequence alignment, (ii) alignments of SNPs identified by mapping to a single reference or (iii) their REALPHY method which combines data from mapping to multiple references and includes not only SNPs but invariable sites as well. 

Unsurprising (and comfortingly!) they found that using the full sequence alignment gave the right tree (whew!). However they also showed that using SNPs identified by mapping to a single reference could give incorrect topologies, and considering only the SNPs (ie excluding invariant sites) could screw with branch lengths.

Uh oh. This is what we do routinely for teasing apart phylogenies within clonal bacterial populations.¬†The authors even name-check some of my papers and those of my Sanger Institute colleagues…

“In recent years, numerous studies (e.g., (Harris et al. 2010; Croucher et al. 2011; Mutreja et al. 2011; Holt et al. 2012; McAdam et al. 2012)) have reconstructed phylogenetic trees for large numbers of closely related bacterial strains by mapping short sequence reads to a reference genome sequence. Here we have analysed the performance of such methods on simulated and real sequence data and have shown that there are two primary pitfalls to this approach. The most readily apparent is that when SNP alignments are used to construct trees with maximum likelihood methods, it can lead to incorrect tree topologies and inaccuracy in the inferred branch lengths. Furthermore, when query sequences are sufficiently divergent from the reference sequence, the most divergent regions of the genome may fail to map, and this mapping bias may lead to incorrect branch lengths and incorrect topologies.”

Have we been getting it wrong for the last 6 years?

Well, not really, no.

Firstly, we have actually thought about this before and looked at the effect of¬†reference choice and excluding invariant sites. But we didn’t publish any of these observations, and probably should have.

Secondly, the scenarios tested in the MBE paper are not representative of the highly clonal bacterial populations to which we have been applying the map-to-a-single-ref-and-analyse-SNP-sites approach.

Figure 2 in the MBE paper shows the accuracy of topology inference over a range of tree shapes and sequence divergence levels. The levels of divergence they considered were 0.5%, 1%, 2%, 4% and 8%. From this figure, topologies were always accurate for 0.5% divergence and usually accurate for 1% divergence too, depending a little on the tree shape. The thing is, we only use this method for phylogenetic analysis of clonal bacteria, where the divergence rate is very low, e.g. in our Shigella sonnei paper, the total divergence across the whole data set (132 strains from global sources representing the entire extant population of Shigella sonnei) was 1 SNP every 432 bases, ie 0.23% divergence (incidentally this is detailed in paragraph 1). The actual pairwise divergence between strains is much lower still. The same is true for the other papers cited (and loads of others that we have used this method for), which deal with high-resolution analyses (hundreds of genomes) of clonal populations including single sequences types of S. aureus, S. pneumoniae, and recent clonal expansion of V. cholerae.

But what about branch lengths? Figure 4 in the MBE paper indicates that branch lengths are often underestimated when using SNP alignments. Again, the effect is more pronounced for higher divergence levels. However at 0.5% divergence it looks like half of all branch lengths were precisely accurate, and the vast majority of branch lengths (looks like ~90%) were accurate within 10%. At <0.5% divergence this effect would probably be even smaller.

So. I guess I don’t need to retract all my papers¬†just yet.

But what about the future?

1. Including all sites, not just SNPs. Point taken. We are increasingly moving to phylogenetic inference an all sites, not just SNPs. The main issue here though is analysis time and memory, especially for BEAST. Performing Bayesian phylogenetic and demographic analysis on a 10GB alignment is not trivial. And if most of the genome is invariant then this can be really quite excessive (e.g. for our Shigella sonnei with 0.23% divergence, we would only need a 23 MB alignment instead of 10 GB). One option is to use analyse variant codons rather than variant SNPs, as this at least allows use of a codon model and incorporates some invariant sites.

2.¬†Getting away from mapping to a single reference. I agree this is useful (and in fact necessary)¬†when dealing with diverse populations, but I don’t see any reason to drop this approach for within-clone whole-genome analysis that we are routinely doing (and was the subject of the cited papers). The most common approach in¬†more diverse populations is to work from the assembled genomes and generate alignments of each orthologous gene. See e.g.¬†the first stage of analysis of 3000 S. pneumoniae genomes recently reported by Stephen Bentley in Nature Genetics, followed by within-clone analysis using the usual mapping-to-close-reference approach.

The authors of the MBE paper present an alternative solution, which involves mapping to multiple references (with Bowtie2) and combines the alignments from these mappings (see the paper for details, esp Supp Figure 6). This is implemented in a program called REALPHY (ouch! I am trying hard not to take this personally… do they think my¬†current analysis pipeline should be called¬†FAKEPHY?), which is available as a web service or to download here. Now I’m all for this and will give it a try. But the ‘computational resources’ section in the Methods makes me wary…

“The disk space required (~60MB per Mbase √ó #genomes √ó #references) and the computing time (~2 min per Mbase √ó #genomes √ó #references) are linearly dependent on these three factors. Furthermore, the amount of RAM required depends primarily on the sequence length and the number of genomes (~250MB per Mbase √ó #genomes).”

Now, the papers I’m working on at the moment include some with 100-300¬†genomes, one¬†with 800 and a couple¬†with 2000 genomes. Not quite as big as the 3000 Strep pneumo, but getting there :). I’m not so concerned about disk space. This would be 30GB – 600 GB. Time is a bit more problematic – for 100 genomes it would take¬†16h, for 200 it would be 33 hours (call it 1.5 days) but for 2000 it would be 2 weeks. Compare that to a few hours for our current pipelines, since each genome can be mapped to a single¬†reference independently of all others and distributed over a cluster with hundreds or thousands of cores. RAM is also concerning. 125 GB for 100 genomes, 2.5 TB for 2000 genomes. Even our supercomputer would struggle with that! I’m not entirely sure, but it seems that these estimates include both the generation of the alignment and tree construction using PhyML and dnapars (from phylip). I doubt the time & RAM¬†calculations would hold across¬†the range of genomes I’m talking about here. And there are ways to do this much faster, including RAxML. But I do know that using all sites in a whole genome alignment increases the time taken, especially for BEAST analysis which we usually need to do.

If I do want to look at more diverse populations, I will have to decide between using an approach like REALPHY or deriving gene alignments from the assemblies. The latter avoids the problem of mapping entirely and allows you to detect differences in evolutionary histories of individual genes, but it does introduce a new problem of ortholog detection. So perhaps the answer is mapping to single reference for deeply sampled intra-clone analysis; REALPHY for intermediate cases (1-5% divergence, ie within species and with lower resolution required) and ortholog alignments for anything more?

So. I think I will be sticking with the mapping-to-single-reference approach for my current projects. But I do like this paper, as it made me think again and throws another free tool into the mix.