Are we doing FAKEPHY? REALPHY paper in MBE and the ‘inaccuracy’ of phylogenies based on whole genome SNPs identified by mapping to a reference

Just been reading an interesting paper in Molecular Biology & Evolution, titled “Automated reconstruction of whole genome phylogenies from short sequence reads“. Read it, it’s open access 🙂

The authors simulated sequences with different tree structures and divergence levels, and compared the accuracy of maximum likelihood tree inference (PhyML) to recover the true tree from the data (examining topology and branch length) using either (i) full sequence alignment, (ii) alignments of SNPs identified by mapping to a single reference or (iii) their REALPHY method which combines data from mapping to multiple references and includes not only SNPs but invariable sites as well. 

Unsurprising (and comfortingly!) they found that using the full sequence alignment gave the right tree (whew!). However they also showed that using SNPs identified by mapping to a single reference could give incorrect topologies, and considering only the SNPs (ie excluding invariant sites) could screw with branch lengths.

Uh oh. This is what we do routinely for teasing apart phylogenies within clonal bacterial populations. The authors even name-check some of my papers and those of my Sanger Institute colleagues…

“In recent years, numerous studies (e.g., (Harris et al. 2010; Croucher et al. 2011; Mutreja et al. 2011; Holt et al. 2012; McAdam et al. 2012)) have reconstructed phylogenetic trees for large numbers of closely related bacterial strains by mapping short sequence reads to a reference genome sequence. Here we have analysed the performance of such methods on simulated and real sequence data and have shown that there are two primary pitfalls to this approach. The most readily apparent is that when SNP alignments are used to construct trees with maximum likelihood methods, it can lead to incorrect tree topologies and inaccuracy in the inferred branch lengths. Furthermore, when query sequences are sufficiently divergent from the reference sequence, the most divergent regions of the genome may fail to map, and this mapping bias may lead to incorrect branch lengths and incorrect topologies.”

Have we been getting it wrong for the last 6 years?

Well, not really, no.

Firstly, we have actually thought about this before and looked at the effect of reference choice and excluding invariant sites. But we didn’t publish any of these observations, and probably should have.

Secondly, the scenarios tested in the MBE paper are not representative of the highly clonal bacterial populations to which we have been applying the map-to-a-single-ref-and-analyse-SNP-sites approach.

Figure 2 in the MBE paper shows the accuracy of topology inference over a range of tree shapes and sequence divergence levels. The levels of divergence they considered were 0.5%, 1%, 2%, 4% and 8%. From this figure, topologies were always accurate for 0.5% divergence and usually accurate for 1% divergence too, depending a little on the tree shape. The thing is, we only use this method for phylogenetic analysis of clonal bacteria, where the divergence rate is very low, e.g. in our Shigella sonnei paper, the total divergence across the whole data set (132 strains from global sources representing the entire extant population of Shigella sonnei) was 1 SNP every 432 bases, ie 0.23% divergence (incidentally this is detailed in paragraph 1). The actual pairwise divergence between strains is much lower still. The same is true for the other papers cited (and loads of others that we have used this method for), which deal with high-resolution analyses (hundreds of genomes) of clonal populations including single sequences types of S. aureus, S. pneumoniae, and recent clonal expansion of V. cholerae.

But what about branch lengths? Figure 4 in the MBE paper indicates that branch lengths are often underestimated when using SNP alignments. Again, the effect is more pronounced for higher divergence levels. However at 0.5% divergence it looks like half of all branch lengths were precisely accurate, and the vast majority of branch lengths (looks like ~90%) were accurate within 10%. At <0.5% divergence this effect would probably be even smaller.

So. I guess I don’t need to retract all my papers just yet.

But what about the future?

1. Including all sites, not just SNPs. Point taken. We are increasingly moving to phylogenetic inference an all sites, not just SNPs. The main issue here though is analysis time and memory, especially for BEAST. Performing Bayesian phylogenetic and demographic analysis on a 10GB alignment is not trivial. And if most of the genome is invariant then this can be really quite excessive (e.g. for our Shigella sonnei with 0.23% divergence, we would only need a 23 MB alignment instead of 10 GB). One option is to use analyse variant codons rather than variant SNPs, as this at least allows use of a codon model and incorporates some invariant sites.

2. Getting away from mapping to a single reference. I agree this is useful (and in fact necessary) when dealing with diverse populations, but I don’t see any reason to drop this approach for within-clone whole-genome analysis that we are routinely doing (and was the subject of the cited papers). The most common approach in more diverse populations is to work from the assembled genomes and generate alignments of each orthologous gene. See e.g. the first stage of analysis of 3000 S. pneumoniae genomes recently reported by Stephen Bentley in Nature Genetics, followed by within-clone analysis using the usual mapping-to-close-reference approach.

The authors of the MBE paper present an alternative solution, which involves mapping to multiple references (with Bowtie2) and combines the alignments from these mappings (see the paper for details, esp Supp Figure 6). This is implemented in a program called REALPHY (ouch! I am trying hard not to take this personally… do they think my current analysis pipeline should be called FAKEPHY?), which is available as a web service or to download here. Now I’m all for this and will give it a try. But the ‘computational resources’ section in the Methods makes me wary…

“The disk space required (~60MB per Mbase × #genomes × #references) and the computing time (~2 min per Mbase × #genomes × #references) are linearly dependent on these three factors. Furthermore, the amount of RAM required depends primarily on the sequence length and the number of genomes (~250MB per Mbase × #genomes).”

Now, the papers I’m working on at the moment include some with 100-300 genomes, one with 800 and a couple with 2000 genomes. Not quite as big as the 3000 Strep pneumo, but getting there :). I’m not so concerned about disk space. This would be 30GB – 600 GB. Time is a bit more problematic – for 100 genomes it would take 16h, for 200 it would be 33 hours (call it 1.5 days) but for 2000 it would be 2 weeks. Compare that to a few hours for our current pipelines, since each genome can be mapped to a single reference independently of all others and distributed over a cluster with hundreds or thousands of cores. RAM is also concerning. 125 GB for 100 genomes, 2.5 TB for 2000 genomes. Even our supercomputer would struggle with that! I’m not entirely sure, but it seems that these estimates include both the generation of the alignment and tree construction using PhyML and dnapars (from phylip). I doubt the time & RAM calculations would hold across the range of genomes I’m talking about here. And there are ways to do this much faster, including RAxML. But I do know that using all sites in a whole genome alignment increases the time taken, especially for BEAST analysis which we usually need to do.

If I do want to look at more diverse populations, I will have to decide between using an approach like REALPHY or deriving gene alignments from the assemblies. The latter avoids the problem of mapping entirely and allows you to detect differences in evolutionary histories of individual genes, but it does introduce a new problem of ortholog detection. So perhaps the answer is mapping to single reference for deeply sampled intra-clone analysis; REALPHY for intermediate cases (1-5% divergence, ie within species and with lower resolution required) and ortholog alignments for anything more?

So. I think I will be sticking with the mapping-to-single-reference approach for my current projects. But I do like this paper, as it made me think again and throws another free tool into the mix.


7 thoughts on “Are we doing FAKEPHY? REALPHY paper in MBE and the ‘inaccuracy’ of phylogenies based on whole genome SNPs identified by mapping to a reference

  1. Pingback: Links 3/30/14 | Mike the Mad Biologist

  2. Many thanks for the thought and time you invest in this very interesting blog!
    Have you seen kSNP?
    We were trying to get around the problem of picking a reference and RAM + time scaling issues with whole genome alignments. k-mer based analyses have some drawbacks, and more simulations are needed to fully assess the pros and cons, but they are fast and convenient for big data sets.
    So long as you don’t use a reference sequence, using just the SNP positions for the phylogeny will overestimate actual branch lengths, but relative branch lengths should be decent, if I understand correctly? Can you just scale them by the average length of the assembled genomes in the tree?
    Thanks for your insights!

  3. Thanks for this post! Always nice to know that there actually people out there that read some of what I have written :). You have actually mentioned a lot of things that we were also thinking about, but obviously have not mentioned them explicitly in our paper. And one thing we have never thought about: the name of our program. We only wanted a googlable and pronouncable name and took the first one that came to mind, without thinking about what it would look like to other people. In hindsight a bit clumsy I must admit…sorry for that!

    The time requirement is another issue (trade-off between time and accuracy)…but I guess discussing this takes more space than I should use in a comment to a blog post 🙂


    • Thanks Frederick, you have come out with an excellent piece of work on this.
      As for the name, I don’t think it is a problem but i believe the author expressed his opinion creatively with the fake and real terms, which captured our attention in the first place. There is always pros and cons in the methods especially comes to time and accuracy, but I believe things are getting better over time isn’t it when new things discovered? However, there are few questions in my mind when come to reference-guided alignments, will it be bias when choosing reference genomes (even we can choose more reference sequences according to the method to enhance accuracy) when we have to consider factors including genes evolving at different rates under different selection pressure, gene gain and loss, gene contents, sequencing quality and differences, recombination and HGT events, and diverse sequences used for alignment. Could this be accurate to use on largely diverse genomes or just effective on very closely related sequences? Anyway, i think the authors have made important improvement in refining the methods. Bravo!

      Kien-Pong Yap

  4. Dear Kat,

    thanks for the comments. I would like to add a few comments myself to what Frederic already said. First, briefly about the name. This seems a bit silly to me. The REAL in the name just stands for REference ALignment. As Frederic mentioned we never even considered that somebody might take it to imply that we think phylogenies produced through other means are ‘fake’. Maybe we should have, but it honestly never occurred to us and, frankly, we’re not THAT full of hubris.

    First about using only SNP sites. I find it hard to see any justification for this. Phylogeny inference programs such as RAxML and PhyML ASSUME that you are given them all sites. So if you leave out all the SNP sites these programs are going to get ‘confused’ about the substitution patterns they observed, because they are at odds with their models (i.e. their models would predict there must be many invariant sites if the species are close, but there are none such invariant sites in the data you give the program). Second, all these programs assume that different alignment columns evolve independently from each other. So the only thing that matters for the likelihood of a given tree is HOW MANY of each alignment-column-type they see, not their order. Although I haven’t looked at the source code of these programs, I would bet they internally just use the numbers of alignment columns of different types and so, once the input data is parsed, computational complexity is not at all increased by adding a large number of invariant columns.

    A more subtle issue is when one could expect reference bias to be a serious problem in real world situations. You seem to be taking some of our results on simulated alignments on 4-taxon trees as a guide for when, in real world scenarios (with large number of taxa), reference bias may or may not be a problem. I think this is a very dangerous thing to do. Our simulated data were of the cleanest possible type one could have: all sites evolve under the same constant rate, there are no insertions or deletions, no rearrangements, no recombination, no horizontal transfers, and the number of taxa is as small as it can possibly be. These simulations were just to show that even in these extremely simplified settings, serious reference bias can occur in some situations.
    What happens when you have more realistic scenarios that include:
    -different genes substituting at vastly different rates.
    -gain and losses of genes such that closer taxa share larger gene content.
    -possible recombinations and horizontal transfers
    -vastly larger number of taxa
    is really not clear to me. I would be very hesitant to assume that, as long as the taxa are close, everything will be fine. That’s not to say that I am convinced that it is often a problem, but just that I am generally worried about the asymmetry that is introduced by mapping all taxa to one reference that may be much closer to some taxa than to others.

  5. For _diverse_ species where you expect reference bias to be a problem (and/or you have no reference genome at all) try our new approach (SISRS). Read it at (software on github – see link in paper). fyi If running RAxML on variable site data use the acquisition bias option (e.g. ASC_GTRGAMMA) as recommended by the manual. So far relative branch lengths have been perfect. If running MrBayes using coding=variable.

    I also recommend you derive gene alignments from reads rather than assemblies. It is faster than doing the assemblies, prevents calling errors, and allows identification of multi-copy genes. Software for that is on my github site too.

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s