Thanks to Konrad Paszkiewicz from University of Exeter for this SNP-based analysis of the 3 E. coli outbreak genomes.
He used MUMmer to compare each complete E. coli genome available in NCBI to the Ec55989 chromosome, and identify single nucleotide polymorphisms (SNPs, i.e. substitution mutations, where one DNA base is swapped for another). He ignored SNPs in regions that are repeated in the Ec55989 genome, since we can’t distinguish whether these represent true substitution mutations in homologous sequences (relevant phylogenetic signal), or variation between copies in the same genome (noise). The results are available here.
He did the same comparison for the three outbreak genome assemblies, including the strain TY2482 from BGI (NCBI: AFOB00000000; Illumina HiSeq+Ion Torrent), strain LB226692 from Life Sciences (NCBI: AFOG00000000; Ion Torrent), and strain H112180280 from the UK Health Protection agency (454 mate pair). In addition he did the same comparison for the first MIRA assembly of TY2482 (Ion Torrent) and I analysed SNPs in Ty2482 using the Illumina HiSeq reads released by BGI (using bwa and samtools).
Konrad then used the annotation of Ec55989 to identify the coding change cause by each SNP – i.e. whether it is in a gene, and whether the DNA base change results in a change in the encoded protein sequence (non-synonymous SNP) or not (synonymous SNP). See resulting table here.
I used this annotation to remove SNP calls in genes annotated with the keywords ‘phage’, ‘transposase’, ‘insertion sequence’ or ‘IS’ in an effort to remove obviously horizontally transferred DNA…since this is not only subject to divergence by mutation (the phylogenetic signal we are looking for) but also to recombination. The result is just over 400,000 SNPs (nearly 10% of the Ec55989 chromosome). The alignment of those SNPs is available here.
I used SplitsTree4 to draw a phylogenetic network for the alignment. The result clearly shows that the outbreak genomes (green) are very similar to Ec55989 (also green, labelled ‘reference’), and very different to other sequenced E. coli (note in particular the group of EHEC O157:H7 on the middle left, which are very distant from the outbreak strains). Since this is a network, it would reveal if there were major recombinations between this strain and the other E. coli chromosomes, which there aren’t. This confirms that the outbreak strain is truly an EAEC, with very close similarity to Ec55989 and not to classical EHEC. This is backed by the presence of an EAEC plasmid, carrying aggregative adhesion fimbrial cluster I (AAF/I; agg operon; see this post for details).
I also used SplitsTree4 to draw a phylogenetic tree of the data, using K2P distances and the BioNJ tree-drawing algorithm. This shows the same result, with all three outbreak strain’s genomes clustering very tightly with the EAEC strain Ec55989.
So where are the SNPs located? This figure shows the distribution of SNPs around the Ec55989 chromosome:
The outer blue rings indicate the Ec55989 genes encoded on the forward (outermost) and reverse strands. The purple/green wiggly line indicates the full set of SNPs found among all avaialable E. coli chromosomes in NCBI…this is a relative plot, so purple bits indicate where SNPs are relatively rare (lower density than the average across the genome) and green bits indicate where SNPs are particularly common (higher density than average across the genome).
The red ring indicates the location of SNPs that are shared among the three outbreak genomes compared to Ec55989, it that distinguish the outbreak from Ec55989 (~600 SNPs, or 0.12% divergence). For comparison, commonly circulating Salmonella enterica serotype Typhi, which are considered a tight-knit clonal group compared to other Salmonella and cause typhoid fever as opposed to the gastroenteritis caused by most Salmonella enterica, can differ from each other by >600 SNPs. So the outbreak strains should be considered quite closely related to the EAEC diarrhea strain Ec55989. In fact, some of these SNPs (red ring) are clustered together, suggesting that they actually represent variation in horizontally transferred sequences such as phage rather than genuine substitution mutations…so the real number of substitution mutations differentiating the outbreak strain from Ec55989 is probably more like 200-300, or 0.05% divergence.
The inner rings (green, blue, black) show the location of SNPs that are specific to just one of the three outbreak genomes. For the HPA and LB226692 strains these are VERY likely to be false SNP calls, due to the fact that they are reliant on Ion Torrent and 454 data which have trouble distinguishing between homopolymeric tracts (e.g. GGGTTT can easily be misread as GGTTTT, masquerading as a substitution of G->T). For TY2482, where we have Illumina read data to call SNPs (which doesn’t have this homopolymer issue), we find only 2 SNPs that are unique to Ty2482 compared to the other two outbreak strains (black inner ring). This is easier to believe, as we expect very small numbers of mutations to accumulate during the few weeks of evolution that separate these sequenced isolates.