So two sequences have so far been released relating to the EHEC outbreak in Europe, see details here and links to public data & analyses here on Nick Loman’s blog:
For the first sequence, Ty2482, BGI has release fastqs and an assembly (methods undescribed); Nick Loman did an assembly using MIRA.
The second sequence is LB2226692, for which only an assembly is available (using a combination of mapping and de novo approaches, see here).
So how similar are the two? As a really basic first pass analysis, I used MUMmer to map the two assemblies to the closest reference sequence, E. coli 55989 (accession CU928145). Excluding indels and SNPs called within 100 bp of a contig end or other variant, this leaves 331 SNPs that the two novel genomes share relative to the reference genome; plus each of the novel genomes has 28-40 unique SNPs of their own. This is the tree (bioNJ, but really doesn’t matter as it’s so simple):
Now, this seems quite a large number if the two isolates were really linked via a recent chain of transmission, but this is probably due to errors in the alignment or SNP calling… which could be assisted if quality scores were available for the assemblies, and by examining the regions around the SNPs to see if they are in repetitive sequence. I’d usually prefer to do SNP analysis direct from reads, but since these aren’t available for both genomes I decided to keep the methodology consistent for both. (Ignore the scale; this is relative to the total SNPs in the alignment =399.)
So where are the SNPs located? This figure shows the reference genome, E. coli 55989 and the location of SNPs. Black = E. coli 55989 genes, pale blue = E. coli 55989 pseudogenes; purple = SNPs shared by the outbreak strains; blue = SNPs only in LB2226692; red = SNPs only in TY2482.
At first glance, TY2482 has a private SNP in acrB, which is associated with efflux… a few of the SNPs are in IS which should really be removed. The SNP locationsare available here (in Artemis-viewable EMBL format, coordinates relate to the E. coli 55989 reference):
I’ve also had a go at aligning the various assemblies and reference sequence to start to look at which genes are shared and different between the outbreak strains….but it’s saturday and time to do other things! So far it seems there are several kb of sequence difference, but since I don’t have the reads for both genomes it’s difficult to say if this is an assembly issue or real differences…more to follow.
Update 2: A lot of these SNPs seem to be associated with homopolymers, mostly of the form [X]n[Y]n where the SNP is X<->Y…. e.g. AAAGG becomes AAGGG. I don’t know the specific error profile for Ion Torrent (we did our first run on the department’s new machine only a few days ago) but if it is similar to 454 as I believe it is, these are likely to be sequencing errors rather than genuine SNPs.
Excluding such cases, two SNPs in IS elements (which are likely to be mismapping between paralogous copies) there are then 12 SNPs on the LB2226692 branch and 3 on the MIRA branch (one in prophage). However 5 of the 12 LB2226692 SNPs were also found in the TY2482 BGI assembly (which incorporates 2 more runs of data than the current MIRA assembly I’ve been looking at), implying that they were present in the common ancestor of the outbreak strains…. so that takes the number of LB2226692-only SNPs down to 7, of which 4 are in prophage.
So, in summary there are 3 non-phage SNPs specific to LB2226692, and 2 non-phage SNPs specific to TY2482. This is much more in line with what we’d expect to emerge during a few weeks that presumably separate these two isolates. They could still be sequencing errors, but since we don’t have reads for LB2226692 it will be hard to check the quality of evidence behind these SNPs. I will be able to check the TY2482 SNPs by mapping the reads to these regions, but that will have to wait. Download updated SNP list here.