EHEC genomes

So two sequences have so far been released relating to the EHEC outbreak in Europe, see details here and links to public data & analyses here on Nick Loman’s blog:

For the first sequence, Ty2482, BGI has release fastqs and an assembly (methods undescribed); Nick Loman did an assembly using MIRA.

The second sequence is LB2226692, for which only an assembly is available (using a combination of mapping and de novo approaches, see here).

So how similar are the two? As a really basic first pass analysis, I used MUMmer to map the two assemblies to the closest reference sequence, E. coli 55989 (accession CU928145). Excluding indels and SNPs called within 100 bp of a contig end or other variant, this leaves 331 SNPs that the two novel genomes share relative to the reference genome; plus each of the novel genomes has 28-40 unique SNPs of their own. This is the tree (bioNJ, but really doesn’t matter as it’s so simple):

Basic ETEC tree (BioNJ)

Now, this seems quite a large number if the two isolates were really linked via a recent chain of transmission, but this is probably due to errors in the alignment or SNP calling… which could be assisted if quality scores were available for the assemblies, and by examining the regions around the SNPs to see if they are in repetitive sequence. I’d usually prefer to do SNP analysis direct from reads, but since these aren’t available for both genomes I decided to keep the methodology consistent for both. (Ignore the scale; this is relative to the total SNPs in the alignment =399.)

So where are the SNPs located? This figure shows the reference genome, E. coli 55989 and the location of SNPs. Black = E. coli 55989 genes, pale blue = E. coli 55989 pseudogenes; purple = SNPs shared by the outbreak strains; blue = SNPs only in LB2226692; red = SNPs only in TY2482.

Posision of SNPs around the ETEC genomes

At first glance, TY2482 has a private SNP in acrB, which is associated with efflux… a few of the SNPs are in IS which should really be removed. The SNP locationsare available here (in Artemis-viewable EMBL format, coordinates relate to the E. coli 55989 reference):

TY2482 (MIRA assembly) SNPs

LB2226692 SNPs

Shared SNPs

I’ve also had a go at aligning the various assemblies and reference sequence to start to look at which genes are shared and different between the outbreak strains….but it’s saturday and time to do other things! So far it seems there are several kb of sequence difference, but since I don’t have the reads for both genomes it’s difficult to say if this is an assembly issue or real differences…more to follow.

Update here

Update 2: A lot of these SNPs seem to be associated with homopolymers, mostly of the form [X]n[Y]n where the SNP is X<->Y…. e.g. AAAGG becomes AAGGG. I don’t know the specific error profile for Ion Torrent (we did our first run on the department’s new machine only a few days ago) but if it is similar to 454 as I believe it is, these are likely to be sequencing errors rather than genuine SNPs.

Excluding such cases, two SNPs in IS elements (which are likely to be mismapping between paralogous copies) there are then 12 SNPs on the LB2226692 branch and 3 on the MIRA branch (one in prophage). However 5 of the 12 LB2226692 SNPs were also found in the TY2482 BGI assembly (which incorporates 2 more runs of data than the current MIRA assembly I’ve been looking at), implying that they were present in the common ancestor of the outbreak strains…. so that takes the number of LB2226692-only SNPs down to 7, of which 4 are in prophage.

So, in summary there are 3 non-phage SNPs specific to LB2226692, and 2 non-phage SNPs specific to TY2482. This is much more in line with what we’d expect to emerge during a few weeks that presumably separate these two isolates. They could still be sequencing errors, but since we don’t have reads for LB2226692 it will be hard to check the quality of evidence behind these SNPs. I will be able to check the TY2482 SNPs by mapping the reads to these regions, but that will have to wait. Download updated SNP list here.

Position Ec5589 LB2226692 TY2482 Change Phage?
837910 C T C intergenic
1438734 A C A synonymous phage
1441823 A C A nonsynon phage
1457668 C T C nonsynon phage
2171808 G T G nonsynon phage
2247817 C G C nonsynon
2555662 T G T nonsynon
524547 T T C nonsynon
1111495 G G A synonymous phage
4775745 G G T synonymous

10 thoughts on “EHEC genomes

  1. Pingback: My contribution to the ‘HUSEC41-strains-are-not-that-new’ debate « The Alignment Gap

  2. Pingback: EHEC Genome Assembly

  3. BTW I ran Nick’s Mira assembly earlier today and you can
    get at it as a guest (login guest and passwd guest) via job number #26553

  4. Hi Kat,
    Fab stuff! I had started an analysis along these lines on Friday but you’ve pretty much done what I had in mind. I’ve posted my results at which have SNP phylogenies against all genbank Ecoli but they don’t have the filtering you applied which makes it less useful for identifying exactly what is going on.

  5. Pingback: Some updates on the crowd-sourcing around HUSEC41 genome analyses « The Alignment Gap

  6. Pingback: 団子 ー 焼肉 ー 汚染肉 ー BSE ー 細菌性髄膜炎集団感染 ー 難病が潜む可能性 放っておけぬ低身長症 | Machikawa Co's Blog

  7. Pingback: Offene EHEC-Forschung |

  8. Pingback: Some More Thoughts About the German E. coli Outbreak | Mike the Mad Biologist

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s