BGI assembly of German E. coli outbreak strain

BGI has released a ‘complete’ assembly of TY2482. Here is what it looks like, with the 4 other available outbreak genomes mapped against it (4 inner rings) plus available references.

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = reference sequence... for chromosome, Ec55989 and red=VT2 phage; for IncI plasmid, pEC_Bactec; for mini plasmid, pO26-S1; for EAEC plasmid, red/yellow/aqua = 55989p, pO86A1, pAA, orange=agg operon.



BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = Ec55989, red=VT2 phage.


EAEC plasmid, novel version and carrying agg operon (aggregative adhesion fimbriae class I; AAF/I):

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Red = 55989p, yellow = pO86A1, aqua = pAA, orange=agg operon.


IncI plasmid, carrying extended spectrum beta-lactamase blaCTX-M-15:

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = pEC_Bactec.


Mini plasmid, replication regions only:

Mini plasmid ("selfish plasmid", carrying only rep genes). Yellow = pO26-S1.

Genome comparisons for 4 available outbreak genomes

Two new genomes were released today (well today my time, yesterday European time!) by the Göttingen Genomics Lab. They say:

We just released the 454 data of another two isolates from the German E. coli outbreak. You can find it on our website:
The link to the ftp server is
User name and password are ‘EAHEC_GOS’.

We just released the 454 assembly data of another two isolates (GOS1
and GOS2) from the German E. coli O104:H4 outbreak.
The shotgun libraries were sequenced on a GS FLX using Titanium
chemistry. 1.5 medium lanes of a Titanium picotiter plate was used for
each strain. Reads were assembled de novo using the Roche Newbler
assembly software 2.3.
We’ll submit the read data in the NCBI SRA as well.

There is also a new assembly from BGI which I’ve not looked at yet, go here to the github wiki to find out more.

I downloaded the first genome assembly (GOS1) but the second kept timing out, maybe they are a very popular download today! So here is the latest comparison of four available genomes from the outbreak, with their closest available reference sequences. This just illustrates what we knew from earlier analyses, and confirms that at least one of the newest genomes is pretty much the same as all the other outbreak genomes.

Comparison of 4 available assemblies (note there is a 5th but I couldn't get it to download!) For chromosome, phage and IncI plasmid, the reference sequences from NCBI are used. For the aggregative adhesion plasmid, it is so different to published references sequences that I have used the HPA scaffold as the reference, and mapped all others (including available EAEC reference plasmids and the agg operon) to this.

Point mutations (SNPs) in outbreak strain relative to Ec55989

This post is really a reply to PatrikD’s comments on my last post about the SNP-based phylogeny of E. coli, showing the close relationship between the outbreak EAEC O104:H4 strain and the finished genome sequence of Ec55989 (well, chromosome). I can’t post much in a comment reply so it requires a new post.

PatrikD said:

Since I’m looking for SNPs that agree between the three assemblies, I’m not too worried about remaining sequencing errors. Given how closely these strains were sampled, I’m not interested in any minor differences between them, but more in the difference between 55989 and the three outbreak strains.

Now, I think the best way to do that at this point would be to use the SNP calls from HiSeq reads of TY2482. I get 1781 high quality SNP calls compared to Ec55989, see table here (using bwa & samtools to map reads and call SNPs). Most of these are in prophage regions (annotated in this artemis-readable file and flagged in the SNP table in the ‘mobile’ column); excluding these there are 564 probable SNPs (SNP table, artemis-readable file).

This is their distribution (dark blue = phage regions and IS elements; red = SNP calls overlapping with these mobile regions; green = 564 SNP calls not in these regions; inner ring=GC content):

Green = SNPs in the outbreak genome (TY2482, Hiseq) compared to Ec55989

Of the 1230 SNPs in mobile regions (i.e. subject to horizontal transfer or homologous recombination with typically horizontally transferred genes), there are 713 non-synonymous mutations (i.e. changes the encoded amino acid) and 424 synonymous changes (i.e. silient mutations, not affecting the encoded amino acid sequence)…while it is dodgy to do dN/dS on this kind of intraspecies data, it would be ~0.5, consistent with purifying selection.

For the 564 SNPs in non-mobile regions of the genome (i.e. not subject to horizontal transfer or homologous recombination with typically transferred genes), 284 are non-synonymous and 145 synonymous (dN/dS~0.6). Again it’s tempting to call this purifying selection, as it looks like mutations affecting protein function are being selectively removed, relative to those silent mutations. Also, 136 SNPs (24%) were in intergenic regions. Since only 13% of the chromosome is non-coding, this suggests that mutations in non-coding regions are better tolerated than those in coding sequences, again consistent with purifying selection against mutations affecting protein function.

There are 4,762 genes annotated in Ec55989, and only 564 SNPs…so the majority of genes don’t have any point mutations (although I haven’t looked at insertion/deletions yet, only subsitution mutations where one DNA base is exhanged for another). But do any genes have a high number of changes?

The answer is yes, but these mostly look like recombination events:

ibrA, an immunoglobulin-binding regulator, has 20 SNPs, mainly in the C-terminal domain. It’s quite likely this is a recombination event rather than diversifying selection, but deserves a closer look. A neighbouring gene, EC55989_3315 (conserved hypothetical) also has 18 SNPs, so recombination is a likely culprit.

The Ag43 gene, EC55989_3357, has 17 SNP calls, however we know from the assemblies that the outbreak strain has acquired divergent copies of Ag43 genes [see posts on analysis of integrated sequence, and biofilm genes], so these SNP calls probably reflect the differences between two distinct copies of Ag43 genes rather than point mutations acquired in the EC55989_3357 gene itself.

There are 6 SNPs in EC55989_4669, an Iha adhesin receptor. This could be real, or could be a divergent copy acquired by the outbreak strain.

Hypothetical proteins EC55989_3283, EC55989_3284, EC55989_3293, EC55989_3355, EC55989_4845, EC55989_4890, EC55989_4891, EC55989_4896, EC55989_4897 each have 3-4 SNPs. These are probably phage remnants or similar to phage genes, but some followup analysis could confirm this. There are 7 SNPs in shiA, a fragment of a gene similar to a shigella SHI-2 pathogenicity island gene. EC55989_4671 is an IS element that slipped through my filter, and contains 5 SNP calls.