E. coli outbreak – PacBio data and PLoS One paper on 2001 O104:H4 strain

It’s been a while since my last post, mainly because my attention has had to return to other things (my day job, ASM, and holidays in the Australian Snowy Mountains).

But a fair bit has happened in the last few weeks on the E. coli front.

PacBio has released some data from an outbreak strain plus a few related strains. As far as I know this is the first time PacBio data has been released publicly so it’s a good opportunity to have a play…I will at some point but not tonight! The data includes very long reads (average ~3 kbp) but with 85% accuracy (ouch! but useful for assembly)…using their circularised read approach, they get much better accuracy (average ~98%) but much shorter reads (average 430 bp).

Muenster & Life Tech have now published their analysis of outbreak strain and the 2001 O104:H4 STEC strain from Germany in PLoS One. Data is available here at NCBI.

The key finding of interest, I think, is that the stx2 phage (i.e. Shiga toxin) was present in the 2001 strain – apparently identical and in the same position – suggesting that the phage was acquired by a common ancestor of the 2001 and 2011 German O104:H4 strains. However the stx2 phage does favour certain insertion sites, so it is still possible that this represents two separate acquisitions.

This is their model:

Mellman et al, PLoS One 2011 - Model for O104:H4 STEC evolutionMellman et al, PLoS One 2011 - Model for O104:H4 STEC evolution

The two strains carry different aggregative adhesion plasmids (AAF/III in 2001; AAF/I in 2011) and different resistance plasmids, consistent with some evolutionary time separating them.

The paper says that each strain has accumulated 87-95 SNPs among 1,444 chromosomal genes since they shared a common ancestor…but I’ve not looked in enough detail to be convinced the authors have corrected sufficiently for homopolymers.

Interestingly their tree suggests that the common ancestor of the 2001 & 2011 German strains is also the common ancestor of the African (stx2-free) EAEC strain Ec55989 (which has picked up 24 SNPs)…again I’m not sure whether this is correct until I inspect the data myself, and the lower number of SNPs in Ec55989 makes me a little suspicious that the others are over-estimates….since Ec55989 was isolated ~1999 I think so should have a similar number of SNPs to the 2001 strain. But still a very interesting development.

This is their tree:

Mellman 2011 PLoS One - minimum spanning tree


SNP analysis and plasmid copy number among 5 outbreak E. coli

Since the Illumina (MiSeq) reads from 5 HPA genomes were recently released, I thought it would be interesting to compare these to the ‘complete’ assembly of TY2482 from BGI and look for substitution mutations and read depths for plasmids vs chromosome.

SNP analysis

I decided to try using Nesoni, the python-based analysis pipeline written by Paul Harrison & Torsten Seeman from the Victorian Bioinformatics Consortium here in Melbourne. It uses shrimp2 to map reads, and samtools and to generate and process sorted alignments and call variants. It has a nice feature where, with a single command (nesoni nway) you can generate a table showing an n-way comparison of consensus allele calls in a set of genomes, at each of the loci called as a variant in any genome.

The complete resulting output is here. It reports not only the consensus call, but the evidence behind the call, so it’s simple to see whether you believe it or not. The result is 220 calls, of which I might believe 9 (pink in figure below, and Excel file here).

I mapped MiSeq reads from 4 of the 5 HPA genomes (shrimp2 didn’t like the fastq file for sample 280, need to sort this out) and the HiSeq reads from BGI’s TY2482, to the complete reference assembly for TY2482. For 194 of the 220 variants called, the TY2482 read mapping resulted in a variant call compared to the TY2482 reference, which means that the variant is unlikely to be real. This could happen for a variety of reasons relating to the mapping & variant calling process, and I was just using the default settings in Nesoni so some tweaking might remove these. In any case, I will ignore these variants for now because I don’t believe they are real (but you can see the full table here).

This leaves 24 variant calls, where the allele detected in one or more of the 4 HPA genomes is different from the TY2482 assembly+reads from BGI, shown in the table below. This includes 9 in the chromosome (highlighted in pink), with 2 SNPs called in all 4 genomes; 1 SNP called in sample 283 only, 3 SNPs called in sample 540 and 3 SNPs called in sample 541.

SNPs identified in HPA strains compared to the complete BGI assembly for TY2482

In pTY1, which is the IncI plasmid bearing the beta-lactamase CTX-M-15 gene, the variants detected (yellow above) were all within the shufflon proteins…this region is able to ‘shuffle’ via inversions between homologous sequences, and these variant calls will most likely represent shuffling in the region rather than point mutation. In fact, the alignment suggests that there are multiple versions of these sequences in each DNA sample, suggesting that the shufflon was active and generating a mixed population of plasmids in the HPA data (but not the BGI strain TY2482, which had homozygous calls in this region). This might be worthy of some further investigation by someone who understands shufflons a lot better than I do!

Finally, there were a few variant calls in the tiny plasmid pTY3, clustered within its rep gene. These calls are heterozygous (see table) in all four HPA strains, suggesting that the mapping is picking up two different versions of the rep gene, which could be due to homology with other replication proteins in plasmids pTY1 and pTY2.

Plasmid copy number

The massive variation in read depth for plasmid sequences compared to the chromosome reminded me it might be interesting to try to infer the average copy number for each plasmid based on read depth. To do this, I used the depth plots output by nesoni (which gives the mean read depth per base in the reference sequence, based on read mapping). I calculated the mean read depth across each reference sequence (ie the completed BGI TY2482 assembly, chromosome + 3 plasmids) from this, and then calculated the ratio of read depths for plasmid:chromosome. Assuming each bacterial cell has ~1 copy of the chromosome (i.e. ignoring cells caught in the act of replication when there will be >1), this should give an approximation of the mean copy number of each plasmid per cell. We know some plasmids are maintained quite stably at 1 per cell, while others can exist at high copy number. This is the result:

Mean read depth and mean plasmid copy number for outbreak strains

So for the TY2482 data, it looks as though the IncI1 resistance plasmid (pTY1) and the aggregative adhesion plasmid (pTY2) are maintained at ~1 per cell, while the mini plasmid (which contains little more than a plasmid replication gene) is present at ~9 per cell. This is pretty much in line with expectation.

Interestingly, the HPA strains appear to have much higher copy numbers, around 20 per cell for pTY1 and pTY2 and hundreds of copies of pTY3. The numbers are pretty consistent across the HPA strains, but are remarkably higher than in TY2482.

I don’t have a good explanation for this apparent difference…. it could be an artefact in the sequencing (MiSeq likes plasmid DNA???) or in the mapping (not sure how this could be, especially since the mean depth plots produced by nesoni exclude regions that map to multiple locations in the reference genome).This could be examined by looking at results from different mapping programs, or analysis of reads from different platforms (Ion Torrent for TY2482 & LB226692, 454 reads for the HPA & C2L genomes).

If it is a real difference, I wonder if it could be differences in growth rate or culture conditions in the two labs. Or a mutation in the chromosome that affects the normal control of plasmid replication? Could having lots of copies of the aggregative adhesion plasmid enhance virulence or transmission of the bug?

Eurosurveillance editorial on E. coli outbreak credits crowdsourcing

This editorial in Eurosurveillance gives a nice overview of the microbiological side of the German E. coli outbreak investigation, including applauding the public data release & analysis efforts:

The data sets from these sequencing initiatives were instantly released for public access, resulting in data analysis among bioinformaticians and other researchers around the world. Results from these preliminary analyses have been rapidly communicated via blogs, Twitter and private web pages, outside the standard peer-reviewed scientific publication route. These initiatives have confirmed the microbiological characterisation of the outbreak strain made in the public health laboratories by targeted genotyping and phenotyping of facultative E. coli virulence genes. Most importantly, among compared E. coli genome sequences, the genome of the 2011 outbreak strain clustered closest to an EAggEC strain isolated in 2002, with the addition of stx2 and antibiotic resistance genes.

The details of findings to date are outlined in this article in the same issue, including details of Shiga toxin-producing enteroaggregative E. coli O104:H4 from an outbreak in Georgia in 2009 (main difference seems to be that it was ESBL negative, unlike the current strain which has acquired an IncI plasmid carrying the ESBL gene blaCTX-M). They also discuss a rapid PCR test for the outbreak strain direct from food samples, involving enrichment+incubation (18-24h) followed by PCR for stx2 gene from extracted DNA, followed by PCR for O104 and then confirmation from pure cultures.

There are now reports of E. coli O104 from a stream in Germany, located downstream from a sewage plant… although this is more likely to be caused by the outbreak than a cause of it, it highlights that even highly industrialized countries need to be vigilant with sanitation and hygiene to prevent the spread of dangerous human pathogens [sourced from ProMed].

Enteroaggregative E. coli on LigerCat

Reading Jon Eisen’s blog this week I was rather taken with this post about LigerCat. LigerCat is an online tool that searches pubmed for whatever you ask it to, and displays a cloud of the MeSH terms (keywords attached to articles) associated with the pubmed results. It also shows a neat bar chart of article counts by year.

Since I’ve just been introduced to enteroaggregative E. coli thanks to the German E. coli outbreak, I thought I’d search for “Enteroaggregative E. coli”… this is the result.

I think this shows quite nicely that (at least according to the literature) this organism is defined by adhesion, normally associated with diarrhea in children and babies and commonly tested for by PCR.

According to this it was first described as enteroaggregative E. coli in 1989 and has been the subject of some attention, but not a lot, ever since (~15 articles per year):

The picture is quite different for “Shiga toxin, most associated with E. coli O157 and hemolytic-uremic syndrome, with first mention in 1942 and a mass of interest since the 1980s, now with >200 papers each year:

BGI assembly of German E. coli outbreak strain

BGI has released a ‘complete’ assembly of TY2482. Here is what it looks like, with the 4 other available outbreak genomes mapped against it (4 inner rings) plus available references.

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = reference sequence... for chromosome, Ec55989 and red=VT2 phage; for IncI plasmid, pEC_Bactec; for mini plasmid, pO26-S1; for EAEC plasmid, red/yellow/aqua = 55989p, pO86A1, pAA, orange=agg operon.



BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = Ec55989, red=VT2 phage.


EAEC plasmid, novel version and carrying agg operon (aggregative adhesion fimbriae class I; AAF/I):

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Red = 55989p, yellow = pO86A1, aqua = pAA, orange=agg operon.


IncI plasmid, carrying extended spectrum beta-lactamase blaCTX-M-15:

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = pEC_Bactec.


Mini plasmid, replication regions only:

Mini plasmid ("selfish plasmid", carrying only rep genes). Yellow = pO26-S1.

Genome comparisons for 4 available outbreak genomes

Two new genomes were released today (well today my time, yesterday European time!) by the Göttingen Genomics Lab. They say:

We just released the 454 data of another two isolates from the German E. coli outbreak. You can find it on our website:
The link to the ftp server is
User name and password are ‘EAHEC_GOS’.

We just released the 454 assembly data of another two isolates (GOS1
and GOS2) from the German E. coli O104:H4 outbreak.
The shotgun libraries were sequenced on a GS FLX using Titanium
chemistry. 1.5 medium lanes of a Titanium picotiter plate was used for
each strain. Reads were assembled de novo using the Roche Newbler
assembly software 2.3.
We’ll submit the read data in the NCBI SRA as well.

There is also a new assembly from BGI which I’ve not looked at yet, go here to the github wiki to find out more.

I downloaded the first genome assembly (GOS1) but the second kept timing out, maybe they are a very popular download today! So here is the latest comparison of four available genomes from the outbreak, with their closest available reference sequences. This just illustrates what we knew from earlier analyses, and confirms that at least one of the newest genomes is pretty much the same as all the other outbreak genomes.

Comparison of 4 available assemblies (note there is a 5th but I couldn't get it to download!) For chromosome, phage and IncI plasmid, the reference sequences from NCBI are used. For the aggregative adhesion plasmid, it is so different to published references sequences that I have used the HPA scaffold as the reference, and mapped all others (including available EAEC reference plasmids and the agg operon) to this.