E. coli outbreak – PacBio data and PLoS One paper on 2001 O104:H4 strain

It’s been a while since my last post, mainly because my attention has had to return to other things (my day job, ASM, and holidays in the Australian Snowy Mountains).

But a fair bit has happened in the last few weeks on the E. coli front.

PacBio has released some data from an outbreak strain plus a few related strains. As far as I know this is the first time PacBio data has been released publicly so it’s a good opportunity to have a play…I will at some point but not tonight! The data includes very long reads (average ~3 kbp) but with 85% accuracy (ouch! but useful for assembly)…using their circularised read approach, they get much better accuracy (average ~98%) but much shorter reads (average 430 bp).

Muenster & Life Tech have now published their analysis of outbreak strain and the 2001 O104:H4 STEC strain from Germany in PLoS One. Data is available here at NCBI.

The key finding of interest, I think, is that the stx2 phage (i.e. Shiga toxin) was present in the 2001 strain – apparently identical and in the same position – suggesting that the phage was acquired by a common ancestor of the 2001 and 2011 German O104:H4 strains. However the stx2 phage does favour certain insertion sites, so it is still possible that this represents two separate acquisitions.

This is their model:

Mellman et al, PLoS One 2011 - Model for O104:H4 STEC evolutionMellman et al, PLoS One 2011 - Model for O104:H4 STEC evolution

The two strains carry different aggregative adhesion plasmids (AAF/III in 2001; AAF/I in 2011) and different resistance plasmids, consistent with some evolutionary time separating them.

The paper says that each strain has accumulated 87-95 SNPs among 1,444 chromosomal genes since they shared a common ancestor…but I’ve not looked in enough detail to be convinced the authors have corrected sufficiently for homopolymers.

Interestingly their tree suggests that the common ancestor of the 2001 & 2011 German strains is also the common ancestor of the African (stx2-free) EAEC strain Ec55989 (which has picked up 24 SNPs)…again I’m not sure whether this is correct until I inspect the data myself, and the lower number of SNPs in Ec55989 makes me a little suspicious that the others are over-estimates….since Ec55989 was isolated ~1999 I think so should have a similar number of SNPs to the 2001 strain. But still a very interesting development.

This is their tree:

Mellman 2011 PLoS One - minimum spanning tree

SNP analysis and plasmid copy number among 5 outbreak E. coli

Since the Illumina (MiSeq) reads from 5 HPA genomes were recently released, I thought it would be interesting to compare these to the ‘complete’ assembly of TY2482 from BGI and look for substitution mutations and read depths for plasmids vs chromosome.

SNP analysis

I decided to try using Nesoni, the python-based analysis pipeline written by Paul Harrison & Torsten Seeman from the Victorian Bioinformatics Consortium here in Melbourne. It uses shrimp2 to map reads, and samtools and to generate and process sorted alignments and call variants. It has a nice feature where, with a single command (nesoni nway) you can generate a table showing an n-way comparison of consensus allele calls in a set of genomes, at each of the loci called as a variant in any genome.

The complete resulting output is here. It reports not only the consensus call, but the evidence behind the call, so it’s simple to see whether you believe it or not. The result is 220 calls, of which I might believe 9 (pink in figure below, and Excel file here).

I mapped MiSeq reads from 4 of the 5 HPA genomes (shrimp2 didn’t like the fastq file for sample 280, need to sort this out) and the HiSeq reads from BGI’s TY2482, to the complete reference assembly for TY2482. For 194 of the 220 variants called, the TY2482 read mapping resulted in a variant call compared to the TY2482 reference, which means that the variant is unlikely to be real. This could happen for a variety of reasons relating to the mapping & variant calling process, and I was just using the default settings in Nesoni so some tweaking might remove these. In any case, I will ignore these variants for now because I don’t believe they are real (but you can see the full table here).

This leaves 24 variant calls, where the allele detected in one or more of the 4 HPA genomes is different from the TY2482 assembly+reads from BGI, shown in the table below. This includes 9 in the chromosome (highlighted in pink), with 2 SNPs called in all 4 genomes; 1 SNP called in sample 283 only, 3 SNPs called in sample 540 and 3 SNPs called in sample 541.

SNPs identified in HPA strains compared to the complete BGI assembly for TY2482

In pTY1, which is the IncI plasmid bearing the beta-lactamase CTX-M-15 gene, the variants detected (yellow above) were all within the shufflon proteins…this region is able to ‘shuffle’ via inversions between homologous sequences, and these variant calls will most likely represent shuffling in the region rather than point mutation. In fact, the alignment suggests that there are multiple versions of these sequences in each DNA sample, suggesting that the shufflon was active and generating a mixed population of plasmids in the HPA data (but not the BGI strain TY2482, which had homozygous calls in this region). This might be worthy of some further investigation by someone who understands shufflons a lot better than I do!

Finally, there were a few variant calls in the tiny plasmid pTY3, clustered within its rep gene. These calls are heterozygous (see table) in all four HPA strains, suggesting that the mapping is picking up two different versions of the rep gene, which could be due to homology with other replication proteins in plasmids pTY1 and pTY2.

Plasmid copy number

The massive variation in read depth for plasmid sequences compared to the chromosome reminded me it might be interesting to try to infer the average copy number for each plasmid based on read depth. To do this, I used the depth plots output by nesoni (which gives the mean read depth per base in the reference sequence, based on read mapping). I calculated the mean read depth across each reference sequence (ie the completed BGI TY2482 assembly, chromosome + 3 plasmids) from this, and then calculated the ratio of read depths for plasmid:chromosome. Assuming each bacterial cell has ~1 copy of the chromosome (i.e. ignoring cells caught in the act of replication when there will be >1), this should give an approximation of the mean copy number of each plasmid per cell. We know some plasmids are maintained quite stably at 1 per cell, while others can exist at high copy number. This is the result:

Mean read depth and mean plasmid copy number for outbreak strains

So for the TY2482 data, it looks as though the IncI1 resistance plasmid (pTY1) and the aggregative adhesion plasmid (pTY2) are maintained at ~1 per cell, while the mini plasmid (which contains little more than a plasmid replication gene) is present at ~9 per cell. This is pretty much in line with expectation.

Interestingly, the HPA strains appear to have much higher copy numbers, around 20 per cell for pTY1 and pTY2 and hundreds of copies of pTY3. The numbers are pretty consistent across the HPA strains, but are remarkably higher than in TY2482.

I don’t have a good explanation for this apparent difference…. it could be an artefact in the sequencing (MiSeq likes plasmid DNA???) or in the mapping (not sure how this could be, especially since the mean depth plots produced by nesoni exclude regions that map to multiple locations in the reference genome).This could be examined by looking at results from different mapping programs, or analysis of reads from different platforms (Ion Torrent for TY2482 & LB226692, 454 reads for the HPA & C2L genomes).

If it is a real difference, I wonder if it could be differences in growth rate or culture conditions in the two labs. Or a mutation in the chromosome that affects the normal control of plasmid replication? Could having lots of copies of the aggregative adhesion plasmid enhance virulence or transmission of the bug?

Eurosurveillance editorial on E. coli outbreak credits crowdsourcing

This editorial in Eurosurveillance gives a nice overview of the microbiological side of the German E. coli outbreak investigation, including applauding the public data release & analysis efforts:

The data sets from these sequencing initiatives were instantly released for public access, resulting in data analysis among bioinformaticians and other researchers around the world. Results from these preliminary analyses have been rapidly communicated via blogs, Twitter and private web pages, outside the standard peer-reviewed scientific publication route. These initiatives have confirmed the microbiological characterisation of the outbreak strain made in the public health laboratories by targeted genotyping and phenotyping of facultative E. coli virulence genes. Most importantly, among compared E. coli genome sequences, the genome of the 2011 outbreak strain clustered closest to an EAggEC strain isolated in 2002, with the addition of stx2 and antibiotic resistance genes.

The details of findings to date are outlined in this article in the same issue, including details of Shiga toxin-producing enteroaggregative E. coli O104:H4 from an outbreak in Georgia in 2009 (main difference seems to be that it was ESBL negative, unlike the current strain which has acquired an IncI plasmid carrying the ESBL gene blaCTX-M). They also discuss a rapid PCR test for the outbreak strain direct from food samples, involving enrichment+incubation (18-24h) followed by PCR for stx2 gene from extracted DNA, followed by PCR for O104 and then confirmation from pure cultures.

There are now reports of E. coli O104 from a stream in Germany, located downstream from a sewage plant… although this is more likely to be caused by the outbreak than a cause of it, it highlights that even highly industrialized countries need to be vigilant with sanitation and hygiene to prevent the spread of dangerous human pathogens [sourced from ProMed].

Enteroaggregative E. coli on LigerCat

Reading Jon Eisen’s blog this week I was rather taken with this post about LigerCat. LigerCat is an online tool that searches pubmed for whatever you ask it to, and displays a cloud of the MeSH terms (keywords attached to articles) associated with the pubmed results. It also shows a neat bar chart of article counts by year.

Since I’ve just been introduced to enteroaggregative E. coli thanks to the German E. coli outbreak, I thought I’d search for “Enteroaggregative E. coli”… this is the result.

I think this shows quite nicely that (at least according to the literature) this organism is defined by adhesion, normally associated with diarrhea in children and babies and commonly tested for by PCR.

According to this it was first described as enteroaggregative E. coli in 1989 and has been the subject of some attention, but not a lot, ever since (~15 articles per year):

The picture is quite different for “Shiga toxin, most associated with E. coli O157 and hemolytic-uremic syndrome, with first mention in 1942 and a mass of interest since the 1980s, now with >200 papers each year:

BGI assembly of German E. coli outbreak strain

BGI has released a ‘complete’ assembly of TY2482. Here is what it looks like, with the 4 other available outbreak genomes mapped against it (4 inner rings) plus available references.

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = reference sequence... for chromosome, Ec55989 and red=VT2 phage; for IncI plasmid, pEC_Bactec; for mini plasmid, pO26-S1; for EAEC plasmid, red/yellow/aqua = 55989p, pO86A1, pAA, orange=agg operon.

 

Chromosome:

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = Ec55989, red=VT2 phage.

 

EAEC plasmid, novel version and carrying agg operon (aggregative adhesion fimbriae class I; AAF/I):

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Red = 55989p, yellow = pO86A1, aqua = pAA, orange=agg operon.

 

IncI plasmid, carrying extended spectrum beta-lactamase blaCTX-M-15:

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = pEC_Bactec.

 

Mini plasmid, replication regions only:

Mini plasmid ("selfish plasmid", carrying only rep genes). Yellow = pO26-S1.

Genome comparisons for 4 available outbreak genomes

Two new genomes were released today (well today my time, yesterday European time!) by the Göttingen Genomics Lab. They say:

We just released the 454 data of another two isolates from the German E. coli outbreak. You can find it on our website:
http://www.g2l.bio.uni-goettingen.de/
The link to the ftp server is ftp://134.76.70.117/
User name and password are ‘EAHEC_GOS’.

We just released the 454 assembly data of another two isolates (GOS1
and GOS2) from the German E. coli O104:H4 outbreak.
The shotgun libraries were sequenced on a GS FLX using Titanium
chemistry. 1.5 medium lanes of a Titanium picotiter plate was used for
each strain. Reads were assembled de novo using the Roche Newbler
assembly software 2.3.
We’ll submit the read data in the NCBI SRA as well.

There is also a new assembly from BGI which I’ve not looked at yet, go here to the github wiki to find out more.

I downloaded the first genome assembly (GOS1) but the second kept timing out, maybe they are a very popular download today! So here is the latest comparison of four available genomes from the outbreak, with their closest available reference sequences. This just illustrates what we knew from earlier analyses, and confirms that at least one of the newest genomes is pretty much the same as all the other outbreak genomes.

Comparison of 4 available assemblies (note there is a 5th but I couldn't get it to download!) For chromosome, phage and IncI plasmid, the reference sequences from NCBI are used. For the aggregative adhesion plasmid, it is so different to published references sequences that I have used the HPA scaffold as the reference, and mapped all others (including available EAEC reference plasmids and the agg operon) to this.

Point mutations (SNPs) in outbreak strain relative to Ec55989

This post is really a reply to PatrikD’s comments on my last post about the SNP-based phylogeny of E. coli, showing the close relationship between the outbreak EAEC O104:H4 strain and the finished genome sequence of Ec55989 (well, chromosome). I can’t post much in a comment reply so it requires a new post.

PatrikD said:

Since I’m looking for SNPs that agree between the three assemblies, I’m not too worried about remaining sequencing errors. Given how closely these strains were sampled, I’m not interested in any minor differences between them, but more in the difference between 55989 and the three outbreak strains.

Now, I think the best way to do that at this point would be to use the SNP calls from HiSeq reads of TY2482. I get 1781 high quality SNP calls compared to Ec55989, see table here (using bwa & samtools to map reads and call SNPs). Most of these are in prophage regions (annotated in this artemis-readable file and flagged in the SNP table in the ‘mobile’ column); excluding these there are 564 probable SNPs (SNP table, artemis-readable file).

This is their distribution (dark blue = phage regions and IS elements; red = SNP calls overlapping with these mobile regions; green = 564 SNP calls not in these regions; inner ring=GC content):

Green = SNPs in the outbreak genome (TY2482, Hiseq) compared to Ec55989

Of the 1230 SNPs in mobile regions (i.e. subject to horizontal transfer or homologous recombination with typically horizontally transferred genes), there are 713 non-synonymous mutations (i.e. changes the encoded amino acid) and 424 synonymous changes (i.e. silient mutations, not affecting the encoded amino acid sequence)…while it is dodgy to do dN/dS on this kind of intraspecies data, it would be ~0.5, consistent with purifying selection.

For the 564 SNPs in non-mobile regions of the genome (i.e. not subject to horizontal transfer or homologous recombination with typically transferred genes), 284 are non-synonymous and 145 synonymous (dN/dS~0.6). Again it’s tempting to call this purifying selection, as it looks like mutations affecting protein function are being selectively removed, relative to those silent mutations. Also, 136 SNPs (24%) were in intergenic regions. Since only 13% of the chromosome is non-coding, this suggests that mutations in non-coding regions are better tolerated than those in coding sequences, again consistent with purifying selection against mutations affecting protein function.

There are 4,762 genes annotated in Ec55989, and only 564 SNPs…so the majority of genes don’t have any point mutations (although I haven’t looked at insertion/deletions yet, only subsitution mutations where one DNA base is exhanged for another). But do any genes have a high number of changes?

The answer is yes, but these mostly look like recombination events:

ibrA, an immunoglobulin-binding regulator, has 20 SNPs, mainly in the C-terminal domain. It’s quite likely this is a recombination event rather than diversifying selection, but deserves a closer look. A neighbouring gene, EC55989_3315 (conserved hypothetical) also has 18 SNPs, so recombination is a likely culprit.

The Ag43 gene, EC55989_3357, has 17 SNP calls, however we know from the assemblies that the outbreak strain has acquired divergent copies of Ag43 genes [see posts on analysis of integrated sequence, and biofilm genes], so these SNP calls probably reflect the differences between two distinct copies of Ag43 genes rather than point mutations acquired in the EC55989_3357 gene itself.

There are 6 SNPs in EC55989_4669, an Iha adhesin receptor. This could be real, or could be a divergent copy acquired by the outbreak strain.

Hypothetical proteins EC55989_3283, EC55989_3284, EC55989_3293, EC55989_3355, EC55989_4845, EC55989_4890, EC55989_4891, EC55989_4896, EC55989_4897 each have 3-4 SNPs. These are probably phage remnants or similar to phage genes, but some followup analysis could confirm this. There are 7 SNPs in shiA, a fragment of a gene similar to a shigella SHI-2 pathogenicity island gene. EC55989_4671 is an IS element that slipped through my filter, and contains 5 SNP calls.

E. coli outbreak and biofilms

Now that the outbreak has been firmly traced back to sprouts, attention is turning to how the sprouts were contaminated in the first place. The initial assumption was that this would be a classic case of contamination with E. coli via feces from cows or other livestock, however in this case it is more likely the original source was humans. Some clues to this:

  • Serotype O104:H4 has not been reported in animals, only in humans
  • Enteroagreggative E. coli (EAEC) is usually isolated from humans, not animals
  • EAEC is not always symptomatic in humans, it can be carried asymptomatically
  • There are no animals kept on the sprout farm linked to the outbreak, nor is there an obvious route by which contamination with animal feces could have occurred

So it could be somewhat similar to the situation with Salmonella Typhi, where people can become colonized but never develop typhoid fever symptoms and so don’t know they are carriers… but they are shedding the bacteria in their feces, which can cause illness in other people. Thus localised outbreaks are sometimes linked to food handlers (in EAEC as well as typhoid)… the most famous example being “Typhoid Mary”, a cook who was a typhoid carrier and spread typhoid fever to dozens, probably hundreds of people. Water contaminated with human feces can also transmit the bacteria. In this outbreak a lot of cases are linked to eating sprouts, so the logic would be that the sprouts have become contaminated. However there are reports of people attending the same function, but not eating the vegetables, also getting sick, which is consistent with secondary transmission via human carriers.

The European Food Safety Authority report summarizes the evidence for a human source of EAEC:

On 21st May 2011, Germany reported an ongoing outbreak of Shiga-toxin producing Escherichia coli- bacteria (STEC4), serotype O104:H4. [...] In the past STEC O104:H4 had been isolated in humans twice in Germany in 2001 (Mellmann et al., 2008) and once in Korea in 2005 (Bae et al., 2006). In addition, according to the information reported to ECDC, a total of 10 persons were infected with STEC O104 in the EU Member States from 2004 to 2009.

…..

The German outbreak strain seems to share virulence characteristics of STEC and EAEC strains. STEC strains usually have an animal reservoir, while EAEC have a human reservoir.

…..

Outbreaks of diarrhoeal illness due to EAEC have been reported and linked
to the ingestion of food which was contaminated by food handlers. In
addition, it has been shown that EAEC carriage by humans is possible (Huang
et al., 2003
; Huang et al., 2006).
…..
EAEC have rarely been identified in animals, suggesting that they are not
zoonotic, but exclusive to humans as a pathogen (Cassar et al., 2004).

…..

Outbreaks may have more than one exposure route involved. For example, primary human infection may originate from consumption of contaminated food or direct contact with an animal carrying STEC, while secondary infection may occur by the faecal-oral route, after contamination of food through handling by an infected person shedding the bacteria. As a result, especially during the late stages of an outbreak multiple exposure routes are likely.

One of their recommendations is:

Since there is evidence of asymptomatic carriers of STEC in humans, screening of humans involved in food handling is relevant. The monitoring and/or exclusion of STEC carriers from food handling should be considered as a mitigation option.

So what can the genomes tell us?

It’s likely that the combination of Shiga-toxin production in an EAEC strain is particularly dangerous because EAEC are particularly ‘sticky’ or adhesive. The cells autoaggregate, forming biofilms, and they are also good at sticking to human cells (in fact EAEC is defined by HEp-2 cell-adherence assay). They can also form mixed biofilms with other bacteria. These abilities are probably what make EAEC good at establishing long-term colonization of humans, which can sometimes result in chronic diarrhea or long-term asymptomatic carriage. Similar properties could aid transmission via sticking to plants.

Several gene families are known to be involved in biofilm formation in E. coli, see this review.

Fimbriae:

the direct contribution of adhesive organelles of the fimbrial family to the irreversible attachment of bacteria to surfaces has been amply demonstrated. Three classes of fimbriae have a role in strengthening the bacteria-to-surface interactions: type 1 fimbriae, curli, and conjugative pili.

The outbreak genomes have swapped their aggregative adherence fimbriae relative to their closest known relatives. Both the African diarrheic strain Ec55989 (the closest related strain to have its complete genome sequenced) and the 2001 German O104:H4 strain, HUSEC O41, expressed type III fimbriae (AAF/III+). [see report for HUSEC O41, here for Ec55989]. AAF/I fimbriae are not uncommon, but they are quite different to AAF/III and may be relevant. They are plasmid borne, and the plasmid they are carried on in the outbreak strain is a bit different to those that have been sequenced previously from EAEC (see this post for details).

The outbreak genome shares with Ec55989 a fimbrial adhesion operon yehABC, however it has also acquired, adjacent to this operon, an insertion of genes yehI-yehQ which are present in E. coli O157:H7 and include proteins that are probable regulators…could they regulate the fimbriae?:

It also shares with Ec55989 several other fimbrial clusters including lpf (long polar fimbriae), but I can’t find any other fimbriae-related genes in the annotation of the HPA outbreak strain that are not in Ec55989.

Conjugative pili:

most tested conjugative plasmids directly contribute, upon derepression of their conjugative function, to bacterial host capacity to form a biofilm (Ghigo 2001)

The outbreak strain contains a conjugative plasmid of the IncI type, carrying the ESBL (extended spectrum beta-lactamase) gene CTX-M. This encodes a type IV pilus, could it be contributing to the fitness of the strain by enhancing biofilm production of the host bacterium? And/or promoting horizontal transfer of DNA? This paper suggests that a type IV pilus encoded on an IncI plasmid enhances biofilm formation in E. coli, although the sequences of the type IV pili are different to the IncI plasmid in the outbreak strain.

Type V secretion system and autotransporters:

…the type V secretion pathway enables a family of proteins to reach the surface with a very limited number of accessory secretion factors because most information necessary to the translocation process is contained within the secreted protein itself. These proteins, which can therefore carry out their own transport to the outer membrane, are called autotransported or autotransporter proteins.

…..

The flu gene encodes antigen 43 (Ag43), a major outer membrane protein found in most commensal and pathogenic E. coli. Although E. coli K-12 has only one copy of flu, most other strains of E. coli have several copies of this gene.

Ag43 is a self-recognizing surface autotransporter protein that does not seem to be involved in non-specific initial adhesion to abiotic surfaces, but rather, promotes cell-to-cell adhesion (Kjaergaard et al. 2000a). While, in liquid culture, this property leads to autoaggregation and clump formation rapidly followed by bacterial sedimentation, it also facilitates bacteria–bacteria adhesion and leads to the three-dimensional development of the biofilm (Owen et al. 1996; Henderson et al. 1997a; Hasman et al. 1999; Kjaergaard et al. 2000a; Schembri et al. 2003a). When expressed in different species, Ag43 can also be used to promote mixed biofilm formation between different bacteria, for example, between E. coli and Pseudomonas aeruginosa (Kjaergaard et al. 2000a, 2000b).

There are three Flu/Ag43 in the outbreak strain (using HPA assembly and ERA7 annotation). One is novel compared to Ec55989, and has been acquired via an integrase-mediated insertion in the same locus as the multidrug resistance genes (see post here for details). Another appears to present in the same prophage as the Shiga-toxin, although it is incomplete (662984-664255 in HPA assembly). A third is conserved in Ec55989 (EC55989_3357). AidA adhesin proteins appear to be conserved between the outbreak stain and Ec55989.

So the differences so far between the outbreak strain and Ec55989 (EAEC diarrhea), with respect to known biofilm-associated adhesins are:

  • replaced AAF/III with AAF/I (EAEC plasmid)
  • acquired novel Ag43 gene via integrase-mediated insertion
  • type IV conjugative pili in IncI plasmid
  • acquired partial Ag43 in same phage as Shiga-toxin
  • acquired cluster of genes adjacent to yehABC fimbrial cluster, yehI-yehQ, which are present in E. coli O157:H7 and include regulators…maybe regulators of fimbriae?

Pic proteins

I also had a look at the pic genes in the outbreak strain. These are mucinases that have recently been shown to promote mucous secretion in the gut, responsible for mucoid diarrhea that is a classic symptom of Shigella and EAEC infection. Ec55989 contains three of these, two on the chromosome (intact, EC55989_4682, EC55989_3279) and one on the EAEC plasmid 55989p (truncated), all of which are conserved in the German outbreak strain. In addition, there is a fourth pic gene present in the EAEC plasmid of the outbreak strain (but missing from Ec55989), which seems to be intact. An NCBI blastp search turned up the same protein sequence in multiple Shigella genomes, and also the EAEC plasmid pO86A1:

>Novel pic gene from HPA assembly
MNKIYYLKYCHITKSLIAVSELARRVTCKSHRRLSRRVILTSVAALSLSSAWPALSATVS
AEIPYQIFRDFAENKGQFTPGTTNISIYDKQGNLVGKLDKAPMADFSSATITTGSLPPGN
HTLYSPQYVVTAKHVSGSDTMSFGYAKNTYTAVGTNNNSGLDIKTRRLSKLVTEVAPAEV
SDIGAVSGAYQAGGRFTAFYRLGGGMQYVKDKNGNRTQVYTNGGFLVGGTVSALNSYNNG
QMITAQTGDIFNPANGPLANYLNMGDSGSPLFAYDSLQKKWVLIGVLSSGTNYGNNWVVT
TQDFLGQQPQNDFDKTIAYTSGEGVLQWKYDAANGTGTLTQGNTTWDMHGKKGNDLNAGK
NLLFTGNNGEVVLQNSVNQGAGYLQFAGDYRVSALNGQTWMGGGIITDKGTHVLWQVNGV
AGDNLHKTGEGTLTVNGTGVNAGGLKVGDGTVILNQQADADGKVQAFSSVGIASGRPTVV
LSDSQQVNPDNISWGYRGGRLELNGNNLTFTRLQAADYGAIITNNSEKKSTVTLDLQTLK
ASDINVPVNTVSIFGGRGAPGDLYYDSSTKQYFILKASSYSPFFSDLNNSSVWQNVGKDR
NKAIDTVKQQKIEASSQPYMYHGQLNGNMDVNIPQLSGKDVLALDGSVNLPEGSITKKSG
TLIFQGHPVIHAGTTTSSSQSDWETRQFTLEKLKLDAATFHLSRNGKMQGDINATNGSTV
ILGSSRVFTDRSDGTGNAVSSVEGSATATTVGDQSDYSGNVTLENKSSLQIMERFTGGIE
AYDSTVSVTSQNAVFDRVGSFVNSSLTLGKGAKLTAQSGIFSTGAVDVKENASLTLTGMP
SAQKQGYYSPVISTTEGINLEDKASFSVKNMGYLSSDIHAGTTAATINLGDSDADAGKTD
SPLFSSLMKGYNAVLRGSITGAQSTVNMINALWYSDGKSEAGALKAKGSRIELGDGKHFA
TLQVKELSADNTTFLMHTNNSWADQLNVTDKLSGSNNSVLVDFLNKPASEMSVTLITAPK
GSDEKTFTAGTQQIGFSNVTPVISTEKTNDATKWVLTGYQTTADAGASKAAKDFMASGYK
SFLTEVNNLNKRMGDLRDTQGDAGVWARIMNGTGSADGDYSDNYTHVQIGVDRKHELDGV
DLFTGALLTYTDSNASSHAFSGKTKSVGGGLYASALFNSGAYFDLIGKYLHHDNQHTANF
ASLGTKDYSSHSWYAGAEVGYRYHLTKESWVEPQIELVYGSVSGKAFSWEDRGMALSMKD
KDYNPLIGRTGVDVGRAFSGDDWKITARAGLGYQFDLLANGETVLQDASGEKRFEGEKDS
RMLMTVGMNAEIKDNMRLGLELEKSAFGKYNVDNAINANFRYVF

German E. coli – phage analysis by Nico Petty

Nico Petty from University of Queensland has done some additional analysis of the prophage which she’s asked me to post here. Thanks Nico!

The stx phage and others in the O104 outbreak strain

Following on from our finding earlier in the week, that the O104 outbreak strain has acquired a phage that carries the stx2A and stx2B Shiga toxin genes, and Kat’s finding that only part of the stx phage in Sakai was present in the German outbreak strain, I’ve done a bit more analysis of the phage.

Using Kat’s ordering of the contigs of the latest BGI assembly (6/6 Illumina + Ion Torrent), a blast (see below) against the stx phages in two EHEC O157:H7 strains – Sp5 in the Sakai genome and 933W in the EDL933 genome shows that, even though the region is in lots of small contigs in the O104 genome (middle, alternating orange and brown), there is some similarity from the start to the end of the stx phages. However, there is clearly much less similarity in the larger contigs in the region to the left of the stx genes (highlighted in red).

Ordered82 contigs (middle) vs EHEC O157:H7 prophage: Sp5 from Sakai genome (top) and 933W in EDL993 genome (bottom)

This could be a simple case of misassembly as this is just an early draft genome and was ordered against the closely related EAEC 55989 genome, which doesn’t have this phage. I had a look through the rest of the O104 genome and found that there are contigs elsewhere in the genome assembly which have similarity to the to the left hand side of the O157 stx phages. I reordered the contigs to replace these contigs at the left hand side of the stx phage in O104 (see below) and a blast against Sp5 and 933W as before showed a little more similarity, particularly with phage 933W (bottom). These contigs (13 and 492) also result in a phage of similar size to Sp5 and EDL933, which makes it a more likely fit. However, although contig 492 does encode phage-related genes, they are still quite different from those in the syntenic region of the related EHEC phages. The reason for this region of difference in the stx phage (other than missassemly) could be that phage genomes are chimeras, they consist of different genetic modules, acquired from different ancestors.

TY2482 contigs (middle) reordered against Sp5 from Sakai genome (top) and 933W in EDL993 genome (bottom)

The genome of E. coli is highly repetitive and full of repeat sequences which contribute to gene flux – both through the acquisition of new genetic material, e.g. phages and antibiotic resistance genes as we have seen in this outbreak strain, and also through recombination. This is particularly the case for the lambdoid prophages of E. coli, which are highly related to each other and often the source of recombination – swapping phage modules amongst themselves and also contributing to rearrangements in the bacterial chromosome through recombination between highly repetitive sequences.

As in other E. coli strains, there are several lambdoid prophages in the genome of this outbreak strain (the Stx phage is one of these). Due to their repetitive nature, it can be difficult to distinguish which prophage regions belong to which prophage, making assembly of these regions in bacterial genomes very difficult. This is certainly the case in the German outbreak strain as most of the prophage regions are in several contigs. However, given the relative novelty of a Shiga toxin producing EAEC, I suspect that this stx phage was only a recent acquisition (comparison with genome sequence of the 2001 German O104 strain when it is available will give us more clues about the evolution of this outbreak strain) and is likely to be intact.

The other phage region not in EAEC 55989

The other phage that Kat mentioned (a set of 20 phage genes on a single contig not found in 55989) is less than half the size required to be an intact prophage by itself, and is either a prophage remnant or could be part of another phage in the O104 genome. A blast shows this prophage region is similar to an intact prophage (highlighted in green in the picture) in the genome of UPEC UTI89 (see below), and also ExPEC S88. It is possible that a similar, intact phage might be present in the O104 genome as lots of the small contigs at the end of the ordered contigs match the rest of the UTI889 prophage, but it is impossible to tell with the current assembly.

TY2482 phage remnant (top; contig 17) vs UPEC UTI89 prophage (bottom; highlighted in green)

More sequence data, particularly paired-end sequencing and longer reads than we have in the current assembly are really needed to resolve these prophage regions and work out how many phages there are in the genome and which bit of genome belongs to which phage. Once the phages are assembled properly, we will be able to determine if the phages are intact and encode all the genes necessary to produce functional virions, we will also be able to determine if they carry any other virulence genes in addition to the stx2A and stx2B genes.

Nico Petty

n.petty@uq.edu.au

9/6/11