Phage annotation with PHAST

Just a quick post to say how much I love PHAST, the PHAge Search Tool.

It looks for possible prophages in your bacterial genomes, and makes such beautiful pictures of the results, like this summary of the five phage it found in a new Salmonella genome:

It also draws nice circular diagrams to show you where the phage are located, like this:

And it will even show you a nicely annotated figure of indidual phage it found, using an interactive Flash viewer:

My only gripe is that unlike some of the more visualization-challenged phage finders, PHAST doesn’t output actual annotation files, like GenBank or GFF or even a  simple text table that would be straightforward to convert into GenBank… the format in which it prints out the actual information on where each phage is located in your sequence seems to be a home-grown text format that is not easy to parse with existing tools.

Oh well, I suppose I will have to write a little script to turn PHAST’s phage hunt results into a proper annotation… unless someone else has already done this?


German E. coli – phage analysis by Nico Petty

Nico Petty from University of Queensland has done some additional analysis of the prophage which she’s asked me to post here. Thanks Nico!

The stx phage and others in the O104 outbreak strain

Following on from our finding earlier in the week, that the O104 outbreak strain has acquired a phage that carries the stx2A and stx2B Shiga toxin genes, and Kat’s finding that only part of the stx phage in Sakai was present in the German outbreak strain, I’ve done a bit more analysis of the phage.

Using Kat’s ordering of the contigs of the latest BGI assembly (6/6 Illumina + Ion Torrent), a blast (see below) against the stx phages in two EHEC O157:H7 strains – Sp5 in the Sakai genome and 933W in the EDL933 genome shows that, even though the region is in lots of small contigs in the O104 genome (middle, alternating orange and brown), there is some similarity from the start to the end of the stx phages. However, there is clearly much less similarity in the larger contigs in the region to the left of the stx genes (highlighted in red).

Ordered82 contigs (middle) vs EHEC O157:H7 prophage: Sp5 from Sakai genome (top) and 933W in EDL993 genome (bottom)

This could be a simple case of misassembly as this is just an early draft genome and was ordered against the closely related EAEC 55989 genome, which doesn’t have this phage. I had a look through the rest of the O104 genome and found that there are contigs elsewhere in the genome assembly which have similarity to the to the left hand side of the O157 stx phages. I reordered the contigs to replace these contigs at the left hand side of the stx phage in O104 (see below) and a blast against Sp5 and 933W as before showed a little more similarity, particularly with phage 933W (bottom). These contigs (13 and 492) also result in a phage of similar size to Sp5 and EDL933, which makes it a more likely fit. However, although contig 492 does encode phage-related genes, they are still quite different from those in the syntenic region of the related EHEC phages. The reason for this region of difference in the stx phage (other than missassemly) could be that phage genomes are chimeras, they consist of different genetic modules, acquired from different ancestors.

TY2482 contigs (middle) reordered against Sp5 from Sakai genome (top) and 933W in EDL993 genome (bottom)

The genome of E. coli is highly repetitive and full of repeat sequences which contribute to gene flux – both through the acquisition of new genetic material, e.g. phages and antibiotic resistance genes as we have seen in this outbreak strain, and also through recombination. This is particularly the case for the lambdoid prophages of E. coli, which are highly related to each other and often the source of recombination – swapping phage modules amongst themselves and also contributing to rearrangements in the bacterial chromosome through recombination between highly repetitive sequences.

As in other E. coli strains, there are several lambdoid prophages in the genome of this outbreak strain (the Stx phage is one of these). Due to their repetitive nature, it can be difficult to distinguish which prophage regions belong to which prophage, making assembly of these regions in bacterial genomes very difficult. This is certainly the case in the German outbreak strain as most of the prophage regions are in several contigs. However, given the relative novelty of a Shiga toxin producing EAEC, I suspect that this stx phage was only a recent acquisition (comparison with genome sequence of the 2001 German O104 strain when it is available will give us more clues about the evolution of this outbreak strain) and is likely to be intact.

The other phage region not in EAEC 55989

The other phage that Kat mentioned (a set of 20 phage genes on a single contig not found in 55989) is less than half the size required to be an intact prophage by itself, and is either a prophage remnant or could be part of another phage in the O104 genome. A blast shows this prophage region is similar to an intact prophage (highlighted in green in the picture) in the genome of UPEC UTI89 (see below), and also ExPEC S88. It is possible that a similar, intact phage might be present in the O104 genome as lots of the small contigs at the end of the ordered contigs match the rest of the UTI889 prophage, but it is impossible to tell with the current assembly.

TY2482 phage remnant (top; contig 17) vs UPEC UTI89 prophage (bottom; highlighted in green)

More sequence data, particularly paired-end sequencing and longer reads than we have in the current assembly are really needed to resolve these prophage regions and work out how many phages there are in the genome and which bit of genome belongs to which phage. Once the phages are assembled properly, we will be able to determine if the phages are intact and encode all the genes necessary to produce functional virions, we will also be able to determine if they carry any other virulence genes in addition to the stx2A and stx2B genes.

Nico Petty


Phage analysis of German E. coli outbreak genome

I’ve given up trying to call this EHEC, EAEC, STEC or HUSEC because to me it seems more confusing than hepful! But we are talking about the E. coli responsible for an outbreak of HUS in Germany.

Now that BGI has released a more complete assembly (using HiSeq data, now in 513 contigs), it is worthwhile attempting some analysis of phage sequences in the genome.

First I ordered the BGI contigs contigs against a composite reference sequence including the Ec55989 chromosome and plasmid, the other two plasmids I discussed here and the VT2 (shiga toxin) prophage as discussed here and here. I did this using Mauve, the alignment & ordered contigs are available here and this is how it looks in Mauve (compound reference on top, Ty2482 contigs from BGI on bottom):

I then submitted the contigs to Prophage Finder, which uses BLASTX to compare contigs to predicted prophage sequences. The raw output is here. I then used a combination of manual processing in text editors and Excel, and a table-to-embl format conversion script I wrote some time ago, to generate an annotation of prophage sequences in the BGI contigs. You can see it as an Excel table, or open this EMBL file in Artemis (it contains the ordered contig sequences plus the phage annotations).

Basically, it shows that the outbreak strain contains several prophage, most of which it shares with Ec55989. However it has acquired two additional prophages…

Stx phage

One of course is the Shiga-toxin producing phage, in which the Prophage Finder annotation found 64 genes in 46 kbp worth of contigs. This is shorter than most published Stx phage, and is consistent with my earlier findings (circular map; ACT diagram) that only part of the 90 kbp VT2-Sakai Shiga toxin phage was present in the German genome. Perhaps someone with expertise in Stx phage will know why this is.

The phylogenetic tree of NCBI blastn hits to these contigs (using ‘Distance tree of results’ option, Fast Minimum Evolution and other default settings) looks like this. No obvious source revealed here.

A second phage?

In addition to this there is a set of 20 phage genes in a single contig of ~14.1 kbp (BGI ), which is not present Ec55989. The phage is most similar to one present in the E. coli UT189 genome (accession CP000243), with 100% identity across ~80% of the sequence. I don’t know if this is a complete phage or if it could be part of the Stx2 phage, perhaps someone who knows how to ‘read’ phage genomes can check if it has all the necessary bits to be its own phage! Download the sequence and annotation here.

Phylogenetic tree for this phage, based on blast hits in NCBI:

Keep up to date on crowd-sourced analysis at the github site here.

STEC/EHEC outbreak – horizontally transferred genes

In the German outbreak bacteria, as in most E. coli, plenty of horizontal transfer has gone on to create the genome we are now looking at.

I’ve done about all I’m going to on this analysis, at least until some more complete data is released… but I did generate a summary plot and have a quick look at the origins of the stx, ter and other acquired genes.

This is a quick look at what the outbreak strain’s genome looks like:

Previously known sequences that are present in the outbreak genome

What is this showing us? Firstly, as established by other’s work mapping reads and contigs to the available E. coli reference genome sequences, the chromosome of the outbreak strain is most similar to strain Ec55989, an enteroaggregative E. coli (EAEC) isolated in Africa over a decade ago [central circle in figure]. It shares with this strain part of the EAEC plasmid [55989p, top right] carrying aggregative adhesion operons aat, the regulator aggR and some other bits, but it has a different aggregative adhesion fimbrial complement (AAF/I) from Ec55989. It has also acquired the stx2 phage carrying shiga-toxin 2 genes stx2A, stx2B [top left]; a plasmid sharing high similarity with the IncI plasmid pEC_Bactec, including blaCTX-M and blaTEM-1 beta-lactamase (antibiotic resistance) genes [bottom left] and a lot of sequence similar to plasmid pCVM29188_101 from Salmonella entericaKentucky [bottom left]. The circles represent the sequence of the plasmids and phage (previously sequenced and deposited in GenBank) that are most similar to sequences in the novel strain. The green rings indicate which parts of these references sequences are also present in the novel German strain (via BLAST comparison with TY2482/MIRA contigs)….so nearly all of the Ec55989 chromosome and pEC_Bactec plasmid, and not quite all of the other phage & plasmid sequences.

There is a further 300-500 kbp of sequence that doesn’t match any of these 5 reference sequences, but we can get a feel for these by searching deeper in the GenBank database via BLAST, and using the wonderful annotation provided by ERA7. [Annotation for just these contigs here.] I haven’t had a chance to look through these properly yet, but of course there is the tellurium resistance operon ter, which we expect because phenotypically the strain was noted as tellurium resistant some time ago.

The origin of the Shiga toxin phage is interesting. The toxin genes themselves (subunits A & B) are 100% identical at the nucleotide level to other stx2 toxins in NCBI, see alignment here showing precisely identical reference sequences. I mapped contigs (TY2482, MIRA assembly) to the VT2 phage to identify those that are likely to be part of the acquired phage. Using these sequences to search NCBI (nr, blastn), the closest match was to Stx2 phage I (accession AP004402, 100% identity across 81%)…but obviously the phage acquired by the German strain is a bit different because the whole of Stx2 phage I is not present (approx 20% missing, top left in figure above).

The tellerium resistance genes are also quite similar to those seen before in a variety of E. coli. I used the ERA7 annotation to identify contigs carrying the ter operon, and did a BLASTN search in NCBI for matches to these contigs. I aligned them properly with Muscle, made a bio-NJ tree and used the ‘Consensus’ function in Dendroscope (LSA tree) to combine the trees into a consensus tree. The result shows the ter operon is very similar to that found in other EHEC, especially O157:H7:

Consensus tree for ter operon (German outbreak strains highlighted)

Finally, I had a look at one contig that I noticed wasn’t present in Ec55989 but had homology to the E. coli O157:H7 Sakai chromosome… it is contig husec41_c1441, containing a probably transporter protein and two other genes of unknown function. Interestingly, a BLAST search of NCBI showed this sequence is usually chromosomally encoded, and was most similar to genes in Shigella flexneri and Shigella boydii, which cause bacterial dysentery [alignment of BLAST hits; tree drawn with FigTree this time]. So this is just a hint that there are still plenty of novel and potentially important genes to be discovered in this genome!

TY2482/MIRA assembled, contig 1441

EAEC plasmids

I’ve begun to look at the EAEC plasmids in the german outbreak sequence data, guided by the review of E. coli plasmids in Johnson & Nolan, 2009. I mapped the novel sequence data to the three available plasmids: 55989p (from the Ec55989 isolate, whose chromosome is most similar to the outbreak strains), p042 (from the 042 isolate), and pO86A1. None of them are present in their entirety, only 40-60% of plasmid genes were covered. The closest was 55989p (59% covered by the LB2226692 assembly, 56% covered by the MIRA assembly of Ty2428 and 54% covered by the BGI assembly of TY2428)…perhaps unsurprising since the outbreak strain was most similar to its Ec55989 host chromosome also. The same was evident from mapping BGI reads to the reference plasmids:

blue = aggR; red = agg3 aggregative adhesion fimbrial cluster

The available plasmids have AAF/II and AAF/III (AAF = aggregative adhesion fimbriae) clusters. As I reported earlier, the novel strains appear to contain a different AAF locus, AAF/I (aggABCD), which is quite distinct. One contig from the LB2226692 genome (gi_334716751_gb_AFOB01000328.1_ Escherichia coli LB226692 Contig26) contains N-terminal of aggC and whole of aggD (from AAF/I), adjacent to an IS element (closest to IS1294, IS91 family) and >8 kbp of additional sequence.

That additional sequence in LB2226992 Contig 26 maps to Salmonella enterica subsp. enterica serovar Kentucky str. CVM29188 pCVM29188_101 CP001121.1. Notably, the rest of this plasmid is also present in both outbreak strains, so it is possible that the agg operon was inserted into a pCVM29188_101-like plasmid and has somehow replaced the AAF genes in this strain. However, it is likely that the appearance of the aggCD and pCVM plasmid genes in the same contig could be down to mis-assembly – more data will be needed to sort this out. Using BWA to map TY2482 reads to the LB2226992 contig 26, I couldn’t find any reads that cover the boundary between the agg and plasmid sequences, but I would need the LB2226992 reads to check the evidence for this in the LB2226992 assembly, which are not yet public. The genes in this shared sequence (SeKA_C0033-SeKA_C0041) include some with basic plasmid functions which, if the assembly was correct, would be pretty good evidence of integration of aggABCD directly into the plasmid backbone.

LB2226692 (top, contig carrying aggABCD in green) vs pCVM29188_101

Apparently EAEC usually have only one AAF cluster, and the outbreak strain is no exception – it does not carry the agg3 or aaf AAF operons of 55989p, p042 or pO86A1. I don’t know why this happens… The outbreak strain also has the ipd gene from pO86A1 (encoding an extracellular serineprotease) but not pet from 042. The aatPABCD operon and aggR genes, plasmid-borne genes characteristic of EAEC, are present and appear to be intact.

A target for PCR typing?

I’m not familiar with EAEC typing, but the little I’ve read in the last few days suggests that (a) AAF operons are a marker for EAEC, and (b) AAF/I is relatively rare among EAEC. In which case PCR testing for a sequence within this AAF/I (aggABCD) cluster could be a good way to differentiate between the outbreak strain and other STEC (i.e. aggA+, stx2+, eae-).

Apparently the current strains are alredy being tested by PCR for  aatA and coming up positive, consistent with the fact that read mapping and assembly identify the complete aatPABCD operon (antiaggregation protein transporter system) in the outbreak strains…. but typing for aggA could eliminate other EAEC.

Update: The Koch Institute is already doing exactly this, they have characterised the outbreak strain as aatA+, aggR+, aap+, aggA+ and aggC+ by PCR (ST678). Must have been missing from the earlier info I saw. Great to see that genomics can provide the same conclusion as the experts, even for an E. coli newbie.

Update 2: Still, the genomic analysis shows/confirms that aatPABCD, aggR, aap are conserved markers of EAEC, whereas the aggA (AAF/I) is a bit more specific to this outbreak. I would argue that the combination of HUS symptoms, stx2+, eae-, aggA+ would be a pretty convincing first pass at typing, especially if you are limited (as most are) in the number of PCRs you can use to screen each isolate.

Clearly some data on frequency of this combination of markers from public health labs would be needed to confirm this has value…but given the initial ‘surprise’ at HUS caused by stx+, LEE-, EAEC (and rarity of reports in the literature) I would hazard a guess that this would be pretty discriminatory in Europe at the moment.

Update 3: BGI has release a new assembly incorporating 200x data from Illumina Hiseq, see post. But the agg cluster forms its own contig within this assembly (contig 43), without even the adjacent IS elements, so no further clues to its genomic context. Maybe if the reads were released this could be look at?