SNP analysis and plasmid copy number among 5 outbreak E. coli

Since the Illumina (MiSeq) reads from 5 HPA genomes were recently released, I thought it would be interesting to compare these to the ‘complete’ assembly of TY2482 from BGI and look for substitution mutations and read depths for plasmids vs chromosome.

SNP analysis

I decided to try using Nesoni, the python-based analysis pipeline written by Paul Harrison & Torsten Seeman from the Victorian Bioinformatics Consortium here in Melbourne. It uses shrimp2 to map reads, and samtools and to generate and process sorted alignments and call variants. It has a nice feature where, with a single command (nesoni nway) you can generate a table showing an n-way comparison of consensus allele calls in a set of genomes, at each of the loci called as a variant in any genome.

The complete resulting output is here. It reports not only the consensus call, but the evidence behind the call, so it’s simple to see whether you believe it or not. The result is 220 calls, of which I might believe 9 (pink in figure below, and Excel file here).

I mapped MiSeq reads from 4 of the 5 HPA genomes (shrimp2 didn’t like the fastq file for sample 280, need to sort this out) and the HiSeq reads from BGI’s TY2482, to the complete reference assembly for TY2482. For 194 of the 220 variants called, the TY2482 read mapping resulted in a variant call compared to the TY2482 reference, which means that the variant is unlikely to be real. This could happen for a variety of reasons relating to the mapping & variant calling process, and I was just using the default settings in Nesoni so some tweaking might remove these. In any case, I will ignore these variants for now because I don’t believe they are real (but you can see the full table here).

This leaves 24 variant calls, where the allele detected in one or more of the 4 HPA genomes is different from the TY2482 assembly+reads from BGI, shown in the table below. This includes 9 in the chromosome (highlighted in pink), with 2 SNPs called in all 4 genomes; 1 SNP called in sample 283 only, 3 SNPs called in sample 540 and 3 SNPs called in sample 541.

SNPs identified in HPA strains compared to the complete BGI assembly for TY2482

In pTY1, which is the IncI plasmid bearing the beta-lactamase CTX-M-15 gene, the variants detected (yellow above) were all within the shufflon proteins…this region is able to ‘shuffle’ via inversions between homologous sequences, and these variant calls will most likely represent shuffling in the region rather than point mutation. In fact, the alignment suggests that there are multiple versions of these sequences in each DNA sample, suggesting that the shufflon was active and generating a mixed population of plasmids in the HPA data (but not the BGI strain TY2482, which had homozygous calls in this region). This might be worthy of some further investigation by someone who understands shufflons a lot better than I do!

Finally, there were a few variant calls in the tiny plasmid pTY3, clustered within its rep gene. These calls are heterozygous (see table) in all four HPA strains, suggesting that the mapping is picking up two different versions of the rep gene, which could be due to homology with other replication proteins in plasmids pTY1 and pTY2.

Plasmid copy number

The massive variation in read depth for plasmid sequences compared to the chromosome reminded me it might be interesting to try to infer the average copy number for each plasmid based on read depth. To do this, I used the depth plots output by nesoni (which gives the mean read depth per base in the reference sequence, based on read mapping). I calculated the mean read depth across each reference sequence (ie the completed BGI TY2482 assembly, chromosome + 3 plasmids) from this, and then calculated the ratio of read depths for plasmid:chromosome. Assuming each bacterial cell has ~1 copy of the chromosome (i.e. ignoring cells caught in the act of replication when there will be >1), this should give an approximation of the mean copy number of each plasmid per cell. We know some plasmids are maintained quite stably at 1 per cell, while others can exist at high copy number. This is the result:

Mean read depth and mean plasmid copy number for outbreak strains

So for the TY2482 data, it looks as though the IncI1 resistance plasmid (pTY1) and the aggregative adhesion plasmid (pTY2) are maintained at ~1 per cell, while the mini plasmid (which contains little more than a plasmid replication gene) is present at ~9 per cell. This is pretty much in line with expectation.

Interestingly, the HPA strains appear to have much higher copy numbers, around 20 per cell for pTY1 and pTY2 and hundreds of copies of pTY3. The numbers are pretty consistent across the HPA strains, but are remarkably higher than in TY2482.

I don’t have a good explanation for this apparent difference…. it could be an artefact in the sequencing (MiSeq likes plasmid DNA???) or in the mapping (not sure how this could be, especially since the mean depth plots produced by nesoni exclude regions that map to multiple locations in the reference genome).This could be examined by looking at results from different mapping programs, or analysis of reads from different platforms (Ion Torrent for TY2482 & LB226692, 454 reads for the HPA & C2L genomes).

If it is a real difference, I wonder if it could be differences in growth rate or culture conditions in the two labs. Or a mutation in the chromosome that affects the normal control of plasmid replication? Could having lots of copies of the aggregative adhesion plasmid enhance virulence or transmission of the bug?

BGI assembly of German E. coli outbreak strain

BGI has released a ‘complete’ assembly of TY2482. Here is what it looks like, with the 4 other available outbreak genomes mapped against it (4 inner rings) plus available references.

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = reference sequence... for chromosome, Ec55989 and red=VT2 phage; for IncI plasmid, pEC_Bactec; for mini plasmid, pO26-S1; for EAEC plasmid, red/yellow/aqua = 55989p, pO86A1, pAA, orange=agg operon.



BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = Ec55989, red=VT2 phage.


EAEC plasmid, novel version and carrying agg operon (aggregative adhesion fimbriae class I; AAF/I):

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Red = 55989p, yellow = pO86A1, aqua = pAA, orange=agg operon.


IncI plasmid, carrying extended spectrum beta-lactamase blaCTX-M-15:

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = pEC_Bactec.


Mini plasmid, replication regions only:

Mini plasmid ("selfish plasmid", carrying only rep genes). Yellow = pO26-S1.

Genome comparisons for 4 available outbreak genomes

Two new genomes were released today (well today my time, yesterday European time!) by the Göttingen Genomics Lab. They say:

We just released the 454 data of another two isolates from the German E. coli outbreak. You can find it on our website:
The link to the ftp server is
User name and password are ‘EAHEC_GOS’.

We just released the 454 assembly data of another two isolates (GOS1
and GOS2) from the German E. coli O104:H4 outbreak.
The shotgun libraries were sequenced on a GS FLX using Titanium
chemistry. 1.5 medium lanes of a Titanium picotiter plate was used for
each strain. Reads were assembled de novo using the Roche Newbler
assembly software 2.3.
We’ll submit the read data in the NCBI SRA as well.

There is also a new assembly from BGI which I’ve not looked at yet, go here to the github wiki to find out more.

I downloaded the first genome assembly (GOS1) but the second kept timing out, maybe they are a very popular download today! So here is the latest comparison of four available genomes from the outbreak, with their closest available reference sequences. This just illustrates what we knew from earlier analyses, and confirms that at least one of the newest genomes is pretty much the same as all the other outbreak genomes.

Comparison of 4 available assemblies (note there is a 5th but I couldn't get it to download!) For chromosome, phage and IncI plasmid, the reference sequences from NCBI are used. For the aggregative adhesion plasmid, it is so different to published references sequences that I have used the HPA scaffold as the reference, and mapped all others (including available EAEC reference plasmids and the agg operon) to this.

Resistance genes plus adhesin in the chromosome?

Two new assemblies of the German E. coli outbreak strains were released today, one from BGI (452 scaffolds/contigs; Illumina Hiseq paired end, 500bp inserts) and one from HPA (13 scaffolds, 454 mate pair). In the HPA assembly, the resistance genes for streptomycin, trimethoprim, sufamethoxazole, streptomycin and mercury (some of which are carried by a Tn21 transposon and IntI1 integron) are present in the same scaffold as the Ec55989 chromosome (scaffold 2). The picture below shows the mapping of this scaffold 2 to the chromosome in blue, the IncI plasmid  in green (which carries the blaTEM and blaCTX-M genes) and  the resistance genes of pAKU_1 in red (plasmid from Paratyphi A, using this as a reference because this is a plasmid I’ve worked with previously so am familiar with interpreting).

What you can see is that most of the scaffold maps to the chromosome, but the the red resistance genes are also present (near 100 kbp marker). Upstream of the resistance genes, in this scaffold at least, there are some additional regions of low similarity with the Ec55989 chromosome, consistent with a whole stretch of DNA being inserted into the chromosome. Very little of this has any homology to the IncI plasmid which we know is present in the strain, consistent with the idea that these resistance genes are not present on the IncI plasmid. This whole region is conserved in the BGI genome, shown in purple.

This could all be a scaffolding and assembly error, but looking at the mate pair reads should confirm or deny this.


Update: A closer look at the region suggests the scaffold may be correct. Below is a mapping of the new scaffold (second line) against Ec55989 (top line) and E. coli S88 (bottom line), showing the site of the possible insertion. The novel sequence is inserted into a tRNA sequence, which is typical of many integrase-mediated insertions. On the right of the insertion is an integrase gene with 100% identity to integrase sai in the Shigella flexneri 2002017. Part of the tRNA is duplicated on the left hand side of the insertion, again typical of a real integrase-mediated insertion.

Possible insertion of resistance genes and adhesin in chromosome

Acquired adhesin

So what exactly is in the insertion? To the left is a stretch of sequence with homology to pathogenicity island genes from several E. coli and Shigella genomes, including E. coli S88, E. coli 042, E. coli SE15, Shigella flexneri 2a SRL pathogenicity island. The homology to S88 is is shown in the figure. This region contains a protein with an autotransporter domain, annotated in some genomes as flu, Ag43, others as aidA-like adhesin, etc. So it is associated with adhesion.

Here is a phylogenetic tree showing the closest matching proteins (by NCBI blastp):

Adhesin inserted in O104:H4 outbreak strain (tree of similar proteins)

And the closest matching DNA sequences (by NCBI blastn):

Adhesin inserted in O104:H4 outbreak strain (tree of DNA sequences)

Acquired multidrug resistance

The rest of the insertion contains small hypothetical genes of unknown function, plus several common mobile elements associated with drug resistance.

Immediately adjacent to the “pathogenicity island” sequences described above (and present in the same contig) are part of tniAdelta (part of Tn21), a pecM-like permease and two tetracycline resistance genes (tetA, tetR):

The next contig contains a mercury resistance operon usually found in Tn21:

And the next contig a sequence containing strA, strB (resistance to streptomycin) and sul2 (resistance to sulfonamides… and then a new contig containing part of a Tn21-like transposon including tnpA, tnpM, tnpR and an IntI1-like integron which appears to have two resistance genes in the cassette (sul1 and a dihydrofolate reductase, I think it is A7). Next door in the same contig is an IS1 transposase and then the sai integrase, and then we are back to the tRNA-Sec sequence and chromosomal genes:

So given the sequence context, it is likely that the scaffold is correct in grouping these contigs together in this order, as it looks like a common and plausible gene order, with a possible mechanism for mobilisation. I’m used to seeing these resistance genes in plasmids, so to convince myself they can also be integrated into the chromosome I had a quick look for similar integrations in other E. coli. Here is one with a very similar set of resistance elements, even in the same order, inserted into the chromosome of EAEC E. coli 042 genome (although in this case it is not associated with the adhesion element mentioned above).

A similar syntenic set of resistance genes are present in E. coli 042 (EAEC) chromosome (top) and the German outbreak strain's chromosome (bottom)

Data: My manual annotation of this region in the HPA assembly is available here. It can be loaded into Artemis as an entry on top of the HPA scaffold. ACT comparisons on request but you can easily make your own using WebACT.

Aggregative plasmid – new E. coli genome from HPA

The Health Protection Agency in the UK has released a third E. coli O104:H4 genome from the German outbreak, strain H112180280. 454 reads and scaffold are available here: (BGI has also released an updated assembly but not sure yet how it was done.)

It contains a 73 kbp scaffold with matches to the other available EAEC aggregative adhesion plasmids (scaffold 8 in the HPA assembly). Here is a plot of the newly assembled scaffold and its homology to the other plasmids:

Aggregative plasmid from outbreak strain (reference) and its homology to three available EAEC plasmids. The location of the agg operon (encoding aggregative adhesion fimbriae I, or AAF/I) is shown in orange.

This was made using BRIG, freely available here from the Beatson lab at UQ, in about 5 minutes. Great program.

The inner ring is the novel plasmid scaffold, and its GC contig is shown in the black squiggly line. Because this is a scaffold, it contains some bits of unknown sequence between contigs (where it is known from mate pair reads that the contigs are ordered in this orientation, but there are small gaps between the contigs where we don’t know what the sequence is). These gaps are indicated with a series of N (as opposed to A, C, G or T) of the estimated length of the gap…so in the black wiggly line, these gaps show up as solid blocks of black, e.g. the one around the 70kbp mark at the top.

The other three rings (blue, red, green) indicate where there are similar sequences in the other three plasmids as shown in the central legend. You can see that pO86A1 is most similar to the novel plasmid, as it has similar sequence in parts of the novel plasmid where the other two plasmids don’t have matches (eg. between the 50-50kbp mark). There are also some parts where none of the three reference plasmids have good matches.

On the left in orange is the location of the agg operon encoding aggregative adhesion fimbriae (AAF/I). This has only very weak similarity to the other plasmids (the rings are very pale here, indicating low % identity), because the others encode type II and type III aggregative adhesion fimbriae instead (AAF/II, AAF/III). The bit upstream (10-15 kbp mark) that also has low homology contains insertion sequence (IS) elements and downstream is a resolvase, so it is possible the operon is mobile…although proper annotation is needed to sort this out.

As with Ion Torrent, the 454 is prone to errors in homopolymeric tracts (ie runs of the same base, difficult to tell between AAAAA and AAAAAA), which introduces frameshifts into the assembly. So it would be great to have an ERA7 error-tolerant annotation to sort this out.

Update: Most of the sequence that is shared with pO86A1 but not the others are transposases, with the exception of a serine protease autotransporter from a family of mucinases including pic, ipd, sepA:

Location of sepA insertion in EAEC plasmid

The gene is mostly similar to sepA genes found in the virulence plasmids of Shigella:

Fast ME tree of NCBI blastp hits to inserted serine protease autotransporter

MLST of IncI blaCTX-M plasmid in German outbreak strain

Overnight I received an email from Scott Weissman at the Seattle Children’s Hospital. He has done some analysis of the IncI, blaCTX-M bearing plasmid from the outbreak strain using the plasmid MLST database. Here’s what he did:

To facilitate comparisons to other plasmids, I analyzed the LB226692 contigs in order to identify a plasmid Sequence Type (pST) for this outbreak strain’s IncI1 plasmid carrying CTX-M-15 and TEM-1.  I extracted fragments for the 5 MLST loci (as described at from the GenBank contigs, and obtained allele assignments as follows: repI 3 | ardA 4 | trbA 6| sogS 3 | pilL 3, which corresponds to pST31.  (I should note that the extracted sequence for trbA contained a 1-nt “insertion” relative to reference allele 6, which I assume to be sequencing artifact, although a novel allele cannot be excluded – given the indel occurrence within a poly-T tract of 4 T’s).

The database contains 15 plasmid entries for IncI1 pST31 (see below), including pEC_Bactec (described by Smet et al, PLoS One, 2011;5:e11202).  All of these entries carry the CTX-M-15 and TEM-1 enzymes, so there are no headlines here.

I would note, however, that this CTX-M-15 plasmid is distinct from the IncF-family plasmids that have been globally distributed by E. coli ST131 (eg, pC15a-1a, as described by Boyd et al, AAC 2004;48:3758-64) and detected subsequently in multiple Klebsiella pneumoniae clones (see Oteo et al, JAC, 2009;64:524-528).

IncI pST31 entries in the IncI pMLST database

To supplement this I had a quick look at the latest BGI assembly of TY2482, the other outbreak strain that has been sequenced. I found the same results, but this time with a precise trbA allele 6 (i.e. Scott was right in guessing this is an error in the Ion Torrent data at a homopolymeric tract).

Re the table above, the paper describing the 2004 Shigella sonnei plasmid is here, I’m not sure if the others are published.

This is what the eBURST diagram looks like for the data in the IncI plasmid MLST database…the German outbreak sequence type, pST31, is pointed out with a red arrow.

eBurst diagram for IncI plasmid MLST

The pMLST sequences from TY2482 (BGI assembly 2, June 6) are:


STEC/EHEC outbreak – horizontally transferred genes

In the German outbreak bacteria, as in most E. coli, plenty of horizontal transfer has gone on to create the genome we are now looking at.

I’ve done about all I’m going to on this analysis, at least until some more complete data is released… but I did generate a summary plot and have a quick look at the origins of the stx, ter and other acquired genes.

This is a quick look at what the outbreak strain’s genome looks like:

Previously known sequences that are present in the outbreak genome

What is this showing us? Firstly, as established by other’s work mapping reads and contigs to the available E. coli reference genome sequences, the chromosome of the outbreak strain is most similar to strain Ec55989, an enteroaggregative E. coli (EAEC) isolated in Africa over a decade ago [central circle in figure]. It shares with this strain part of the EAEC plasmid [55989p, top right] carrying aggregative adhesion operons aat, the regulator aggR and some other bits, but it has a different aggregative adhesion fimbrial complement (AAF/I) from Ec55989. It has also acquired the stx2 phage carrying shiga-toxin 2 genes stx2A, stx2B [top left]; a plasmid sharing high similarity with the IncI plasmid pEC_Bactec, including blaCTX-M and blaTEM-1 beta-lactamase (antibiotic resistance) genes [bottom left] and a lot of sequence similar to plasmid pCVM29188_101 from Salmonella entericaKentucky [bottom left]. The circles represent the sequence of the plasmids and phage (previously sequenced and deposited in GenBank) that are most similar to sequences in the novel strain. The green rings indicate which parts of these references sequences are also present in the novel German strain (via BLAST comparison with TY2482/MIRA contigs)….so nearly all of the Ec55989 chromosome and pEC_Bactec plasmid, and not quite all of the other phage & plasmid sequences.

There is a further 300-500 kbp of sequence that doesn’t match any of these 5 reference sequences, but we can get a feel for these by searching deeper in the GenBank database via BLAST, and using the wonderful annotation provided by ERA7. [Annotation for just these contigs here.] I haven’t had a chance to look through these properly yet, but of course there is the tellurium resistance operon ter, which we expect because phenotypically the strain was noted as tellurium resistant some time ago.

The origin of the Shiga toxin phage is interesting. The toxin genes themselves (subunits A & B) are 100% identical at the nucleotide level to other stx2 toxins in NCBI, see alignment here showing precisely identical reference sequences. I mapped contigs (TY2482, MIRA assembly) to the VT2 phage to identify those that are likely to be part of the acquired phage. Using these sequences to search NCBI (nr, blastn), the closest match was to Stx2 phage I (accession AP004402, 100% identity across 81%)…but obviously the phage acquired by the German strain is a bit different because the whole of Stx2 phage I is not present (approx 20% missing, top left in figure above).

The tellerium resistance genes are also quite similar to those seen before in a variety of E. coli. I used the ERA7 annotation to identify contigs carrying the ter operon, and did a BLASTN search in NCBI for matches to these contigs. I aligned them properly with Muscle, made a bio-NJ tree and used the ‘Consensus’ function in Dendroscope (LSA tree) to combine the trees into a consensus tree. The result shows the ter operon is very similar to that found in other EHEC, especially O157:H7:

Consensus tree for ter operon (German outbreak strains highlighted)

Finally, I had a look at one contig that I noticed wasn’t present in Ec55989 but had homology to the E. coli O157:H7 Sakai chromosome… it is contig husec41_c1441, containing a probably transporter protein and two other genes of unknown function. Interestingly, a BLAST search of NCBI showed this sequence is usually chromosomally encoded, and was most similar to genes in Shigella flexneri and Shigella boydii, which cause bacterial dysentery [alignment of BLAST hits; tree drawn with FigTree this time]. So this is just a hint that there are still plenty of novel and potentially important genes to be discovered in this genome!

TY2482/MIRA assembled, contig 1441