Bacterial genomics tutorial

This is a shameless plug for an article and accompanying tutorial I’ve just published together with David Edwards, my excellent MSc Bioinformatics student from the University of Melbourne. It’s currently available as a PDF pre-pub from BMC Microbial Informatics and Experimentation, but the web version will be available soon. The accompanying tutorial is available here.

The idea for this came from discussions at last year’s ASM (Australian Society of Microbiology) meeting, where it was highlighted that there was a lack of courses and tutorials available for biologists to learn the basics of genomic analysis so that they can make use of next gen sequencing. Michael Wise, a founding editor of BMC Microbial Informatics and Experimentation based at UWA in Perth, suggested the new journal would be an ideal home for such a tutorial… so here we are:

Beginner’s guide to comparative bacterial genome analysis using next-generation sequence data

http://www.microbialinformaticsj.com/content/3/1/2/

High throughput sequencing is now fast and cheap enough to be considered part of the toolbox for investigating bacteria, and there are thousands of bacterial genome sequences available for comparison in the public domain. Bacterial genome analysis is increasingly being performed by diverse groups in research, clinical and public health labs alike, who are interested in a wide array of topics related to bacterial genetics and evolution. Examples include outbreak analysis and the study of pathogenicity and antimicrobial resistance. In this beginner’s guide, we aim to provide an entry point for individuals with a biology background who want to perform their own bioinformatics analysis of bacterial genome data, to enable them to answer their own research questions. We assume readers will be familiar with genetics and the basic nature of sequence data, but do not assume any computer programming skills. The main topics covered are assembly, ordering of contigs, annotation, genome comparison and extracting common typing information. Each section includes worked examples using publicly available E. coli data and free software tools, all which can be performed on a desktop computer.

Four great tools

In the paper and tutorial, we introduce the four tools which we rely on most for basic analysis of bacterial genome assemblies: Velvet, ACT, Mauve and BRIG. All except ACT were developed as part of a PhD project, and have endured well beyond the original PhD to become well-known bioinformatics tools. New students take note!

In the paper, each tool is highlighted in its own figure, which includes some basic instructions. This is reproduced below, but is covered in much more detail in the tutorial that comes with the paper (link at the bottom).

1. Velvet for genome assembly

Possibly the most popular and widely used short read assembler, developed by the amazing Dan Zerbino during his PhD at EBI in Cambridge. Quite a PhD project!

Download | Paper | Protocol ]

Figure1_Velvet 

Reads are assembled into contigs using Velvet and VelvetOptimiser in two steps, (1) velveth converts reads to k-mers using a hash table, and (2) velvetg assembles overlapping k-mers into contigs via a de Bruijn graph. VelvetOptimiser can be used to automate the optimisation of parameters for velveth and velvetg and generate an optimal assembly. To generate an assembly of E. coli O104:H4 using the command-line tool Velvet:

• Download Velvet [23] (we used version 1.2.08 on Mac OS X, compiled with a maximum k-mer length of 101 bp)

• Download the paired-end Illumina reads for E. coli O104:H4 strain TY-2482 (ENA accession SRR292770)

• Convert the reads to k-mers using this command:

velveth out_data_35 35 -fastq.gz -shortPaired -separate SRR292770_1.fastq.gz SRR292770_2.fastq.gz

• Then, assemble overlapping k-mers into contigs using this command:

velvetg out_data_35 -clean yes -exp_cov 21 -cov_cutoff 2.81 -min_contig_lgth 200

This will produce a set of contigs in multifasta format for further analysis. See Additional file 1: Tutorial for further details, including help with downloading reads and using VelvetOptimiser.

2. ACT for pairwise genome comparison

Part of the Sanger Institute’s Artemis suite of tools. Also look at Artemis (single genome viewer), DNA Plotter (which can draw circular diagrams of your genomes) and BAMView (which can display mapped reads overlaid on a reference genome), they are all available here.

Download | Paper | Manual ]

Figure2_ACT

Artemis and ACT are free, interactive genome browsers (we used ACT 11.0.0 on Mac OS X).

• Open the assembled E. coli O104:H4 contigs in Artemis and write out a single, concatenated sequence using File -> Write -> All Bases -> FASTA Format.

• Generate a comparison file between the concatenated contigs and 2 alternative reference genomes using the website WebACT.

• Launch ACT and load in the reference sequences, contigs and comparison files, to get a 3-way comparison like the one shown here.

Here, the E. coli O104:H4 contigs are in the middle row, the enteroaggregative E. coli strain Ec55989 is on top and the enterohaemorrhagic E. coli strain EDL933 is below. Details of the comparison can be viewed by zooming in, to the level of genes or DNA bases.

3. Mauve for contig ordering and multiple genome comparison

Developed by the wonderful Aaron Darling during his PhD, he is now Associate Professor at University of Technology Sydney. Also see Mauve Assembly Metrics, an optional plugin for assessing assembly quality which was developed for the Assemblathon.

Download | Paper | User Guide ]

Fig3_Mauve

Mauve is a free alignment tool with an interactive browser for visualising results (we used Mauve 2.3.1 on Mac OS X).

• Launch Mauve and select File -> Align with progressiveMauve

• Click ‘Add Sequence…’ to add your genome assembly (e.g. annotated E. coli O104:H4 contigs) and other reference genomes for comparison.

• Specify a file for output, then click ‘Align…’

• When the alignment is finished, a visualization of the genome blocks and their homology will be displayed, as shown here. E. coli O104:H4 is on the top, red lines indicate contig boundaries within the assembly. Sequences outside coloured blocks do not have homologs in the other genomes.

4. BRIG (BLAST Ring Image Generator) for multiple genome comparison

From Nabil-Fareed Alikhan at the University of Queensland, also as part of a graduate project, which I believe is still in progress…

Download | Download BLAST | Paper | Tutorial ]

Fig4_BRIG

BRIG is a free tool that requires a local installation of BLAST (we used BRIG 0.95 on Mac OS X). The output is a static image.

• Launch BRIG and set the reference sequence (EHEC EDL933 chromosome) and the location of other E. coli sequences for comparison. If you include reference sequences for the Stx2 phage and LEE pathogenicity island, it will be easy to see where these sequences are located.

• Click ‘Next’ and specify the sequence data and colour for each ring to be displayed in comparison to the reference.

• Click ‘Next’ and specify a title for the centre of the image and an output file, then click ‘Submit’ to run BRIG.

• BRIG will create an output file containing a circular image like the one shown here. It is easy to see that the Stx2 phage is present in the EHEC chromosomes (purple) and the outbreak genome (black), but not the EAEC or EPEC chromosomes.

Tutorial

The tutorial accompanying the article is available here. To give you an idea of what’s covered, here is the table of contents:

1. Genome assembly and annotation…………………………………………………………… 2

1.1 Downloading E. coli sequences for assembly…………………………………………….. 2

1.2 Examining quality of reads (FastQC)………………………………………………………… 2

1.3 Velvet – assembling reads into contigs………………………………………………………. 4

1.3.1 Using VelvetOptimiser to optimise de novo assembly with Velvet………….. 6

1.4 Ordering contigs against a reference using Mauve………………………………………. 7

1.4.1 Viewing the ordered contigs (Mauve)………………………………………………… 10

1.4.2 Viewing the ordered contigs (ACT)……………………………………………………. 13

1.5 Mauve Assembly Metrics – Statistical View of the Contigs………………………… 15

1.6 Annotation with RAST……………………………………………………………………………. 15

1.6.1 Alternatives to RAST………………………………………………………………………. 19

2. Comparative genome analysis……………………………………………………………….. 20

2.1 Downloading E. coli genome sequences for comparative analysis………………. 20

2.2 Mauve – for multiple genome alignment……………………………………………………. 21

2.3 ACT – for detailed pairwise genome comparisons……………………………………… 24

2.3.1 Generating comparison files for ACT…………………………………………………. 24

2.3.2 Viewing genome comparisons in ACT……………………………………………….. 27

2.4 BRIG – Visualizing reference-based comparisons of multiple sequences……… 29

3. Typing and specialist tools……………………………………………………………………. 34

3.1 PHAST – for identification of phage sequences…………………………………………. 34

3.2 ResFinder – for identification of resistance gene sequences………………………… 34

3.3 Multilocus sequence typing…………………………………………………………………….. 34

3.4 PATRIC – online genome comparison tool………………………………………………… 34

E. coli outbreak – PacBio data and PLoS One paper on 2001 O104:H4 strain

It’s been a while since my last post, mainly because my attention has had to return to other things (my day job, ASM, and holidays in the Australian Snowy Mountains).

But a fair bit has happened in the last few weeks on the E. coli front.

PacBio has released some data from an outbreak strain plus a few related strains. As far as I know this is the first time PacBio data has been released publicly so it’s a good opportunity to have a play…I will at some point but not tonight! The data includes very long reads (average ~3 kbp) but with 85% accuracy (ouch! but useful for assembly)…using their circularised read approach, they get much better accuracy (average ~98%) but much shorter reads (average 430 bp).

Muenster & Life Tech have now published their analysis of outbreak strain and the 2001 O104:H4 STEC strain from Germany in PLoS One. Data is available here at NCBI.

The key finding of interest, I think, is that the stx2 phage (i.e. Shiga toxin) was present in the 2001 strain – apparently identical and in the same position – suggesting that the phage was acquired by a common ancestor of the 2001 and 2011 German O104:H4 strains. However the stx2 phage does favour certain insertion sites, so it is still possible that this represents two separate acquisitions.

This is their model:

Mellman et al, PLoS One 2011 - Model for O104:H4 STEC evolutionMellman et al, PLoS One 2011 - Model for O104:H4 STEC evolution

The two strains carry different aggregative adhesion plasmids (AAF/III in 2001; AAF/I in 2011) and different resistance plasmids, consistent with some evolutionary time separating them.

The paper says that each strain has accumulated 87-95 SNPs among 1,444 chromosomal genes since they shared a common ancestor…but I’ve not looked in enough detail to be convinced the authors have corrected sufficiently for homopolymers.

Interestingly their tree suggests that the common ancestor of the 2001 & 2011 German strains is also the common ancestor of the African (stx2-free) EAEC strain Ec55989 (which has picked up 24 SNPs)…again I’m not sure whether this is correct until I inspect the data myself, and the lower number of SNPs in Ec55989 makes me a little suspicious that the others are over-estimates….since Ec55989 was isolated ~1999 I think so should have a similar number of SNPs to the 2001 strain. But still a very interesting development.

This is their tree:

Mellman 2011 PLoS One - minimum spanning tree

SNP analysis and plasmid copy number among 5 outbreak E. coli

Since the Illumina (MiSeq) reads from 5 HPA genomes were recently released, I thought it would be interesting to compare these to the ‘complete’ assembly of TY2482 from BGI and look for substitution mutations and read depths for plasmids vs chromosome.

SNP analysis

I decided to try using Nesoni, the python-based analysis pipeline written by Paul Harrison & Torsten Seeman from the Victorian Bioinformatics Consortium here in Melbourne. It uses shrimp2 to map reads, and samtools and to generate and process sorted alignments and call variants. It has a nice feature where, with a single command (nesoni nway) you can generate a table showing an n-way comparison of consensus allele calls in a set of genomes, at each of the loci called as a variant in any genome.

The complete resulting output is here. It reports not only the consensus call, but the evidence behind the call, so it’s simple to see whether you believe it or not. The result is 220 calls, of which I might believe 9 (pink in figure below, and Excel file here).

I mapped MiSeq reads from 4 of the 5 HPA genomes (shrimp2 didn’t like the fastq file for sample 280, need to sort this out) and the HiSeq reads from BGI’s TY2482, to the complete reference assembly for TY2482. For 194 of the 220 variants called, the TY2482 read mapping resulted in a variant call compared to the TY2482 reference, which means that the variant is unlikely to be real. This could happen for a variety of reasons relating to the mapping & variant calling process, and I was just using the default settings in Nesoni so some tweaking might remove these. In any case, I will ignore these variants for now because I don’t believe they are real (but you can see the full table here).

This leaves 24 variant calls, where the allele detected in one or more of the 4 HPA genomes is different from the TY2482 assembly+reads from BGI, shown in the table below. This includes 9 in the chromosome (highlighted in pink), with 2 SNPs called in all 4 genomes; 1 SNP called in sample 283 only, 3 SNPs called in sample 540 and 3 SNPs called in sample 541.

SNPs identified in HPA strains compared to the complete BGI assembly for TY2482

In pTY1, which is the IncI plasmid bearing the beta-lactamase CTX-M-15 gene, the variants detected (yellow above) were all within the shufflon proteins…this region is able to ‘shuffle’ via inversions between homologous sequences, and these variant calls will most likely represent shuffling in the region rather than point mutation. In fact, the alignment suggests that there are multiple versions of these sequences in each DNA sample, suggesting that the shufflon was active and generating a mixed population of plasmids in the HPA data (but not the BGI strain TY2482, which had homozygous calls in this region). This might be worthy of some further investigation by someone who understands shufflons a lot better than I do!

Finally, there were a few variant calls in the tiny plasmid pTY3, clustered within its rep gene. These calls are heterozygous (see table) in all four HPA strains, suggesting that the mapping is picking up two different versions of the rep gene, which could be due to homology with other replication proteins in plasmids pTY1 and pTY2.

Plasmid copy number

The massive variation in read depth for plasmid sequences compared to the chromosome reminded me it might be interesting to try to infer the average copy number for each plasmid based on read depth. To do this, I used the depth plots output by nesoni (which gives the mean read depth per base in the reference sequence, based on read mapping). I calculated the mean read depth across each reference sequence (ie the completed BGI TY2482 assembly, chromosome + 3 plasmids) from this, and then calculated the ratio of read depths for plasmid:chromosome. Assuming each bacterial cell has ~1 copy of the chromosome (i.e. ignoring cells caught in the act of replication when there will be >1), this should give an approximation of the mean copy number of each plasmid per cell. We know some plasmids are maintained quite stably at 1 per cell, while others can exist at high copy number. This is the result:

Mean read depth and mean plasmid copy number for outbreak strains

So for the TY2482 data, it looks as though the IncI1 resistance plasmid (pTY1) and the aggregative adhesion plasmid (pTY2) are maintained at ~1 per cell, while the mini plasmid (which contains little more than a plasmid replication gene) is present at ~9 per cell. This is pretty much in line with expectation.

Interestingly, the HPA strains appear to have much higher copy numbers, around 20 per cell for pTY1 and pTY2 and hundreds of copies of pTY3. The numbers are pretty consistent across the HPA strains, but are remarkably higher than in TY2482.

I don’t have a good explanation for this apparent difference…. it could be an artefact in the sequencing (MiSeq likes plasmid DNA???) or in the mapping (not sure how this could be, especially since the mean depth plots produced by nesoni exclude regions that map to multiple locations in the reference genome).This could be examined by looking at results from different mapping programs, or analysis of reads from different platforms (Ion Torrent for TY2482 & LB226692, 454 reads for the HPA & C2L genomes).

If it is a real difference, I wonder if it could be differences in growth rate or culture conditions in the two labs. Or a mutation in the chromosome that affects the normal control of plasmid replication? Could having lots of copies of the aggregative adhesion plasmid enhance virulence or transmission of the bug?

Eurosurveillance editorial on E. coli outbreak credits crowdsourcing

This editorial in Eurosurveillance gives a nice overview of the microbiological side of the German E. coli outbreak investigation, including applauding the public data release & analysis efforts:

The data sets from these sequencing initiatives were instantly released for public access, resulting in data analysis among bioinformaticians and other researchers around the world. Results from these preliminary analyses have been rapidly communicated via blogs, Twitter and private web pages, outside the standard peer-reviewed scientific publication route. These initiatives have confirmed the microbiological characterisation of the outbreak strain made in the public health laboratories by targeted genotyping and phenotyping of facultative E. coli virulence genes. Most importantly, among compared E. coli genome sequences, the genome of the 2011 outbreak strain clustered closest to an EAggEC strain isolated in 2002, with the addition of stx2 and antibiotic resistance genes.

The details of findings to date are outlined in this article in the same issue, including details of Shiga toxin-producing enteroaggregative E. coli O104:H4 from an outbreak in Georgia in 2009 (main difference seems to be that it was ESBL negative, unlike the current strain which has acquired an IncI plasmid carrying the ESBL gene blaCTX-M). They also discuss a rapid PCR test for the outbreak strain direct from food samples, involving enrichment+incubation (18-24h) followed by PCR for stx2 gene from extracted DNA, followed by PCR for O104 and then confirmation from pure cultures.

There are now reports of E. coli O104 from a stream in Germany, located downstream from a sewage plant… although this is more likely to be caused by the outbreak than a cause of it, it highlights that even highly industrialized countries need to be vigilant with sanitation and hygiene to prevent the spread of dangerous human pathogens [sourced from ProMed].

Enteroaggregative E. coli on LigerCat

Reading Jon Eisen’s blog this week I was rather taken with this post about LigerCat. LigerCat is an online tool that searches pubmed for whatever you ask it to, and displays a cloud of the MeSH terms (keywords attached to articles) associated with the pubmed results. It also shows a neat bar chart of article counts by year.

Since I’ve just been introduced to enteroaggregative E. coli thanks to the German E. coli outbreak, I thought I’d search for “Enteroaggregative E. coli”… this is the result.

I think this shows quite nicely that (at least according to the literature) this organism is defined by adhesion, normally associated with diarrhea in children and babies and commonly tested for by PCR.

According to this it was first described as enteroaggregative E. coli in 1989 and has been the subject of some attention, but not a lot, ever since (~15 articles per year):

The picture is quite different for “Shiga toxin, most associated with E. coli O157 and hemolytic-uremic syndrome, with first mention in 1942 and a mass of interest since the 1980s, now with >200 papers each year:

BGI assembly of German E. coli outbreak strain

BGI has released a ‘complete’ assembly of TY2482. Here is what it looks like, with the 4 other available outbreak genomes mapped against it (4 inner rings) plus available references.

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = reference sequence... for chromosome, Ec55989 and red=VT2 phage; for IncI plasmid, pEC_Bactec; for mini plasmid, pO26-S1; for EAEC plasmid, red/yellow/aqua = 55989p, pO86A1, pAA, orange=agg operon.

 

Chromosome:

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = Ec55989, red=VT2 phage.

 

EAEC plasmid, novel version and carrying agg operon (aggregative adhesion fimbriae class I; AAF/I):

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Red = 55989p, yellow = pO86A1, aqua = pAA, orange=agg operon.

 

IncI plasmid, carrying extended spectrum beta-lactamase blaCTX-M-15:

BGI assembly (TY2482, reference) vs 4 other outbreak genome assemblies (from inner: purple=GOS1, pink=GOS2, green=HPA strain, blue=LB226692). Yellow = pEC_Bactec.

 

Mini plasmid, replication regions only:

Mini plasmid ("selfish plasmid", carrying only rep genes). Yellow = pO26-S1.

SNP-base phylogeny confirms similarity of E. coli outbreak to EAEC Ec55989

Thanks to Konrad Paszkiewicz from University of Exeter for this SNP-based analysis of the 3 E. coli outbreak genomes.

He used MUMmer to compare each complete E. coli genome available in NCBI to the Ec55989 chromosome, and identify single nucleotide polymorphisms (SNPs, i.e. substitution mutations, where one DNA base is swapped for another). He ignored SNPs in regions that are repeated in the Ec55989 genome, since we can’t distinguish whether these represent true substitution mutations in homologous sequences (relevant phylogenetic signal), or variation between copies in the same genome (noise). The results are available here.

He did the same comparison for the three outbreak genome assemblies, including the strain TY2482 from BGI (NCBI: AFOB00000000; Illumina HiSeq+Ion Torrent), strain LB226692 from Life Sciences (NCBI: AFOG00000000; Ion Torrent), and strain H112180280 from the UK Health Protection agency (454 mate pair). In addition he did the same comparison for the first MIRA assembly of TY2482 (Ion Torrent) and I analysed SNPs in Ty2482 using the Illumina HiSeq reads released by BGI (using bwa and samtools).

Konrad then used the annotation of Ec55989 to identify the coding change cause by each SNP – i.e. whether it is in a gene, and whether the DNA base change results in a change in the encoded protein sequence (non-synonymous SNP) or not (synonymous SNP). See resulting table here.

I used this annotation to remove SNP calls in genes annotated with the keywords ‘phage’, ‘transposase’, ‘insertion sequence’ or ‘IS’ in an effort to remove obviously horizontally transferred DNA…since this is not only subject to divergence by mutation (the phylogenetic signal we are looking for) but also to recombination. The result is just over 400,000 SNPs (nearly 10% of the Ec55989 chromosome). The alignment of those SNPs is available here.

I used SplitsTree4 to draw a phylogenetic network for the alignment. The result clearly shows that the outbreak genomes (green) are very similar to Ec55989 (also green, labelled ‘reference’), and very different to other sequenced E. coli (note in particular the group of EHEC O157:H7 on the middle left, which are very distant from the outbreak strains). Since this is a network, it would reveal if there were major recombinations between this strain and the other E. coli chromosomes, which there aren’t. This confirms that the outbreak strain is truly an EAEC, with very close similarity to Ec55989 and not to classical EHEC. This is backed by the presence of an EAEC plasmid, carrying aggregative adhesion fimbrial cluster I (AAF/I; agg operon; see this post for details).

I also used SplitsTree4 to draw a phylogenetic tree of the data, using K2P distances and the BioNJ tree-drawing algorithm. This shows the same result, with all three outbreak strain’s genomes clustering very tightly with the EAEC strain Ec55989.

So where are the SNPs located? This figure shows the distribution of SNPs around the Ec55989 chromosome:

SNPs relative to the Ec55989 chromosome

The outer blue rings indicate the Ec55989 genes encoded on the forward (outermost) and reverse strands. The purple/green wiggly line indicates the full set of SNPs found among all avaialable E. coli chromosomes in NCBI…this is a relative plot, so purple bits indicate where SNPs are relatively rare (lower density than the average across the genome) and green bits indicate where SNPs are particularly common (higher density than average across the genome).

The red ring indicates the location of SNPs that are shared among the three outbreak genomes compared to Ec55989, it that distinguish the outbreak from Ec55989 (~600 SNPs, or 0.12% divergence). For comparison, commonly circulating Salmonella enterica serotype Typhi, which are considered a tight-knit clonal group compared to other Salmonella and cause typhoid fever as opposed to the gastroenteritis caused by most Salmonella enterica, can differ from each other by >600 SNPs. So the outbreak strains should be considered quite closely related to the EAEC diarrhea strain Ec55989. In fact, some of these SNPs (red ring) are clustered together, suggesting that they actually represent variation in horizontally transferred sequences such as phage rather than genuine substitution mutations…so the real number of substitution mutations differentiating the outbreak strain from Ec55989 is probably more like 200-300, or 0.05% divergence.

The inner rings (green, blue, black) show the location of SNPs that are specific to just one of the three outbreak genomes. For the HPA and LB226692 strains these are VERY likely to be false SNP calls, due to the fact that they are reliant on Ion Torrent and 454 data which have trouble distinguishing between homopolymeric tracts (e.g. GGGTTT can easily be misread as GGTTTT, masquerading as a substitution of G->T). For TY2482, where we have Illumina read data to call SNPs (which doesn’t have this homopolymer issue), we find only 2 SNPs that are unique to Ty2482 compared to the other two outbreak strains (black inner ring). This is easier to believe, as we expect very small numbers of mutations to accumulate during the few weeks of evolution that separate these sequenced isolates.

E. coli data released under Creative Commons 0 license

BGI has now formally released their data, including Illumina reads, under Creative Commons 0 (CC0) license. This is the most open license possible, and includes this statement:

The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

They have set up this page with links to, and details of, all the data available. I understand HPA will do something similar.

This is an important step for crowdsourcing! In particular, it clarifies the position that this data is released for the benefit of the public & public health, and there are no restrictions on publication of data and analysis. This is important. The genomics community has long been used to pre-publication data release of ‘community resource’ genomics data (begun during the Human Genome Project), on the understanding (in broad terms) that the research community would not attempt to publish analysis of the released data until the data generators had completed their data generating and had a chance to publish their own analysis. In short, releasing data does not imply that others have the right to publish analysis of it. But the CC0 license does.

I’ve been asked by a few people about this, over the past week. There is understandably some discomfort about people publicly sharing analysis of data that they did not generate. My position has been that the data generators released their data freely for public analysis (with public health as the driver), and this is very different from a research project where they have released the data with the expectation that no-one will ‘publish’ anything before they themselves have. But it is important to formalise this…. I guess it just takes time for the legalities to catch up.

For the crowd-sourcers part, all the analysis posted at the GitHub E. coli O104:H4 Genome Analysis Crowdsourcing site will be available under the CC0 license too, this is being formalised today thanks to the folks at ERA7.

E. coli outbreak and biofilms

Now that the outbreak has been firmly traced back to sprouts, attention is turning to how the sprouts were contaminated in the first place. The initial assumption was that this would be a classic case of contamination with E. coli via feces from cows or other livestock, however in this case it is more likely the original source was humans. Some clues to this:

  • Serotype O104:H4 has not been reported in animals, only in humans
  • Enteroagreggative E. coli (EAEC) is usually isolated from humans, not animals
  • EAEC is not always symptomatic in humans, it can be carried asymptomatically
  • There are no animals kept on the sprout farm linked to the outbreak, nor is there an obvious route by which contamination with animal feces could have occurred

So it could be somewhat similar to the situation with Salmonella Typhi, where people can become colonized but never develop typhoid fever symptoms and so don’t know they are carriers… but they are shedding the bacteria in their feces, which can cause illness in other people. Thus localised outbreaks are sometimes linked to food handlers (in EAEC as well as typhoid)… the most famous example being “Typhoid Mary”, a cook who was a typhoid carrier and spread typhoid fever to dozens, probably hundreds of people. Water contaminated with human feces can also transmit the bacteria. In this outbreak a lot of cases are linked to eating sprouts, so the logic would be that the sprouts have become contaminated. However there are reports of people attending the same function, but not eating the vegetables, also getting sick, which is consistent with secondary transmission via human carriers.

The European Food Safety Authority report summarizes the evidence for a human source of EAEC:

On 21st May 2011, Germany reported an ongoing outbreak of Shiga-toxin producing Escherichia coli- bacteria (STEC4), serotype O104:H4. [...] In the past STEC O104:H4 had been isolated in humans twice in Germany in 2001 (Mellmann et al., 2008) and once in Korea in 2005 (Bae et al., 2006). In addition, according to the information reported to ECDC, a total of 10 persons were infected with STEC O104 in the EU Member States from 2004 to 2009.

…..

The German outbreak strain seems to share virulence characteristics of STEC and EAEC strains. STEC strains usually have an animal reservoir, while EAEC have a human reservoir.

…..

Outbreaks of diarrhoeal illness due to EAEC have been reported and linked
to the ingestion of food which was contaminated by food handlers. In
addition, it has been shown that EAEC carriage by humans is possible (Huang
et al., 2003
; Huang et al., 2006).
…..
EAEC have rarely been identified in animals, suggesting that they are not
zoonotic, but exclusive to humans as a pathogen (Cassar et al., 2004).

…..

Outbreaks may have more than one exposure route involved. For example, primary human infection may originate from consumption of contaminated food or direct contact with an animal carrying STEC, while secondary infection may occur by the faecal-oral route, after contamination of food through handling by an infected person shedding the bacteria. As a result, especially during the late stages of an outbreak multiple exposure routes are likely.

One of their recommendations is:

Since there is evidence of asymptomatic carriers of STEC in humans, screening of humans involved in food handling is relevant. The monitoring and/or exclusion of STEC carriers from food handling should be considered as a mitigation option.

So what can the genomes tell us?

It’s likely that the combination of Shiga-toxin production in an EAEC strain is particularly dangerous because EAEC are particularly ‘sticky’ or adhesive. The cells autoaggregate, forming biofilms, and they are also good at sticking to human cells (in fact EAEC is defined by HEp-2 cell-adherence assay). They can also form mixed biofilms with other bacteria. These abilities are probably what make EAEC good at establishing long-term colonization of humans, which can sometimes result in chronic diarrhea or long-term asymptomatic carriage. Similar properties could aid transmission via sticking to plants.

Several gene families are known to be involved in biofilm formation in E. coli, see this review.

Fimbriae:

the direct contribution of adhesive organelles of the fimbrial family to the irreversible attachment of bacteria to surfaces has been amply demonstrated. Three classes of fimbriae have a role in strengthening the bacteria-to-surface interactions: type 1 fimbriae, curli, and conjugative pili.

The outbreak genomes have swapped their aggregative adherence fimbriae relative to their closest known relatives. Both the African diarrheic strain Ec55989 (the closest related strain to have its complete genome sequenced) and the 2001 German O104:H4 strain, HUSEC O41, expressed type III fimbriae (AAF/III+). [see report for HUSEC O41, here for Ec55989]. AAF/I fimbriae are not uncommon, but they are quite different to AAF/III and may be relevant. They are plasmid borne, and the plasmid they are carried on in the outbreak strain is a bit different to those that have been sequenced previously from EAEC (see this post for details).

The outbreak genome shares with Ec55989 a fimbrial adhesion operon yehABC, however it has also acquired, adjacent to this operon, an insertion of genes yehI-yehQ which are present in E. coli O157:H7 and include proteins that are probable regulators…could they regulate the fimbriae?:

It also shares with Ec55989 several other fimbrial clusters including lpf (long polar fimbriae), but I can’t find any other fimbriae-related genes in the annotation of the HPA outbreak strain that are not in Ec55989.

Conjugative pili:

most tested conjugative plasmids directly contribute, upon derepression of their conjugative function, to bacterial host capacity to form a biofilm (Ghigo 2001)

The outbreak strain contains a conjugative plasmid of the IncI type, carrying the ESBL (extended spectrum beta-lactamase) gene CTX-M. This encodes a type IV pilus, could it be contributing to the fitness of the strain by enhancing biofilm production of the host bacterium? And/or promoting horizontal transfer of DNA? This paper suggests that a type IV pilus encoded on an IncI plasmid enhances biofilm formation in E. coli, although the sequences of the type IV pili are different to the IncI plasmid in the outbreak strain.

Type V secretion system and autotransporters:

…the type V secretion pathway enables a family of proteins to reach the surface with a very limited number of accessory secretion factors because most information necessary to the translocation process is contained within the secreted protein itself. These proteins, which can therefore carry out their own transport to the outer membrane, are called autotransported or autotransporter proteins.

…..

The flu gene encodes antigen 43 (Ag43), a major outer membrane protein found in most commensal and pathogenic E. coli. Although E. coli K-12 has only one copy of flu, most other strains of E. coli have several copies of this gene.

Ag43 is a self-recognizing surface autotransporter protein that does not seem to be involved in non-specific initial adhesion to abiotic surfaces, but rather, promotes cell-to-cell adhesion (Kjaergaard et al. 2000a). While, in liquid culture, this property leads to autoaggregation and clump formation rapidly followed by bacterial sedimentation, it also facilitates bacteria–bacteria adhesion and leads to the three-dimensional development of the biofilm (Owen et al. 1996; Henderson et al. 1997a; Hasman et al. 1999; Kjaergaard et al. 2000a; Schembri et al. 2003a). When expressed in different species, Ag43 can also be used to promote mixed biofilm formation between different bacteria, for example, between E. coli and Pseudomonas aeruginosa (Kjaergaard et al. 2000a, 2000b).

There are three Flu/Ag43 in the outbreak strain (using HPA assembly and ERA7 annotation). One is novel compared to Ec55989, and has been acquired via an integrase-mediated insertion in the same locus as the multidrug resistance genes (see post here for details). Another appears to present in the same prophage as the Shiga-toxin, although it is incomplete (662984-664255 in HPA assembly). A third is conserved in Ec55989 (EC55989_3357). AidA adhesin proteins appear to be conserved between the outbreak stain and Ec55989.

So the differences so far between the outbreak strain and Ec55989 (EAEC diarrhea), with respect to known biofilm-associated adhesins are:

  • replaced AAF/III with AAF/I (EAEC plasmid)
  • acquired novel Ag43 gene via integrase-mediated insertion
  • type IV conjugative pili in IncI plasmid
  • acquired partial Ag43 in same phage as Shiga-toxin
  • acquired cluster of genes adjacent to yehABC fimbrial cluster, yehI-yehQ, which are present in E. coli O157:H7 and include regulators…maybe regulators of fimbriae?

Pic proteins

I also had a look at the pic genes in the outbreak strain. These are mucinases that have recently been shown to promote mucous secretion in the gut, responsible for mucoid diarrhea that is a classic symptom of Shigella and EAEC infection. Ec55989 contains three of these, two on the chromosome (intact, EC55989_4682, EC55989_3279) and one on the EAEC plasmid 55989p (truncated), all of which are conserved in the German outbreak strain. In addition, there is a fourth pic gene present in the EAEC plasmid of the outbreak strain (but missing from Ec55989), which seems to be intact. An NCBI blastp search turned up the same protein sequence in multiple Shigella genomes, and also the EAEC plasmid pO86A1:

>Novel pic gene from HPA assembly
MNKIYYLKYCHITKSLIAVSELARRVTCKSHRRLSRRVILTSVAALSLSSAWPALSATVS
AEIPYQIFRDFAENKGQFTPGTTNISIYDKQGNLVGKLDKAPMADFSSATITTGSLPPGN
HTLYSPQYVVTAKHVSGSDTMSFGYAKNTYTAVGTNNNSGLDIKTRRLSKLVTEVAPAEV
SDIGAVSGAYQAGGRFTAFYRLGGGMQYVKDKNGNRTQVYTNGGFLVGGTVSALNSYNNG
QMITAQTGDIFNPANGPLANYLNMGDSGSPLFAYDSLQKKWVLIGVLSSGTNYGNNWVVT
TQDFLGQQPQNDFDKTIAYTSGEGVLQWKYDAANGTGTLTQGNTTWDMHGKKGNDLNAGK
NLLFTGNNGEVVLQNSVNQGAGYLQFAGDYRVSALNGQTWMGGGIITDKGTHVLWQVNGV
AGDNLHKTGEGTLTVNGTGVNAGGLKVGDGTVILNQQADADGKVQAFSSVGIASGRPTVV
LSDSQQVNPDNISWGYRGGRLELNGNNLTFTRLQAADYGAIITNNSEKKSTVTLDLQTLK
ASDINVPVNTVSIFGGRGAPGDLYYDSSTKQYFILKASSYSPFFSDLNNSSVWQNVGKDR
NKAIDTVKQQKIEASSQPYMYHGQLNGNMDVNIPQLSGKDVLALDGSVNLPEGSITKKSG
TLIFQGHPVIHAGTTTSSSQSDWETRQFTLEKLKLDAATFHLSRNGKMQGDINATNGSTV
ILGSSRVFTDRSDGTGNAVSSVEGSATATTVGDQSDYSGNVTLENKSSLQIMERFTGGIE
AYDSTVSVTSQNAVFDRVGSFVNSSLTLGKGAKLTAQSGIFSTGAVDVKENASLTLTGMP
SAQKQGYYSPVISTTEGINLEDKASFSVKNMGYLSSDIHAGTTAATINLGDSDADAGKTD
SPLFSSLMKGYNAVLRGSITGAQSTVNMINALWYSDGKSEAGALKAKGSRIELGDGKHFA
TLQVKELSADNTTFLMHTNNSWADQLNVTDKLSGSNNSVLVDFLNKPASEMSVTLITAPK
GSDEKTFTAGTQQIGFSNVTPVISTEKTNDATKWVLTGYQTTADAGASKAAKDFMASGYK
SFLTEVNNLNKRMGDLRDTQGDAGVWARIMNGTGSADGDYSDNYTHVQIGVDRKHELDGV
DLFTGALLTYTDSNASSHAFSGKTKSVGGGLYASALFNSGAYFDLIGKYLHHDNQHTANF
ASLGTKDYSSHSWYAGAEVGYRYHLTKESWVEPQIELVYGSVSGKAFSWEDRGMALSMKD
KDYNPLIGRTGVDVGRAFSGDDWKITARAGLGYQFDLLANGETVLQDASGEKRFEGEKDS
RMLMTVGMNAEIKDNMRLGLELEKSAFGKYNVDNAINANFRYVF