MLST from short read data

Our paper on a mapping-based approach to extracting MLST data from Illumina short reads was recently published in BMC Genomics. We used read mapping because this has greater sensitivity than approaches which rely on assembly, especially for low-coverage data sets of genomes with extreme GC content or other sequencing issues. The approach is called SRST (short read sequence typing), and code and usage instructions are available from

However, it is obviously useful to be able to extract MLST info from genome assemblies too. For example, many finished or WGS genome sequences in NCBI do not have ST information attached to them, or it is hard to find. Also, for 454 and perhaps Ion Torrent data, it can be easier to deal with homopolymer issues at the assembly level by using newbler/gsAssembler and then working with contigs.

There is a web service available that is designed to do this, i.e. you can upload your genomes and choose a MLST scheme, and it will return the ST. It is described in this paper and available at this URL. However, unfortunately I have never been able to get the website to load in any of my web browsers, so I’ve not been able to try it. Also, it is a pain to have to upload large amounts of data over the web, and this becomes completely infeasible when dealing with lots of genomes, so instead I use a simple script to extract MLST info via blast, which runs locally on my laptop or cluster.

The script and a short readme containing usage instructions are available at:

I’m sure many people have written in-house scripts for this same task, but a few people have asked for mine recently and I figure it might save some others reinventing the wheel. The script simply uses BioPython to run a set of nucleotide blast searches in order to assign STs to genome assemblies. The inputs are just the latest set of allele sequences and profiles for the MLST scheme, and whatever genome assemblies you wish to determine STs for. The script will then determine the ST for each input genome, and if an exact match can’t be found, it will try to figure out the closest matching alleles and ST.

Happy sequence typing!

Typhoid in Kathmandu and Open Biology OA journal

And now for some shameless self-promotion…

A paper I’ve been working on for a few years on typhoid in Kathmandu yesterday had the honour of being the first paper ever published by the new open access journal of the Royal Society, Open Biology. I’m very keen on open access publishing and always try to submit to OA journals, but there is still a limited choice of truly OA journals. I love PLoS and BMC and submit to both regularly, but I think it’s really important to have a diverse range of OA journals – and therefore diversity in editors, editorial policies & styles, subject areas, etc – to make open access work for everyone.

So I’m excited to be have a paper in the new Open Biology, who publish under a Creative Commons 3 license (reuse/modify/distribute with attribution). Only time will tell how well the journal does, but it will only become great if us scientists are willing to submit good manuscripts. One incentive to do this is that Open Biology aims for a quick turn-around time of 4 weeks from submission to decision. Much as I love PLoS and BMC, they’ve never managed anywhere near that sort of turn-around. For info on Open Biology, see their ‘About’ page

So what is the paper? Thanks to OA, I can reproduce it here… (or you can read online or PDF)

Combined high-resolution genotyping and geospatial analysis reveals modes of endemic urban typhoid fever transmission

Stephen Baker1,2,*,, Kathryn E. Holt3,4,, Archie C. A. Clements5, Abhilasha Karkey2, Amit Arjyal2, Maciej F. Boni1,6, Sabina Dongol2, Naomi Hammond4, Samir Koirala2, Pham Thanh Duy1, Tran Vu Thieu Nga1, James I. Campbell1, Christiane Dolecek1,2, Buddha Basnyat2, Gordon Dougan4 and Jeremy J. Farrar1,2

Open Biol October 2011 1:110008; doi:10.1098/rsob.110008.

Basically, it uses genotyping and GPS to study typhoid fever in Kathmandu, Nepal. We examined 4-years worth of typhoid cases and looked at where the patients lived within the city (using GPS) and subtyped the bacteria responsible for their infections using high throughput SNP typing.

Firstly, we found that about 3/4 of the patients were infected with Salmonella Typhi and 1/4 were infected with Salmonella Paratyphi A. If you aren’t familiar with Salmonella, these are two serotypes of Salmonella enterica which, rather than causing gastrointestinal disease (ie food poisoning) like most Salmonella serotypes, cause the systemic infection known as typhoid. Typhi and Paratyphi A are quite different genetically, but have undergone convergent evolution to cause the same disease syndrome (see earlier paper in BMC Genomics).

Temporal distribution of Typhi (red) and Paratyphi A (blue) cases

Then we looked at the spatial distribution of the patients homes, and found that they were clustered in specific “hotspot” areas of Kathmandu:

Spatial risk model for Typhi infection (see paper for separate map for Paratyphi A risk)

Contrary to expectation, these hotspots weren’t the most densely populated areas…you might expect more people = more cases, but this wasn’t the case. Some complicated spatial statistics, done by Archie Clements at University of Queensland, confirmed that the hostpots weren’t associated with population density or hospital referral patterns, but were in low-elevation areas local people source their water from stone waterspouts.

Spatial distribution of Typhi cases, and location of water spouts

To see if the waterspouts could really be a source of typhoid transmission, we tested water samples for the presence of Typhi or Paratyphi A using culture and PCR. Culturing didn’t work, but it is notoriously difficult to culture Typhi from water samples that are not pre-enriched for bacteria…however PCR (using this method we published earlier in BMC Infectious Diseases) detected Typhi in 3/4 of water samples and Paratyphi A in 2/3.

Stone water spouts in Kathmandu (taken by co-author Stephen Baker)

We also looked at the population of bacteria causing the typhoid fever. We examined Salmonella Typhi isolated from the blood of typhoid fever patients, and used SNP typing to analyse the Typhi DNA and examine the population structure. We typed 113 SNPs (single nucleotide polymorphisms, ie point mutations) that we already knew about from previous variation discovery efforts. About 2/3 of isolates had the same haplotype, so to discriminate further within this local subgroup we sequenced 40 of the Typhi to identify novel SNPs arising in the local population (local microevolution) and typed these SNPs as well. Most of the Typhi belonged to the H58 lineage, which is common in other typhoid endemic zones we’ve looked at previously (Mekong Delta Vietnam – Holt & Dolecek 2011, PLoS NTD [OA]; Nairobi, Kenya – Kariuki 2010, J Clin Micro [free]; globally – Holt & Phan 2011, PLoS NTD [OA], Roumagnac 2006, Science [free in PMC]).

Typhi tree, red bars indicate frequency of genotypes in Kathmandu collection; red zones are H58 lineage and H58G sublineage

As the map above shows, the different Typhi genotypes were distributed randomly, with no spatial or temporal clustering. The exception was a probable outbreak in the west of the city, outside the hotspot zone, where 28 cases of infection with the same Typhi genotype were recorded in a two-month period – see yellow shaded area in map above, and zoomed in below:

Localised outbreak of Typhi genotype H58G-b4

Finally, we looked at what was happening in households from which multiple typhoid infections were studied. You might expect these household disease clusters to represent shared infections, which are transmitted between members of the household. However in most of these household typhoid clusters, the cases were caused by different organisms – either Paratyphi in one case and Typhi in the other, or multiple cases caused by different genotypes of Typhi. Cases with the same causative Typhi genotype are linked with dashed lines in this figure, you can see they are the exception rather than the rule:

Distribution of typhoid-causing bacteria in households with multiple typhoid cases

So, all in all we found typhoid fever infections clustered in spatial hotspots within Kathmandu, and that this clustering was explained not by population density but by low elevation and proximity to stone water spouts which are used to supply water. This implicates the water spouts in typhoid transmission via dissemination of Typhi and Paratyphi A around the city, supported by the detection of Typhi and Paratyphi A in the majority of water samples taken from these spouts. The diversity of Typhi genotypes we detected indicates that transmission occurs via water that is contaminated with a diverse population of Typhi, rather than point source outbreaks (with the exception of one outbreak, which actually occurred outside the hotspot zone). The diversity of Typhi genotypes within households suggests that this sort of transmission – ie dissemination via contamination of the water supply – contributes more to the overall typhoid burden than direct person-to-person transmission.

How does this contamination happen? Well, it is possible for people to carry Typhi and Paratyphi A in the gall bladder, without ever noticing an infection…for example, Typhoid Mary was a famous carrier of Typhi. Carriers shed the bacteria in their feces, so any food or water contaminated with their fecal material becomes a vehicle for typhoid transmission. Our study suggests that there are many Typhi and Paratyphi A carriers in Kathmandu who are unknowingly shedding the bacteria, so that whenever sewage seeps into the groundwater that feeds the stone water spout, the water becomes contaminated and can pass on the infection to those who drink the water. Most of the typhoid cases occur in the monsoon season, when flooding is likely to promote seepage of sewage into the underground aquifers that supply the water spouts. Hence the study suggests that endemic typhoid in Kathmandu is essentially a question of water infrastructure, and could potentially be dramatically reduced by supplying clean drinking water to people living in these few hotspot areas.

Genomics Workshop – Melbourne

I’m helping to organise a genomics workshop in Melbourne, designed to accommodate student and postdoc researchers with a background in either biology or computational/technical sciences. It will run twice, on June 30 & July 19. It’s filling up fast so if you’re local and interested, register quick.


VLSCI Workshop. Life Sciences meet Technical Sciences: Strings attached to ‘omics.

Program overview Download workshop details Register

  • Date: A one-day workshop held on both Thursday 30 June and Tuesday 19 July 2011
  • Location: Parkville
  • Maximum 16 participants per workshop; attendance is free.
  • Participants will be researchers at graduate and post-graduate level.

This workshop is designed for biologists, computer scientists, mathematicians and engineers:

  • to learn about ‘omics research
  • to establish some lines of communication across the disciplines
  • to look for synergies through active learning and application exercises.

Cutting edge scientific research demands interdisciplinary collaboration. This is especially obvious in the Life Sciences, where computational techniques are revolutionising the way many problems are tackled. However, establishing links between the Life Sciences and the Technical Sciences remains difficult because of communication barriers. This workshop aims to break down those barriers.  Participants will exchange ideas, enjoy some ‘cultural exchanges’ and look for potential for joint research projects. With a focus on ‘omics research, ie. sequencing technology and analysis, we plan to create a group of like-minded researchers who are interested in interdisciplinary exchange and possible research synergies.

The morning program covers sequencing technology and analysis and learning about some of the challenges in faced by Life Scientists involved in ‘omics research. The afternoon will shift focus on to learning about how Computer Scientists approach programming. Participants will work in teams of two and in small groups to foster discussion and communication.

Prerequisite: Participants need to have a laptop with a wireless connection. A console needs to be installed (Windows: Putty; Mac: Terminal).

Max. participants: 16 (8 from Life Sciences and 8 from Computer Science). If a workshop is oversubscribed, participants are chosen at random and alternate dates for a follow on workshop will be offered to meet the demand.

German E. coli – phage analysis by Nico Petty

Nico Petty from University of Queensland has done some additional analysis of the prophage which she’s asked me to post here. Thanks Nico!

The stx phage and others in the O104 outbreak strain

Following on from our finding earlier in the week, that the O104 outbreak strain has acquired a phage that carries the stx2A and stx2B Shiga toxin genes, and Kat’s finding that only part of the stx phage in Sakai was present in the German outbreak strain, I’ve done a bit more analysis of the phage.

Using Kat’s ordering of the contigs of the latest BGI assembly (6/6 Illumina + Ion Torrent), a blast (see below) against the stx phages in two EHEC O157:H7 strains – Sp5 in the Sakai genome and 933W in the EDL933 genome shows that, even though the region is in lots of small contigs in the O104 genome (middle, alternating orange and brown), there is some similarity from the start to the end of the stx phages. However, there is clearly much less similarity in the larger contigs in the region to the left of the stx genes (highlighted in red).

Ordered82 contigs (middle) vs EHEC O157:H7 prophage: Sp5 from Sakai genome (top) and 933W in EDL993 genome (bottom)

This could be a simple case of misassembly as this is just an early draft genome and was ordered against the closely related EAEC 55989 genome, which doesn’t have this phage. I had a look through the rest of the O104 genome and found that there are contigs elsewhere in the genome assembly which have similarity to the to the left hand side of the O157 stx phages. I reordered the contigs to replace these contigs at the left hand side of the stx phage in O104 (see below) and a blast against Sp5 and 933W as before showed a little more similarity, particularly with phage 933W (bottom). These contigs (13 and 492) also result in a phage of similar size to Sp5 and EDL933, which makes it a more likely fit. However, although contig 492 does encode phage-related genes, they are still quite different from those in the syntenic region of the related EHEC phages. The reason for this region of difference in the stx phage (other than missassemly) could be that phage genomes are chimeras, they consist of different genetic modules, acquired from different ancestors.

TY2482 contigs (middle) reordered against Sp5 from Sakai genome (top) and 933W in EDL993 genome (bottom)

The genome of E. coli is highly repetitive and full of repeat sequences which contribute to gene flux – both through the acquisition of new genetic material, e.g. phages and antibiotic resistance genes as we have seen in this outbreak strain, and also through recombination. This is particularly the case for the lambdoid prophages of E. coli, which are highly related to each other and often the source of recombination – swapping phage modules amongst themselves and also contributing to rearrangements in the bacterial chromosome through recombination between highly repetitive sequences.

As in other E. coli strains, there are several lambdoid prophages in the genome of this outbreak strain (the Stx phage is one of these). Due to their repetitive nature, it can be difficult to distinguish which prophage regions belong to which prophage, making assembly of these regions in bacterial genomes very difficult. This is certainly the case in the German outbreak strain as most of the prophage regions are in several contigs. However, given the relative novelty of a Shiga toxin producing EAEC, I suspect that this stx phage was only a recent acquisition (comparison with genome sequence of the 2001 German O104 strain when it is available will give us more clues about the evolution of this outbreak strain) and is likely to be intact.

The other phage region not in EAEC 55989

The other phage that Kat mentioned (a set of 20 phage genes on a single contig not found in 55989) is less than half the size required to be an intact prophage by itself, and is either a prophage remnant or could be part of another phage in the O104 genome. A blast shows this prophage region is similar to an intact prophage (highlighted in green in the picture) in the genome of UPEC UTI89 (see below), and also ExPEC S88. It is possible that a similar, intact phage might be present in the O104 genome as lots of the small contigs at the end of the ordered contigs match the rest of the UTI889 prophage, but it is impossible to tell with the current assembly.

TY2482 phage remnant (top; contig 17) vs UPEC UTI89 prophage (bottom; highlighted in green)

More sequence data, particularly paired-end sequencing and longer reads than we have in the current assembly are really needed to resolve these prophage regions and work out how many phages there are in the genome and which bit of genome belongs to which phage. Once the phages are assembled properly, we will be able to determine if the phages are intact and encode all the genes necessary to produce functional virions, we will also be able to determine if they carry any other virulence genes in addition to the stx2A and stx2B genes.

Nico Petty


Typing of German E. coli

I know a lot of people will have seen this from other sources, but the people at Am Institut für Hygiene at Münster University have some very useful strain typing data on HUSEC publicly available here. You can see that there is an entry of HUSEC of serotype O104:H4, ST678 (same as the current outbreak), strain 01-09591 from 2001, carrying stx2 & terE but not saa; this matches the current HUSEC strains. On June 6, Münster university announced they had sequenced this older strain, however no data or analysis has been released yet.

Münster have also released a multiplex PCR for typing the outbreak strain, which incorporates primers for stx2, terD, rfbO (for O104) and flicH4 (for H4). So this will capture the presence of Shiga toxin, tellurite resistance, and O104:H4 serotype. Seems pretty comprehensive, although I’m not sure how useful the detection of tellurite resistance is since it seems to be quite widely distributed in enterohemorrhagic E. coli (see my tree of ter operon here).

Yesterday, BGI released their own typing scheme, targeting stx2 and what they call AFF_I-2 (primers F-CAGTCAGGAAAGTGGGAAAGC, R-AGGTCTGTATAGTGCCGTCTG). I’m pretty sure this is a typo and they mean AAF/I, i.e. aggregative adhesive fimbriae type I. Their primers target the aggC gene of the AAF/I gene cluster, see EMBL file showing off agg operon and primer binding sites here.

This is exactly what I suggested a few days ago based on the detection of stx2 and AAF/I fimbrial cluster (agg operon) in the outbreak strain (as opposed to the AAF/II operon found in Ec55989). The rationale for me was that this combination tells you that you have an enteroaggregative E. coli (and a specific type of aggregative fimbriae) in combination with Shiga toxin, which tells you that it is not a typical EHEC, and is more informative than any old EAEC + shiga toxin. I don’t have access to databases which would show how rare this combination is, but given that the AAF/I cluster is not present in any GenBank genomes, I would hazard a guess it is at least somewhat rare.

So it seems to me the best option would be a combination of the two schemes – take the Münster scheme and replace terD primers with aggC primers, to end up with stx2, aggC, rfbO and flicH4. This will give a definitive diagnosis of serotype O104:H4 enteroaggregative E. coli (AAF/I type) producing Shiga-toxin.

Phage analysis of German E. coli outbreak genome

I’ve given up trying to call this EHEC, EAEC, STEC or HUSEC because to me it seems more confusing than hepful! But we are talking about the E. coli responsible for an outbreak of HUS in Germany.

Now that BGI has released a more complete assembly (using HiSeq data, now in 513 contigs), it is worthwhile attempting some analysis of phage sequences in the genome.

First I ordered the BGI contigs contigs against a composite reference sequence including the Ec55989 chromosome and plasmid, the other two plasmids I discussed here and the VT2 (shiga toxin) prophage as discussed here and here. I did this using Mauve, the alignment & ordered contigs are available here and this is how it looks in Mauve (compound reference on top, Ty2482 contigs from BGI on bottom):

I then submitted the contigs to Prophage Finder, which uses BLASTX to compare contigs to predicted prophage sequences. The raw output is here. I then used a combination of manual processing in text editors and Excel, and a table-to-embl format conversion script I wrote some time ago, to generate an annotation of prophage sequences in the BGI contigs. You can see it as an Excel table, or open this EMBL file in Artemis (it contains the ordered contig sequences plus the phage annotations).

Basically, it shows that the outbreak strain contains several prophage, most of which it shares with Ec55989. However it has acquired two additional prophages…

Stx phage

One of course is the Shiga-toxin producing phage, in which the Prophage Finder annotation found 64 genes in 46 kbp worth of contigs. This is shorter than most published Stx phage, and is consistent with my earlier findings (circular map; ACT diagram) that only part of the 90 kbp VT2-Sakai Shiga toxin phage was present in the German genome. Perhaps someone with expertise in Stx phage will know why this is.

The phylogenetic tree of NCBI blastn hits to these contigs (using ‘Distance tree of results’ option, Fast Minimum Evolution and other default settings) looks like this. No obvious source revealed here.

A second phage?

In addition to this there is a set of 20 phage genes in a single contig of ~14.1 kbp (BGI ), which is not present Ec55989. The phage is most similar to one present in the E. coli UT189 genome (accession CP000243), with 100% identity across ~80% of the sequence. I don’t know if this is a complete phage or if it could be part of the Stx2 phage, perhaps someone who knows how to ‘read’ phage genomes can check if it has all the necessary bits to be its own phage! Download the sequence and annotation here.

Phylogenetic tree for this phage, based on blast hits in NCBI:

Keep up to date on crowd-sourced analysis at the github site here.

New German STEC/EHEC data from BGI

June 13. BGI has now formally released their data, including Illumina reads, under Creative Commons 0 (CC0) license. This is the most open license possible, and includes this statement:

The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

They have set up this page with links to, and details of, all the data available. I understand HPA will do something similar.

This is an important step for crowdsourcing! In particular, it clarifies the position that this data is released for the benefit of the public & public health, and there are no restrictions on publication of data and analysis. This is important. The genomics community has long been used to pre-publication data release of ‘community resource’ genomics data (begun during the Human Genome Project), on the understanding (in broad terms) that the research community would not attempt to publish analysis of the released data until the data generators had completed their data generating and had a chance to publish their own analysis. In short, releasing data does not imply that others have the right to publish analysis of it. But the CC0 license does.

I’ve been asked by a few people about this, over the past week. There is understandably some discomfort about people publicly sharing analysis of data that they did not generate. My position has been that the data generators released their data freely for public analysis (with public health as the driver), and this is very different from a research project where they have released the data with the expectation that no-one will ‘publish’ anything before they themselves have. But it is important to formalise this…. I guess it just takes time for the legalities to catch up.

For the crowd-sourcers part, all the analysis posted at the GitHub E. coli O104:H4 Genome Analysis Crowdsourcing site will be available under the CC0 license too, this is being formalised today thanks to the folks at ERA7.


So BGI have released a new assembly of TY2482 [here], which they say includes 200x of data from HiSeq. I’m concerned that they’ve not released any reads this time, and still no information on how they are doing the assembly. How long are the reads? Were they paired end? What insert size? Why only 200x coverage from a HiSeq run…even with short reads (35 bp) single end I’d expect a lot more than that…

The new assembly includes more than 200x single-end reads from the Illumina HighSeq Platform, which allowed BGI to provide a more complete genome map and to correct any assembly errors from the previous version. More importantly, this version is a completely de novo assembly, whereas the previous versions by BGI and others used a reference-based assembly method to obtain a consensus sequence. The new assembly continues to support the finding that this infectious strain carries disease-causing genes from two types of pathogenic E. coli: enteroaggregative E. coli(EAEC) and enterohemorrhagic E. coli (EHEC).

To properly analyse this, the reads are really important, so why hasn’t BGI released them? For crowd sourcing analysis to work (and be worth the crowd’s effort) the full data needs to be public. BGI released the reads from their first 7 Ion Torrent runs plus an assembly, which was really useful. A lot of the annotation and analysis effort has been based around Nick Loman’s assembly of these reads using MIRA, and ERA7’s annotation of that assembly [see collated analyses at github]. In comparison, the BGI assembly has been drawn on less by analysts, because there is no detail about how it was created. So releasing reads worked for BGI first time around…why not now?

Life Tech released only an assembly and has still not released reads (Ion torrent) several days later. Similarly, their assembly has not been publicly annotated or the focus of concerted analysis efforts, because it is dependent on an assembly that we don’t know much about.

So I guess the point is, crowd-sourcing analysis relies on transparent data release. If we can’t see the raw data, we’re less likely to trust an assembly and proceed with analysis.

P.S. The press release from BGI implies that all prior assemblies were reference based and not de novo. However the MIRA assembly of runs 1-5 of BGI reads, which has been annotated by ERA7 and analysed extensively, was a de novo assembly.

Update: Ion Torrent reads now available for LB226692, from Life Technologies