MLST from short read data

Our paper on a mapping-based approach to extracting MLST data from Illumina short reads was recently published in BMC Genomics. We used read mapping because this has greater sensitivity than approaches which rely on assembly, especially for low-coverage data sets of genomes with extreme GC content or other sequencing issues. The approach is called SRST (short read sequence typing), and code and usage instructions are available from srst.sourceforge.net.

However, it is obviously useful to be able to extract MLST info from genome assemblies too. For example, many finished or WGS genome sequences in NCBI do not have ST information attached to them, or it is hard to find. Also, for 454 and perhaps Ion Torrent data, it can be easier to deal with homopolymer issues at the assembly level by using newbler/gsAssembler and then working with contigs.

There is a web service available that is designed to do this, i.e. you can upload your genomes and choose a MLST scheme, and it will return the ST. It is described in this paper and available at this URL. However, unfortunately I have never been able to get the website to load in any of my web browsers, so I’ve not been able to try it. Also, it is a pain to have to upload large amounts of data over the web, and this becomes completely infeasible when dealing with lots of genomes, so instead I use a simple script to extract MLST info via blast, which runs locally on my laptop or cluster.

The script and a short readme containing usage instructions are available at: http://sourceforge.net/projects/srst/files/mlstBLAST/

I’m sure many people have written in-house scripts for this same task, but a few people have asked for mine recently and I figure it might save some others reinventing the wheel. The script simply uses BioPython to run a set of nucleotide blast searches in order to assign STs to genome assemblies. The inputs are just the latest set of allele sequences and profiles for the MLST scheme, and whatever genome assemblies you wish to determine STs for. The script will then determine the ST for each input genome, and if an exact match can’t be found, it will try to figure out the closest matching alleles and ST.

Happy sequence typing!

MLST of IncI blaCTX-M plasmid in German outbreak strain

Overnight I received an email from Scott Weissman at the Seattle Children’s Hospital. He has done some analysis of the IncI, blaCTX-M bearing plasmid from the outbreak strain using the plasmid MLST database. Here’s what he did:

To facilitate comparisons to other plasmids, I analyzed the LB226692 contigs in order to identify a plasmid Sequence Type (pST) for this outbreak strain’s IncI1 plasmid carrying CTX-M-15 and TEM-1.  I extracted fragments for the 5 MLST loci (as described at http://pubmlst.org/plasmid/primers/incI1.shtml) from the GenBank contigs, and obtained allele assignments as follows: repI 3 | ardA 4 | trbA 6| sogS 3 | pilL 3, which corresponds to pST31.  (I should note that the extracted sequence for trbA contained a 1-nt “insertion” relative to reference allele 6, which I assume to be sequencing artifact, although a novel allele cannot be excluded – given the indel occurrence within a poly-T tract of 4 T’s).

The database contains 15 plasmid entries for IncI1 pST31 (see below), including pEC_Bactec (described by Smet et al, PLoS One, 2011;5:e11202).  All of these entries carry the CTX-M-15 and TEM-1 enzymes, so there are no headlines here.

I would note, however, that this CTX-M-15 plasmid is distinct from the IncF-family plasmids that have been globally distributed by E. coli ST131 (eg, pC15a-1a, as described by Boyd et al, AAC 2004;48:3758-64) and detected subsequently in multiple Klebsiella pneumoniae clones (see Oteo et al, JAC, 2009;64:524-528).

IncI pST31 entries in the IncI pMLST database

To supplement this I had a quick look at the latest BGI assembly of TY2482, the other outbreak strain that has been sequenced. I found the same results, but this time with a precise trbA allele 6 (i.e. Scott was right in guessing this is an error in the Ion Torrent data at a homopolymeric tract).

Re the table above, the paper describing the 2004 Shigella sonnei plasmid is here, I’m not sure if the others are published.

This is what the eBURST diagram looks like for the data in the IncI plasmid MLST database…the German outbreak sequence type, pST31, is pointed out with a red arrow.

eBurst diagram for IncI plasmid MLST

The pMLST sequences from TY2482 (BGI assembly 2, June 6) are:

>repI1-3
 GAGAGATGGCATGTACGGGCAGTAAGTCAGAAGACTGAAGATGCTCCGGAAGCCATAAAA
 GGAAAACCCCCACTATCTTTCTTACGAACTTGGCGGAACGACGAA
 >ardA-4
 AATACAACTGTGGAAGCATCGCCGGACGCTGGTTTGACCTGACCACGTTTGATGATGAGC
 GCGACTTTTTCGCCGCCTGCCGTGCTCTTCACCAGGATGAAGCCGATCCTGAACTGATGT
 TTCAGGATTATGAGGGATTCCCGGGGAATATGGCCTCTGAATGCCATATCAACTGGGCCT
 GGGTTGAAGGCTTCCGCCTGGCACGGGATGAAGGCTGCGAAGAGGCTTATCGTCTCTGGG
 TGGAGGATACCGGTGAGACGGATTTTGACACCTTCCGCGATGCCTGGTGGGGCGAGGCTG
 ACAGTGAGGAGGCTTTTGCGGTTGAGTTCGCCAGTGATACCGG
 >trbA-6
 GCAACCCGCCGCTCAGGCCGTTTGCCACCATGAAAGAGTTTTTCCGGATCACCATCTGCC
 AGTACTGGGGCGATAGCAGGGGACAACGAGGCAAAGATGTGTGGCAGTCGGGTAATATCT
 ACAGGTCTGCGGGTGAAACGGCTTTGTCCCGGGTGTTGATACCATTCCCATAAACACCAG
 AGTGTCACAGGTAAAAGATACATCCACAGAATACCTATGGTCTGCTCCATGACGTTAATC
 CACTGGCTATAGCTTATGTTGGCGGCGTTGTTACCGGTCATGGCAAGCAGGTTGTATCGT
 GGTGCTGCATAGTTATGAAATGGTCCCCAGTCGACCAGTCCCCATAAGGTATGAAGAATC
 AGGCAGCTGGCGTAAACCACTTCCGGCAAGAATAACCATATGACGAATAGCAGCAGGATA
 AGAAGGACACCAACAGCGCCCCATATCTGCATAGGATCTTCTGCGACAGGCTGTCGATTG
 TAAGA
 >sogS-3
 GTCGTCGTGGTTTCCGCTGAGGGCGTGGGATCACTGTTCTCATGCGCCTGTGAATCCGTT
 TTTTTACGCGTAAAAAGGCCACGCGCTTTGTCGAGAAACGATGAAGTATTATCAGAAGGT
 GATGTGCTCTGAACAGGTTGCTGCGGAGTGGGTTCATCCCGGACAGCCGGTTCTATAGTG
 GCTGTTGTGGCCTGAAGTTCTGACTCATCTGCCTGAACGTGGCCTGTGTCCGGTTGCGAC
 GGCATATCACTGTT
 >pilL-3
 TTGATGCCATGCTTTCGCATTTTGTTTCTTCTGCCCACTTAATAATGTTTTCCCTTAATG
 TAGTGCCTGCCGGCGCACGCCACTCTTTACCCTGAGATACCGGTTTGACAGGTGTCCCGG
 TCATGAGTGGGATAGACTTGACTGTAGAGCCGGTCGGAGTCGGGATTGCTGCGGGCGTAG
 ACGGAGATACGCTGTTTCCCCTGAATGGGTTTCGTGGTTTGTTCTGGCTATTTGCTGTCG
 TTGAAGACTCCGGGGAAGTGGATGTGGTTACCATGG