MLST from short read data

Our paper on a mapping-based approach to extracting MLST data from Illumina short reads was recently published in BMC Genomics. We used read mapping because this has greater sensitivity than approaches which rely on assembly, especially for low-coverage data sets of genomes with extreme GC content or other sequencing issues. The approach is called SRST (short read sequence typing), and code and usage instructions are available from srst.sourceforge.net.

However, it is obviously useful to be able to extract MLST info from genome assemblies too. For example, many finished or WGS genome sequences in NCBI do not have ST information attached to them, or it is hard to find. Also, for 454 and perhaps Ion Torrent data, it can be easier to deal with homopolymer issues at the assembly level by using newbler/gsAssembler and then working with contigs.

There is a web service available that is designed to do this, i.e. you can upload your genomes and choose a MLST scheme, and it will return the ST. It is described in this paper and available at this URL. However, unfortunately I have never been able to get the website to load in any of my web browsers, so I’ve not been able to try it. Also, it is a pain to have to upload large amounts of data over the web, and this becomes completely infeasible when dealing with lots of genomes, so instead I use a simple script to extract MLST info via blast, which runs locally on my laptop or cluster.

The script and a short readme containing usage instructions are available at: http://sourceforge.net/projects/srst/files/mlstBLAST/

I’m sure many people have written in-house scripts for this same task, but a few people have asked for mine recently and I figure it might save some others reinventing the wheel. The script simply uses BioPython to run a set of nucleotide blast searches in order to assign STs to genome assemblies. The inputs are just the latest set of allele sequences and profiles for the MLST scheme, and whatever genome assemblies you wish to determine STs for. The script will then determine the ST for each input genome, and if an exact match can’t be found, it will try to figure out the closest matching alleles and ST.

Happy sequence typing!