MLST from short read data

Our paper on a mapping-based approach to extracting MLST data from Illumina short reads was recently published in BMC Genomics. We used read mapping because this has greater sensitivity than approaches which rely on assembly, especially for low-coverage data sets of genomes with extreme GC content or other sequencing issues. The approach is called SRST (short read sequence typing), and code and usage instructions are available from srst.sourceforge.net.

However, it is obviously useful to be able to extract MLST info from genome assemblies too. For example, many finished or WGS genome sequences in NCBI do not have ST information attached to them, or it is hard to find. Also, for 454 and perhaps Ion Torrent data, it can be easier to deal with homopolymer issues at the assembly level by using newbler/gsAssembler and then working with contigs.

There is a web service available that is designed to do this, i.e. you can upload your genomes and choose a MLST scheme, and it will return the ST. It is described in this paper and available at this URL. However, unfortunately I have never been able to get the website to load in any of my web browsers, so I’ve not been able to try it. Also, it is a pain to have to upload large amounts of data over the web, and this becomes completely infeasible when dealing with lots of genomes, so instead I use a simple script to extract MLST info via blast, which runs locally on my laptop or cluster.

The script and a short readme containing usage instructions are available at: http://sourceforge.net/projects/srst/files/mlstBLAST/

I’m sure many people have written in-house scripts for this same task, but a few people have asked for mine recently and I figure it might save some others reinventing the wheel. The script simply uses BioPython to run a set of nucleotide blast searches in order to assign STs to genome assemblies. The inputs are just the latest set of allele sequences and profiles for the MLST scheme, and whatever genome assemblies you wish to determine STs for. The script will then determine the ST for each input genome, and if an exact match can’t be found, it will try to figure out the closest matching alleles and ST.

Happy sequence typing!

Workshop materials – phylogenetics and evolutionary analysis

Just came across this while looking for some info on character trait evolution:

2011 Bodega workshop – Phylogenetics wiki

It includes some great lecture slides and tutorials from a workshop on applied phylogenetics held in March 2011. The tutorials cover software packages like MrBayes, RAxML, R, BEAST, BayesTraits and topics including model selection, divergence dating, discrete & continuous character evolution, diversification rates.

There is also some good advice for aspiring phylogeneticists, including advice on becoming a programmer.

Clarifier: Bacterial populations and communities

Two areas where next-gen sequencing is making a big impact in the bacterial world are the analysis of ‘bacterial populations’ and ‘bacterial communities’. While these might sound similar, they are actually very different.

In common parlance we sometimes use ‘population’ and ‘community’ somewhat interchangeably in talking about groups of humans. We might say that in Melbourne, coffee drinking is common in the local population, or that it is common in the local community. What we mean is that it’s common among people living in Melbourne.

In biology, the term ‘population’ has a specific meaning – a group of individuals of the same species (i.e. able to interbreed; but note this concept is complex in bacteria), defined by time and space. So we can talk about the currrent human population of the Earth, or the human population of Melbourne 20 years ago. Note that this is intimately tied up with the concept of species, as separation into two distinct populations is a key step towards diverging into different species. On the other hand, ‘community’ refers more generally to the group of organisms inhabiting a particular ecological niche, which could include any number of species. So for example we could talk about the population of karri trees (a species of eucalyptus found in the south west of Western Australia), or the community of plants inhabiting the karri forrest.

 

Bacterial populations

When we talk about bacterial populations, what we mean is investigating the population structure of a particular bacterial species/subtype… in theory this aims to understand the population in its entirety, but in practice usually involves studying lots of individual members of the population and making inferences about the population as a whole. We can attempt to understand populations at different levels of localisation…. e.g. we can study a highly localised population, like the population of Salmonella Typhi inhabiting the gall bladder of a typhoid carrier; or more expansive populations of Salmonella Typhi circulating in a city, a country or around the globe.

Sequencing has been a great tool for understanding bacterial populations, by allowing lots of individual members of a population (i.e. individual bacterial isolates or colonies) to be compared at the sequence level. Sequence data is ideal for this, as the differences between individuals are often tiny  (i.e. there is very little variation) since they belong to a single population, and DNA sequence data allows us to detect single nucleotide changes (ie provides high resolution). Also, since we have well-developed models of sequence evolution (ie how nucleotide changes accumulate), sequence data can be interpreted using phylogenetic analysis. This really kicked off a decade ago with multi-locus sequence typing (MLST; see wikipedia entry(!) or Maiden et al, 1998 for more info) and is now expanding rapidly with the advent of sequencing platforms that allow whole genomes of hundreds of isolates to be sequenced (e.g. 96 bacterial isolates can be readily sequenced in a single run of the Illumina GAIIx or HiSeq, using multiplexing).

This kind of analysis can be used in public health microbiology and infectious disease epidemiology to trace outbreaks or transmission (sometimes called molecular epidemiology or genomic epidemiology). It can also be used to study the evolution of drug resistance or pathogenesis/virulence in bacterial populations (microevolution, since it is occurring within populations), or the impact of a novel vaccine or drug on a given bacterial population, all of which can be useful for designing and monitoring public health interventions or making treatment recommendations.

A great recent example is the study by Nick Croucher (an immensely talented PhD student) from the Sanger Institute, and numerous collaborators, who compared the genomes of 240 Streptococcus pneumoniae isolates of the PMEN1 subtype, collected from all over the world since 1984. By comparing the genomes of these isolates, they found evidence of frequent homologous recombination with other S. pneumoniae, including exchange of genes encoding the capsule targeted by vaccination and acquisition of drug resistance genes. Assuming the sequenced isolates are reasonably representative of the global population of S. pneumoniae PMEN1, this indicates that the PMEN1 population is not isolated from the rest of the S. pneumoniae population but that there is constant gene exchange within and between S. pneumoniae groups, allowing the bacteria to escape the effects of human interventions including vaccine-induced immunity and exposure to antimicrobial drugs. We already ‘knew’ this could happen in bacterial pathogen populations, but this study provides direct evidence of it occurring in response to a specific vaccine and specific drugs used for treatment. See pubmed entry, unfortunately you need access to Science magazine to read the article.

 

Bacterial communities

On the other hand, when we talk about bacterial communities, what we mean is investigating the communities of bacteria present in a given sample. This is akin to walking through the forest and taking note of each plant you see, and the analysis methods borrow heavily from ecology. Studies of bacterial communities are being done in just about every kind of sample in which you would expect to find bacteria – from environmental samples (e.g. underwater caves; windscreen splatter) to human body sites (faeces or the gut; skin; nasal passages; read more at the Human Microbiome Project site).

The analysis usually focuses on determining which bacterial taxa (e.g. a genus, species or subgroup) were present in each sample, and their relative abundance. These can be compared across samples to identify taxa that are only present in certain kinds of environments, or whose presence is associated with another property of the sample (e.g. presence in the nose may be associated with development of otitis media). Communities can be examined more holistically to identify broad differences in the bacterial community structures associated with different samples.

Sequencing has dramatically improved the ease with which bacterial communities can be studied, via sequencing of DNA extracted from a given sample (e.g. a soil sample; a fecal sample). Two approaches are possible – sequence the raw DNA extract or amplify a conserved bacterial gene (using PCR) and sequence that. The first is true ‘metagenomics’, as you are sequencing all of the genomes present in the original sample, but this takes a lot sequencing effort and you may not need or want to know every single gene present in the sample. At the moment, Illumina platforms are most appropriate for this application as they have the highest throughput, however their short read lengths (max 250 bp using paired end sequencing) make assembly and analysis difficult. The second way, which usually targets the conserved 16S ribosomal RNA gene (‘16S sequencing’), is a more tractable way of determining what species/subgroups of bacteria are present in the sample and estimating their relative abundance. Multiplexing can be achieved by incorporating sample-specific barcodes into the amplicons during PCR, allowing hundreds of samples to be analysed in a single run of the 454 (for longer reads) or Illumina platforms (shorter reads but greater depth).

 

What I do in these areas

Most of my work is in bacterial population genomics, using whole genome sequencing and/or SNP typing to study populations of Salmonella Typhi (typhoid fever), Shigella sonnei (dysentery), Klebsiella pneumoniae (wide range of infections) and other bacterial pathogens of humans. Some things we’ve found using this approach are

  • that Typhi is undergoing genome degradation, by accumulating mutations that inactivate or delete genes, but not gaining novel genes [full text];
  • that lots of different Typhi types co-circulate in localised areas, without much evidence for direct competition, recombination or replacement of old types by new ones [e.g. studies in Jakarta, Nairobi, Kathmandu, Mekong Delta]… but a single subtype (which we call H58) appears to have swept the globe in the last 1-2 decades;
  • that the global population of Shigella sonnei is actually made up of at least three quite distinct groups, with different genetic properties.

To do these studies I’ve used Illumina/Solexa (mostly multiplexing 12 isolates per lane; done by Sanger or AGRF) and 454 (not since the original Typhi paper; done at Sanger) and, when we get it up and running, I’ll try out my department’s new Ion Torrent too (although throughput is currently too low to justify switching from Illumina + multiplexing). I have also done a lot of the Typhi work using SNP typing via Illumina GoldenGate arrays or Sequenom (both at Sanger).

I do the analysis using tools like bwa & samtools for read mapping and SNP calling, Velvet for genome assembly, MUMmer, blastn and ACT for comparing contigs, and programs like RAxML and BEAST for phylogenetic analysis. I use Python to write scripts/pipelines to stitch it all together and run efficiently on VLSCI.

I’ve recently begun using 16S amplicon sequencing (454 Titanium; 2 gasket-separated regions each with 50 barcoded samples to achieve 100-plex per run; at the Ramaciotti Centre in NSW) to investigate bacterial communities in nasal samples from babies. These samples were collected as part of a large cohort study looking at the development of asthma and allergy during childhood, long before it was feasible to do look at the bacterial colonization in this way. Luckily the study’s designers had the foresight (and funding support) to collect and keep the samples, in the hope that novel technologies and avenues of research would open up down the track. It’s exciting to be able to work with such a well-characterised cohort.

For this analysis, I’ve been relying mainly on the QIIME virtual box, but will branch out and try some other methods when I get the next lot of data.