Replacing the blog with a new lab website

This will be the last post on this site, just to say that I am replacing this blog with a new lab website –http://holtlab.net/… Partly because now that I have my own group, I need a website for the whole lab. And partly because I haven’t had time to post much here for the last couple of years anyway, as things have got busier and I have started using twitter as a quicker/briefer way of sharing things.

I am leaving this blog in place, as I think it’s an interesting record of where things were at in 2011-12, especially around the impact of the first desktop sequencers and the crowdsourced genomic analysis of the infamous E. coli outbreak in Germany.

I have migrated a few of my favourite posts from this blog to the new site, however all new posts will go on the new lab website.

Cheers

Kat

SRST2 for short read sequence typing of bacterial pathogens – behind the paper

I’m pleased to say our paper SRST2: Rapid genomic surveillance for public health and hospital microbiology labs is now out in the Genome Medicine/Genome Biology special issue Genomics of Infectious Diseases!

Having posted the code to GitHub over a year ago, and the paper to BioRxiv in June, it is great to have it finally published. I must say I love the BioRxiv experience, it is great to know the paper was viewed >1500 times while it was stuck in the peer review/publication process.

This project has been a fun journey, which started when I was lucky enough to get a fantastic summer intern, Harriet Dashnow (follow: @hdashnow), funded by VSLCI (our wonderful local supercompute cluster). Harriet’s job was to figure out how to apply our SRST code, which was designed to extract MLST information from Illumina reads (see code and BMC Genomics paper), to detect and type resistance genes and alleles. She did a fantastic job, but in doing so we realised that we needed to overhaul the scoring system to something that was faster and more robust. So, Mike Inouye designed a new scoring approach from the ground up, which Harriet and Bernie Pope from VLSCI turned into the new Python code that is SRST2. We then did a whole lot of testing on public data (mainly by Lesley Raven and Mark Schultz from the Holt & Inouye labs, with some guidance from Justin Zobel in the Computing & Info Systems department) and had our friends at the local reference lab, the Microbiological Diagnostic Unit (mainly Takehiro Tomita), test it out for themselves on a load of Listeria monocytogenes they had recently sequenced on their MiSeq.

What we came up with was some code that draws on the fundamental tools of the genomics trade (bowtie2, samtools & python) to rapidly and reliably detect genes and alleles direct from short read data. SRST2 has two modes, (i) MLST, where you can provide MLST alleles and profile definitions, and it will report sequence types; and (ii) gene detection and allele calling, suitable for screening for acquired genes such as resistance genes, virulence genes and/or plasmid genes.

Figure 1 from the paper outlines how SRST2 works:

The coolest thing about SRST2 is that it is consistent and reliable, even at low read depths, which is absolutely critical for use in a routine diagnostic/surveillance workflow. The robustness of SRST2 stems from the fact that we use a mapping approach coupled with a pretty clever approach to scoring potential matches.

The alternative is to generate de novo assemblies, and then interrogate them with BLAST or other tools… the problem with that approach is that, while de novo assemblers like Velvet and SPAdes can do a great a job of assembling bacterial genomes, there is no way to guarantee you will get a good assembly with every read set. Anyone who works with biggish bacterial data sets of 100s-1000s of genomes knows – it is very hard to standardize the assembly process and establish a protocol that works consistently and reliably every time. And if you are needing to type loads of strains as part of routine service or investigations, you need to know you will get a reliable result every time. Figure 3 in the paper compares the overall performance (call rate x accuracy) of SRST2 vs BLAST analysis of highly optimised assemblies:

As you can see, once you get up to 15-20x read depth, SRST2 will consistently give you an answer, and that answer will be accurate. Whereas using assemblies – even highly optimised ones – you will always miss out on 1-5% of the answers.

We were also asked by a reviewer to compare SRST2 with the only other publicly-available program designed for MLST typing from short reads: the MLST Typer web service at the Danish Center for Genomic Epidemiology. Because this requires uploading read sets to the other side of the world, we only did this for one of our test data sets, 44 Salmonella genomes. The results are in Table 5, and showed that the website got the correct ST for only 14/44 genomes. SRST2 got correct STs for 43/44 genomes (for the 44th strain, SRST2 returned 6/7 alleles correctly and suggested that there is actually a SNP in the 7th allele, which could actually be correct but we cannot verify; MLST Typer did not call anything for this strain).

When it comes to detecting the presence of genes, such as acquired antibiotic resistance genes, SRST2 also outperformed assembly based approaches. We got accurate detection at depths >3x, and picked up many resistance genes that did not assemble well, especially at depths below 15x (see Fig 5 of the paper). Again, we compared to ResFinder on the Center for Genomic Epidemiology website, and found SRST2 performs much better (see Table 3 in the paper). I don’t want to criticize the tools on the Center for Genomic Epidemiology website, as they are really useful for people who have limited resources to analyse genome data… but I do think people need to be aware that the successful use of those tools rely heavily on first obtaining a high quality assembly, and you can very easily miss important antibiotic resistance genes.

 

So what are the main uses for SRST2?

The key jobs SRST2 is good for are:

  • detecting the presence of genes
  • identifying specific alleles for those genes from a known set of sequencescalculating sequence types (STs) specified by MLST-style schemas.

So it is great for detecting acquired genes, such as antibiotic resistance genes, virulence genes or plasmid genes; or extracting MLST information. To this end, the SRST2 download includes:

  • a script to help you pull down the latest MLST data from pubmlst.org for >100 MLST schemes
  • instructions and code for typing virulence genes from the Virulence Factor Database (VFDB)
  • SRST2-compatible versions of popular resistance gene databases (ARG-Annot, ResFinder) and the PlasmidFinder database of plasmid replicons

For example, in the paper we show the use of SRST2 for investigating changing patterns of vancomycin resistance amongst Enterococcus faecium isolated at a local hospital (data from this 2013 MBio paper). We analysed the Illumina readsets using SRST2 to detect resistance genes and perform MLST. SRST2 returned a table of strains with columns to indicate sequence types, alleles for each locus in the MLST scheme, and the presence of acquired resistance genes (shown here viewed in Excel):

srst2_data_excel_Enterococcus

You will notice that the depth values are VERY high for this data set, as the sequencing was done at relatively low-plex on a HiSeq. This means that it takes a long time to map (and VERY long time to assemble) these read sets… however note that you can obtain the same results much faster by telling SRST2 to use just the first 1 or 2 million reads per strain, which is more than enough to get accurate results (–stop_after 1000000).

We used R to summarize and plot the results, which are shown in Figure 7 in the paper (R code is in the /scripts directory in the SRST2 download):

Panel (a) shows that vanB vancomycin resistance became more frequent over the 12 years investigated, while panel (b) shows that the STs also changed during this time period… so is the increase in resistance linked to the introduction of new van-resistant ST252 and ST203 strains?

Panel (c) puts the ST & resistance results together… the tree on the left is a dendrogram depicting the similarity between sequence types based on number of shared alleles; MLST profiles (STs and their component alleles) are shown next to this. On the right, the frequency of each ST is shown as a bar graph, which highlights that there are 3 main types, ST17, ST252 and ST203, which all share 5-6 alleles. The heatmap in the middle shows the frequency of the various resistance genes within each ST – it is clear that the vanB locus is present in all 3 STS, but only at ~50% in each… so the increase in vancomycin resistance is not explained by the introduction of resistant ST252 and ST203 clones; rather, vancomycin resistance comes and goes, via movement of the vanB operon, in all 3 common STs.

The whole genome phylogenetic analysis in the 2013 MBio paper shows this in much finer detail of course; but the basic conclusions could be drawn from a very rapid analysis with SRST2.

The same is true for hospital outbreak investigations (see examples in the paper) – in a few minutes SRST2 can show whether a potential transmission case has the same ST and acquired genes profile as confirmed outbreak cases; if the ST matches, genome-wide SNP trees can be used to investigate further.

 

My next post will have some pro tips for using SRST2.

Are we doing FAKEPHY? REALPHY paper in MBE and the ‘inaccuracy’ of phylogenies based on whole genome SNPs identified by mapping to a reference

Just been reading an interesting paper in Molecular Biology & Evolution, titled “Automated reconstruction of whole genome phylogenies from short sequence reads“. Read it, it’s open access 🙂

The authors simulated sequences with different tree structures and divergence levels, and compared the accuracy of maximum likelihood tree inference (PhyML) to recover the true tree from the data (examining topology and branch length) using either (i) full sequence alignment, (ii) alignments of SNPs identified by mapping to a single reference or (iii) their REALPHY method which combines data from mapping to multiple references and includes not only SNPs but invariable sites as well. 

Unsurprising (and comfortingly!) they found that using the full sequence alignment gave the right tree (whew!). However they also showed that using SNPs identified by mapping to a single reference could give incorrect topologies, and considering only the SNPs (ie excluding invariant sites) could screw with branch lengths.

Uh oh. This is what we do routinely for teasing apart phylogenies within clonal bacterial populations. The authors even name-check some of my papers and those of my Sanger Institute colleagues…

“In recent years, numerous studies (e.g., (Harris et al. 2010; Croucher et al. 2011; Mutreja et al. 2011; Holt et al. 2012; McAdam et al. 2012)) have reconstructed phylogenetic trees for large numbers of closely related bacterial strains by mapping short sequence reads to a reference genome sequence. Here we have analysed the performance of such methods on simulated and real sequence data and have shown that there are two primary pitfalls to this approach. The most readily apparent is that when SNP alignments are used to construct trees with maximum likelihood methods, it can lead to incorrect tree topologies and inaccuracy in the inferred branch lengths. Furthermore, when query sequences are sufficiently divergent from the reference sequence, the most divergent regions of the genome may fail to map, and this mapping bias may lead to incorrect branch lengths and incorrect topologies.”

Have we been getting it wrong for the last 6 years?

Well, not really, no.

Firstly, we have actually thought about this before and looked at the effect of reference choice and excluding invariant sites. But we didn’t publish any of these observations, and probably should have.

Secondly, the scenarios tested in the MBE paper are not representative of the highly clonal bacterial populations to which we have been applying the map-to-a-single-ref-and-analyse-SNP-sites approach.

Figure 2 in the MBE paper shows the accuracy of topology inference over a range of tree shapes and sequence divergence levels. The levels of divergence they considered were 0.5%, 1%, 2%, 4% and 8%. From this figure, topologies were always accurate for 0.5% divergence and usually accurate for 1% divergence too, depending a little on the tree shape. The thing is, we only use this method for phylogenetic analysis of clonal bacteria, where the divergence rate is very low, e.g. in our Shigella sonnei paper, the total divergence across the whole data set (132 strains from global sources representing the entire extant population of Shigella sonnei) was 1 SNP every 432 bases, ie 0.23% divergence (incidentally this is detailed in paragraph 1). The actual pairwise divergence between strains is much lower still. The same is true for the other papers cited (and loads of others that we have used this method for), which deal with high-resolution analyses (hundreds of genomes) of clonal populations including single sequences types of S. aureus, S. pneumoniae, and recent clonal expansion of V. cholerae.

But what about branch lengths? Figure 4 in the MBE paper indicates that branch lengths are often underestimated when using SNP alignments. Again, the effect is more pronounced for higher divergence levels. However at 0.5% divergence it looks like half of all branch lengths were precisely accurate, and the vast majority of branch lengths (looks like ~90%) were accurate within 10%. At <0.5% divergence this effect would probably be even smaller.

So. I guess I don’t need to retract all my papers just yet.

But what about the future?

1. Including all sites, not just SNPs. Point taken. We are increasingly moving to phylogenetic inference an all sites, not just SNPs. The main issue here though is analysis time and memory, especially for BEAST. Performing Bayesian phylogenetic and demographic analysis on a 10GB alignment is not trivial. And if most of the genome is invariant then this can be really quite excessive (e.g. for our Shigella sonnei with 0.23% divergence, we would only need a 23 MB alignment instead of 10 GB). One option is to use analyse variant codons rather than variant SNPs, as this at least allows use of a codon model and incorporates some invariant sites.

2. Getting away from mapping to a single reference. I agree this is useful (and in fact necessary) when dealing with diverse populations, but I don’t see any reason to drop this approach for within-clone whole-genome analysis that we are routinely doing (and was the subject of the cited papers). The most common approach in more diverse populations is to work from the assembled genomes and generate alignments of each orthologous gene. See e.g. the first stage of analysis of 3000 S. pneumoniae genomes recently reported by Stephen Bentley in Nature Genetics, followed by within-clone analysis using the usual mapping-to-close-reference approach.

The authors of the MBE paper present an alternative solution, which involves mapping to multiple references (with Bowtie2) and combines the alignments from these mappings (see the paper for details, esp Supp Figure 6). This is implemented in a program called REALPHY (ouch! I am trying hard not to take this personally… do they think my current analysis pipeline should be called FAKEPHY?), which is available as a web service or to download here. Now I’m all for this and will give it a try. But the ‘computational resources’ section in the Methods makes me wary…

“The disk space required (~60MB per Mbase × #genomes × #references) and the computing time (~2 min per Mbase × #genomes × #references) are linearly dependent on these three factors. Furthermore, the amount of RAM required depends primarily on the sequence length and the number of genomes (~250MB per Mbase × #genomes).”

Now, the papers I’m working on at the moment include some with 100-300 genomes, one with 800 and a couple with 2000 genomes. Not quite as big as the 3000 Strep pneumo, but getting there :). I’m not so concerned about disk space. This would be 30GB – 600 GB. Time is a bit more problematic – for 100 genomes it would take 16h, for 200 it would be 33 hours (call it 1.5 days) but for 2000 it would be 2 weeks. Compare that to a few hours for our current pipelines, since each genome can be mapped to a single reference independently of all others and distributed over a cluster with hundreds or thousands of cores. RAM is also concerning. 125 GB for 100 genomes, 2.5 TB for 2000 genomes. Even our supercomputer would struggle with that! I’m not entirely sure, but it seems that these estimates include both the generation of the alignment and tree construction using PhyML and dnapars (from phylip). I doubt the time & RAM calculations would hold across the range of genomes I’m talking about here. And there are ways to do this much faster, including RAxML. But I do know that using all sites in a whole genome alignment increases the time taken, especially for BEAST analysis which we usually need to do.

If I do want to look at more diverse populations, I will have to decide between using an approach like REALPHY or deriving gene alignments from the assemblies. The latter avoids the problem of mapping entirely and allows you to detect differences in evolutionary histories of individual genes, but it does introduce a new problem of ortholog detection. So perhaps the answer is mapping to single reference for deeply sampled intra-clone analysis; REALPHY for intermediate cases (1-5% divergence, ie within species and with lower resolution required) and ortholog alignments for anything more?

So. I think I will be sticking with the mapping-to-single-reference approach for my current projects. But I do like this paper, as it made me think again and throws another free tool into the mix.

Inferring transmission from genomic + epi data

There is a flurry of new papers on this, how exciting! One is through peer review and published in final form at PLoS, and the others in preprint (arXiv and the new BioRXiv)!!

It will take me a while to read and understand these, especially the subtleties of the different applications they are aiming for…. so in the meantime let’s just compare them the lazy way – is the software available, and who has the best pictures?

1. In PLoS Computational Biology, using SARS to demonstrate utility:

Bayesian Reconstruction of Disease Outbreaks by Combining Epidemiologic and Genomic Data

Thibaut Jombart, Xavier Didelot, Simon Cauchemez, Christophe Fraser, Neil Ferguson

Abstract: Recent years have seen progress in the development of statistically rigorous frameworks to infer outbreak transmission trees (“who infected whom”) from epidemiological and genetic data. Making use of pathogen genome sequences in such analyses remains a challenge, however, with a variety of heuristic approaches having been explored to date. We introduce a statistical method exploiting both pathogen sequences and collection dates to unravel the dynamics of densely sampled outbreaks. Our approach identifies likely transmission events and infers dates of infections, unobserved cases and separate introductions of the disease. It also proves useful for inferring numbers of secondary infections and identifying heterogeneous infectivity and super-spreaders. After testing our approach using simulations, we illustrate the method with the analysis of the beginning of the 2003 Singaporean outbreak of Severe Acute Respiratory Syndrome (SARS), providing new insights into the early stage of this epidemic. Our approach is the first tool for disease outbreak reconstruction from genetic data widely available as free software, the R package outbreaker. It is applicable to various densely sampled epidemics, and improves previous approaches by detecting unobserved and imported cases, as well as allowing multiple introductions of the pathogen. Because of its generality, we believe this method will become a tool of choice for the analysis of densely sampled disease outbreaks, and will form a rigorous framework for subsequent methodological developments.

Software: R package ‘outbreaker’, which includes functions for simulating outbreak data. Very nice.

thumbsupthumbsupthumbsup

Pictures: Pretty good!

thumbsupthumbsupthumbsup

2. In preprint on BioRXiv, using tuberculosis as an example:

Bayesian inference of infectious disease transmission from whole genome sequence data

Xavier Didelot, Jennifer Gardy, Caroline Colijn

Abstract: Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

Software: No software package. Uses BEAST trees, for data visualisation uses DensiTree, R and Cytoscape.

thumbsup

Pictures: I probably shouldn’t post these as it’s still preprint, so you’ll have to just look at the PDF. But there are some nice figures and a few different ways of presenting the data. I particularly like Figure 4A, which shows the phylogenetic tree against a binary matrix indicating SNP alleles in each sample, for those that vary within the group. This would be good to have handy for inspecting fine phylogenies… it looks like it was done in R, I think I might add this to my tree visualisation R functions (to be posted soon).

thumbsup

3. In preprint on arXiv, using a Staph. aureus hospital outbreak example

Reconstructing transmission networks for communicable diseases using densely sampled genomic data: a generalized approach

Colin J. Worby1, Philip D. O’Neill, Theodore Kypraios, Julie V. Robotham, Daniela De Angelis, Edward J. P. Cartwright, Sharon J. Peacock and Ben S. Cooper

Abstract: Genomics is increasingly being used to investigate disease outbreaks, but an important question remains unanswered — how well do genomic data capture known transmission events, particularly for pathogens with long carriage periods or large within-host population sizes? Here we present a novel Bayesian approach to reconstruct densely-sampled outbreaks from genomic data whilst considering within-host diversity. We infer a time-labelled phylogeny using BEAST, then infer a transmission network via a Monte-Carlo Markov Chain. We find that under a realistic model of within-host evolution, reconstructions of simulated outbreaks contain substantial uncertainty even when genomic data reflect a high substitution rate. Reconstruction of a real-world tuberculosis outbreak displayed similar uncertainty, although the correct source case and several clusters of epidemiologically linked cases were identified. We conclude that genomics cannot wholly replace traditional epidemiology, but that Bayesian reconstructions derived from sequence data may form a useful starting point for a genomic epidemiology investigation.

Software: No mention of any software package or how this was implemented.

Pictures: I probably shouldn’t post these as it’s still preprint, so you’ll have to just look at the PDF. But in my opinion figures are much less clear and harder to follow than the others (and mostly black and white…why???).

Eddie Holmes reviews Peter Doherty’s new book “Pandemics”

I love this because it involves two fantastic Australian-based infectious disease researchers, and is published in PLoS Biology. Yay PLoS!

The wonderful Eddie Holmes, who recently moved to the University of Sydney in Australia, has written a great review of Nobel prize winner Peter Doherty’s latest book, “Pandemics – what everyone needs to know”. Peter Doherty is of course Australian, and has been based at the University of Melbourne for the last few years… and just last week has moved into the brand new building that bears his name – the Doherty Institute for Infection and Immunity. He has written many great books, including the fabulous “Beginner’s Guide to Winning the Nobel Prize: Advice for Young Scientists”

 

Anyway, back to the review. As always, Eddie’s review is clear and interesting, and makes me really want to read the book!

 

If like me, you don’t have time to read it now, I also found this video of Peter Doherty talking about the book, over on BookTV which I am going to listen to while doing some boring admin tasks this afternoon…

epidemic – and How To Read A Phylogenetic Tree

Bacterial molecular epidemiologists have a lot to learn from our virus-studying friends.

A great example is the ‘epidemic’ website over at http://epidemic.bio.ed.ac.uk/, which has the byline “Molecular Epidemiology and Evolution of Viral Pathogens” and is led by the wonderful Andrew Rambaut of BEAST fame.

The site currently covers Coronavirus (where the current concern is Middle Eastern Respiratory Syndrome or MERS, which is a close relative of SARS) and influenza. There are really cool features, including the visual display of spatial and epidemiological data on current outbreaks, collated from numerous sources (http://epidemic.bio.ed.ac.uk/coronavirus_background) as well as regular blog posts about new cases, sequences and analyses. The data is stored on github, which means anyone can easily access it (fork it in git language), add and edit data, and generally share the love.

There is also a fantastic blog post by Andrew Rambaut explaining How To Read A Phylogenetic Tree, with a particular focus on epidemiological analyses, e.g. how to interpret pathogen trees from different hosts to learn about transmission between hosts.

As I often say, if there is one thing better than a good tree, it’s a good tree with a map… so I’m super excited by the phylogeography software Pandemix which facilitates visualisation of tree + map data, with a temporal animation thrown in for good measure (!), over the web. Go here to see it in action on MERS data.

mersmap

Screenshot of Pandemix showing MERS data from Cotten et al (2013) in The Lancet. See http://epidemic.bio.ed.ac.uk/mers-cov_phylogeography

PhD position in TB host-pathogen genomics

TB_poster

I am looking for a PhD student to work on host-pathogen genomics of tuberculosis (TB) at the University of Melbourne (with co-supervisors across the Departments of Pathology; Biochemistry & Molecular Biology; Bio21 Institute; and Nossal Global Health Institute). There will also be opportunities to interact with collaborators at the Oxford University Clinical Research Unit in Vietnam and the National University of Singapore.

The project will involve analysis of large numbers of TB genomes, paired with host genotype data, using phylogenetic, computational and statistical analysis.

Interested students with a background in microbiology, bioinformatics, phylogenomics and/or biostatistics, can contact me here.

Multidrug_resistant_Mycobacterium_tuberculosis_15058261_1

SPAdes vs Velvet assemby comparison

This is a guest post from Dave Savage, a post-doc working on resistance gene analysis at the University of Melbourne. We have been trying out SPAdes as a replacement for Velvet + Velvet Optimizer which we routinely use for assembling bacterial genomes. The SPAdes paper, and the B-GAGE assembler comparison, show that SPAdes does better than most other assemblers on a small set of bacterial read sets, but we routinely assemble hundreds of genomes so needed to see whether this performance holds across large sets of reads with variable coverage etc. The results confirm SPAdes consistently performs well, and better than Velvet. This, and the fact that SPAdes gets you from fastq reads to error-corrected contigs and scaffolds in a single command with no need for tuning parameters, makes it a clear winner for our purposes.

Dave says:

SPAdes (http://bioinf.spbau.ru/spades) is a new(ish) genome assembly program that is particularly suitable for assembly of bacterial genomes. SPAdes makes a number of improvements on the de Bruijn graph algorithms used by assemblers like Velvet, IDBA and SOAPdenovo, including iteration over values of kmer sizes, and incorporation of paired kmers (k-bimers), which allows information from paired end reads to be introduced into the computation at an earlier stage.

In their paper, the authors of SPAdes make comparisons with a number of popular assemblers using single-cell and multi-cell E. coli datasets. Their results show that SPAdes can in fact deliver better assemblies, with a higher N50 length and larger contigs. Since we regularly use Velvet for assembly, we wanted to perform some further comparisons between Velvet and SPAdes on the types of data sets we typically deal with. To this end we ran SPAdes and Velvet on 114 Shigella sonnei 54 bp paired end Illumina read sets (from our 2012 paper, reads available at ERP000182), and 67 unpaired 37 bp Staphylococcus aureus samples (from Harris 2010, available at ERP000070). We then calculated the N50, length, L50, the number of contigs > 500 bp, and the total contig length produced by each assembler. The figures below show the distributions of each of these values for the two data sets.

Dark = SPAdes; light = Velvet Optimizer

Shigella sonnei assemblies. Dark = SPAdes; light = Velvet Optimizer

Staphylococcus aureus assemblies. Dark = SPAdes; light = Velvet Optimizer.

Staphylococcus aureus assemblies. Dark = SPAdes; light = Velvet Optimizer.

There’s a few things to notice about these plots. Firstly there’s quite a bit of difference between the S. shigella and the S. aureus assembly statistics, although this is hardly surprising given that the S. sonnei reads are paired ends, while the S. aureus are unpaired (and only 37 bp long). What’s apparent though is that SPAdes and Velvet perform similarly on the S. aureus reads, but have very different performance on the S. sonnei reads, where SPAdes clearly outperforms Velvet. We can probably attribute this to the fact that SPAdes includes an improved algorithm for handling paired end reads.

For both species, SPAdes frequently results in a higher N50 than Velvet and a larger total contig length. For the paired end S. sonnei reads SPAdes also results in a larger number of contigs greater than 500 bp in length. For the L50 metric, Velvet results in quite low scores for the S. sonnei reads, with 10 or less contigs frequently accounting for at least half of the total contig size. However, given the total contig length obtained using SPAdes, the much higher values for L50 obtained using SPAdes are not unexpected.

In our current assembly pipeline, we run Velvet Optimizer a number of times using smaller and smaller steps for k in order to hone in on an optimal value. Using SPAdes, not only is this step incorporated into the assembly program, but the information from each iteration is combined to produce the final output. Thus SPAdes is able to capture the information obtained using small, highly sensitive kmer sizes, as well as large, highly specific, kmer sizes, and at the same time simplify our assembly pipeline. We’re currently beginning a new project looking at the pan-genome of a number of bacterial species, and we anticipate that this project will involve quite a bit of assembly. Based on the results shown above, and the pipeline process for SPAdes, it looks like we’ll be using SPAdes to do the bulk of this work.

Run specifics:

  • Velvet was run on the S. sonnei pared end reads using VelvetOptimser with a minimum kmer size of 29 and a maximum kmer size of 89.
  • SPAdes was run on the S. sonnei pared end reads using the default settings.
  • Velvet was run on the S. sonnei pared end reads using VelvetOptimser with a minimum kmer size of 15 and a maximum kmer size of 35.
  • SPAdes was run on the S. sonnei pared end reads using the kmer sizes of 15, 23, 35.

Get coverage stats for a LOT of reference sequences

Sometimes we need to map reads to a multi-fasta reference with lots of sequences, e.g. to screen for large sets of genes or plasmids. The mapping works fine, but it can be tricksy to get alignment statistics (such as % of reads mapped) broken down by reference sequence, using common tools like Samtools, Bamtools or Bamstats.

Today we came across a tool that can do the job: sam-stats from the ea-utils fastq processing package.

Getting the statistics is easy:

sam-stats -A -B aln.bam > aln-stats.txt

(The ‘A’ option turns on reporting for all ‘chromosomes’, whilst ‘B’ tell the program the file is a BAM)

The details of the output can be found on the sam-stats wiki page. There you can also find more options for statistics, as well as utilities for removing adapter sequences and de-multiplexing fastqs.