Graphing and visualising data

I came across this today…

Top Ten Worst Graphs – http://www.biostat.wisc.edu/~kbroman/topten_worstgraphs/

…which got me thinking about data visualisation for scientific papers, posters and talks. This is another one of those things that isn’t taught to scientists, but really should be, as it really affects the impact and credibility of a study. I know I spend a lot of time on my figures – maybe too much? – but it’s a really crucial aspect of communicating research.

So, here are some more visualisation ideas:

And, crucially, colour schemes that work for different kinds of data (Cynthia Brewer)

Steve Baker’s article on failure of genomics to advance diseases of poverty

A colleague of mine, Steve Baker from Oxford Uni Clinical Research Unit in Ho Chi Minh City, has written an excellent but unsettling piece for Nature’s “World View” column on the impact of genomics on infectious disease in the developing world. Steve and I have worked together on several projects related to typhoid and diarrheal diseases, including our latest paper on typhoid in Nepal that I posted about a few days ago.

I’m generally optimistic about the great promise that genomics holds for infectious disease research and public health practice, and promote this whenever I can (and Steve does too). But as Steve points out in the article, the funding for these things is not directed at diseases that are common in developing countries, where the vast majority of the infectious disease burden occurs. He makes some very interesting points about how several factors – including local power structures and advocacy, well-meaning international donors like the Gates foundation, and a genuine lack of diagnostic field tests that are actually useable in the areas where the disease burden is – are resulting in very slow progress for diseases of poverty like typhoid.

The article is free to read here on the Nature site.

A general point that the article highlights, is that diseases and health issues need advocacy in order for change to happen. This is beginning to happen for typhoid, through the international consortium Coalition against Typhoid (see Wikipedia entry and the Sabin Institute). Their stated goal is “to define the barriers for adoption of typhoid vaccine and the key activities needed for the barriers to be overcome”, in short they aim to provide the advocacy much needed to get typhoid – which is essentially preventable through clean water and vaccination – onto the political & health agenda wherever the disease is endemic.

The global typhoid burden (from CaT site)

Typhoid in Kathmandu and Open Biology OA journal

And now for some shameless self-promotion…

A paper I’ve been working on for a few years on typhoid in Kathmandu yesterday had the honour of being the first paper ever published by the new open access journal of the Royal Society, Open Biology. I’m very keen on open access publishing and always try to submit to OA journals, but there is still a limited choice of truly OA journals. I love PLoS and BMC and submit to both regularly, but I think it’s really important to have a diverse range of OA journals – and therefore diversity in editors, editorial policies & styles, subject areas, etc – to make open access work for everyone.

So I’m excited to be have a paper in the new Open Biology, who publish under a Creative Commons 3 license (reuse/modify/distribute with attribution). Only time will tell how well the journal does, but it will only become great if us scientists are willing to submit good manuscripts. One incentive to do this is that Open Biology aims for a quick turn-around time of 4 weeks from submission to decision. Much as I love PLoS and BMC, they’ve never managed anywhere near that sort of turn-around. For info on Open Biology, see their ‘About’ page http://rsob.royalsocietypublishing.org/site/misc/about.xhtml.

So what is the paper? Thanks to OA, I can reproduce it here… (or you can read online or PDF)

Combined high-resolution genotyping and geospatial analysis reveals modes of endemic urban typhoid fever transmission

Stephen Baker1,2,*,, Kathryn E. Holt3,4,, Archie C. A. Clements5, Abhilasha Karkey2, Amit Arjyal2, Maciej F. Boni1,6, Sabina Dongol2, Naomi Hammond4, Samir Koirala2, Pham Thanh Duy1, Tran Vu Thieu Nga1, James I. Campbell1, Christiane Dolecek1,2, Buddha Basnyat2, Gordon Dougan4 and Jeremy J. Farrar1,2

Open Biol October 2011 1:110008; doi:10.1098/rsob.110008.

Basically, it uses genotyping and GPS to study typhoid fever in Kathmandu, Nepal. We examined 4-years worth of typhoid cases and looked at where the patients lived within the city (using GPS) and subtyped the bacteria responsible for their infections using high throughput SNP typing.

Firstly, we found that about 3/4 of the patients were infected with Salmonella Typhi and 1/4 were infected with Salmonella Paratyphi A. If you aren’t familiar with Salmonella, these are two serotypes of Salmonella enterica which, rather than causing gastrointestinal disease (ie food poisoning) like most Salmonella serotypes, cause the systemic infection known as typhoid. Typhi and Paratyphi A are quite different genetically, but have undergone convergent evolution to cause the same disease syndrome (see earlier paper in BMC Genomics).

Temporal distribution of Typhi (red) and Paratyphi A (blue) cases

Then we looked at the spatial distribution of the patients homes, and found that they were clustered in specific “hotspot” areas of Kathmandu:

Spatial risk model for Typhi infection (see paper for separate map for Paratyphi A risk)

Contrary to expectation, these hotspots weren’t the most densely populated areas…you might expect more people = more cases, but this wasn’t the case. Some complicated spatial statistics, done by Archie Clements at University of Queensland, confirmed that the hostpots weren’t associated with population density or hospital referral patterns, but were in low-elevation areas local people source their water from stone waterspouts.

Spatial distribution of Typhi cases, and location of water spouts

To see if the waterspouts could really be a source of typhoid transmission, we tested water samples for the presence of Typhi or Paratyphi A using culture and PCR. Culturing didn’t work, but it is notoriously difficult to culture Typhi from water samples that are not pre-enriched for bacteria…however PCR (using this method we published earlier in BMC Infectious Diseases) detected Typhi in 3/4 of water samples and Paratyphi A in 2/3.

Stone water spouts in Kathmandu (taken by co-author Stephen Baker)

We also looked at the population of bacteria causing the typhoid fever. We examined Salmonella Typhi isolated from the blood of typhoid fever patients, and used SNP typing to analyse the Typhi DNA and examine the population structure. We typed 113 SNPs (single nucleotide polymorphisms, ie point mutations) that we already knew about from previous variation discovery efforts. About 2/3 of isolates had the same haplotype, so to discriminate further within this local subgroup we sequenced 40 of the Typhi to identify novel SNPs arising in the local population (local microevolution) and typed these SNPs as well. Most of the Typhi belonged to the H58 lineage, which is common in other typhoid endemic zones we’ve looked at previously (Mekong Delta Vietnam – Holt & Dolecek 2011, PLoS NTD [OA]; Nairobi, Kenya – Kariuki 2010, J Clin Micro [free]; globally – Holt & Phan 2011, PLoS NTD [OA], Roumagnac 2006, Science [free in PMC]).

Typhi tree, red bars indicate frequency of genotypes in Kathmandu collection; red zones are H58 lineage and H58G sublineage

As the map above shows, the different Typhi genotypes were distributed randomly, with no spatial or temporal clustering. The exception was a probable outbreak in the west of the city, outside the hotspot zone, where 28 cases of infection with the same Typhi genotype were recorded in a two-month period – see yellow shaded area in map above, and zoomed in below:

Localised outbreak of Typhi genotype H58G-b4

Finally, we looked at what was happening in households from which multiple typhoid infections were studied. You might expect these household disease clusters to represent shared infections, which are transmitted between members of the household. However in most of these household typhoid clusters, the cases were caused by different organisms – either Paratyphi in one case and Typhi in the other, or multiple cases caused by different genotypes of Typhi. Cases with the same causative Typhi genotype are linked with dashed lines in this figure, you can see they are the exception rather than the rule:

Distribution of typhoid-causing bacteria in households with multiple typhoid cases

So, all in all we found typhoid fever infections clustered in spatial hotspots within Kathmandu, and that this clustering was explained not by population density but by low elevation and proximity to stone water spouts which are used to supply water. This implicates the water spouts in typhoid transmission via dissemination of Typhi and Paratyphi A around the city, supported by the detection of Typhi and Paratyphi A in the majority of water samples taken from these spouts. The diversity of Typhi genotypes we detected indicates that transmission occurs via water that is contaminated with a diverse population of Typhi, rather than point source outbreaks (with the exception of one outbreak, which actually occurred outside the hotspot zone). The diversity of Typhi genotypes within households suggests that this sort of transmission – ie dissemination via contamination of the water supply – contributes more to the overall typhoid burden than direct person-to-person transmission.

How does this contamination happen? Well, it is possible for people to carry Typhi and Paratyphi A in the gall bladder, without ever noticing an infection…for example, Typhoid Mary was a famous carrier of Typhi. Carriers shed the bacteria in their feces, so any food or water contaminated with their fecal material becomes a vehicle for typhoid transmission. Our study suggests that there are many Typhi and Paratyphi A carriers in Kathmandu who are unknowingly shedding the bacteria, so that whenever sewage seeps into the groundwater that feeds the stone water spout, the water becomes contaminated and can pass on the infection to those who drink the water. Most of the typhoid cases occur in the monsoon season, when flooding is likely to promote seepage of sewage into the underground aquifers that supply the water spouts. Hence the study suggests that endemic typhoid in Kathmandu is essentially a question of water infrastructure, and could potentially be dramatically reduced by supplying clean drinking water to people living in these few hotspot areas.

Peter Doherty on scientists in the spotlight

Interesting article from Peter Doherty (who is based in the Microbiology & Immunology department at the University of Melbourne where I work) about what it was like to be thrown into the spotlight by winning the Nobel prize.

I particularly like his line at the end:

“Real status in science is based on current performance, not on past achievement.”

This seems to me to be the attitude of the true scientist – whatever amazing things you’ve already done, developed or discovered, there is still a whole world of the unknown to explore.

Visualising trees

I just came across this very cool visual dictionary of tree visualisation methods – treevis, http://www.informatik.uni-rostock.de/~hs162/treeposter/poster.html, which made me think about the language of phylogenetic trees. It’s interesting to think how our ideas about phylogenetic inference, and the ways in which we interpret trees, are influenced by the way the tree is represented.

Whether studying bacterial populations or bacterial communities, we tend to use phylogenetics to breakdown, comprehend and represent the relationships between bacterial genomes. A basic phylogenetic tree can capture so much information… a great example is the recent Nature paper on the 7th Cholera pandemic, out of the Sanger Institute (see pubmed entry, the paper itself is behind the NPG paywall). The phylogenetic tree structure showing relationships between Vibrio cholera strains (obtained of course by whole genome sequencing with Illumina), together with the time and location that each strain was isolated, reveals an incredible amount of detail about how the pandemic has spread around the world over the last few decades.

But interpreting trees can be difficult. And the way the trees are represented can make them more or less difficult to interpret. As a simple example, I’m often surprised how many people are unaware that these two trees are just alternative representations of the same structure:

(If you don’t believe me, copy the tree structure below into a text file and open it up in a tree viewer like DendroScope or FigTree and click around the different representations.)

((A:0,B:0):0.2,(C:0.5,((J:0,K:0):2,(D:8,(E:12,(F:5,(G:5,(H:0.5,I:0):3):1):3):2):5):2):0.2);



In the unrooted tree on the right, it’s easy to see for example that E is about equidistant from all other leaves on the tree. But from the (randomly) rooted tree on the left, this is less apparent and requires some thinking about… many people interpret this as E being closer to F, G, H & I than to the other points. Granted, a proper rooting of the dendrogram on the left would help the situation, but still it’s a good example of how the visual representation makes interpretation more or less intuitive.

Many of the representations in treevis were created by computer scientists for purposes entirely unrelated to phylogenetics, so it would probably take a bit of effort to apply them to your favourite phylogeny…but could be worth it depending on what you are trying to convey.

An easy option for overlaying annotations of all kinds onto traditional phylogenetic tree structures (rectangular, circular and unrooted) is iTOL, the interactive tree of life. It’s a handy webtool where you can upload your tree file in newick format (like the one given above) or nexus format, plus some text files of annotations for nodes or leaves, and display the annotations overlaid on the tree in all kinds of cool ways. These are the examples in the iTOL paper in Nucleic Acids Research’s web server issue (open access), or just see the iTOL website for loads more examples and to try it out.

(Figure 1 from “Interactive Tree Of Life v2: online annotation and display of phylogenetic trees made easy”, Ivica Letunic & Peer Bork, NAR 39 (S2):W475-W478)

Assemblathon 1

I’ve been a little slow to catch up on the results of the Assemblathon, a competitive assembly event where teams use their best method(s) to generate assemblies from raw read data and the results are compared by a variety of metrics. The results from the first assemblathon, using simulated read sets, are now available pre-publication from Genome Research. The second assemblathon, using real (Illumina and Illumina+454) data from eukaryotic genomes, is happening now.

Firstly I think this is a brilliant idea and there should be far more of it in bioinformatics! So many of us are engaged in the same basic analysis tasks for dealing with short read sequence data – assembly, mapping and variant calling – but there are so many different programs & approaches (see this compilation over at seqanswers.com) out there that it quickly becomes overwhelming.

So, the results are in but, as always in the comparison of methods, there is not really a clear-cut winner. Each assembly was assessed using an enormous set of metrics (>100 apparently), including N50 (at contig and scaffold levels), miscalled bases, depth of coverage, misassemblies, etc… and unsurprisingly there was no single assembly that scored top on all metrics. BGI’s SOAPdenovo, Broad’s ALLPATHS, and Sanger’s SGA were consistently among the best for most metrics… but with clear differences. E.g. for contig N50 SOAPdenovo and ALLPATHS were both superior to SGA, which performed better than the others on scaffolding N50. SGA had the least substitution errors, but SOAPdenovo had fewer copy number errors and ALLPATHS had the best contig-level stats. For all the gory details see the results website or summaries in the paper [free at Genome Res] or this talk [PDF] presented at the Cold Spring Harbour Lab Biology of Genomes meeting.

I am trying to wrap my head around how informative this is for assembling bacterial genomes. I know a lot of people run their own in-house comparisons to determine the best approach for a particular project, but the assemblathon approach is systematic, and manages to be both competitive and collaborative, which is an awesome combination. While bacterial genomes are small and therefore raise fewer computational issues associated with large data & memory requirements, assembling them is still far from trivial and is often a crucial element of the analysis, because gene content is so variable among even very closely-related bacteria. The parameters I usually have in mind when considering bacterial assemblers are:

  • impact of different sequencing platforms & error profiles
  • impact of different insert sizes for paired or mate-pair reads
  • genomes with high or low GC
  • genomes with excessive IS elements

I guess the only aspect of this that’s missing from the current/planned assemblathon datasets is the effect of high or low G+C content, i.e. low complexity sequence, which isn’t really bacteria-specific anyway (think e.g. P. falciparum, the malaria parasite).

E. coli outbreak – PacBio data and PLoS One paper on 2001 O104:H4 strain

It’s been a while since my last post, mainly because my attention has had to return to other things (my day job, ASM, and holidays in the Australian Snowy Mountains).

But a fair bit has happened in the last few weeks on the E. coli front.

PacBio has released some data from an outbreak strain plus a few related strains. As far as I know this is the first time PacBio data has been released publicly so it’s a good opportunity to have a play…I will at some point but not tonight! The data includes very long reads (average ~3 kbp) but with 85% accuracy (ouch! but useful for assembly)…using their circularised read approach, they get much better accuracy (average ~98%) but much shorter reads (average 430 bp).

Muenster & Life Tech have now published their analysis of outbreak strain and the 2001 O104:H4 STEC strain from Germany in PLoS One. Data is available here at NCBI.

The key finding of interest, I think, is that the stx2 phage (i.e. Shiga toxin) was present in the 2001 strain – apparently identical and in the same position – suggesting that the phage was acquired by a common ancestor of the 2001 and 2011 German O104:H4 strains. However the stx2 phage does favour certain insertion sites, so it is still possible that this represents two separate acquisitions.

This is their model:

Mellman et al, PLoS One 2011 - Model for O104:H4 STEC evolutionMellman et al, PLoS One 2011 - Model for O104:H4 STEC evolution

The two strains carry different aggregative adhesion plasmids (AAF/III in 2001; AAF/I in 2011) and different resistance plasmids, consistent with some evolutionary time separating them.

The paper says that each strain has accumulated 87-95 SNPs among 1,444 chromosomal genes since they shared a common ancestor…but I’ve not looked in enough detail to be convinced the authors have corrected sufficiently for homopolymers.

Interestingly their tree suggests that the common ancestor of the 2001 & 2011 German strains is also the common ancestor of the African (stx2-free) EAEC strain Ec55989 (which has picked up 24 SNPs)…again I’m not sure whether this is correct until I inspect the data myself, and the lower number of SNPs in Ec55989 makes me a little suspicious that the others are over-estimates….since Ec55989 was isolated ~1999 I think so should have a similar number of SNPs to the 2001 strain. But still a very interesting development.

This is their tree:

Mellman 2011 PLoS One - minimum spanning tree

SNP analysis and plasmid copy number among 5 outbreak E. coli

Since the Illumina (MiSeq) reads from 5 HPA genomes were recently released, I thought it would be interesting to compare these to the ‘complete’ assembly of TY2482 from BGI and look for substitution mutations and read depths for plasmids vs chromosome.

SNP analysis

I decided to try using Nesoni, the python-based analysis pipeline written by Paul Harrison & Torsten Seeman from the Victorian Bioinformatics Consortium here in Melbourne. It uses shrimp2 to map reads, and samtools and to generate and process sorted alignments and call variants. It has a nice feature where, with a single command (nesoni nway) you can generate a table showing an n-way comparison of consensus allele calls in a set of genomes, at each of the loci called as a variant in any genome.

The complete resulting output is here. It reports not only the consensus call, but the evidence behind the call, so it’s simple to see whether you believe it or not. The result is 220 calls, of which I might believe 9 (pink in figure below, and Excel file here).

I mapped MiSeq reads from 4 of the 5 HPA genomes (shrimp2 didn’t like the fastq file for sample 280, need to sort this out) and the HiSeq reads from BGI’s TY2482, to the complete reference assembly for TY2482. For 194 of the 220 variants called, the TY2482 read mapping resulted in a variant call compared to the TY2482 reference, which means that the variant is unlikely to be real. This could happen for a variety of reasons relating to the mapping & variant calling process, and I was just using the default settings in Nesoni so some tweaking might remove these. In any case, I will ignore these variants for now because I don’t believe they are real (but you can see the full table here).

This leaves 24 variant calls, where the allele detected in one or more of the 4 HPA genomes is different from the TY2482 assembly+reads from BGI, shown in the table below. This includes 9 in the chromosome (highlighted in pink), with 2 SNPs called in all 4 genomes; 1 SNP called in sample 283 only, 3 SNPs called in sample 540 and 3 SNPs called in sample 541.

SNPs identified in HPA strains compared to the complete BGI assembly for TY2482

In pTY1, which is the IncI plasmid bearing the beta-lactamase CTX-M-15 gene, the variants detected (yellow above) were all within the shufflon proteins…this region is able to ‘shuffle’ via inversions between homologous sequences, and these variant calls will most likely represent shuffling in the region rather than point mutation. In fact, the alignment suggests that there are multiple versions of these sequences in each DNA sample, suggesting that the shufflon was active and generating a mixed population of plasmids in the HPA data (but not the BGI strain TY2482, which had homozygous calls in this region). This might be worthy of some further investigation by someone who understands shufflons a lot better than I do!

Finally, there were a few variant calls in the tiny plasmid pTY3, clustered within its rep gene. These calls are heterozygous (see table) in all four HPA strains, suggesting that the mapping is picking up two different versions of the rep gene, which could be due to homology with other replication proteins in plasmids pTY1 and pTY2.

Plasmid copy number

The massive variation in read depth for plasmid sequences compared to the chromosome reminded me it might be interesting to try to infer the average copy number for each plasmid based on read depth. To do this, I used the depth plots output by nesoni (which gives the mean read depth per base in the reference sequence, based on read mapping). I calculated the mean read depth across each reference sequence (ie the completed BGI TY2482 assembly, chromosome + 3 plasmids) from this, and then calculated the ratio of read depths for plasmid:chromosome. Assuming each bacterial cell has ~1 copy of the chromosome (i.e. ignoring cells caught in the act of replication when there will be >1), this should give an approximation of the mean copy number of each plasmid per cell. We know some plasmids are maintained quite stably at 1 per cell, while others can exist at high copy number. This is the result:

Mean read depth and mean plasmid copy number for outbreak strains

So for the TY2482 data, it looks as though the IncI1 resistance plasmid (pTY1) and the aggregative adhesion plasmid (pTY2) are maintained at ~1 per cell, while the mini plasmid (which contains little more than a plasmid replication gene) is present at ~9 per cell. This is pretty much in line with expectation.

Interestingly, the HPA strains appear to have much higher copy numbers, around 20 per cell for pTY1 and pTY2 and hundreds of copies of pTY3. The numbers are pretty consistent across the HPA strains, but are remarkably higher than in TY2482.

I don’t have a good explanation for this apparent difference…. it could be an artefact in the sequencing (MiSeq likes plasmid DNA???) or in the mapping (not sure how this could be, especially since the mean depth plots produced by nesoni exclude regions that map to multiple locations in the reference genome).This could be examined by looking at results from different mapping programs, or analysis of reads from different platforms (Ion Torrent for TY2482 & LB226692, 454 reads for the HPA & C2L genomes).

If it is a real difference, I wonder if it could be differences in growth rate or culture conditions in the two labs. Or a mutation in the chromosome that affects the normal control of plasmid replication? Could having lots of copies of the aggregative adhesion plasmid enhance virulence or transmission of the bug?

Four new outbreak isolates sequenced

Just a quick post to say that, in addition to the 454 scaffold for H112180280, the UK’s Health Protection Agency has now posted Illumina MiSeq data (reads + assembly) for H112180280 and a further 4 isolates: http://www.hpa-bioinformatics.org.uk/lgp/genomes

So enters the 4th sequencing platform…

 

Eurosurveillance editorial on E. coli outbreak credits crowdsourcing

This editorial in Eurosurveillance gives a nice overview of the microbiological side of the German E. coli outbreak investigation, including applauding the public data release & analysis efforts:

The data sets from these sequencing initiatives were instantly released for public access, resulting in data analysis among bioinformaticians and other researchers around the world. Results from these preliminary analyses have been rapidly communicated via blogs, Twitter and private web pages, outside the standard peer-reviewed scientific publication route. These initiatives have confirmed the microbiological characterisation of the outbreak strain made in the public health laboratories by targeted genotyping and phenotyping of facultative E. coli virulence genes. Most importantly, among compared E. coli genome sequences, the genome of the 2011 outbreak strain clustered closest to an EAggEC strain isolated in 2002, with the addition of stx2 and antibiotic resistance genes.

The details of findings to date are outlined in this article in the same issue, including details of Shiga toxin-producing enteroaggregative E. coli O104:H4 from an outbreak in Georgia in 2009 (main difference seems to be that it was ESBL negative, unlike the current strain which has acquired an IncI plasmid carrying the ESBL gene blaCTX-M). They also discuss a rapid PCR test for the outbreak strain direct from food samples, involving enrichment+incubation (18-24h) followed by PCR for stx2 gene from extracted DNA, followed by PCR for O104 and then confirmation from pure cultures.

There are now reports of E. coli O104 from a stream in Germany, located downstream from a sewage plant… although this is more likely to be caused by the outbreak than a cause of it, it highlights that even highly industrialized countries need to be vigilant with sanitation and hygiene to prevent the spread of dangerous human pathogens [sourced from ProMed].

Follow

Get every new post delivered to your Inbox.