New German STEC/EHEC data from BGI

June 13. BGI has now formally released their data, including Illumina reads, under Creative Commons 0 (CC0) license. This is the most open license possible, and includes this statement:

The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

They have set up this page with links to, and details of, all the data available. I understand HPA will do something similar.

This is an important step for crowdsourcing! In particular, it clarifies the position that this data is released for the benefit of the public & public health, and there are no restrictions on publication of data and analysis. This is important. The genomics community has long been used to pre-publication data release of ‘community resource’ genomics data (begun during the Human Genome Project), on the understanding (in broad terms) that the research community would not attempt to publish analysis of the released data until the data generators had completed their data generating and had a chance to publish their own analysis. In short, releasing data does not imply that others have the right to publish analysis of it. But the CC0 license does.

I’ve been asked by a few people about this, over the past week. There is understandably some discomfort about people publicly sharing analysis of data that they did not generate. My position has been that the data generators released their data freely for public analysis (with public health as the driver), and this is very different from a research project where they have released the data with the expectation that no-one will ‘publish’ anything before they themselves have. But it is important to formalise this…. I guess it just takes time for the legalities to catch up.

For the crowd-sourcers part, all the analysis posted at the GitHub E. coli O104:H4 Genome Analysis Crowdsourcing site will be available under the CC0 license too, this is being formalised today thanks to the folks at ERA7.


So BGI have released a new assembly of TY2482 [here], which they say includes 200x of data from HiSeq. I’m concerned that they’ve not released any reads this time, and still no information on how they are doing the assembly. How long are the reads? Were they paired end? What insert size? Why only 200x coverage from a HiSeq run…even with short reads (35 bp) single end I’d expect a lot more than that…

The new assembly includes more than 200x single-end reads from the Illumina HighSeq Platform, which allowed BGI to provide a more complete genome map and to correct any assembly errors from the previous version. More importantly, this version is a completely de novo assembly, whereas the previous versions by BGI and others used a reference-based assembly method to obtain a consensus sequence. The new assembly continues to support the finding that this infectious strain carries disease-causing genes from two types of pathogenic E. coli: enteroaggregative E. coli(EAEC) and enterohemorrhagic E. coli (EHEC).

To properly analyse this, the reads are really important, so why hasn’t BGI released them? For crowd sourcing analysis to work (and be worth the crowd’s effort) the full data needs to be public. BGI released the reads from their first 7 Ion Torrent runs plus an assembly, which was really useful. A lot of the annotation and analysis effort has been based around Nick Loman’s assembly of these reads using MIRA, and ERA7’s annotation of that assembly [see collated analyses at github]. In comparison, the BGI assembly has been drawn on less by analysts, because there is no detail about how it was created. So releasing reads worked for BGI first time around…why not now?

Life Tech released only an assembly and has still not released reads (Ion torrent) several days later. Similarly, their assembly has not been publicly annotated or the focus of concerted analysis efforts, because it is dependent on an assembly that we don’t know much about.

So I guess the point is, crowd-sourcing analysis relies on transparent data release. If we can’t see the raw data, we’re less likely to trust an assembly and proceed with analysis.

P.S. The press release from BGI implies that all prior assemblies were reference based and not de novo. However the MIRA assembly of runs 1-5 of BGI reads, which has been annotated by ERA7 and analysed extensively, was a de novo assembly.

Update: Ion Torrent reads now available for LB226692, from Life Technologies