E. coli data released under Creative Commons 0 license

BGI has now formally released their data, including Illumina reads, under Creative Commons 0 (CC0) license. This is the most open license possible, and includes this statement:

The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

They have set up this page with links to, and details of, all the data available. I understand HPA will do something similar.

This is an important step for crowdsourcing! In particular, it clarifies the position that this data is released for the benefit of the public & public health, and there are no restrictions on publication of data and analysis. This is important. The genomics community has long been used to pre-publication data release of ‘community resource’ genomics data (begun during the Human Genome Project), on the understanding (in broad terms) that the research community would not attempt to publish analysis of the released data until the data generators had completed their data generating and had a chance to publish their own analysis. In short, releasing data does not imply that others have the right to publish analysis of it. But the CC0 license does.

I’ve been asked by a few people about this, over the past week. There is understandably some discomfort about people publicly sharing analysis of data that they did not generate. My position has been that the data generators released their data freely for public analysis (with public health as the driver), and this is very different from a research project where they have released the data with the expectation that no-one will ‘publish’ anything before they themselves have. But it is important to formalise this…. I guess it just takes time for the legalities to catch up.

For the crowd-sourcers part, all the analysis posted at the GitHub E. coli O104:H4 Genome Analysis Crowdsourcing site will be available under the CC0 license too, this is being formalised today thanks to the folks at ERA7.

Advertisements

Short read data storage – where is the future?

The sequence read archive (SRA) is an archive of next gen seq data, mirrored (like the other sequence databses) at NCBI (US), EBI (Europe) and DDBJ (Japan).

SRA links: @NCBI, @EBI’s European Nucleotide Archive (ENA), @DDBJ.

A few weeks ago, the NCBI announced they would be discontinuing the NCBI-hosted SRA.

The reason cited by NCBI is budget constraints [see announcement] and a couple of other NCBI databases are also affected, plus the SRA continues to be supported by EBI & DDBJ….so in the short term, this decision is unlikely to have much impact on day-to-day availability of the SRA data. It does however raise questions about the prospects for long-term storage of sequence data, and whether it makes more sense to store sequence data (raising problems related to long-term storage, accessibility, backup & format issues) or simply store the DNA.

I’m somewhat torn on this. On the one hand, if we are going to be sequencing every bacteria that comes into a public health/hospital microbiology lab, it makes no sense to store it all. On the other hand, I don’t think we know quite enough yet to decide which information we really need to extract and keep, and what is extraneous. Another major issue is that while sequence data is (relatively) easily shared, searched and otherwise accessed by groups all over the world, storage of strains or DNA necessarily entails individual labs not only coming up with long-term storage solutions for their samples, but finding ways to make these accessible to other groups. I’ve heard enough horror stories about labs throwing out their “old strains” – for a variety of reasons ranging from containment level/health & safety requirements, to simply having no money for additional freezer space, not to rely on the long-term storage of DNA.

Genome Biology (open access) recently ran a Q&A on this issue with 5 researchers (David Lipman – NCBI; Paul Flicek – EBI; Steven Salzberg; Mark Gerstein; Rob Knight).

Some of these guys highlight that other databases are actually more useful for NGS storage, especially for metagenomics/bacterial 16S data. I find it interesting that some of them seemed not to realise (or at least not acknowledge in their answers) that the SRA is not just a NCBI/NIH/US venture, but is in fact an international effort and continues to be maintained at EBI.