Short read data storage – where is the future?

The sequence read archive (SRA) is an archive of next gen seq data, mirrored (like the other sequence databses) at NCBI (US), EBI (Europe) and DDBJ (Japan).

SRA links: @NCBI, @EBI’s European Nucleotide Archive (ENA), @DDBJ.

A few weeks ago, the NCBI announced they would be discontinuing the NCBI-hosted SRA.

The reason cited by NCBI is budget constraints [see announcement] and a couple of other NCBI databases are also affected, plus the SRA continues to be supported by EBI & DDBJ….so in the short term, this decision is unlikely to have much impact on day-to-day availability of the SRA data. It does however raise questions about the prospects for long-term storage of sequence data, and whether it makes more sense to store sequence data (raising problems related to long-term storage, accessibility, backup & format issues) or simply store the DNA.

I’m somewhat torn on this. On the one hand, if we are going to be sequencing every bacteria that comes into a public health/hospital microbiology lab, it makes no sense to store it all. On the other hand, I don’t think we know quite enough yet to decide which information we really need to extract and keep, and what is extraneous. Another major issue is that while sequence data is (relatively) easily shared, searched and otherwise accessed by groups all over the world, storage of strains or DNA necessarily entails individual labs not only coming up with long-term storage solutions for their samples, but finding ways to make these accessible to other groups. I’ve heard enough horror stories about labs throwing out their “old strains” – for a variety of reasons ranging from containment level/health & safety requirements, to simply having no money for additional freezer space, not to rely on the long-term storage of DNA.

Genome Biology (open access) recently ran a Q&A on this issue with 5 researchers (David Lipman – NCBI; Paul Flicek – EBI; Steven Salzberg; Mark Gerstein; Rob Knight).

Some of these guys highlight that other databases are actually more useful for NGS storage, especially for metagenomics/bacterial 16S data. I find it interesting that some of them seemed not to realise (or at least not acknowledge in their answers) that the SRA is not just a NCBI/NIH/US venture, but is in fact an international effort and continues to be maintained at EBI.