New German STEC/EHEC data from BGI

June 13. BGI has now formally released their data, including Illumina reads, under Creative Commons 0 (CC0) license. This is the most open license possible, and includes this statement:

The person who associated a work with this deed has dedicated the work to the public domain by waiving all of his or her rights to the work worldwide under copyright law, including all related and neighboring rights, to the extent allowed by law.

They have set up this page with links to, and details of, all the data available. I understand HPA will do something similar.

This is an important step for crowdsourcing! In particular, it clarifies the position that this data is released for the benefit of the public & public health, and there are no restrictions on publication of data and analysis. This is important. The genomics community has long been used to pre-publication data release of ‘community resource’ genomics data (begun during the Human Genome Project), on the understanding (in broad terms) that the research community would not attempt to publish analysis of the released data until the data generators had completed their data generating and had a chance to publish their own analysis. In short, releasing data does not imply that others have the right to publish analysis of it. But the CC0 license does.

I’ve been asked by a few people about this, over the past week. There is understandably some discomfort about people publicly sharing analysis of data that they did not generate. My position has been that the data generators released their data freely for public analysis (with public health as the driver), and this is very different from a research project where they have released the data with the expectation that no-one will ‘publish’ anything before they themselves have. But it is important to formalise this…. I guess it just takes time for the legalities to catch up.

For the crowd-sourcers part, all the analysis posted at the GitHub E. coli O104:H4 Genome Analysis Crowdsourcing site will be available under the CC0 license too, this is being formalised today thanks to the folks at ERA7.

—–

So BGI have released a new assembly of TY2482 [here], which they say includes 200x of data from HiSeq. I’m concerned that they’ve not released any reads this time, and still no information on how they are doing the assembly. How long are the reads? Were they paired end? What insert size? Why only 200x coverage from a HiSeq run…even with short reads (35 bp) single end I’d expect a lot more than that…

The new assembly includes more than 200x single-end reads from the Illumina HighSeq Platform, which allowed BGI to provide a more complete genome map and to correct any assembly errors from the previous version. More importantly, this version is a completely de novo assembly, whereas the previous versions by BGI and others used a reference-based assembly method to obtain a consensus sequence. The new assembly continues to support the finding that this infectious strain carries disease-causing genes from two types of pathogenic E. coli: enteroaggregative E. coli(EAEC) and enterohemorrhagic E. coli (EHEC).

To properly analyse this, the reads are really important, so why hasn’t BGI released them? For crowd sourcing analysis to work (and be worth the crowd’s effort) the full data needs to be public. BGI released the reads from their first 7 Ion Torrent runs plus an assembly, which was really useful. A lot of the annotation and analysis effort has been based around Nick Loman’s assembly of these reads using MIRA, and ERA7’s annotation of that assembly [see collated analyses at github]. In comparison, the BGI assembly has been drawn on less by analysts, because there is no detail about how it was created. So releasing reads worked for BGI first time around…why not now?

Life Tech released only an assembly and has still not released reads (Ion torrent) several days later. Similarly, their assembly has not been publicly annotated or the focus of concerted analysis efforts, because it is dependent on an assembly that we don’t know much about.

So I guess the point is, crowd-sourcing analysis relies on transparent data release. If we can’t see the raw data, we’re less likely to trust an assembly and proceed with analysis.

P.S. The press release from BGI implies that all prior assemblies were reference based and not de novo. However the MIRA assembly of runs 1-5 of BGI reads, which has been annotated by ERA7 and analysed extensively, was a de novo assembly.

Update: Ion Torrent reads now available for LB226692, from Life Technologies

Advertisements

8 thoughts on “New German STEC/EHEC data from BGI

  1. Hi Kat,

    Just to say the reads on BGIs website are single-end 90bp and they we part of a multiplexed lane which explains the relatively low coverage.

    It would of course help BGI explained this in their README file.

    I’m still checking to see that they map to the reference genomes.

    All the best!

    • Thanks for this Marina, I was hoping you would 🙂

      I am really impressed with your annotation, it has made things that much easier to analyse.

      Could I just point out one thing? It is understandable but confusing to call the beta-lactamase CTX-M-3, as it is actually a CTX-M-15. I know that by homology you will get CTX-M-3 coming up, but in this case the correct allele is 15. The numbers at the end designate variants which usually differ by just one base/aa, so although it doesn’t seem like much, this distinction actually helps to differentiate among the huge proliferation of these genes.

      This is not a criticism, just a minor point in the scheme of things.

      Great job, and I hope to get time to take a proper look at the new annotation! When you get a chance, I’d be interested to know more about your service – how much do you charge? Do you sell software or just individual analyses?

      All the best,

      Kat

      > From: “comment-reply@wordpress.com” > Reply-To:

  2. Hi Kat,

    thanks for the correction in the annotation. Our system is not able to distinguish two proteins that may differ only in one or two amino acids. It’s an automated system and we always encourage to take a deep look and check manually these details, Besides, this is something everyone should do with any automated annotation I think.

    Anyway, we’re thinking about incorporate manual annotations that people do into our annotations so in a time we could have a very rich annotation done by the community. We’ll include your correction in the annotation, it’d be great if you could tell me the gene ID of the beta-lactamase CTX-3, just to be sure we correct the right gene.

    Regards

    Marina

  3. Hey,

    I forgot a very important thing. Thanks very much for taking a look at our annotations! 😀

    It’s very exciting seeing how suddenly lot of people have started working on a dataset sharing results and findings. This crowdsourcing thing is great! 🙂

  4. Pingback: Are there plasmids in the E. coli TY-2482 genome? | the oh no sequences! blog

  5. Hi Kat,

    Sorry for just getting to this, but our illumina reads have been available, it’s just we didn’t highlight them properly in the readme file. You can get them here: http://ftp.genomics.org.cn/pub/Ecoli_TY-2482/110601_I238_FCB067HABXX_L3_ESCqslRAADIAAPEI-2_1.fq.gz
    I’ve been trying to clarify few things on the excellent github wiki and also put a portal page together to make things clearer in the future, and I’m sorry if in the mad rush going on producing all this data we may not have invested as much time as we should in providing all the background information.

    Love the work you and all of the crowdsourcers are doing, and please let us know if you ever need further clarification or information on anything. We’ve just released another assembly this morning, so I’m sure some people will have a busy weekend.

    Best wishes,

    Scott

  6. Pingback: Rum and Reason » Some O104:H4 Links [Mike the Mad Biologist]

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s