Genome comparisons for 4 available outbreak genomes

Two new genomes were released today (well today my time, yesterday European time!) by the Göttingen Genomics Lab. They say:

We just released the 454 data of another two isolates from the German E. coli outbreak. You can find it on our website:
http://www.g2l.bio.uni-goettingen.de/
The link to the ftp server is ftp://134.76.70.117/
User name and password are ‘EAHEC_GOS’.

We just released the 454 assembly data of another two isolates (GOS1
and GOS2) from the German E. coli O104:H4 outbreak.
The shotgun libraries were sequenced on a GS FLX using Titanium
chemistry. 1.5 medium lanes of a Titanium picotiter plate was used for
each strain. Reads were assembled de novo using the Roche Newbler
assembly software 2.3.
We’ll submit the read data in the NCBI SRA as well.

There is also a new assembly from BGI which I’ve not looked at yet, go here to the github wiki to find out more.

I downloaded the first genome assembly (GOS1) but the second kept timing out, maybe they are a very popular download today! So here is the latest comparison of four available genomes from the outbreak, with their closest available reference sequences. This just illustrates what we knew from earlier analyses, and confirms that at least one of the newest genomes is pretty much the same as all the other outbreak genomes.

Comparison of 4 available assemblies (note there is a 5th but I couldn't get it to download!) For chromosome, phage and IncI plasmid, the reference sequences from NCBI are used. For the aggregative adhesion plasmid, it is so different to published references sequences that I have used the HPA scaffold as the reference, and mapped all others (including available EAEC reference plasmids and the agg operon) to this.

Advertisements

3 thoughts on “Genome comparisons for 4 available outbreak genomes

  1. I looked a bit at the fourth BGI assembly. It offers finished versions of the main chromosome of size 5278900 bp and three plasmids of size 88695, 75330 and 1549 bp, so vanquished are intermediate scaffolds, strings of Ns, the uncertain lower case atcg seen in the HPA, and conceivably the majority of homopolymer run and base calling errors.

    BGI notes “roughly 5000 coding proteins” encoding some 1,532,203 amino acids (an 87% payload) on the main chromosome, suggesting only an initial pass since 800-900 previously annotated proteins would be missing. For comparison, the chromosome of E coli K-12 consists of 4,639,221 bp encoding 4290 proteins and that of O157:H7 5,444,000 bp encoding an additional 1346 proteins.

    In terms of possible base miscalls, I did see an early internal stop codon within YafB of the largest plasmid (pEC_Bactec) that could be corrected by taa -> caa (pyrimidine re-call) implied by the glutamine codon at this site in O157:H7; pseudogenization is implausible here because of the otherwise near-perfect agreement. However the horse plasmid has this very same stop codon, making me think in O157:H7 the initial methionine was mis-annotated (the single most common error at GenBank) and indeed other orthologous entries begin with MGKN.

    atgaattgtgatattattaaagagtaaaaccagccagccgaggtatttatgggaaaaaat
     M  N  C  D  I  I  K  E  -  N  Q  P  A  E  V  F  M  G  K  N 

    I didn’t see any notable candidates offhand for homopolymer run errors. These typically are off by 1-2 bp so throw off the reading frame, which manifests itself on all vs all blastp as hits broken into two pieces, which can also come about from middling indels or regions of non-alignment. These also arise very commonly from replication slippage mutations in compositionally simple regions, so the problem comes in deciding whether a given homopolymer ‘event’ is a sequencing problem or real. Here we have so many sequenced isolates that continuous reading frame in any of them can be taken as the preferred alternative. Note near-terminal shift in reading frame does not necessarily mean pseudogenization as read-through soon leads to another stop codon.

    I chased down what BGI meant with “the biggest plasmid is highly homologous to a previously sequenced plasmid isolated from a horse and carrying additional multi-drug resistance genes” to a [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2887853/?tool=pubmed June 2010 Plos article] by Smet et al. Their 92,970 bp plasmid, called pEC_Bactec (annotation view [http://www.ebi.ac.uk/ena/data/view/GU371927]) was isolated in 2009 or so from the joint of a Belgian horse with arthritis. It belongs to the incompatibility group IncI1 and carries among its [http://www.plosone.org/article/fetchSingleRepresentation.action?uri=info:doi/10.1371/journal.pone.0011202.s001 86 conserved open reading frames] two beta-lactamase genes blaTEM-1 and blaCTX-M-15 which are indeed (as required by the nomenclature) identical as proteins to those in the BGI human isolate assembly.

    The two resistance genes of pc_BacTec were [http://www.ncbi.nlm.nih.gov/pmc/articles/PMC2764225/ swapped in from another plasmid] with concommitant loss of 12 other genes, with gain of ampicillin and ESBL cephalosporin resistance but loss of arsenic, tetracycline, and streptomycin resistance and, equally puzzling, loss of two systems used to [http://www.ttk.pte.hu/biologia/genetika/bact_gen/refs/plasm_adict.pdf maintain plasmids in descendent cells], the toxin-antidote mck/kor addiction system and the parA/parB system for partitioning plasmid copies to daughter cells.

    So it is a bit mysterious what is now maintaining the pc_BacTec plasmid According conventional wisdom, there is either another trans acting addiction or partition system on the main chromosome or second large plasmid, or there is continuous selective advantage to the beta-lactamases (there being little else in the other plasmid genes to sustain it). [http://news.vin.com/VINNews.aspx?articleId=18659 According to the FDA], the United States used 28,900,000 pounds of antibiotics a year in confined animal feeding operations (CAFOs) plus an additional 7,300,000 pounds in humans, not to mention exports.

    If I may embellish on an earlier note of Kat on the significance of CTX-M-15, this enxyme is but a single step forward, namely D240G, in the post-Kluyvera evolutionary trajectory of ancestral CTX-M-3, “a diversification process can only be explained in a complex selective landscape with at least two antibiotics (cefotaxime and ceftazidime).” (Novais et al. [http://www.plospathogens.org/article/info%3Adoi%2F10.1371%2Fjournal.ppat.1000735 PLoS Pathog Jan 2010]).

    In other words, CTX-M-15 may tell us a bit where this plasmid has been, prior to joining the party at O104:H4. Not the natural world where syn-methoxyimino third-generation chemical derivatives of cephalosporin don’t exist, much less simultaneously at elevated concentration. CTX-M-15 has been recovered from chicken, turkey, gulls, pig, cattle, horse, and human. TEM-1 in contrast is an older widespread enzyme for garden variety ampicillins.

  2. I took another look at the pseudogene issue on June 18th using the new era7 proteomic analysis of the fourth and final BGI assembly. The era7 annotation up at github is very nicely done and provides columns allowing convenient extraction of proteins that seem to have internal stop codons (326 reported) and internal frameshifting indels (273). While their pipeline approach does an excellent job at finding candidates, not all of these can be expected to pan out upon manual curation.

    One thing sure to go awry is the selenoproteome. During translation, TGA stop codons in certain places in certain genes are interpreted as selenocysteine, our 21st amino acid U. This can give rise to false calls of internal stop codons. That is fixable today by era7 methodology if you recall the ten year battle to get GenBank to stop truncating selenoprotein entries and UniProt to stop replacing them with non-encoded cysteines is over. Here era7 could help by explicitly listing the stop codon. I determined that this strain of E. coli, like its cousin O157:H7, has the usual 3 selenoproteins in its formate dehydrogenases but no others (thanks to era7 work). This amounts to fixing 3 of the 6007 protein coding gene calls in the era7 annotation of the main chromosome and three plasmids. (There is a fair amount of transposase etc debris that can also be scrubbed.)

    I looked next at the TGA stop codon in E6AKY5, thinking as a 4Fe-4S redox protein it might have an unsuspected selenocysteine. However what happened here is two vaguely related genes on the same strand were cheek to jowl with little intervening sequence and no other stop codons than the TGA. It is clear enough blastp that this is not a case of a gene fusion followed by a stop codon mutation, it is simply two separate but juxtaposed genes with the first happening to have a conventional TGA stop codon. I don’t know how widespread this second category of miscall is.

    Finally, I collapsed the 4-5 verbose comment columns into a single field and pulled out the enzymes (the -ases) plus a seemingly lethal errant ribosomal protein. These, if validated, would be interesting because their loss indicates non-use (lack of selective pressure in the immediate reservoir of this strain). Here the first thing to look at would be position within the enzyme because extremely distal stop codons might shorten the protein trivially in area not needed for catalysis.

    E6AKY5  pre_stop  459  2x  4Fe-4S electron carrier 
    B7L9E5  pre_stop 1016  Se  Formate dehydrogenase-O
    B7LB28  pre_stop  715  Se  Formate dehydrogenase-H 
    B7L7H4  pre_stop 1015  Se  Formate dehydrogenase-N
    C8UE52  pre_stop  315  ..  Hydrogenase
    Q8ZGV9  pre_stop  490  ..  Betaine aldehyde dehydrogenase
    Q8Z2M4  pre_stop  850  ..  Trimethylamine-N-oxide reductase
    C6UXH9  pre_stop  777  ..  Biotin sulfoxide reductase
    Q4MVY2  pre_stop  324  ..  Acetyl-coenzyme A carboxylase
    B7LQL5  pre_stop  132  ..  Endoribonuclease
    Q06067  pre_stop  608  ..  histidine-protein kinase
    E9XIL4  pre_stop  356  ..  LPS heptosyltransferase
    Q7CGZ5  pre_stop  748  ..  PEP phosphotransferase
    B7LEX1  pre_stop  260  ..  Pseudouridine synthase
    B5YQB0  pre_stop  714  ..  Methylmalonyl-CoA mutase
    Q8Z913  pre_stop  652  ..  N-methyltransferase
    D8C0T9  pre_stop  112  ..  Cytosine methyltransferase
    Q0WKB1  pre_stop  266  ..  Peptidyl-prolyl cis-trans isomerase
    Q4MMX4  pre_stop  788  ..  cyclic-phosphodiesterase
    Q4MJ41  pre_stop  365  ..  Carbamoyl-phosphate synthase
    E3XVQ7  pre_stop  182  ..  Phenylacetic acid transferase  
    D3GWC2  pre_stop  247  ..  4'-phosphopantetheinyl transferase
    Q8Z8I8  pre_stop  341  ..  citrate N-acetyltransferase+
    C4ZUJ6  pre_stop  179  ..  30S ribosomal protein S7
    
  3. Fixing homopolymer run errors (frameshifts in coding regions seen by the era7 annotation of the final BGI assembly) won’t be easy. The first one I looked at, the lipopolysaccharide biosynthethic enzyme 3-deoxy-D-manno-octulosonate cytidylyltransferase (gene KdsB), has a string of 6 As that appear to be a homopolymer run error: upon blastx the highest scoring match is to two reading frames of EFZ43318.

    However what really happened here is GenBank entry EFZ43318 (from EPECa14 22x 454 Celera Assembler v6.0) has the homopolymer run error (5 As) but was submitted it to GenBank anyway. This could not be a pseudogene in EPECa14 because this enzyme is essential to viability except in supplemented media never encountered in nature.

    The homopolymer run error causes the stop codon to be in the wrong reading frame. The next stop codon is a ways downstream, causing a spurious C-terminal extension to the protein. That in turn caused a longer match to O154:H4 dna leading to a higher scoring hit despite the second reading frame needed. There are hundreds of correct entries, all shorter and lower scoring. The BGI sequence is thus correct.

    This won’t be easy for era7 to fix because they rely on the integrity of GenBank entries (hah!). GenBank however is filling to the brim with homopolymer run errors from next-generation sequencing projects, causing an error cascade to downstream to users treating them as fiducial.

    Finally, since era7 reliance on EFZ43318 caused 93 bp following the true stop codon to mis-assigned to it, I looked at what it might encode as it too would be misassigned. These bp do not belong to another protein because the next such feature in the genome, MukF killing factor KicB, kicks in at position 148. So far it appears that the intervening 147 bp are just dead wood.

    cttcgtgttctgtggtacggcgAAAAAAtccatgttgctgttgctcaggaagtt O154:H4  Exp = 2e-131 next best matches
     L  R  V  L  W  Y  G  E  K  I  H  V  A  V  A  Q  E  V  
    
    cttcgtgttctgtggtacggcgAAAAAtccatgttgctgttgctcaggaagttc EFZ43318 Exp = 9e-145 best match
     L  R  V  L  W  Y  G  E  K  S  M  L  L  L  L  R  K  F 
    
    >93extra bp following the true stop codon mis-assigned to kdsB, followed by 59 bp, followed by start of MukF
    ttcacttcacgacacttcagccaattttgggaggagtgtcgtaccgttacgattttcctcaatttttcttttcaacaattgatctcattcagg
    TGACATCTTTTATATTGGCGCTCATTATGAAAGCAGTAGCTTTTATGAGGGTAATCTGA
    atggaacagctgcgtgccgaattaagccatttactgggcgaaaaactcagtcgtattgagtgcgtcaatga

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s