This post is really a reply to PatrikD’s comments on my last post about the SNP-based phylogeny of E. coli, showing the close relationship between the outbreak EAEC O104:H4 strain and the finished genome sequence of Ec55989 (well, chromosome). I can’t post much in a comment reply so it requires a new post.
Since I’m looking for SNPs that agree between the three assemblies, I’m not too worried about remaining sequencing errors. Given how closely these strains were sampled, I’m not interested in any minor differences between them, but more in the difference between 55989 and the three outbreak strains.
Now, I think the best way to do that at this point would be to use the SNP calls from HiSeq reads of TY2482. I get 1781 high quality SNP calls compared to Ec55989, see table here (using bwa & samtools to map reads and call SNPs). Most of these are in prophage regions (annotated in this artemis-readable file and flagged in the SNP table in the ‘mobile’ column); excluding these there are 564 probable SNPs (SNP table, artemis-readable file).
This is their distribution (dark blue = phage regions and IS elements; red = SNP calls overlapping with these mobile regions; green = 564 SNP calls not in these regions; inner ring=GC content):
Of the 1230 SNPs in mobile regions (i.e. subject to horizontal transfer or homologous recombination with typically horizontally transferred genes), there are 713 non-synonymous mutations (i.e. changes the encoded amino acid) and 424 synonymous changes (i.e. silient mutations, not affecting the encoded amino acid sequence)…while it is dodgy to do dN/dS on this kind of intraspecies data, it would be ~0.5, consistent with purifying selection.
For the 564 SNPs in non-mobile regions of the genome (i.e. not subject to horizontal transfer or homologous recombination with typically transferred genes), 284 are non-synonymous and 145 synonymous (dN/dS~0.6). Again it’s tempting to call this purifying selection, as it looks like mutations affecting protein function are being selectively removed, relative to those silent mutations. Also, 136 SNPs (24%) were in intergenic regions. Since only 13% of the chromosome is non-coding, this suggests that mutations in non-coding regions are better tolerated than those in coding sequences, again consistent with purifying selection against mutations affecting protein function.
There are 4,762 genes annotated in Ec55989, and only 564 SNPs…so the majority of genes don’t have any point mutations (although I haven’t looked at insertion/deletions yet, only subsitution mutations where one DNA base is exhanged for another). But do any genes have a high number of changes?
The answer is yes, but these mostly look like recombination events:
ibrA, an immunoglobulin-binding regulator, has 20 SNPs, mainly in the C-terminal domain. It’s quite likely this is a recombination event rather than diversifying selection, but deserves a closer look. A neighbouring gene, EC55989_3315 (conserved hypothetical) also has 18 SNPs, so recombination is a likely culprit.
The Ag43 gene, EC55989_3357, has 17 SNP calls, however we know from the assemblies that the outbreak strain has acquired divergent copies of Ag43 genes [see posts on analysis of integrated sequence, and biofilm genes], so these SNP calls probably reflect the differences between two distinct copies of Ag43 genes rather than point mutations acquired in the EC55989_3357 gene itself.
There are 6 SNPs in EC55989_4669, an Iha adhesin receptor. This could be real, or could be a divergent copy acquired by the outbreak strain.
Hypothetical proteins EC55989_3283, EC55989_3284, EC55989_3293, EC55989_3355, EC55989_4845, EC55989_4890, EC55989_4891, EC55989_4896, EC55989_4897 each have 3-4 SNPs. These are probably phage remnants or similar to phage genes, but some followup analysis could confirm this. There are 7 SNPs in shiA, a fragment of a gene similar to a shigella SHI-2 pathogenicity island gene. EC55989_4671 is an IS element that slipped through my filter, and contains 5 SNP calls.