SNP analysis and plasmid copy number among 5 outbreak E. coli

Since the Illumina (MiSeq) reads from 5 HPA genomes were recently released, I thought it would be interesting to compare these to the ‘complete’ assembly of TY2482 from BGI and look for substitution mutations and read depths for plasmids vs chromosome.

SNP analysis

I decided to try using Nesoni, the python-based analysis pipeline written by Paul Harrison & Torsten Seeman from the Victorian Bioinformatics Consortium here in Melbourne. It uses shrimp2 to map reads, and samtools and to generate and process sorted alignments and call variants. It has a nice feature where, with a single command (nesoni nway) you can generate a table showing an n-way comparison of consensus allele calls in a set of genomes, at each of the loci called as a variant in any genome.

The complete resulting output is here. It reports not only the consensus call, but the evidence behind the call, so it’s simple to see whether you believe it or not. The result is 220 calls, of which I might believe 9 (pink in figure below, and Excel file here).

I mapped MiSeq reads from 4 of the 5 HPA genomes (shrimp2 didn’t like the fastq file for sample 280, need to sort this out) and the HiSeq reads from BGI’s TY2482, to the complete reference assembly for TY2482. For 194 of the 220 variants called, the TY2482 read mapping resulted in a variant call compared to the TY2482 reference, which means that the variant is unlikely to be real. This could happen for a variety of reasons relating to the mapping & variant calling process, and I was just using the default settings in Nesoni so some tweaking might remove these. In any case, I will ignore these variants for now because I don’t believe they are real (but you can see the full table here).

This leaves 24 variant calls, where the allele detected in one or more of the 4 HPA genomes is different from the TY2482 assembly+reads from BGI, shown in the table below. This includes 9 in the chromosome (highlighted in pink), with 2 SNPs called in all 4 genomes; 1 SNP called in sample 283 only, 3 SNPs called in sample 540 and 3 SNPs called in sample 541.

SNPs identified in HPA strains compared to the complete BGI assembly for TY2482

In pTY1, which is the IncI plasmid bearing the beta-lactamase CTX-M-15 gene, the variants detected (yellow above) were all within the shufflon proteins…this region is able to ‘shuffle’ via inversions between homologous sequences, and these variant calls will most likely represent shuffling in the region rather than point mutation. In fact, the alignment suggests that there are multiple versions of these sequences in each DNA sample, suggesting that the shufflon was active and generating a mixed population of plasmids in the HPA data (but not the BGI strain TY2482, which had homozygous calls in this region). This might be worthy of some further investigation by someone who understands shufflons a lot better than I do!

Finally, there were a few variant calls in the tiny plasmid pTY3, clustered within its rep gene. These calls are heterozygous (see table) in all four HPA strains, suggesting that the mapping is picking up two different versions of the rep gene, which could be due to homology with other replication proteins in plasmids pTY1 and pTY2.

Plasmid copy number

The massive variation in read depth for plasmid sequences compared to the chromosome reminded me it might be interesting to try to infer the average copy number for each plasmid based on read depth. To do this, I used the depth plots output by nesoni (which gives the mean read depth per base in the reference sequence, based on read mapping). I calculated the mean read depth across each reference sequence (ie the completed BGI TY2482 assembly, chromosome + 3 plasmids) from this, and then calculated the ratio of read depths for plasmid:chromosome. Assuming each bacterial cell has ~1 copy of the chromosome (i.e. ignoring cells caught in the act of replication when there will be >1), this should give an approximation of the mean copy number of each plasmid per cell. We know some plasmids are maintained quite stably at 1 per cell, while others can exist at high copy number. This is the result:

Mean read depth and mean plasmid copy number for outbreak strains

So for the TY2482 data, it looks as though the IncI1 resistance plasmid (pTY1) and the aggregative adhesion plasmid (pTY2) are maintained at ~1 per cell, while the mini plasmid (which contains little more than a plasmid replication gene) is present at ~9 per cell. This is pretty much in line with expectation.

Interestingly, the HPA strains appear to have much higher copy numbers, around 20 per cell for pTY1 and pTY2 and hundreds of copies of pTY3. The numbers are pretty consistent across the HPA strains, but are remarkably higher than in TY2482.

I don’t have a good explanation for this apparent difference…. it could be an artefact in the sequencing (MiSeq likes plasmid DNA???) or in the mapping (not sure how this could be, especially since the mean depth plots produced by nesoni exclude regions that map to multiple locations in the reference genome).This could be examined by looking at results from different mapping programs, or analysis of reads from different platforms (Ion Torrent for TY2482 & LB226692, 454 reads for the HPA & C2L genomes).

If it is a real difference, I wonder if it could be differences in growth rate or culture conditions in the two labs. Or a mutation in the chromosome that affects the normal control of plasmid replication? Could having lots of copies of the aggregative adhesion plasmid enhance virulence or transmission of the bug?

Point mutations (SNPs) in outbreak strain relative to Ec55989

This post is really a reply to PatrikD’s comments on my last post about the SNP-based phylogeny of E. coli, showing the close relationship between the outbreak EAEC O104:H4 strain and the finished genome sequence of Ec55989 (well, chromosome). I can’t post much in a comment reply so it requires a new post.

PatrikD said:

Since I’m looking for SNPs that agree between the three assemblies, I’m not too worried about remaining sequencing errors. Given how closely these strains were sampled, I’m not interested in any minor differences between them, but more in the difference between 55989 and the three outbreak strains.

Now, I think the best way to do that at this point would be to use the SNP calls from HiSeq reads of TY2482. I get 1781 high quality SNP calls compared to Ec55989, see table here (using bwa & samtools to map reads and call SNPs). Most of these are in prophage regions (annotated in this artemis-readable file and flagged in the SNP table in the ‘mobile’ column); excluding these there are 564 probable SNPs (SNP table, artemis-readable file).

This is their distribution (dark blue = phage regions and IS elements; red = SNP calls overlapping with these mobile regions; green = 564 SNP calls not in these regions; inner ring=GC content):

Green = SNPs in the outbreak genome (TY2482, Hiseq) compared to Ec55989

Of the 1230 SNPs in mobile regions (i.e. subject to horizontal transfer or homologous recombination with typically horizontally transferred genes), there are 713 non-synonymous mutations (i.e. changes the encoded amino acid) and 424 synonymous changes (i.e. silient mutations, not affecting the encoded amino acid sequence)…while it is dodgy to do dN/dS on this kind of intraspecies data, it would be ~0.5, consistent with purifying selection.

For the 564 SNPs in non-mobile regions of the genome (i.e. not subject to horizontal transfer or homologous recombination with typically transferred genes), 284 are non-synonymous and 145 synonymous (dN/dS~0.6). Again it’s tempting to call this purifying selection, as it looks like mutations affecting protein function are being selectively removed, relative to those silent mutations. Also, 136 SNPs (24%) were in intergenic regions. Since only 13% of the chromosome is non-coding, this suggests that mutations in non-coding regions are better tolerated than those in coding sequences, again consistent with purifying selection against mutations affecting protein function.

There are 4,762 genes annotated in Ec55989, and only 564 SNPs…so the majority of genes don’t have any point mutations (although I haven’t looked at insertion/deletions yet, only subsitution mutations where one DNA base is exhanged for another). But do any genes have a high number of changes?

The answer is yes, but these mostly look like recombination events:

ibrA, an immunoglobulin-binding regulator, has 20 SNPs, mainly in the C-terminal domain. It’s quite likely this is a recombination event rather than diversifying selection, but deserves a closer look. A neighbouring gene, EC55989_3315 (conserved hypothetical) also has 18 SNPs, so recombination is a likely culprit.

The Ag43 gene, EC55989_3357, has 17 SNP calls, however we know from the assemblies that the outbreak strain has acquired divergent copies of Ag43 genes [see posts on analysis of integrated sequence, and biofilm genes], so these SNP calls probably reflect the differences between two distinct copies of Ag43 genes rather than point mutations acquired in the EC55989_3357 gene itself.

There are 6 SNPs in EC55989_4669, an Iha adhesin receptor. This could be real, or could be a divergent copy acquired by the outbreak strain.

Hypothetical proteins EC55989_3283, EC55989_3284, EC55989_3293, EC55989_3355, EC55989_4845, EC55989_4890, EC55989_4891, EC55989_4896, EC55989_4897 each have 3-4 SNPs. These are probably phage remnants or similar to phage genes, but some followup analysis could confirm this. There are 7 SNPs in shiA, a fragment of a gene similar to a shigella SHI-2 pathogenicity island gene. EC55989_4671 is an IS element that slipped through my filter, and contains 5 SNP calls.

SNP-base phylogeny confirms similarity of E. coli outbreak to EAEC Ec55989

Thanks to Konrad Paszkiewicz from University of Exeter for this SNP-based analysis of the 3 E. coli outbreak genomes.

He used MUMmer to compare each complete E. coli genome available in NCBI to the Ec55989 chromosome, and identify single nucleotide polymorphisms (SNPs, i.e. substitution mutations, where one DNA base is swapped for another). He ignored SNPs in regions that are repeated in the Ec55989 genome, since we can’t distinguish whether these represent true substitution mutations in homologous sequences (relevant phylogenetic signal), or variation between copies in the same genome (noise). The results are available here.

He did the same comparison for the three outbreak genome assemblies, including the strain TY2482 from BGI (NCBI: AFOB00000000; Illumina HiSeq+Ion Torrent), strain LB226692 from Life Sciences (NCBI: AFOG00000000; Ion Torrent), and strain H112180280 from the UK Health Protection agency (454 mate pair). In addition he did the same comparison for the first MIRA assembly of TY2482 (Ion Torrent) and I analysed SNPs in Ty2482 using the Illumina HiSeq reads released by BGI (using bwa and samtools).

Konrad then used the annotation of Ec55989 to identify the coding change cause by each SNP – i.e. whether it is in a gene, and whether the DNA base change results in a change in the encoded protein sequence (non-synonymous SNP) or not (synonymous SNP). See resulting table here.

I used this annotation to remove SNP calls in genes annotated with the keywords ‘phage’, ‘transposase’, ‘insertion sequence’ or ‘IS’ in an effort to remove obviously horizontally transferred DNA…since this is not only subject to divergence by mutation (the phylogenetic signal we are looking for) but also to recombination. The result is just over 400,000 SNPs (nearly 10% of the Ec55989 chromosome). The alignment of those SNPs is available here.

I used SplitsTree4 to draw a phylogenetic network for the alignment. The result clearly shows that the outbreak genomes (green) are very similar to Ec55989 (also green, labelled ‘reference’), and very different to other sequenced E. coli (note in particular the group of EHEC O157:H7 on the middle left, which are very distant from the outbreak strains). Since this is a network, it would reveal if there were major recombinations between this strain and the other E. coli chromosomes, which there aren’t. This confirms that the outbreak strain is truly an EAEC, with very close similarity to Ec55989 and not to classical EHEC. This is backed by the presence of an EAEC plasmid, carrying aggregative adhesion fimbrial cluster I (AAF/I; agg operon; see this post for details).

I also used SplitsTree4 to draw a phylogenetic tree of the data, using K2P distances and the BioNJ tree-drawing algorithm. This shows the same result, with all three outbreak strain’s genomes clustering very tightly with the EAEC strain Ec55989.

So where are the SNPs located? This figure shows the distribution of SNPs around the Ec55989 chromosome:

SNPs relative to the Ec55989 chromosome

The outer blue rings indicate the Ec55989 genes encoded on the forward (outermost) and reverse strands. The purple/green wiggly line indicates the full set of SNPs found among all avaialable E. coli chromosomes in NCBI…this is a relative plot, so purple bits indicate where SNPs are relatively rare (lower density than the average across the genome) and green bits indicate where SNPs are particularly common (higher density than average across the genome).

The red ring indicates the location of SNPs that are shared among the three outbreak genomes compared to Ec55989, it that distinguish the outbreak from Ec55989 (~600 SNPs, or 0.12% divergence). For comparison, commonly circulating Salmonella enterica serotype Typhi, which are considered a tight-knit clonal group compared to other Salmonella and cause typhoid fever as opposed to the gastroenteritis caused by most Salmonella enterica, can differ from each other by >600 SNPs. So the outbreak strains should be considered quite closely related to the EAEC diarrhea strain Ec55989. In fact, some of these SNPs (red ring) are clustered together, suggesting that they actually represent variation in horizontally transferred sequences such as phage rather than genuine substitution mutations…so the real number of substitution mutations differentiating the outbreak strain from Ec55989 is probably more like 200-300, or 0.05% divergence.

The inner rings (green, blue, black) show the location of SNPs that are specific to just one of the three outbreak genomes. For the HPA and LB226692 strains these are VERY likely to be false SNP calls, due to the fact that they are reliant on Ion Torrent and 454 data which have trouble distinguishing between homopolymeric tracts (e.g. GGGTTT can easily be misread as GGTTTT, masquerading as a substitution of G->T). For TY2482, where we have Illumina read data to call SNPs (which doesn’t have this homopolymer issue), we find only 2 SNPs that are unique to Ty2482 compared to the other two outbreak strains (black inner ring). This is easier to believe, as we expect very small numbers of mutations to accumulate during the few weeks of evolution that separate these sequenced isolates.