Comparative genomic analysis of Shiga toxin-producing and non-Shiga toxin-producing Escherichia coli O157 isolated from outbreaks in Korea

Background The Shiga toxin-producing Escherichia coli (STEC) O157 strain NCCP15739 and non-STEC O157 strain NCCP15738 were isolated from outbreaks in Korea. We characterized NCCP15739 and NCCP15738 by genome sequencing and a comparative genomic analysis using two additional strains, E. coli K-12 substr. MG1655 and O157:H7 EDL933. Results Using the Illumina HiSeq 2000 platform and the RAST server, the whole genomes of NCCP15739 and NCCP15738 were obtained and annotated. NCCP15739 and NCCP15738 clustered with different E. coli strains based on a whole-genome phylogeny and multi-locus sequence typing analysis. Functional annotation clustering indicated enrichment for virulence plasmid and hemolysis-related genes in NCCP15739 and conjugation- and flagellum-related genes in NCCP15738. Defense mechanism- and pathogenicity-related pathways were enriched in NCCP15739 and pathways related to the assimilation of energy sources were enriched in NCCP15738. We identified 66 and 18 virulence factors from the NCCP15739 and NCCP15738 genome, respectively. Five and eight antibiotic resistance genes were identified in the NCCP15739 and NCCP15738 genomes, respectively. Based on a comparative analysis of phage-associated regions, NCCP15739 and NCCP15738 had specific prophages. The prophages in NCCP15739 carried virulence factors, but those in NCCP15738 did not, and no antibiotic resistance genes were found in the phage-associated regions. Conclusions Our whole-genome sequencing and comparative genomic analysis revealed that NCCP15739 and NCCP15738 have specific genes and pathways. NCCP15739 had more genes (410), virulence factors (48), and phage-related regions (11) than NCCP15738. However, NCCP15738 had three more antibiotic resistance genes than NCCP15739. These differences may explain differences in pathogenicity and biological characteristics. Electronic supplementary material The online version of this article (doi:10.1186/s13099-017-0156-2) contains supplementary material, which is available to authorized users.


Background
In 1983, outbreaks of EHEC O157:H7 in humans were first reported [1][2][3]. Since then, EHEC has been recognized as an important food-borne pathogen that causes hemorrhagic colitis and hemolytic uremic syndrome [4,5]. Shiga toxin (Stx) is the major virulence factor and a defining characteristic of EHEC. Shiga toxin-producing Escherichia coli (STEC) strains produce one or two major Shiga toxins, designated Stx1 and Stx2 [4]. Typical STEC strains possess a 35-kb locus of enterocyte effacement (LEE) pathogenicity island containing eae [6], which encodes an outer membrane protein (intimin) required for intimate attachment to epithelial cells; this pathogenicity island is also found in EPEC strains. LEE encodes a type III secretion system (TTSS) through which E. coli secretes proteins, resulting in the delivery of effector molecules to the host cell and disrupting the host cytoskeleton [7][8][9][10]. STEC strains cause hemolyticuremic syndrome and hemorrhagic colitis.
Numerous comparative genomics studies of STEC O157 and non-O157 STEC have been performed, but non-STEC O157 has not been a focus of past research. Few cases of non-STEC O157 have been reported in human patients with diarrhea [11]. Moreover, there are no whole-genome sequencing data or comparative genomics studies of non-STEC O157 strains. However, there was a recent outbreak of non-STEC O157 in human hosts in Korea [12]. Even though non-STEC O157 does not produce Shiga-like toxins, it could be a public health problem because it is pathogenic and causes diarrhea in humans. STEC NCCP15739 [13] and non-STEC NCCP15738 [12] were isolated from the feces of two Korean patients with diarrhea. To characterize NCCP15739 and NCCP15738 as well as the origin of pathogenicity, whole-genome sequencing and comparative genomic analyses using two additional strains, E. coli K-12 substr. MG1655 and O157:H7 EDL933 (as non-STEC and STEC reference strains, respectively), were performed.

Strain, isolation, and serotyping
Escherichia coli were isolated from patients with diarrhea using MacConkey agar and Trypticase Soy Broth containing vancomycin (Sigma Co., St. Louis, MO, USA). Candidate colonies were identified based on phenotypes and biochemical properties using the API20E system (Biomerieux, Marcy l'Etoile, France). The O antigen of the isolates was determined using the methods of Guinee et al. [14] with all available O (O1 to O181) antisera (Lugo, Spain, http://www.lugo.usc.es/ecoli). The isolated strains have been deposited in the National Culture Collection for Pathogens (NCCP) at the Korea National Institute of Health under accession numbers NCCP15739 [13] and NCCP15738 [12]. E. coli K-12 substr. MG1655 and NCCP15738 were used as reference strains for non-STEC and EHEC O157:H7 str. EDL933 was used as the reference strain for STEC.

Library preparation and whole genome sequencing
The Illumina HiSeq 2000 platform was used for the whole genome sequencing of NCCP15739 and NCCP15738 (Theragen Etex Bio Institute, Suwon, Republic of Korea).

Genome assembly and annotation
A de novo assembly was performed using SOAPdenovo (version 1.05) [15]. Only scaffolds longer than 500 bp were used for further analysis. Annotated open reading frames of the NCCP15739 and NCCP15738 genomes were identified using the RAST (Rapid Annotation using Subsystem Technology, version 4.0) [16] server. The genomes of two reference strains, K-12 substr. MG1655 and O157:H7 str. EDL933, were re-annotated using the RAST server. For the comparison of the coding sequences (CDSs) of the four strains, OrthoMCL (version 2.0.9) was used [17]. The sequence similarity and coverage [18] were considered simultaneously to assess the orthologous proteins of all four E. coli strains.
Functional annotation enrichment in the set of genes in NCCP15739 and NCCP15738 was performed using the Database for Annotation, Visualization and Integrated Discovery (DAVID) (http://david.abcc.ncifcrf.gov). To identify lineage-specific genes in the NCCP15739 and NCCP15738 genomes, the BLAST Score Ratio (BSR) was calculated. Unique genes with a BSR score of ≤0.4 were selected. A comparative KEGG metabolic pathway analysis was conducted for the total CDSs of NCCP15739 and NCCP15738 using Model SEED (version 1.0). To investigate the virulence factor genes in the four E. coli strains, a BLAST search of the total ORFs of the four E. coli strains against the virulence factor genes of E. coli listed in VFDB [19] was performed with an e-value threshold of 1e-5. We determined the antibiotic resistance genes in the genome sequences of the four E. coli strains using ResFinder 2.1 (https://cge.cbs.dtu.dk/services/ResFinder/) [20]. To compare the genomic structures among the four strains, the genomic scaffolds of NCCP15739, NCCP15738, E. coli K-12 substr. MG1655, and O157:H7 EDL933 were aligned using the progressive alignment algorithm of Mauve (version 2.3.1) [21]. After the alignment, the scaffolds of NCCP15739 were reordered against the complete genome of E. coli O157:H7 EDL933 using the Move Contig tool of Mauve. The scaffolds of NCCP15738 were reordered against the genome sequence of E. coli K-12 substr. MG1655. The BLAST algorithm was used to identify syntenic genes and to analyze the genes of interest. The resulting reordered scaffolds and syntenic genes were visualized using Circos (version 0.64) [22].

Analysis of mobile genetic elements
To identify insertion sequences (ISs), all ISs were downloaded from the IS Finder DB (http://www-is.biotoul. fr), and the genome sequences of four E. coli strains, NCCP15739, NCCP15738, K-12 substr. MG1655, and EDL933, were mapped to the sequence database using RepeatMasker (version 4.0.1) (http://www.repeatmasker.org). Phage-associated regions in the genome sequences of the four E. coli strains were predicted using the PHAST server [31]. Genomic scaffolds, including prophages, were confirmed based on the RAST annotation results.

Quality assurance
The genomic DNAs were purified from a pure culture of a single bacterial isolate of NCCP15739 and NCCP15738. Potential contamination of the genomic libraries by other microorganisms was evaluated using a BLAST search against the non-redundant database.

General features
The draft genome size of NCCP15739 was 5,373,767 bp and NCCP15738 was 5,005,278 bp. The G+C contents of NCCP15739 and NCCP15738 were 50.25 and 50.65%, respectively. The genomic features of E. coli strains used in the analysis, including NCCP15739 and NCCP15738, are summarized in Table 1. Based on a RAST analysis, 5190 putative CDSs from NCCP15739 and 4780 putative CDSs from NCCP15738 ( Fig. 1; Additional file 1: Table S1) were identified. The syntenic regions between NCCP15739 and three other E. coli strains based on a BLAST search are depicted on the reordered contigs of NCCP15739 in Fig. 1.

Phylogenetic analysis
A whole-genome phylogenetic analysis of 44 E. coli strains revealed that NCCP15739 is closely related to the pathogenic E. coli Xuzhou21 and TW14588 (Fig. 2a). However, a multilocus sequence analysis showed that NCCP15739 is closely related to O157:H7 serotypes, such as E. coli O157:H7 Sakai, EDL933, TW14588, and E. coli Xuzhou21 (Fig. 2b). The serotype O157:H7 clustered into a recently diverged group according to the MLSTbased phylogeny. Based on the whole-genome phylogenetic analysis, NCCP15738 was grouped with UMNK88 ( Fig. 2a), but it grouped with DH1 (ME8569) based on MLST analyses (Fig. 2b). The clusters in the wholegenome phylogenetic tree and the MLST phylogenetic tree were different; we think the difference comes from how many genotypes were considered in the phylogenetic analysis. The whole-genome phylogenetic tree considered all of variation throughout the whole-genome, but MLST phylogenetic tree only considered the genotypes of the seven MLST genes. Based on the phylogenetic analysis, we concluded that NCCP15739 and NCCP15738 are different strains belonging to their own groups.

Functional annotation clustering
Based on BSR scores, we selected 534 genes from NCCP15739 and 651 genes from NCCP15738 for functional annotation clustering (Additional file 1: Table S1). According to this analysis, 534 genes of NCCP5739 were classified into 7 groups and 651 genes of NCCP15738 were classified into 8 groups. In NCCP15739, the virulence plasmid and hemolysis-related genes were enriched, while the NCCP15738 genome exhibited enrichment for conjugation-and flagellum-related genes ( Table 2). In particular, the flagellum is an important characteristic of NCCP15738 because the strain has a dual flagellar system [12], like those found in Vibrio parahaemolyticus, Aeromonas spp., and Rhodospirillum centenum [32]. NCCP15738 had 65 genes encoding flagellar biosynthesis or structural proteins.

Metabolic pathway comparison
Based on a metabolic pathway comparison, we found that seven pathways were more developed in NCCP15739 than in NCCP15738. Genes in the pathways that determine folate biosynthesis, purine metabolism, amino sugar metabolism, atrazine degradation [33], urea cycle, amino acid metabolism, and the biosynthesis of siderophores [34][35][36] were more highly enriched in NCCP15739. For example, the folate biosynthesis pathway had more genes in NCCP15739 than in NCCP15738 (Additional file 2: Table S2). Folate is important for frequent divisions and rapid cell growth because it is required for methylation reactions and nucleic acid synthesis [37]. The pathways enriched in NCCP15739 were closely related to defense mechanisms and the pathogenicity of bacteria. NCCP15739 is pathogenic and causes hemolytic-uremic syndrome in the host [13]. By contrast, sixteen pathways were more developed in NCCP15738. The enriched pathways in NCCP15738 were responsible for the assimilation of various energy sources (Additional file 2: Table S2). Genes in the pathways that determine tyrosine metabolism, pentose and glucuronate interconversion [38], phenylalanine metabolism [39], galactose metabolism [40], glycerolipid metabolism, and ascorbate and aldarate metabolism were more highly enriched in NCCP15738. A comparative genomic analysis with the reference strains E. coli K-12 substr. MG1655 and O157:H7 EDL933 showed that NCCP15738 has a dual flagellar system [12]. However, we did not observe its locomotion and did not test its function in the strain; the structure and function should be investigated in further studies.

Virulence factors
We detected 66 and 18 [12] virulence factors from NCCP15739 and NCCP15738, respectively (Additional Fig. 1 Circular map of the NCCP15739 and NCCP15738 draft genomes. Circular map of genes and genome statistics were visualized for NCCP15739 and NCCP15738 using Circos (version 0.64). All CDSs are syntenic regions of NCCP15739 that were determined using BLAST searches file 3: Table S3). All 18 virulence genes of NCCP15738 were shared with NCCP15739; NCCP15738 did not contain any unique virulence factors. The 66 virulence genes of NCCP15739 were grouped into 7 categories: adherence, autotransporter, iron uptake, LEE-encoded TTSS effectors, non-LEE-encoded TTSS effectors, secretion      [44] were present in NCCP15739, but no toxin genes were found in NCCP15738. In brief, the NCCP15738 strain had fewer virulence factors than NCCP15739. However, NCCP15738 is pathogenic and causes diarrhea in human hosts. We propose that the strain NCCP15738 is a model organism for studies of pathogenicity in non-STEC O157 strains because its genome does not contain toxin-related genes. To identify the virulence genes related to diarrhea in humans, additional studies are needed.

Phage-associated regions
Prophages are mobile genetic elements that deliver antimicrobial-resistance genes [45] or virulence factors [46] to bacterial hosts and contribute to the diversity of host genomes [47]. We identified sixteen phage-associated regions (S1-S16) from the NCCP15739 genome and five phage-associated regions (N1-N5) from the NCCP15738 genome using the PHAST algorithm (Additional file 4: Table S4). Only five of the sixteen phages in NCCP15739 were intact, whereas all five phages in NCCP15738 were intact. Based on a BLAST search, only one phage-associated region, i.e., the N3 region from NCCP15738, was identical to the S2 region from the NCCP15739 genome, whereas the four remaining phages (N1, N2, N4, and N5) were specific to NCCP15738. In terms of virulence, the S2, S4, S11, S12, and S13 regions in NCCP15739 had the virulence factors nleC, stx2A and stx2B, paa, nleA/ espI, espJ and stx1A, and stx1B, respectively. Meanwhile, NCCP15738 had no virulence factors in the phage-associated regions. Therefore, we hypothesized that prophages are not causal factors of virulence in NCCP15738. In addition, we examined antibiotic resistance-related genes in prophage regions of NCCP15739 and NCCP15738, but no antibiotic resistance genes were found in either genome (Additional file 5: Table  S5). According to these results, we concluded that prophages are not vehicles of antibiotic resistance genes in NCCP15739 and NCCP15738.

Future directions
STEC O157 NCCP15739 and non-STEC NCCP15738 belong to the O157 serotype, which has strong pathogenicity and can cause foodborne disease. In this study, we performed a comparative genomic analysis of NCCP15739, NCCP15738, E. coli K-12 substr. MG1655, and O157:H7 EDL933. We found that NCCP15739 and NCCP15738 have specific functional genes and pathways related to pathogenicity and motility, and their genomes contained specific prophages. NCCP15739 had more genes (410), virulence factors (48), and phage-related regions (11) than NCCP15738. However, NCCP15738 had three more antibiotic resistance genes than NCCP15739. To investigate the effect of these differences on pathogenicity and biological properties, further studies are needed.