Genetic variants of Helicobacter pylori type IV secretion system components CagL and CagI and their association with clinical outcomes

Background Helicobacter pylori infection is associated with risk for chronic gastritis (CG), gastric ulcer (GU), duodenal ulcer (DU), and gastric cancer (GC). The H. pylori Cag type IV secretion system (TFSS) translocates the virulence factor cytotoxin-associated gene A protein into host cells and plays an important role in initiating gastric carcinogenesis. The CagL and CagI proteins are components of the TFSS. The Arg-Gly-Asp (RGD) motif of CagL, and the six most distal C-terminal amino acids (Ser-Lys-Ile-Ile-Val-Lys, and Ser-Lys-Val-Ile-Val-Lys) of CagL and CagI are essential for TFSS adhesion to host cells. Additionally, the CagL variant Tyr58Glu59 was previously shown to be associated with GC patients. Results We isolated 43 H. pylori isolates from 17 CG, 8 GU, 8 DU, and 10 GC patients in Southeast Asia. Total DNAs were extracted and sequenced with MiSeq. H. pylori strain ATCC 26695, which was isolated from CG patients, was used as a reference. We examined the full sequences of H. pylori cagL and cagI using whole-genome sequencing (WGS), and analyzed whether single nucleotide variants and amino acid changes (AACs) correlated with adverse clinical outcomes. Three isolates were excluded from the analysis due to cagPAI rearrangements. CagL RGD motifs were conserved in 39 isolates (97.5%). CagL-Glu59 and Ile234 in the C-terminal motif were more common in 10 H. pylori isolates from GC patients (p < 0.001 and p < 0.05, respectively). When 5 Vietnamese isolates from GC patients were excluded, CagL-Glu59 still remains significant (p < 0.05), but not Ile234. CagL-Tyr58 was seen in only one isolate. The CagI C-terminal motif was completely conserved across all 40 isolates, and there were no significant AACs in CagI. Conclusions Using WGS, we analyzed genetic variants in clinical H. pylori isolates and identified putative novel and candidate variants in uncharacterized CagL and CagI sequences that are related to gastric carcinogenesis. In particular, CagL-Glu59 has the possible association with GC. Electronic supplementary material The online version of this article (doi:10.1186/s13099-017-0165-1) contains supplementary material, which is available to authorized users.


Background
The infection rate for the Gram-negative bacterium Helicobacter pylori is around 50% worldwide [1,2]. H. pylori infection increases the risk of chronic gastritis (CG), gastric ulcer (GU), duodenal ulcer (DU), and gastric cancer translocate its major virulence protein cytotoxin associated gene A (CagA) into host gastric epithelial cells using a type-IV secretion system (TFSS) [5,6]. The role of the H. pylori TFSS and CagA translocation was examined in previous sequential studies that showed Src-mediated phosphorylation of CagA tyrosines is important for H. pylori virulence [5,7,8]. In East Asia in particular, nearly all H. pylori infections are CagA positive, which complicates assessment of how clinical H. pylori isolates are associated with disease outcomes [9][10][11]. Moreover, the mechanisms by which H. pylori expresses and regulates its TFSS injection apparatus when adapting to human epithelial cell receptors are unclear.
A recent study identified integrin α5β1 expressed on gastric epithelial cells as the putative host receptor for H. pylori TFSS [12]. The H. pylori CagL protein was found to be an adhesion target on the injected pilus surface for binding to host integrin α5β1 through the CagL Arg-Gly-Asp (RGD) motif [13]. Initial CagL-integrin binding properly induced to locate the bacterial TFSS prior to CagA translocation as well as to activate host tyrosine kinase [12,14]. This interaction between the H. pylori TFSS and host integrin α5β1 can activate the NF-kB proteins and several important pro-inflammatory cytokines that resulted in more adverse clinical outcomes, such as gastric carcinogenesis.
CagI is another H. pylori protein, but its function is less clear [12,15]. CagI has no sequence similarities to any other TFSS components, or to other known proteins [16,17]. Although an isogenic cagI mutant has been examined, there were conflicting reports about whether CagI is required for TFSS function [3,18]. Based on H. pylori transcriptome evidence [19], cagI is certainly part of an operon containing cagPAI genes involved in the TFSS, but the actual contribution of CagI to clinical phenotypes is unknown.
Here, we used whole-genome sequencing (WGS) to analyze genetic variants of 43 H. pylori isolates from patients in Southeast Asia who had different clinical disease. Using the WGS data, we examined whether CagL and/or CagI amino acid changes (AACs) correlated with adverse clinical outcomes such as GC.

Characteristics of clinical H. pylori isolates
We previously performed WGS on 19 H. pylori clinical isolates that we deposited under accession number DRA001250 (see "Methods"). Here we undertook WGS of 24 new clinical H. pylori isolates, and analyzed a total of 43 H. pylori whole genome sequences ( Table 1). The 43 isolates were from 17 chronic gastritis (CG), 8 gastric ulcer (GU), 8 duodenal ulcer (DU) and 10 gastric cancer (GC) patients whose diagnosis was based on endoscopy results. The 43 H. pylori isolates we analyzed also had different geographic origins in that 31, 7, and 5 isolates were isolated from Japanese, Chinese, and Vietnamese patients, respectively.

Sequence reads mapping to ATCC 26695 and quality check
The total reads for the 43 H. pylori isolates ranged from 1.99 to 10.87 million (Table 1). Sequencing data were mapped to the genome of the H. pylori strain ATCC 26695, which was isolated from CG patients, as a reference. Total consensus length (bp) ranged from 1,503,522 to 1,664,897, and total consensus coverage (%) ranged from 90.15 to 99.82%. Average coverage (fold) ranged from 79.5 to 669.4-fold.
Following the initial quality check, we focused on the 28 genes in the cagPAI region (Additional file 1: Table S1). Among the 43 isolates, strain ID 189 had lower coverage (under 100-fold) in the cagPAI region, strain ID 194 had no genes in the cagPAI region, and strain ID F51 carried the cagA gene alone. Due to these major sequence differences in the cagPAI region, we excluded data for these three isolates, which were all from Japanese patients, such that 40 clinical H. pylori isolates were subjected to further analysis. Of these 40 isolates, 15,8,7, and 10 were from CG, GU, DU, and GC patients, respectively, and 28, 7, and 5 isolates were derived from Japanese, Chinese, and Vietnamese patients, respectively. CagA motifs of 40 clinical isolates were different (Additional file 2: Table  S2).
After the quality check, the average coverage of the remaining 40 isolates ranged from 99.6-to 361.4-fold for cagL, and from 105.4-to 416.3-fold for cagI (nearly over 100-fold). Consistent with our earlier report, the WGS data in this study had high sequencing coverage, and were of sufficiently high quality to allow detection of SNVs in the H. pylori genome [20].

CagL variants in patients with different clinical disease outcomes
We translated the CagL nucleotide sequences into amino acid sequences (residues 1-237) with Genomics Workbench 8.5.1, and analyzed CagL variants based on clinical disease outcomes. Table 2 lists CagL variants, and the partial alignments of CagL amino acid changes (AACs) and their locations are shown in Fig. 1. In particular, we characterized AACs present in 10 clinical H. pylori isolates derived from GC patients.
More recently, the CagL variants Tyr58 and/or Glu59 (CagL-Y58E59) was found to occur at significantly higher rates in H. pylori isolates from Taiwanese GC patients. CagL-Tyr58Glu59 can induce higher integrin α5β1 expression levels in the upper stomach and increase inflammation in the corpus [21]. Consistent   1  11  21  31  41  51  61  71  81  91  ATCC26695  MKTLVKNTIS SFLLLSVLMA EDITSGLKQL DSTYQETNQQ VLKNLDEIFS TTSPSANNEM GEEDALNIKK AAIALRGDLA LLKANFEANE LFFISEDVIF  174_CG - - Meanwhile, CagL-Tyr58 was present in only one isolate (HZ67) from a GC patient, and its frequency was not significant. Aspartic acid was the most commonly present amino acid at position 58 (Asp58), and occurred in 38 of 40 isolates (95.0%). The remaining isolate (F32) had CagL-Asn58, as did the reference strain ATCC 26695. Only one isolate (HZ67) among the 43 tested had a CagL sequence with both Tyr58 and Glu59.
The C-terminal motifs that include the most distal amino acids of both CagL and CagI are functionally important for the TFSS [22]. In CagL, the sequence of this motif is Ser-Lys-Ile-Ile-Val-Lys (232-237). In this study, we found that Ile234 occurred at a significantly (p = 0.018) higher rate in GC patients (7/10 isolates, 70.0%) relative to that for non-GC patients (23.3%, 7/30). However, Ile234 lost significance with the exception of 5 Vietnamese isolates from GC patients ( Table 3). The other five residues showed no significant amino acid sequence differences among the disease outcomes tested.

CagI variants in patients with different clinical disease outcomes
We also translated CagI nucleotide sequences into amino acid sequences (1-381), and analyzed rates and locations of CagI variants based on clinical disease outcomes (Table 4; Fig. 2).
As with CagL, the C-terminal motif of Ser-Lys-Val-Ile-Val-Lys (376-381) in CagI is functionally essential for the TFSS. In our analysis, all 40 H. pylori isolates from both GC and non-GC patients had the same motif, which had a completely conserved sequence.

Phylogenetic implications of H. pylori CagL and CagI diversity
Phylogenetic trees were conducted using MEGA7 [23]. In general, CagL sequences showed no characteristic clusters around disease outcomes (Fig. 3a), although there was a cluster among the five Vietnamese isolates (Fig. 3b). Meanwhile, CagI sequences had no characteristic clusters for either region or disease outcome (Fig. 3c, d).

Discussion
Using the advantages provided by whole-genome sequencing (WGS), we analyzed candidate and novel variants of CagL and CagI proteins in 40 clinical H. pylori isolates from patients in Southeast Asia. We showed that CagL from H. pylori isolates derived from GC patients carried several specific amino acid changes (AACs), but we detected no significant changes in the CagI amino acid sequence.
Whole-genome sequencing technology was recently applied to clarify the pathogenicity and evolution of H. pylori, as well as to identify its virulence factors [24,25]. Using WGS, we and others detected potential mutations throughout the H. pylori genome and identified variants when sequence changes were present [20,24,25]. Here, we used WGS technology to detect novel variants in uncharacterized cagPAI genes associated with H. pylori pathogenicity.  cagPAI is a 37 kb segment of H. pylori DNA that contains 28 genes [3,4], and is found in about 60% of Western isolates, whereas nearly all East Asian isolates are cagPAI positive [26]. We analyzed cagPAI integrity and showed the rearrangement of this island in three Japanese isolates (189, 194 and F51). Although cagPAI was most intact in Japanese isolates, it was disrupted in isolates isolated throughout the world [27]. Since the pathogenic role of the cagPAI is well defined as a whole or in part, we excluded the three isolates that had cagPAI rearrangements.

Table 3 Seven variants of CagL in GC and non-GC isolates without 5 Vietnamese isolates
Several Cag proteins have been detected as constituents of the H. pylori cag TFSS apparatus and have important roles in CagA translocation [14,15,22]. The CagL and CagI proteins have been previously characterized [16,17], so in this study we used WGS to screen 40 clinical H. pylori isolates for CagL and CagI variants, and analyzed the relationship between amino acid sequence and clinical outcomes. Consistent with a previous report [21], we detected complete RGD motifs in CagL sequences from all isolates. These data highlight the importance of the RGD motif for CagL function in the TFSS. We also checked whether other AACs in CagL and CagI were correlated with clinical outcomes.
The C-terminal motif in CagL and CagI consisting of six amino acids (Ser-Lys-Ile-Ile-Val-Lys, and Ser-Lys-Val-Ile-Val-Lys, respectively) is important for TFSS function [22]. However, whether these sequences were conserved among genomes of clinical H. pylori isolates was unclear.
Here, we showed that the CagI C-terminal motif was completely conserved. Although the CagL C-terminal motif was also well conserved, we found a significant difference at position 234 of CagL among H. pylori isolates derived from GC and non-GC patients. However, Ile234 lost significance with the exception of 5 Vietnamese isolates from GC patients. Future studies on additional H. pylori isolates could validate whether CagL-Ile234 could serve as a marker that indicates an increased risk for gastric carcinogenesis.

Conclusions
We analyzed genetic variants of H. pylori using WGS, which has significant advantages over other approaches that examine only a fraction of the genome at any one time. WGS identified several putative novel variants of CagL and CagI sequences from previously uncharacterized H. pylori isolates. These variants, particularly in CagL-Glu59, have the possible effect on the TFSS activity and the relevance with clinical outcomes.

H. pylori samples
Forty-three H. pylori clinical isolates were obtained from gastric epithelium biopsy tissues taken during upper gastroduodenal endoscopy procedures performed at Okinawa Prefectural Chubu Hospital, Kobe University Hospital, and Fukui University Hospital in Japan, as well as Zhejiang University Hospital in China and Cho Ray Hospital in Vietnam. All patients gave written informed consent for use of their samples in this study, which was performed according to the principles of the Declaration of Helsinki. The major reference strain, ATCC 26695 (NC_000915), was isolated from CG patients in the United Kingdom [30], and its sequence served as the reference sequence.

H. pylori culture
Gastric biopsy specimens were first inoculated onto trypticase soy agar II (TSA-II)-5% sheep blood plates (Becton, Dickinson and Company: BD) and cultured under microaerophilic conditions (O 2 5%; CO 2 5%; N 2 90%) at 37 °C for 3-5 days. Then, one colony was picked from each primary culture plate, and seeded onto a Columbia Helicobacter pylori agar plate containing vancomycin (10 mg/l), trimethoprim (5 mg/l), amphotericin B (5 mg/l), and polymyxin B (2500 units/l), and cultured under the same conditions. A colony was picked from this second plate, seeded onto a TSA-II plate, and cultured under the same conditions. Several colonies were picked from the third plate, transferred into Brucella Broth medium (2 ml) containing 10% fetal calf serum, and cultured for 18 h under the same conditions.
A portion of each culture was stored at −80 °C in 0.01 M phosphate-buffered saline (PBS), pH 7.4, containing 20% glycerol. H. pylori DNA was extracted from bacterial pellets prepared from liquid cultures using the protease-phenol-chloroform method. The extracted DNA was suspended in 100 μl distilled water and stored at 4 °C.

Whole-genome sequencing (WGS)
Total DNA of H. pylori isolated from patients and the reference strain ATCC 26695 were sequenced. The bacterial DNA concentration of each sample was measured with a Qubit dsDNA HS assay kit (Q32851; Invitrogen, Carlsbad, CA) and the concentration of each sample was between 250 and 320 pg/μl. A DNA library of H. pylori isolates was prepared using a Nextera XT DNA Sample Prep Kit (Illumina, Carlsbad, CA), which was used according to the manufacturer's instructions to uniformly shear the DNA into 500 bp fragments and add unique adapter sequences to the fragments. The resulting DNA library was run on a MiSeq sequencer (Illumina) with a reagent kit (300 cycle, pairedend). Fluorescence images were analyzed using MiSeq Control Software, and FASTQ-formatted sequence data were generated using MiSeq Reporter Analysis.

Sequence read mapping and single nucleotide variant (SNV) detection
For the analyzed DNA sequence data, read qualities having a Q30 value above 80% were selected according to recommendations by Illumina. After a quality check and data trimming, the sequence reads were assembled with Genomics Workbench 8.5.1 (CLC bio, Aarhus, Denmark). The read mapping module was termed as CLC Assembly Cell 4.0, which was based on an uncompressed Suffix-Array representing the entire reference genome in a single data structure (White paper on CLC read mapper; October 10, 2012). Sequence reads were mapped against the ATCC 26695 genome (NC_000915) as a reference, and single nucleotide variants (SNVs) were identified with Fixed Ploidy Variant Detection modules with default parameters and minor modifications to the mapping algorithm. Variant detection of the software was set to 1.