Diversity of 3′ variable region of cagA gene in Helicobacter pylori strains isolated from Chinese population

Background The cytotoxin-associated gene A (cagA) is one of the most important virulence factors of Helicobacter pylori (H. pylori). There is a highly polymorphic Glu-Pro-Ile-Tyr-Ala (EPIYA) repeat region in the C-terminal of CagA protein. This repeat region is thought to play an important role in the pathogenesis of gastrointestinal diseases. The aim of this study was to investigate the diversity of cagA 3′ variable region and the amino acid polymorphisms in the EPIYA segments of the CagA C-terminal region of H. pylori, and their association with gastroduodenal diseases. Methods A total of 515 H. pylori strains from patients in 14 different geographical regions of China were collected. The genomic DNA from each strain was extracted and the cagA 3′ variable region was amplified by polymerase chain reaction (PCR). The PCR products were sequenced and analyzed using MEGA 7.0 software. Results A total of 503 (97.7%) H. pylori strains were cagA-positive and 1,587 EPIYA motifs were identified, including 12 types of EPIYA or EPIYA-like sequences. In addition to the four reported major segments, several rare segments (e.g., B′, B″ and D′) were defined and 20 different sequence types (e.g., ABD, ABC) were found in our study. A total of 481 (95.6%) strains carried the East Asian type CagA, and the ABD subtypes were most prevalent (82.1%). Only 22 strains carried the Western type CagA, which included AC, ABC, ABCC and ABCCCC subtypes. The CagA-ABD subtype had statistical difference in different geographical regions (P = 0.006). There were seven amino acid polymorphisms in the sequences surrounding the EPIYA motifs, among which amino acids 893 and 894 had a statistical difference with gastric cancer (P = 0.004). Conclusions In this study, 503 CagA sequences were studied and analyzed in depth. In Chinese population, most H. pylori strains were of the CagA-ABD subtype and its presence was associated with gastroduodenal diseases. Amino acid polymorphisms at residues 893 and 894 flanking the EPIYA motifs had a statistically significant association with gastric cancer.

mucosal-associated lymphoid tissue (MALT) lymphoma. Therefore, the World Health Organization classified H. pylori as a group I carcinogen in 1994 [34]. Epidemiological survey shows that about 50% of adults are infected with H. pylori and chronic infection of H. pylori plays an important role in the development of gastric carcinoma [24]. Despite the high prevalence of H. pylori infection, more than 80% of the carriers present asymptomatic gastritis, only 10%-20% develop CG and PUD, and a minority of H. pylori carriers develop into gastric cancer (GC) [22]. Variation in virulence of the strains is thought to be an important reason for the different clinical outcomes of H. pylori infection [43]. The cytotoxin-associated gene A (cagA) is one of the most important virulence genes of H. pylori, which is located at the end of cag pathogenicity island (cag PAI) and encodes the 120-145 kDa CagA protein [7]. Studies have confirmed that the cagA-positive strains are more virulent than the cagA-negative strains and can cause more severe gastric inflammation [5]. CagA protein can be transported into the gastric epithelial cells by type IV secretion system (T4SS) encoded by the cag PAI. After the CagA translocation, the tyrosine residues of EPIYA(Glu-Pro-Ile-Tyr-Ala) motif in the CagA C-terminal region can be phosphorylated by Src family kinases (SFKs) rapidly [21,30]. Based on the amino acid sequences flanking the EPIYA motifs, the EPIYA motifs can be subdivided into four distinct peptide segments: EPIYA-A, EPIYA-B, EPIYA-C and EPIYA-D [23]. According to the different combinations of these four EPIYA motifs, H. pylori can be divided into two types, namely the East Asian type and the Western type [15].
CagA can specifically bind to the SH2 domain of Src homology 2 (SH2)-containing protein tyrosine phosphatase (SHP-2), which induces spatial configuration change of SHP-2 and activates it [40]. SHP-2 can be involved in the downstream signal transduction of growth factor receptor, regulate cell growth, differentiation and cell adhesion, and thereby inducing morphologic transformation and abnormal proliferation of gastric epithelial cells [6]. The binding of CagA and SHP-2 can lead to the cytoskeletal rearrangement of the host gastric epithelial cells, known as the hummingbird phenotype, which plays an important role in the development of gastric cancer [14]. Studies showed that the East Asian type CagA containing EPIYA-D segment displayed stronger binding activity to SHP-2 and more strongly damage to cells than did Western CagA. Western strains with more EPIYA-C segments showed a stronger ability to bind to SHP2 and could be prone to induce the hummingbird phenotype than Western type CagA containing segments EPIYA-C [9]. The phosphorylated CagA can interact with CagA C-terminal Src kinase (Csk) and inactivate Src kinase, resulting in cytoskeleton rearrangement and cell elongation [13]. In addition to SHP-2 and Csk, the CagA protein can interact with PI3K (phosphatidylinositol 3-kinase), Grb2 (growth factor receptor bound protein 2) and ZO-1 (zonula occludens-1) in a tyrosine phosphorylationdependent manner [1,18]. The interaction of CagA with these proteins results in activation of abnormal signaling pathways that can lead to cell dysfunction. In addition, the CagA C-terminus includes a 16 amino acid stretch named CagA-multimerization (CM) sequence or conserved repeat responsible for phosphorylationindependent activity (CRPIA) motif located distal to the EPIYA-C or EPIYA-D segment [36]. The CRPIA motif can mediate the dimer formation of CagA protein and stabilize its binding to SHP-2 [25]. It can also bind to polarity regulatory kinase partitioning defective 1/ microtubule affinity regulating kinase (PAR1/MARK) and inhibit the activity of kinase [29].
The incidence of H. pylori infection and gastric cancer in China is much higher than that in the Western countries [35]. However, there are controversial reports about the relationship between the CagA type and gastroduodenal diseases [28,44]. This controversy may be due to regional diversity or differences in research methods. In fact, there is lack of comprehensive analysis of cagA 3′ variable region sequence characteristics. Moreover, few studies have detected the detailed amino acid polymorphisms surrounding the EPIYA motifs and their association with clinical outcomes [3]. The aim of this study was to investigate the diversity of cagA 3′ variable region and the amino acid polymorphisms surrounding the EPIYA motifs, and the relationship with gastroduodenal diseases through the sequence alignment and statistical analysis of 503 CagAs in H. pylori strains isolated from Chinese different populations.

cagA gene status
A total of 503 (97.7%) cagA-positive strains out of 515 H. pylori strains from 14 different geographical regions in China were obtained. Among those cagA-positive strains, 82 (91.1%) were isolated from Shandong, 75 (94.9%) from Guangxi and 100% from other twelve regions. There was no significant difference in the distribution among different regions (χ 2 = 0.933, P > 0.05). The sequencing results showed that PCR products of the cagA 3′ variable region ranged from 480 to 858 bp, most of which were approximately 630 bp (Fig. 1). The alignment of the corresponding amino acid sequences revealed the presence of amino acid mutations, such as substitution, insertion and deletion in the CagA C-terminal region.

Characteristics of EPIYA segments flanking sequences
According to the segments flanking EPIYA motifs, we classified EPIYA segments. In addition to the four major segments, we defined several rare segments, including EPIYA-B′, EPIYA-B″ and EPIYA-D′. Representative segment types obtained from 503 CagAs were listed in Table 1. Through sequence alignment, it was found that there were differences in amino acids among the same sequences. The two most frequent segments in segments A, B, C and D were shown in Table 2. There were obvious differences between segments EPIYA-C and EPIYA-D when analyzed using the WebLogo 3. As shown in Fig. 2, the lengths and sequences of segments A C and A D were very similar, whereas that of segments B C and B D were quite different. The sequences variation started from the six amino acids, QVAKKV, in segments B C and B D , and the sequences of segments C and D were completely different. As shown in Fig. 3, the most common CRPIA motif in Western type CagA was FPLKRHDKVDDL-SKVG and the most common CRPIA motif in East Asian type CagA was FPLRRSAAVNDLSKVG. Western and East Asian CRPIA motifs varied at positions 4, 6, 7, 8 and 10. Western strains with three EPIYA motifs contained two CRPIA motifs and the CRPIA motif before the EPIYA-C motif was a Western type in all 22 Western type strains.
The alignment of the amino acid sequences confirmed that the EPIYA motifs in the EPIYA-C and EPIYA-D segments were highly conserved, whereas the EPIYA motifs in the EPIYA-A and EPIYA-B segments had evidence of mutations. A total of 1,587 EPIYA motifs were obtained from the 503 CagAs, including 12 types of EPIYA or EPIYA-like sequences ( Table 3). The three most frequent EPIYA motifs were EPIYA (1461/1587 = 92.1%), EPIYT (4.7%), and ESIYA (1.4%). The EPIYA-B motif had a high degree of variation in the five amino acids (e.g., EPIYA, EPIYT and ESIYA). As shown in Table 4 and Fig. 2, the sequences, KVNK and QVNK, were the main types of segments A C and A D , respectively. QVAK was the main amino acid of segments B C and B D . In the present study, the sequences were identified as segments C and D if they were followed by TIDD and TIDF, respectively. However, by sequence alignment, it also belonged to segment C if it was followed by TIED or TIDE.

Correlation between CagA sequence types and geographical regions
There were some differences in CagA sequence types in different geographical regions (Table 6). In Yunnan, strains containing 4 or more EPIYA motifs accounted for 40% (29/73). There was a significant correlation between CagA-AB′BD type and Yunnan isolates (χ 2 = 81.523, P < 0.001). However, most of the Western strains were from Neimenggu, and the association of CagA-ABC type with Neimenggu isolates was highly significant (χ 2 = 25.468, P < 0.01). There was a significant difference in the presence of CagA-ABD type between different geographical regions (χ 2 = 80.067, P < 0.01).

Correlation between CagA sequence types and clinical outcomes
Clinical data were available from 131 of the 503 cagApositive H. pylori strains. Based on the gastrointestinal endoscopy and pathological examination, CG was diagnosed in 85 patients, GC in 22, gastric ulcer (GU) in 10, duodenal ulcer (DU) in 10 and MALT lymphoma in 4. A total of 12 Western type strains were found, 11 of which were from patients with CG. Among all the 131 CagAs, 86 were of the type ABD, 25 of the type AB′BD, 3 of the type AB′B′BD, 2 of the type AB′B′B′B′BD and 3 of the type AD. The distribution of the CagA sequence types in various clinical outcomes was shown in Table 7. We compared the types ABD and AB′BD in relation to clinical outcomes. Other CagA types were excluded because the number of other types was relatively small. As shown in Table 7, the prevalence of ABD was 58.1% (50/86) in CG; whereas only 22.1% (19/86) in GC and 9.3% (8/86) in GU. The ratio of AB′BD /ABD was therefore higher in CG (20/50 = 0.4) than GC (1/19 = 0.05), and the differences were statistically significant (χ 2 = 71.500/80.067, P < 0.01).   . The detailed information of these amino acid polymorphisms in the sequence flanking the EPIYA motifs in 86 ABD subtypes was shown in Table 8 and Fig. 4. The absence of amino acids 893 and 894 was synchronous. Strains at the absence of the 893 and 894 residues had a statistically significant association with GC compared with CG (χ 2 = 21.778, P < 0.01). Most patients with CG, GU, DU and MALT lymphoma had a glutamic acid (Glu) at 894, while some patients with GC had Glu deletion or substituted by other amino acids, such as threonine (Thr) or asparagine (Asn). These changes at residue 894 had significant difference between GC patients and those with other diseases (χ 2 = 4.908, P < 0.05). In addition to the seven amino acid polymorphisms mentioned above, other amino acids were relatively conserved, except for individual amino acid absence or substitution.

Discussion
CagA is an important oncoprotein that can be translocated into the gastric epithelial cells and subsequently tyrosine-phosphorylated at residues of the EPIYA motifs [11]. The phosphorylated CagA can activate the phosphatase SHP-2 and then cause actin cytoskeleton rearrangement, hummingbird phenotype, which disturbs the normal signal transduction pathway of cells and promotes abnormal proliferation of gastric epithelial cells [6]. A recent study shows that the CagA EPIYA segments can interact with SHP-1, Grb2, Grb7, PI3K and Ras-GAP1 in addition to SHP-2 and Csk [32]. The interaction between CagA and these proteins suggests that CagA plays an important role in the development of gastrointestinal diseases caused by H. pylori. Therefore, we used molecular epidemiological methods to study the diversity of cagA 3′ variable region and the amino acid polymorphisms in the EPIYA segments of the CagA C-terminal region, and their association with gastroduodenal diseases.
The tyrosine phosphorylation site is located on EPIYA repeat sequences at the CagA C-terminus, and the number of EPIYA repeats directly affects the binding of CagA to SHP-2 and the ability of causing morphological changes of gastric epithelial cells [27]. Therefore, the variation of EPIYA repeat sequences may be an important reason for the difference in H. pylori strains virulence and clinical outcome. In our study, EPIYA (92.1%) was the predominant type, followed by EPIYT (4.7%) and ESIYA (1.4%). This result differed from previous study examining 710 EPIYA motifs of 206 CagAs, found 77.8% were EPIYA. EPIYT and ESIYT were only found in 14.8% and 6.2% of strains, respectively [17]. EPIYA motif variation had the highest frequency in EPIYA-B segment. It was reported that there was significant correlation between gastric cancer and EPIYA sequences, whereas EPIYT sequences was significantly associated with DU [42]. The role of EPIYT sequences in the development of gastrointestinal diseases needs further study. In EPIYA-C and EPIYA-D segments, the amino acids following EPIYA motif are generally TIDD and TIDF, respectively, which is an important structural domain of binding SHP-2. Our study confirmed that the EPIYA belonged to segment C if it was followed by TIED or TIDE. However, it has been proven that EPIYA was also identified as segment C if it was followed by TIEE, SIDD, TIDG, TIAE or TIAD, and it belonged to segment D if followed by TIDS [40].
According to the segments flanking the EPIYA motifs, we defined several segments, including B′ D , B″ D and D′. The sequences of B′ D , B″ D and D′ segments had some differences from those of B and D segments. For example, the sequences before EPIYA were similar to those of D segment in B″ D segment, whereas the sequences after EPIYA were similar to those of A D segment in B′ D segment. It has been reported that the distribution of CagA EPIYA segments shows great geographical differences. The EPIYA-A and EPIYA-B segments appeared in almost all cagA-positive strains, whereas EPIYA-C and EPIYA-D Table 7 CagA sequence types and clinical outcomes * The prevalence of CagA-ABD was signifcantly higher in CG than in GC (P < 0.01). † The prevalence of CagA-AB′BD was signifcantly higher in CG than in GC (P < 0.   I  Y A T  I  D D   HLJ193_CG  --A  -----Q  ----A T  -------A  --Q  ------------F   HLJ201_CG  --A  -----Q  -A  --A T  -------A  --Q  ------------F   HLJ215_CG  -N E  -------A  -----------A  --R  ------------F   HLJ220_CG  -N  ------Q  -A  --A T  -------A  --R  --------- segments were characteristic of Western and East Asian CagA strains, respectively [41]. As expected, 95.6% (481/503) of the CagA strains contained segment D. In contrast, 4.4% (22/503) contained segment C instead of segment D. Some studies showed that Western type CagA was the most frequent type in Mongolian and Russia patients and all H. pylori from gastric cancer patients possessed Western type CagA [26,37]. Southeast Asian countries, such as Thailand and Myanma, formed the geographical boundaries between segments C and D, and the prevalence of CagA strains containing segments C and D was similar in Southeast Asian countries [20,39]. In our study, 77.3% (17/22) of the Western CagAs were from Neimenggu, Heilongjiang and Yunnan, which may be due to human migration or direct transmission. Studies have reported that there was no significant correlation between CagA-ABD and the types of gastroduodenal diseases [44]. However, our study confirmed that there was a significant correlation between the ABD subtype and gastroduodenal diseases (P < 0.01). Studies have shown that East Asian CagA is more pathogenic than Western CagA, which may explain why the incidence of GC in Eastern countries is significantly higher than that in Western countries [19,31]. In our study, the CRPIA motif of Western CagA showed approximately 70% identity with that of East Asian CagA. This result was consistent with previous studies [16]. Western CagA strains carried two CRPIA motifs, placed within each and distal to the ending of EPIYA-C, while a single CRPIA motif located after the EPIYA-D segment possesed by East Asian CagA. CagA can be phosphorylated by the SFKs at tyrosine residues of the EPIYA motifs [30]. The tyrosine phosphorylated C and D segments specifically bind to SHP-2, which plays an important role in the development of gastric cancer [6]. The tyrosine phosphorylated A and B segments can bind and activate the CagA C-terminal Src kinase (CSK) that is a SFK with negative feedback regulation. The inhibition of SFK can lead to the decrease of phosphorylated CagA protein, which to some extent explains that H. pylori can survive in gastric epithelial cells for a long time without causing extensive gastric injury [8,38]. Therefore, it is thought that CagA with more A and B segments can inhibit SFK more effectively, and thereby reduce cell damage [10]. In the present study, we found 20 CagA sequence types with different numbers of the EPIYA-A or EPIYA-B segment, such as AAABD, ABDABD and BD. The number of EPIYA-A and EPIYA-B segments may lead to the difference in the type and severity of gastrointestinal diseases. The relationship between EPIYA segments and gastrointestinal diseases needs to be further explored.
Research has shown that the pathogenicity of CagA is determined by the binding ability of SHP-2, which is also related to the number of tyrosine phosphorylation sites [6]. Souza [33] reported that the SH2 domains bound to highly correlated sequences, and the binding motif was pY-(S/T/A/V/I)-X-(V/I/L)-X-(W/F). Interestingly, the binding ability of East Asian CagA (pY-A-T-I-D-F) to SHP-2 was higher than that of Western CagA (pY-A-T-I-D-D), which can lead to more severe gastroduodenal diseases. Higashi et al. [12,13] demonstrated that the difference of single amino acid led to the difference of SHP-2 binding activity between East Asian and Western CagA proteins. Therefore, the research on amino acid polymorphisms and their association with gastrointestinal diseases may have an important clinical value. In our study, we obtained seven amino acid polymorphisms in the sequences surrounding the EPIYA motifs: residues 893, 894, 900, 906, 909, 910 and 963. The absence of the amino acids 893 and 894 had a statistically significant association with GC. In most patients with CG, GU, DU and MALT lymphoma, the amino acids at residues 893 and 894 were asparagine (Asn) and glutamic acid (Glu), respectively, whereas 36.8% (7/19) of the isolates from GC patients lost these two amino acids. This change may affect the ability of CagA tyrosine phosphorylation and binding to SHP-2, and alter the spatial conformation of CagA protein, thereby accelerating the development of gastrointestinal diseases.

Conclusions
In this study, 503 CagA sequences were analyzed in depth and we defined several novel segment types, including B′ D , B″ D and D′. We demonstrated that most of H. pylori isolates from Chinese population were of the CagA-ABD subtype and it was statistically correlated with the type of gastroduodenal diseases. Strains at the absence or mutation of the 893 and 894 residues had a significant association with GC. Therefore, amino acid polymorphisms in EPIYA motifs might affect the function of CagA protein, and then lead to the development of gastrointestinal diseases, especially GC.

PCR amplification
To amplify the cagA 3′ variable region of H. pylori, the primers were: forward, 5′-TGC GTG TGT GGC TGT TAG TAG-3′ and reverse, 5′-CCC TAG TCG GTA ATG GGT TGT-3′ [4]. PCR assay was performed in a volume of 25 μl containing 1 μl of each primer, 1 μl template DNA, 12.5 μl Go Taq ® Green Master Mix (Promega, USA) and 9.5 μl nuclease-free water. PCR was performed using a thermocycler system (Bio-Rad, USA) under the following conditions: denaturation at 94 °C for 5 min, 35 cycles at 94 °C for 30 s, at 54 °C for 30 s and at 72 °C for 40 s, and an extension at 72 °C for 10 min. The amplified products were identified after electrophoresis on 1.5% agarose gel with GelStain in 1 × TAE buffer at 110 V for 30 min. The gel documentation system (Bio-Rad, USA) was used to detect the DNA bands and obtain the images of the PCR products.

Sequencing and analysis of the diversity of the cagA 3′ variable region
Positive PCR products were sent to the Beijing Genomics Institute (BGI) for purification and sequencing. The nucleotide sequences of the cagA 3′ variable region were submitted to China National Microbiological Data Center (accession number NMDCN0000IOV to NMDC-N0000J0V and NMDCN0000LT9 to NMDCN0000M4U). EditPlus (version 5.3.0, korea) was used to collect sequence information, sort the sequences and create files in FASTA format. Bioedit was used to align and obtain amino acid sequences of the CagA protein. The EPIYA segment types and CRPIA motifs of CagA were analyzed using the program WebLogo (http:// weblo go. three pluso ne. com/). The Western strain 26695 cagA (GenBank No. CP003904) was used as a reference sequence. MEGA software (version 7.0.18, USA) was used for sequence alignments to analyze the diversity of the cagA 3′ variable region.

Statistics
Statistical data were analyzed using SPSS 20.0 (SPSS, Chicago, USA). The χ 2 test and Fisher's exact test were used to test statistical difference among different gastroduodenal diseases in the CagA subtype and amino acid polymorphisms. A P-value < 0.05 was considered indicative of a statistically difference.