New tool for phylogenetic analysis of Helicobacter pylori

The aim of this study was to detect canonical insertion/deletion (INDEL) markers in the genome of Helicobacter pylori and offer INDEL-typing method for differentiation of H. pylori strains. For comparative analysis of the genomes of H. pylori presented in the GenBank database, a local database of nucleotide sequences of 69 H. pylori strains was created. For detecting all INDEL markers with a preset size of 6-20 bp a pairwise comparison of more than 1500 open reading frames (ORF) in the genomes of local database strains was performed. Ten loci containing INDEL markers were founded. The five most variable loci were tested in silico with 21 strains with known geographical origin from the most common populations of hpEurope, hspWAfrica, and hspEAsia. Fifteen individual genotypes with a high diversity index (DI=0.95) were identified. For cluster analysis, the minimal spanning tree (MST) method was used, which demonstrated a clear distribution of clusters according to the geographical origin of the strains tested. INDEL-typing of 21 regional strains from the Astrakhan region was performed in vitro. It was shown that an extensive majority of them belong to the population hpEurope. The findings in this study indicate that the proposed INDEL-typing method almost perfectly reflects the geographical distribution of H. pylori strains determined by the multilocus sequence typing (MLST) method, despite the fact that the primary object of research is completely different genes. Further research is needed to determine the geographical origin of H. pylori strains in Russia.


Introduction
The most important factor of H. pylori providing the high genetic variability of strains is a significant frequency of endogenous mutations and respective low occurrence of identical alleles [1]. In addition, H. pylori cells are naturally competent and can integrate foreign DNA into their own genome through recombination and exchange genetic information between different strains in the same patient [2]. Therefore, H. pylori is a species without a significant clonal structure [2,3]. The real genetic diversity of H. pylori was confirmed by various methodological approaches [3][4][5][6][7][8].
Previously, Achtman et al. [3] established the global structure of the population of H. pylori according to their geographical origin using the method of multi-locus sequencing-typing (MLST) of seven life-support genes (ureI, mutY, efp, ppa, yphC, atpA, trpC). Initially, they described two groups of strains with a weak clonal structure: Asian Clone (Asian strains) and Clone2 (strains from other parts of the world). Later, Falush et al. [9] identified four groups of strains with different geographical origins: hpEurope, hpAfrica 1 (later divided into hspWAfrica and hspSAfrica), hpAfrica 2, and hpEastAsia (consisting of hspAmerind, hspEAsia, and hspMaori) [9]. Currently, the MLST-typing method is the most common method for determining the geographical origin of H. pylori strains in the world, but it has some disadvantages. The method requires determining the DNA sequence size of 3850 nucleotides with further determination of SNP substitutions in the DNA sequence size of more than 1400 nucleotides for each studied strain, which necessitates the use of high-tech equipment and complex software. Perhaps, this determines the fact that the structure of the population of H. pylori strains circulating in Russia is practically not studied today. At the same time, the INDEL-typing method [10][11][12][13], based on the available PCR method, is widely used to study the phylogenetic relationships between strains of some other microorganisms, in addition to the MLST typing method. To date, the method of INDEL-typing of H. pylori strains was not been described in the world literature.
Therefore, the objective of this study is to detect INDEL markers in the genome of Helicobacter pylori and develope INDEL-typing method for differentiation of H. pylori strains.

Methods of bioinformation analysis
Original software was used to create a local database of H. pylori genomes. Genomes represented as reads were assembled using the Spades program [14]. Comparative analysis of open reading frames (ORF) was performed using the original program Gene Expert, written in the Java script. The variability of detected insertions/deletions (INDEL) was estimated using the Simpson method (diversity index, DI), calculated using the formula [15]. Primers were designed and PCR was performed in silico using the author's programs Primer M and Virtual PCR. Cluster analysis and dendrogram construction were performed using the UPGMA method. The MEGA 5 program was used to construct the dendrogram [16]. Optimization of the set of INDEL loci was performed using the program AuSeTTS (Automated Selection of Typing Target Subsets) [17]. The phylogenetic tree was constructed using the MST (minimal spanning tree) method (Bionumerix 7.6 software package).

Bacterial strains and DNA isolation
Twenty-one H. pylori isolates were obtained from collection of Astrachan state medical university. Genomic DNA was extracted from bacterial isolates using a Probe NA Kit (DNA-Technology, Russia), according to the manufacturer's instructions.

PCR amplification
Amplification was carried out as described earlier [7], with one exception: annealing at 50 °C for locus hp3660 and 55 °C for other loci. Each PCR product was resolved by 8% polyacrylamide gel electrophoresis and allelic sizes were estimated using a pBlueScript DNA / MspI (MBI Fermentas, Vilnius, Lithuania) as a size marker. Gels were visualised using UV transillumination and the images captured using the ChemiDoc XRS System (BioRad).

In silico INDEL-typing
For comparative analysis of the genomes of H. pylori presented in the GenBank database, a local database of nucleotide sequences of 69 H. pylori strains was created using the author's software. Using the Gene Expert program, a pairwise comparison of more than 1,500 open reading frames in the genomes of local database strains was performed to detect all INDEL markers with a preset size of 6-20 bp and 10 loci containing INDEL markers were detected.
The variability of each individual locus was determined for all 10 loci containing INDEL markers. Six most variable loci were selected for further research (hp3330, hp5605, hp6405, hp340, hp1390, hp3660). To optimize a set of 6 INDEL loci the program AuSeTTS (Automated Selection of Typing Target Subsets) [17] was applied to obtain the maximum number of individual genotypes.
It was shown that the exclusion of the hp3330 locus does not affect the resolution of the method and allows detecting fifteen individual genotypes with a high diversity index (DI=0.95) in 21 H. pylori strains. For PCR in silico primers were constructed using the original program Primer M (Table 1). Recently, the most accurate method of constructing a minimal spanning tree (MST) has been increasingly used to clarify the phylogenetic relationships between microbial strains of different species [18][19][20]. The five INDEL loci (hp5605, hp6405, hp340, hp1390, hp3660) were used to type in silico 21 strains of H. pylori from the GenBank database. The results of the MST analysis are shown in Figure 1. demonstrated an accurate division of strains into clusters according to their geographical origin. It should be noted that there is a significant variability of individual genotypes within each of the populations: 3 individual genotypes in hspWAfrica, 5 genotypes in hspEAsia, and 7 genotypes in hpEurope. The high degree of genetic diversity in hpEurope is due to the mixing of two ancestral populations, AE1 and AE2, which may have reached Europe at different times and from different sources: AE1 was indeed present throughout Central Asia in addition to Europe, and AE2 came from North Africa and Central Asia [21]. In a recent study by Vale et al. [22] could to differentiate two European populations represented mainly in Northern Europe (with genetic affinity to AE1) and southern Europe (with genetic affinity to AE2). The study was based on typing the nucleotide sequence of two H. pylori prophages. The distribution of INDEL genotypes in the hpEurope strain cluster also indicates the existence of two distinct groups of strains (Fig.1).

INDEL-typing of clinical isolates
At the stage of constructing primers, we faced the problem of extreme polymorphism of the H. pylori genome and, consequently, the very high variability of flanking regions of INDEL markers. However, the problem was solved, and the vast majority of primers were specific also in vitro (Fig.2). For each locus, different alleles are clearly separated using electrophoresis in 8% polyacrylamide gel and can be easily identified (Fig. 2).  (Table 2). According to the AuSeTTS program [17], 21 strains are represented by 16 individual INDEL genotypes, which indicates a significant heterogeneity of the strains studied.

Building and analyzing a phylogenetic tree
To clarify the structure of phylogenetic relationships between clinical isolates and strains from the GenBank database with a known geographical origin, the MST procedure was applied, the result of which is shown in Figure 3. It should be noted, that there are some changes in the structure of phylogenetic relationships compared to the original one ( Figure  1). Cluster A. represented by hspWafrica strains was supplemented by one Astrakhan strain. Cluster B, represented by strains of the hspEAsia population, has undergone some changes, namely, one strain was "displaced" to cluster C. The remaining 20 strains of the Astrakhan population belong to cluster C, which contains hpEurope strains. The same cluster contains hpEurope population strains from the GenBank database. It should be noted, that the strains of the Astrakhan population are represented by both INDEL genotypes shared with GenBank strains (4) and unique individual ones (11), which emphasizes the high genetic heterogeneity of hpEurope population strains. These additional genotypes are represented by three phylogenetic lineages that extend the genetic diversity of European strains. In total, the studied 42 strains are represented by 27 individual genotypes with a high diversity index (DI=0.95).

Figure 3
Common MST phylogenetic tree of 21 H. pylori strains from the GenBank database and 21 Astrakhan strains Thus, the proposed INDEL-typing method almost perfectly reflects the geographical distribution of H. pylori strains determined by the MLST method, despite the fact that the primary object of research is completely different genes.

Conclusion
The INDEL-typing method for differentiation of H. pylori strains based on the available PCR method is proposed for the first time. This method includes the determination of alleles of five INDEL loci by PCR, followed by the study of their phylogenetic relationship by MST. Twenty-one H. pylori strains from a database GenBank with a known geographical origin were typed. The results obtained shows very good similarity with MLST typing data. The method was successfully tested in vitro in the study of regional Russian strains.