Draft genomes of Shigella strains used by the STOPENTERICS consortium

Background Despite a significant global burden of disease, there is still no vaccine against shigellosis widely available. One aim of the European Union funded STOPENTERICS consortium is to develop vaccine candidates against Shigella. Given the importance of translational vaccine coverage, here we aimed to characterise the Shigella strains being used by the consortium by whole genome sequencing, and report on the stability of strains cultured in different laboratories or through serial passage. Methods We sequenced, de novo assembled and annotated 20 Shigella strains being used by the consortium. These comprised 16 different isolates belonging to 7 serotypes, and 4 derivative strains. Derivative strains from common isolates were manipulated in different laboratories or had undergone multiple passages in the same laboratory. Strains were mapped against reference genomes to detect SNP variation and phylogenetic analysis was performed. Results The genomes assembled into similar total lengths (range 4.14–4.83 Mbp) and had similar numbers of predicted coding sequences (average of 4,400). Mapping analysis showed the genetic stability of strains through serial passages and culturing in different laboratories, as well as varying levels of similarity to published reference genomes. Phylogenetic analysis revealed the presence of three main clades among the strains and published references, one containing the Shigella flexneri serotype 6 strains, a second containing the remaining S. flexneri serotypes and a third comprised of Shigella sonnei strains. Conclusions This work increases the number of the publically available Shigella genomes available and specifically provides information on strains being used for vaccine development by STOPENTERICS. It also provides information on the variability among strains maintained in different laboratories and through serial passage. This work will guide the selection of strains for further vaccine development.


Background
Shigella are Gram-negative bacteria that represent the etiologic agent of the shigellosis, a global human health problem, especially in developing countries and in children younger than 5 years. Shigellosis is estimated to cause annually 125 million cases and 100,000 deaths [1], and is one of main causes of traveller's diarrhea. The genus Shigella comprises four serogroups (Shigella dysenteriae, Shigella sonnei, Shigella flexneri and Shigella boydii) subdivided in 50 different serotypes based on the carbohydrate composition of the O antigen of their lipopolysaccharide [2] and the presence of serotypes varies among different regions and over time [3]. As no vaccines are currently widely available, one of the aims of the European Union-funded STOPENTERICS consortium (Vaccination against Shigella and ETEC: novel antigens, novel approaches) [4] is to develop novel vaccine candidates against Shigella [e.g. the Generalized Modules for Membrane Antigens (GMMA) approach [5,6]], as well as to improve the immunogenicity of the existing antigens (e.g. synthetic chemistry for glycoconjugates [7]). To this end, partners of the STOPENTERICS consortium have Open Access *Correspondence: nrt@sanger.ac.uk † Omar Rossi and Kate S Baker contributed equally 2 Wellcome Trust Sanger Institute, Hinxton, UK Full list of author information is available at the end of the article been integrating basic research, particularly genomics, transcriptomics, proteomics, and other high-throughput technologies, with novel vaccine technologies and synthetic chemistry [7]. To assemble Shigella expertise to identify and rapidly take novel vaccine candidates through to clinical trials for effective vaccine development, the research is carried out among different academic institutions (e.g. University of Oxford, Wellcome Trust Sanger Institute, Institut Pasteur) and vaccines companies (Novartis Vaccines Institute for Global Health and Sanofi-Pasteur).
To ensure the congruence of strains between laboratories, and create a public resource for vaccine development and further Shigella research, we whole genome sequenced the Shigella strains used by the STOPENTER-ICS consortium which are used as they offer most effective breadth of cross-protection against Shigella sp. in endemic areas [8], and report the assembly and annotation of their draft genomes. We assessed the presence of SNPs between strains and against references, as well as defined their phylogenetic relationships, and compared genetic stability of strains maintained in different consortium laboratories and after serial passage.

Bacterial strains
The Shigella strains analysed in this study and relevant metadata are summarized in Table 1. Strains were serotyped by slide agglutination using commercially available monovalent antisera (Denka Seiken, Japan) to all type specific somatic antigens and the group factor antigens [9].

DNA extraction and genome sequencing
Bacterial cultures were grown over night in liquid Luria-Bertani (LB) media to an optical density (measured at 600 nm) of approximately three. Genomic DNA was isolated using the Wizard kit (Promega, Madison, WI, USA) according to manufacturer's instructions. Purified DNA was then sequenced at the Wellcome Trust Sanger Institute (WTSI). Paired end libraries 150 bp in length were generated and sequenced on the Illumina MiSeq instrument (San Diego, CA, USA) according to in house protocols [10,11], with an approximately 500 bp insert size. Sequence data for each of the strains were deposited in the European Nucleotide Archive (accession numbers in Table 1).

Genomic analysis
Resulting sequencing reads were trimmed using Trimmomatic v0.27 [12] to remove adapters, bases with a PHRED score of <30, and remaining reads with lengths <50 bp.
High quality reads were then mapped to relevant reference strains (Table 1), using SMALT (http://www.sanger. ac.uk/resources/software/smalt/) and Single Nucleotide Polymorphisms (SNPs) were called using Samtools [13]. Nucleotides where mapping quality was below 30 and genotyping quality was below 50 were excluded from further analysis. Mapping coverage of all isolates was approximately 70-fold coverage.
De novo assembly was performed using Velvet Optimiser [14] and contiguous sequences were annotated using Prokka [15]. Clustering and BLAST comparisons were used to determine the presence/absence of genes in annotated assemblies as previously described [16].
To prepare a multiple sequence alignment for phylogenetic analysis, sequencing data from strains in this study and from simulated fastq data created from published reference genomes were mapped to the chromosome of S. flexneri 2457T (GenBank accession: NC_004741.1). The other reference isolates (and their accessions) used in this analysis were: S. boydii Sb 227 (NC_007613.1). Core genes (n = 2,427) were identified that had 100% mapping coverage in all isolates and phylogenetic analysis was performed using RAxML software v7.0.3 [17] on the 43,349 variable sites (subset from 2,306,256 bp) of these core genes.
In silico molecular serotyping of S. flexneri isolates was performed on de novo assemblies for each isolate (and as in [18]). Briefly, the presence/absence and known differences of the gtr genes (encoding for enzymes responsible of the presence of type specific antigens I, II, IV, V, X, IC), oac genes (encoding for enzymes that mediates O-acetylation modification in serotypes 1b, 3a, 3b, and 4b) and wzx6 (specific for serotype 6) were analyzed, facilitating the differentiation of the six different S. flexneri serotypes.

Results and discussion
Sixteen different Shigella isolates belonging to seven different serotypes were sequenced (listed in Table 1). These included S. sonnei (2 isolates) and different S. flexneri serotypes including 1a, 1b (2 isolates), 2a, 3a, 5a and 6 (eight different isolates) plus four derivative strains from either serial passage (S. sonnei 53G, S. flexneri 2a 2457T) or having been cultivated and the DNA extracted in different laboratories (S. flexneri 3a 6865 and S. flexneri 6 10.5302). Derivative strains from the same isolate, but manipulated in different laboratories of the STOPEN-TERICS consortium were denoted '_1' and '_2' , whereas those that had undergone serial passage (~10 passages) in the same laboratory were denoted '_p' . The derivatives allowed us to assess the genetic stability of strains across laboratories and through serial passage. Results of genomic assembly and annotation were similar for all strains ( Table 1). The strains assembled into an average of 381 contigs (range 265-446), with an average contigs length of 12,141 bp (range 9,897-15,619) and an N50 of 28,620 (range 22,494-35,991). The resulting genomic size was similar for all the strains and fell within the range of 4.14-4.83 Mbp. Similarly, automated annotation predicted the presence of an average of 4,400 coding sequences per genome (range 4,044-4,583; Table 1). The serotypes of the Shigella strains were confirmed based on the combinations of gtr and oac genes, encoding the relevant enzymes for the serotype-specific OAg modifications [18] (not shown).
To facilitate strain comparisons and phylogenetic analysis, sequence reads were mapped to existing Shigella reference genomes ( Table 1). The percentage of the reference genome covered by mapped reads ranged from 87 to 98% and the number of SNPs varied (Table 1) depending on the isolate. These data showed comparatively few SNPs (<200) when an isolate was compared to a previously published reference of itself (as in the case of S. sonnei 53G, S. flexneri 2a 2457T, S. flexneri 5a M90T). Higher numbers of SNPs were seen where no such reference was available. For example, when an isolate was mapped to a reference genome of a different isolate of the same serotype (e.g. Ss_25931 mapped against Ss_53G) several hundred SNPs were seen, and several thousand SNPs were seen if the isolate was mapped to a reference isolate from a phylogenetic related, but distinct serotype (e.g. S. flexneri six isolates mapped against S. boydii strain Sb227).
To assess the genomic stability of isolates held at different laboratories and through serial passage within the same laboratory, we resequenced a number of isolates and compared their mapping results to the relevant reference (Table 1). Two isolates (original and passaged) of S. sonnei 53G had only two SNPs relative to the published reference genome, and these SNPs were the same in both isolates. Similarly, the sequences of original and passaged S. flexneri 2a strain 2457T were very similar, but had 195 and 192 SNPs relative to the published reference genome. Among these SNPs, 188 were common to both isolates and the remaining four and seven sites were not resolved in the other isolate, indicating that the two isolates were likely identical to each other. The level of genetic variation compared to the reference strain was surprising (~200 SNPs) and may have biological significance, showing the utility of obtaining up-to-date genetic information for the exact strain being worked with in a  given project. Two strains, Sf 3a_6865 and Sf 6_10.5302, were manipulated for sequencing in separate laboratories in the consortium. These strains differed by only one and two SNPs respectively, indicating that over a 2-3 year time period, isolate genomes remain relatively stable through passage and between laboratories, but may differ significantly from published references.
To assess the phylogenetic relationship of the isolates, we constructed a maximum likelihood phylogenetic tree of a large core genome shared among the strains (Figure 1). Consistent with expectations based on prior evolutionary studies of shigellae [19,20], the strains were divided into three main clades, with the S. flexneri six strains being phylogenetically removed from the remaining S. flexneri serotypes, and the S. sonnei strains forming a separate clade.

Conclusions
The work presented here increases the number of publically available Shigella genomes, including for the first time, sequencing data for S. sonnei 25931, two S. flexneri 1b, one S. flexneri 1a, one S. flexneri 3a and 8 S. flexneri six isolates. We provide details on the draft genomes generated from this sequencing data, and report SNP variation in strains maintained in different laboratories and after serial passage. We also described the relatedness of the strains and isolates used by the STOPENTERICS consortium, and have deposited this data as a public resource. Data presented in this work will guide the selection of strains for further development of vaccine and contribute to a growing awareness of diversity in Shigella.
Author's contributions OR, KB and NRT analyzed the sequencing data. OR, KB, AP, FXW, FC, PJS, CG and NRT participated on data collection analysis and contributed to the writing of the manuscript. All authors read and approved the final manuscript. 1 Novartis Vaccines Institute for Global Health, s.r.l., a GSK Company, Siena, Italy. 2 Wellcome Trust Sanger Institute, Hinxton, UK. 3 Institut Pasteur, Paris, France.