A draft genome of Escherichia coli sequence type 127 strain 2009-46

Background Escherichia coli are a frequent cause of urinary tract infections (UTI) and are thought to have a foodborne origin. E. coli with sequence type 127 (ST127) are emerging pathogens increasingly implicated as a cause of urinary tract infections (UTI) globally. A ST127 isolate (2009-46) resistant to ampicillin and trimethoprim was recovered from the urine of a 56 year old patient with a UTI from a hospital in Sydney, Australia and was characterised here. Results We sequenced the genome of Escherichia coli 2009-46 using the Illumina Nextera XT and MiSeq technologies. Assembly of the sequence data reconstructed a 5.14 Mbp genome in 89 scaffolds with an N50 of 161 kbp. The genome has extensive similarity to other sequenced uropathogenic E. coli genomes, but also has several genes that are potentially related to virulence and pathogenicity that are not present in the reference E. coli strain. Conclusion E. coli 2009-46 is a multiple antibiotic resistant, phylogroup B2 isolate recovered from a patient with a UTI. This is the first description of a drug resistant E. coli ST127 in Australia.


Background
Escherichia coli infections of the urinary tract are among the most frequent infections reported in the developed world with an estimated 130-175 million cases per annum worldwide [1]. E. coli that cause urinary tract infections (UTI) are classified as uropathogenic Escherichia coli (UPEC), a subgroup of extraintestinal pathogenic E. coli (ExPEC). ExPEC also cause a range of afflictions including meningitis, septicaemia, and pneumonia and are genotypically and phenotypically distinct from diarrhoegenic E. coli (DEC) [2]. ExPEC are thought to be acquired orally via the consumption of contaminated food and are considered to be zoonotic pathogens [3][4][5]. The emergence of multiple antibiotic resistance among ExPEC poses a serious health threat; antibiotics are an important treatment strategy for controlling UTI.
Multilocus sequence typing (MLST) is currently the gold standard for characterising E. coli causing UTI.

Methods
The isolate was supplied on a Sensi-agar plate from the SAN laboratories in Sydney, Australia. To confirm pure culture, a loopful of the isolate was streaked onto a Luria Bertani (LB) Agar plate and incubated at 37°C for long term storage in minus 80°C as a glycerol stock. A single colony was picked from the plate and subcultured in http://www.gutpathogens.com/content/6/1/32 10 mL LB broth at 37°C overnight. To prepare the glycerol stock culture 7 mL of the overnight was used, and genomic DNA was prepared from the remaining 3 mL. Genomic DNA for sequencing was prepared using the ISOLATE II gDNA extraction kit from Bioline.

Genome sequencing
DNA was quantified using qubit flourimetry and 0.5 ng of gDNA was used as template to construct the sequencing library, using the Illumina Nextera XT library preparation protocol following the manufacturer's instructions. However, the "PCR Clean-Up" and "Library Normalization" steps were omitted and size selection was instead performed by running balanced and pooled samples in a 1% agarose gel and excising the 600 bp to 1200 bp region of interest. The DNA was then purified from the agarose using Promega's Wizard SV Gel and PCR Clean-Up System. Finally, an Agilent 2100 Bioanalyzer, with a High Sensitivity DNA Kit, was used to quantitate the pooled DNA library before loading onto the MiSeq with other multiplexed samples. Two MiSeq runs were carried out, one with paired-end 250 nt reads on MiSeq V2 chemistry and another with paired-end 300 nt reads on V3 chemistry. The first library was found to have an average insert size of 368 +/− 157 nt, while the second library had inserts with an average size 497 +/− 118 nt.

Assembly and annotation
The genome was assembled using the A5-miseq pipeline, a version of the A5 pipeline [10] that has been revised to process reads up to 500 nt long. Briefly, the A5-miseq pipeline consists of five stages: (1) read quality filtering and error correction, (2) contig assembly, (3) permissive draft scaffolding, (4) misassembly detection, and (5) conservative scaffolding. The revised A5 pipeline uses a new version of idba_ud that uses read pairing information, and that has been modified to accept reads up to 500 nt long and to construct de Bruijn graphs with k-mers up to 500 nt. These modifications provide substantial improvements in assembly contiguity.
The genome was annotated with the RAST annotation system using FigFAM release 70 [11]. Putative antibiotic resistance genes and other genes of interest identified by RAST annotation were manually curated using the NCBI ORF finder and iterative BLASTn and BLASTp searches.

Quality assurance
The A5 pipeline includes a quality checking step that detects putative misassemblies by identifying clusters of read pairs that map to disjoint locations in the assembled genome. This method did not detect any putative misassemblies.

Initial findings
Sequencing generated 1,702,236 read pairs for a total of 483,658,987 nt that were assembled to reconstruct the 5,139,229 bp genome of E. coli 2009-46 in 89 scaffolds, with a scaffold N50 of 161 kbp and an N90 of 30.8 kbp. The raw (unfiltered) coverage is 94x, and after read filtering the assembly has a median depth of coverage of 61x. The annotation of this assembly identified 5084 predicted CDS and 106 predicted RNA genes. 19 genes were identified as possibly missing from the assembly by the RAST system. The overall functional profile of the genome is shown  Figure 1. We conducted a phylogenetic analysis of E. coli 2009-46 using the PhyloSift software [12] to identify the most closely related organism with an available reference genome. PhyloSift works by identifying homologs in the draft genome to a set of 37 genes that are universally conserved among bacteria and archaea and present in single copy. It then adds any homologs found in the draft genome to an existing multiple sequence alignment containing the 37 genes from a subset of all genomes publicly available in the NCBI and EBI databases that is chosen to span the phylogenetic diversity of these databases. The PhyloSift reference database includes only a single representative from groups of closely related organisms. To gain additional resolution in the Escherichia, we used Phy-loSift to construct a multiple alignment of the 37 marker genes from all finished E. coli genomes available in the NCBI database as of September 2013. We then inferred a phylogeny from that alignment using FastTree2 [13]. The resulting analysis, shown in Figure 2, identified E. coli 536 as the most closely related isolate with a finished genome available, although there was some uncertainty in the 37 gene alignment as to whether E. coli 2009-46 diverged on the same lineage as E. coli 536. We used the closely related genome of E. coli 536 as a reference for further comparative analysis.
The scaffolds of E. coli 2009-46 were reordered to match the order in the finished genome of the closely related strain E. coli 536 using the Mauve Contig Mover [14]. After reordering, the genomes had 82 predicted rearrangement breakpoints. Many of these cluster in regions containing annotated transposase genes and multi-copy transporter gene families, suggesting either homologymediated rearrangement or misassembly has occurred at these repetitive sequences. To further characterize the structure of the genome we used the CGview webserver [15] to plot matches to annotated proteins and the GC skew of the genome, with scaffolds ordered according to the E. coli 536 reference. The CGview plot is shown Figure 2 Phylogeny of E. coli and Shigella including the 2009-46 isolate. A phylogeny inferred on a concatenated set of codon alignments from 37 universally conserved genes is shown, as calculated by PhyloSift [12] and FastTree2 [13]. The phylogeny has been rooted on the branch leading to Salmonella and internal nodes are labeled with SH-like support values. http://www.gutpathogens.com/content/6/1/32 in Figure 3. Of note, the GC skew in E. coli 2009-46 genome appears to fluctuate frequently. This pattern is in sharp contrast to the GC skew of the E. coli 536 reference, which shows a strong pattern coinciding with the chromosome's replication arms (data not shown). This suggests that either E. coli 2009-46 has undergone substantial genome rearrangement in the recent past, that the true genome arrangement may not match the E. coli 536 reference very closely, that undetectable misassembly errors exist in the E. coli 2009-46 genome, or that some combination of these three situations exists. We note that our assembly pipeline contains a step to detect and fix misassembly errors; none were found in the genome of E. coli 2009-46.
Comparison of the gene content between E. coli 2009-46 and the finished E. coli 536 reference genome identified 164 annotated gene functions predicted to be present only in E. coli 2009-46. Included among these are several genes related to scavenging iron, a type VII secretion system, an IncF conjugation system, mediators of hyperadherence, and copper and mercury resistance genes. The full list of gene functions found only in 2009-46 and those which 2009-46 lacks relative to the reference isolate are listed in Additional files 1 and 2, respectively.
The bla TEM1 gene, conferring resistance to ampicillin, was present on scaffold 78.1 (2551 nt), while the sul2-strA-strB genes conferring resistance to sulphonamides and streptomycin was located on scaffold 67.1, which was 5064 nt long. Ends of both the scaffolds had a partial copy the insertion element IS26. The isolate also houses a clinical class 1 integron and two associated resistance genes on scaffold 71.1. One of the two resistance genes is a variant of dihydrofolate reductase (dhfr) gene which provides trimethoprim resistance to isolates and the other confers resistance to aminoglocoside antibiotics (aadA). However the scaffold, 71.1, is 3,863 nt long and also has a copy of IS26 at both ends. We identified the presence of the 3'-CS of a class 1 integron on scaffold 58.1 (6679 nt long), that had an IS26 on one end and an IS1 element on the other. Presence of IS26 elements at both ends of seven scaffolds has resulted in scaffold breaks around a region of the genome, which most likely harbours a complex resistance locus (CRL), during the assembly of the genome sequence. We were therefore unable to confirm

Antibiotic resistance profile
The antibiotic resistance profile of E. coli 2009-46 was experimentally determined using the disk diffusion method. This strain was found to be resistant to Ampicillin, Trimethoprim, Sulphafurazole, Tetracycline, Streptomycin, Apramycin, Kanamycin, and Azithromycin. A full list of antibiotics tested and E. coli 2009-46 susceptibility is provided in Additional file 3.
To better understand the genomic basis for the observed antibiotic resistance traits, the genome was searched for specific genes known to confer antibiotic resistance. A listing of these genes and their presence or absence in E. coli 2009-46 is provided in Additional file 3.

Future directions
Improved efficiency of clinical genomics pipelines will eventually enable fine-scale epidemiological monitoring of E. coli outbreaks in real time. When fully developed, this capacity will influence clinical and public health decisions related to treatment and control of pathogen outbreaks. Genomic data such as is presented here will aid in the interpretation of data from future outbreaks.

Availability of supporting data
The draft genome assembly has been submitted to NCBI and is associated with BioSample accession SAMN02725027. Genome annotations are available from the RAST web server under accession 562.3620. The Illumina sequence reads have been deposited to the Short Read Archive under accessions SRX514806 and SRX514807.