Fusion Protein Antibiotic Resistance in One Reading
-
Loading metrics
VAMPr: VAriant Mapping and Prediction of antibiotic resistance via explainable features and machine learning
- Jiwoong Kim,
- David East. Greenberg,
- Reed Pifer,
- Shuang Jiang,
- Guanghua Xiao,
- Samuel A. Shelburne,
- Andrew Koh,
- Yang Xie,
- Xiaowei Zhan
x
- Published: January 13, 2020
- https://doi.org/10.1371/periodical.pcbi.1007511
Figures
Abstract
Antimicrobial resistance (AMR) is an increasing threat to public wellness. Electric current methods of determining AMR rely on inefficient phenotypic approaches, and at that place remains incomplete understanding of AMR mechanisms for many pathogen-antimicrobial combinations. Given the rapid, ongoing increment in availability of high-density genomic information for a diverse array of leaner, evolution of algorithms that could utilize genomic data to predict phenotype could both be useful clinically and help with discovery of heretofore unrecognized AMR pathways. To facilitate understanding of the connections betwixt DNA variation and phenotypic AMR, we developed a new bioinformatics tool, variant mapping and prediction of antibiotic resistance (VAMPr), to (1) derive factor ortholog-based sequence features for protein variants; (ii) interrogate these explainable cistron-level variants for their known or novel associations with AMR; and (3) build accurate models to predict AMR based on whole genome sequencing data. Nosotros curated the publicly bachelor sequencing data for iii,393 bacterial isolates from 9 species that independent AMR phenotypes for 29 antibiotics. We detected xiv,615 variant genotypes and built 93 clan and prediction models. The association models confirmed known genetic antibody resistance mechanisms, such equally blaKPC and carbapenem resistance consistent with the authentic nature of our approach. The prediction models achieved high accuracies (mean accuracy of 91.i% for all antibiotic-pathogen combinations) internally through nested cross validation and were also validated using external clinical datasets. The VAMPr variant detection method, clan and prediction models will be valuable tools for AMR inquiry for basic scientists with potential for clinical applicability.
Author summary
Antimicrobial resistance (AMR) is a global health threat. The current method to determine AMR is inefficient and complete agreement of the mechanisms of AMR is lacking. With the increased feasibility of sequencing bacterial genomes, it is now easier, faster and cheaper to have genomic insights into AMR. In this manuscript, we propose a novel bioinformatic tool for variant mapping and prediction of antibody resistance (VAMPr). We curated 3,393 bacterial genomes from 9 bacterial species that independent AMR phenotypes for 29 antibiotics. We used protein orthology and detected 14,615 variants. Combined with AMR phenotypes, we built 93 association and prediction models. The association model confirms known genetic AMR mechanisms, and the prediction models achieved high accuracies. Together, our work will be valuable for AMR research for basic scientists with the potential for clinical applicability.
Citation: Kim J, Greenberg DE, Pifer R, Jiang Southward, Xiao Chiliad, Shelburne SA, et al. (2020) VAMPr: VAriant Mapping and Prediction of antibiotic resistance via explainable features and machine learning. PLoS Comput Biol 16(ane): e1007511. https://doi.org/10.1371/periodical.pcbi.1007511
Editor: Morgan Langille, DAL, CANADA
Received: April 30, 2019; Accepted: October 25, 2019; Published: January 13, 2020
Copyright: © 2020 Kim et al. This is an open access article distributed under the terms of the Artistic Commons Attribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original author and source are credited.
Information Availability: All information were available from NCBI SRA (www.ncbi.nlm.nih.gov/sra) and NCBI Antibiograms (www.ncbi.nlm.nih.gov/biosample/docs/antibiogram/). The accession numbers for the bacterial isolated used in this manuscript are available in the S2 Table.
Funding: This work was supported by the National Institutes of Health [5P30CA142543, 1R01GM12647901A1] (XZ), [2T32AI007520-21] (RP) and the UTSW DocStars Laurels (DEG); Cancer Prevention Research Institute (CPRIT) [RP150596] (JK) and [RP180319] (XZ). The funders had no office in study design, data collection and analysis, decision to publish, or preparation of the manuscript.
Competing interests: The authors take declared that no competing interests exist.
Introduction
Antimicrobial resistance (AMR) is an urgent worldwide threat [1]. Decreased efficacy of antibiotics tin lead to prolonged hospitalization and increased mortality [ii]. Current phenotypic methods for determining whether an isolate is sensitive or resistant to a particular antibody can, in some instances, have days resulting in delays in providing effective therapy [3]. Targeted methods for AMR conclusion, such as PCR, are limited in that they identify only a subset of resistant genes and therefore do not provide a full caption for a particular resistance phenotype [4].
Next-generation sequencing (NGS) technology enabling whole genome sequencing (WGS) of bacterial isolates is now both cheap and widely-used [5]. A recent review illustrates how this promising applied science could enable genome-based prediction for antibiotic resistance [6]. We accept previously shown that NGS can identify AMR determinants for a limited number of β-lactam antimicrobials using a dominion-based method and that genotype correlated well with classic phenotypic testing [7]. However, that study focused on a narrow set of both antibiotics and pathogens because the links between genotype and phenotype are relatively well understood for those antibiotic/pathogen combinations. To build prediction models for a broader spectrum of antimicrobials, it is necessary to use model-based based methods to study the complex relationship among resistance loci. For example, other groups have utilized NGS information to place the presence of genes or short nucleotide sequences that confer resistance in a variety of pathogens using k-nn or adaBoost algorithms [8–ten]. However, these studies have not taken advantage of factor orthology features. In addition, mechanisms of AMR for many pathogen-antibiotic combinations are non well delineated which hinders the understanding of genotypic-phenotypic relationships. Therefore, we sought to utilize large bacterial data collections in lodge to develop novel approaches (association and prediction models) to characterize explainable genetic features that correlate with antimicrobial resistance.
Results
VAMPr: A novel bioinformatics resources to written report microbial resistance
In order to more fully explore genotypic prediction of antibiotic resistance and build upon our previous efforts, we take developed novel methods for utilizing NGS information to better 1) characterize amino-acrid based variant features, 2) expand the noesis base of genetic associations with AMR, and 3) construct accurate prediction models for determining phenotypic resistance from NGS data in a broad array of pathogen-antibiotic combinations. We developed a novel bioinformatics resources, VAriant Mapping and Prediction of antibiotic resistance, VAMPr (Fig ane). Information technology was congenital utilizing a large dataset of bacterial genomes from the NCBI Sequence Read Annal (SRA) along with paired antibiotic susceptibility data from the NCBI BioSample Antibiogram. VAMPr utilizes ii different approaches, association models and prediction models, to appraise genotype-phenotype relationships. In the association assay, data-driven clan models utilizing a gene ortholog approach were constructed. This allowed for unbiased screening of genotype and phenotype across a broad array of bacterial isolates. In the prediction analysis, we utilized a machine learning algorithm to develop prediction models that have NGS data and predict resistance for every pathogen-drug combination. These approaches non only confirmed known genetic mechanisms of antibacterial resistance, only likewise identified potentially novel or underreported correlates of resistance.
Fig i. Overview of the VAMPr workflow.
The VAMPr pipeline candy sequence data from the NCBI Short Read Achieve (SRA) and NCBI BioSample Antibiograms for phenotypes. The curated AMR genotypes and AMR phenotypes were used to create both association and prediction models.
https://doi.org/10.1371/journal.pcbi.1007511.g001
We depicted the VAMPr workflow in Fig 1. Offset, we downloaded publicly bachelor bacterial genomes from the NCBI Short Read Archive (SRA) and paired antibody susceptibility data from the NCBI BioSample Antibiograms project. In order to identify bacterial genetic variants, nosotros performed de novo associates and aligned the assembled scaffolds to a curated Antimicrobial Resistance (AMR) KEGG orthology database (KO) [11]. Through this procedure KO-based sequence variants were identified. CLSI breakpoints were used to decide the antibiotic phenotype (sensitive versus resistant; isolates with intermediate susceptibility were not included for analysis) [12]. Finally, factoring both genetic variants and antibiotic resistance phenotypes, association and prediction models were constructed. These models are available to the research community through our website (see Data Access).
Construction of NCBI datasets of curated genotypes and phenotypes
Focusing on the isolates reported in the NCBI Antibiogram database, we retrieved four,515 bacterial whole genome sequence datasets (Illumina platform) from NCBI SRA and their antimicrobial resistance phenotypes from NCBI BioSample Antibiograms projection. Sequence reads were de novo assembled and aligned to Multi Locus Sequence Typing (MLST) databases to validate reported bacterial species identification [13]. one,100 isolates were excluded from analysis because of inaccurate species identification. Our final analysis cohort included three,393 isolates representing nine species: Salmonella enterica (1349 isolates), Acinetobacter baumannii (772), Escherichia coli (350), Klebsiella pneumoniae (344), Streptococcus pneumoniae (317), Pseudomonas aeruginosa (83), Enterobacter cloacae (79), Klebsiella aerogenes (68), and Staphylococcus aureus (31). A full of 38,871 MIC (minimal inhibitory concentration, the lowest antibody concentration to inhibit bacterial growth) values were reported for 29 different antibiotics (S1 Table and S2 Table). In total, in that location were 38,248 individual pathogen-drug data points identified (Fig 2A).
Fig ii. Summary of pregnant variant associations and prediction accuracies from 93 species-antibiotics combinations.
Both heatmaps display the counts of curated isolates by the combination of 9 bacterial species and 29 antibiotics from 13 drug categories. The boxes without a number indicates that no isolates were available for this item bacterial species and antibiotic combination. A) the color of the boxes indicates the number of gene-antibiotic resistance associations with FDR adjusted p-values <0.05 from VAMPr association models, and the bodily numbers are shown inside the parenthesis; B) the color indicates cantankerous-validated prediction accuracies from VAMPr prediction models, and the accuracies are shown within the parenthesis.
https://doi.org/10.1371/journal.pcbi.1007511.g002
After curation, we analyzed isolates with de novo assembled genome and MIC values, and this dataset included 93 species/antibiotic combinations for building clan and prediction models (detailed in next 3 sections). The fraction of resistant isolates for any given leaner and antibiotic varied greatly (the median fraction of resistant isolates was l.0%). Every bit an example, for S. enterica and trimethoprim-sulfamethoxazole, the fraction of resistant isolates was 0.six% while for K. pneumoniae and cefazolin, the fraction of resistant isolates was 97.iii%. This dataset was used in both the association and prediction models.
Characterization of explainable AMR sequence variants
We curated a list of 537 Antimicrobial Resistance (AMR) KEGG ortholog (KO) genes (S3 Table) then identified the corresponding UniRef poly peptide sequences (a total of 298,760 sequences). Protein sequences were then amassed (using a minimal sequence similarity of 0.7). This resulted in 96,462 KO gene clusters to serve as a reference AMR protein sequence database. Next, we analyzed iii,393 de novo assembled genomes, identified the cistron locations on the assembled genomes, and aligned the gene sequences to the reference AMR protein sequence database. Based on the alignment results and stringent filtering, we can identify AMR genes for each isolate. Finally, the AMR genes were examined for the presence of mutations (e.g. amino acrid substitutions) using multiple sequence alignment software. We nominated an identifier format to represent the sequences. For example, K01990.129|290|TN|ID indicates that the 129thursday cluster of K01990 KO gene has mutation starting from its 290th amino acid from threonine (T) and asparagine (N) to isoleucine (I) and aspartic acid (D).
Association models between sequence variants and antibiotic resistance phenotypes retain accurateness
We interrogated the force of the association model between genetic variants and antibiotic susceptibility phenotypes for each bacterial species and antibiotic combination. For a number of pathogen-antibiotic pairs, the association model accuracy was greater than 95% (ranged from 69.half dozen% for Pseudomonas aeruginosa-aztreonam to 100.0% for S. pneumoniae-tetracycline; mean accuracy was 91.1%) (Fig 2B). Utilizing contingency tables of variant carrying condition and resistance phenotypes with the appropriate statistical analysis (odds ratios and p-values from Fisher's exact tests), we examined a subset of five,359 associations with false discovery rates less than 0.05. In many instances, a significantly strong clan confirmed an expected antibody resistance mechanism (Fig 3). For example, the sequence variant K18768.0 represents β-lactamase (Bla) encoding gene bla KPC , the K. pneumoniae carbapenemase whose presence is significantly associated with resistance to meropenem in K. pneumoniae (P-value <0.0001) [10](Fig 3A). Variant K18093.xiii is oprD, a major porin responsible for uptake of carbapenems in Pseudomonas [fourteen]. Loss of porin activity by Pseudomonas is well known to result in carbapenem resistance [15] in this pathogen, and absence of wild-type oprD is strongly associated with imipenem resistance (P-value <0.0001) (Fig 3B). Other examples (OXA-1 and aac-(6')-lb) of stiff associations are illustrated in Fig 3C and 3D.
Fig three. Examples of variant-phenotype relationships adamant past the association models.
(A) K18768.0 indicates blaKPC, the K. pneumoniae carbapenemase. The presence of blaKPC is associated with resistance to ceftazidime in K. pneumoniae as shown. The numbers in the plots represent the frequency of sure MIC (minimal inhibitory concentration) values. Numbers in the plot represent total number of isolates with the given MIC value. (B) K18093.xiii is oprD, an imipenem/basic amino acid-specific outer membrane pore; absence of oprD is associated with resistance to imipenem in P. aeruginosa. (C) K18790.0 represents blaOXA-ane, the beta-lactamase class D OXA-1. Its presence is associated with resistance to cefepime in E. coli. (D) K19278.0 is aac6-lb factor. The presence of this variant is associated with amikacin resistance in A. baumannii. The "+" and "-"sign in the X-axis stand for whether the wild-blazon gene exists or non. The ruddy horizontal lines marking the mean and standard error of the groupwise MIC measurements. Each gray dot represents an MIC value. P-values are calculated based on Fisher'due south verbal test. MIC: minimal inhibitory concentration.
https://doi.org/10.1371/journal.pcbi.1007511.g003
Antibody resistance prediction models developed utilizing machine learning
Our association studies demonstrated the accurateness of our genotypic arroyo for known AMR elements. To begin to explore the capacity of our arroyo to take sequence data and generate robust prediction, we starting time developed 93 unlike prediction models using the VAMPr pipeline. The almost promising prediction models were based on an extreme gradient boosting tree algorithm and all hyper-parameters were fine-tuned in the inner 5-fold cantankerous validation. Other prediction models (due east.grand. elastic internet [sixteen], support vector machines, 3-layer neural network, and adaptive boosting) were evaluated but did not showroom superior prediction performances (S4 Fig). For all models, nosotros used nested cross validation to report prediction performance metrics (Table i, S4 Tabular array). Among 93 models, half had prediction accuracies greater than 90%. The pathogen-antibiotic combinations that displayed the highest accurateness were for Due south. pneumoniae (clindamycin (100.0%), meropenem (100.0%), and tetracycline (100.0%), and E. coli and kanamycin (100.0%). 11 prediction models for South. enterica had very high accuracies (minimal prediction accuracy is 98.0%) likely due to the larger dataset of S. enterica isolates. A similar tendency was as well seen in the performance of the models for A. baumannii.
Table 1. Prediction metrics for 32 VAMPr prediction models.
Among 93 prediction models, we listed the superlative 32 models that take the mean prediction accuracies higher than 95%. The isolate and variant counts derived from sequencing were used to build the prediction model using slope boosting tree algorithms. The accuracy is reported using nested cross validation approach. The x-fold outer cross validation were used to written report accuracy and the five-fold inner cross validation was used for hyperparameter tuning.
https://doi.org/10.1371/periodical.pcbi.1007511.t001
Validation of the VAMPr prediction model using an external dataset
To validate the prediction operation of VAMPr, we utilized thirteen E. cloacae, 31 E. coli, 24 K. pneumoniae and 21 P. aeruginosa isolates that were genetically and phenotypically profiled in a prior study but not present in the NCBI Antibiogram database [7]. All isolates had been previously tested against 3 antibiotics (cefepime, ceftazidime, and meropenem). Importantly, approximately 62%, fifteen%, 28% and 31% of the discovered variants of these strains, respectively, were not detected in the NCBI isolates. As these variants are specific to the validation datasets, their roles in antibody resistance could not be modelled past the NCBI datasets. In Fig 4, we testify three prediction results with the highest AUROC (area under the receiver operator characteristics) values, equally well as the of import genetic variants that frequently appear in the gradient boosting tree models. In the E. coli and meropenem model, VAMPr reached 1.0 AUROC (Fig 4A) and the nigh important predictor was the presence of the blaNDM cistron (New Dehli metallo-beta-lactamase; Class B). VAMPr had a similarly high prediction performance for K. pneumoniae and ceftazidime (Fig 4B). This model also has an AUROC value of 0.99 and the significant predictors were the presence of KPC (Grand. pneumoniae carbapenemase) and the presence of wildtype ddl; D-alanine-D-alanine ligase (in 4 isolates, variants of ddl were associated with sensitivity to ceftazidime). In Fig 4C, the prediction model for P. aeruginosa and meropenem is 0.95, and three significant predictors were ebr (minor multidrug resistance pump), mexA (membrane fusion protein, multidrug efflux system) and oprD (imipenem/basic amino acrid-specific outer membrane pore). Among all bacteria and antibody combinations, the minimal AUROC values for all VAMPr prediction models is 0.70 (Table ii). Additionally, we retrieved 1,688 Thou. pneumoniae isolates to validate the VAMPr models (S1 Text: Validation of the VAMPr prediction model using 1,668 K. pneumoniae isolates) and observed similar AUROC values. These results propose that the VAMPr prediction models identify both known AMR-related genes as well every bit genes or variants that are non currently considered equally contributing to resistance.
Fig 4. Validation performance metrics using an external dataset.
AUROC (Surface area under the Receiver Operating Feature) for the prediction of the external dataset and pinnacle three predictors (KEGG orthlog variants based on importance) from the prediction models are reported. A) The AUROC bend for the East. coli and meropenem; B) The AUROC curve for the Thou. pneumoniae and ceftazidime; C) The AUROC bend for the P. aeruginosa and meropenem; D) The top iii predictors for the Eastward. coli and meropenem; E) The summit three predictors for the Grand. pneumoniae and ceftazidime; F) The acme three predictors for the P. aeruginosa and meropenem.
https://doi.org/10.1371/journal.pcbi.1007511.g004
Tabular array 2. External validation of VAMPr prediction model.
The external dataset includes 13 Enterobacter cloacae, 31 Escherichia coli, 24 Klebsiella pneumoniae and 21 Pseudomonas aeruginosa isolates. All isolates were tested against 3 antibiotics (cefepime, ceftazidime and meropenem). We reported the accuracy equally the fraction of correct predictions, and the AUROC (expanse under the receiver operator curve) represents the expanse under the operator-receiver characteristic. The AUROC value is n/a for Eastward. cloacae every bit all xiii isolates are susceptible to meropenem.
https://doi.org/ten.1371/periodical.pcbi.1007511.t002
Online and offline resource for VAMPr pipeline
Online resources-VAMPr association and prediction models.
Nosotros provide a pre-calculated antibiotic resistance-associated variant database at https://qbrc.swmed.edu/softwares.php (Fig 5). Users tin browse KO genetic variants and examine the forcefulness of bear witness based on calculated odds ratio and P-values from Fisher's verbal test. For prediction models, an online website and offline computational tool for users to predict antibiotic resistance from their own isolate sequences is available (see Information Access). The user input is the assembled FASTA files, and the online website volition examine whether the sequence contains AMR genes, and if so, the exact variant of the assembled sequence. After, our prediction model will give the probability of resistance based on these sequence variants. The VAMPr model is highly efficient, as the running time of assay is typically threescore seconds.
Fig 5. VAMPr provides rich sets of online resources for association models and prediction models.
Users have the flexibility to explore known or novel antibiotic resistance-associated variants, and can upload their own sequence assembly and obtain predictions on antibody resistance. (A) association results webpage: users tin can explore variants, their interpretations, and their statistical significance assessments; (B) detailed information, contingency tabular array and odd-ratio for variant K18768 in the clan model, and distribution plots; (C) Distribution plots for variant K18768 in the association model folio; (D) prediction models allow for uploads of users' sequence information for antibody resistance prediction.
https://doi.org/ten.1371/journal.pcbi.1007511.g005
Offline resources–VAMPr source code.
Nosotros provide the source code that was used to create the association and prediction models. This allows users to curate and clarify their own sequence data for convenient offline usage. For example, users can provide FASTA sequence files and predict antibiotic resistance for multiple antibiotics without an internet connection.
Give-and-take
With the growing threat of antibiotic resistance and the rapidly decreasing costs associated with bacterial whole-genome sequencing, at that place is an opportunity for developing improved methods to detect resistance genes from genomic data [17]. All the same, prior to the routine utilize of genomic data to routinely identify bacterial AMR status, at that place are several hurdles to be overcome including improving our agreement of the genetic mechanisms underlying AMR for a broad-assortment of pathogen-antimicrobial combinations [eighteen]. To this cease, we take adult the VAMPr pipeline to discover variant-level genetic features from NGS reads which can and so be correlated with phenotypic AMR data. We anticipate that with the continued generation of WGS data for numerous medically of import pathogens, the widespread employment of VAMPr will assist with both strengthening associations between genomic data and AMR equally well equally developing new lines of AMR mechanism research.
An important advance of our written report was our utilization of a novel approach to classifying variants based on gene orthologs. Our approach is different than other prediction models such as with PATRIC which utilized the adaptive boosting (adaboost) algorithm [9, 18–20]. Our results were comparable or better in operation depending on the antibiotic-pathogen combination (S1 Text: Comparison with existing prediction models). In addition, our approach is in contrast to other popular ways for looking at gene variants such equally k-mers [21]. In the k-mer method, the frequency of 1000 consecutive nucleotide or amino acid bases are counted as sequence features. Although the k-mer approach is straightforward to compute, information technology is not straightforward to explain the k-mer in the context of genes, which requires actress analysis steps to translate. To avoid these limitations, nosotros instead utilized factor orthologs. Past aligning the bacteria genomes with a group of consensus orthologous gene sequences, we adamant variants that are present for any item AMR gene in a item isolate. As the sequence variants are linked to ortholog genes, this approach can not only place the presence or absenteeism of known resistance genes, merely can also requite additional insight into the impact of amino acid variants on diverse resistance phenotypes (e.g., amino acid substitutions shown in Fig 4).
To sympathise how genetic variants were linked to AMR phenotypes, we built data-driven association models. We utilized a large collection of isolate sequence data from NCBI SRA and matching antibody resistance phenotypes reported in the NCBI BioSample Antibiogram. This allowed for a high throughput screening for statistically meaning associations between genetic variants and specific antibiotics for a variety of pathogens. Thus, some other strength of this study was the big data universe that these models were congenital upon with over 38,248 pathogen-antibiotic comparisons performed. Other groups have developed some similar tools, including recent efforts to predict AMR for drugs used in the treatment of Mycobacterium tuberculosis [22]. An advantage of VAMPr over existing tools is its ability to analyze information from any bacterial species, providing that there are sufficient numbers of bacterial genomes with AMR phenotypic information to develop robust models. The publicly available nature of VAMPr and the NCBI Antibiogram ways that the predictive models of VAMPr should significantly improve moving forward.
Our attempt to develop prediction models utilizing machine learning algorithms and large-scale datasets allowed for the identification of genes that are associated with resistance to a particular antibiotic in an unbiased fashion. This could allow for both confirmation of known resistance markers equally well as a discovery tool to find novel genes that contribute to resistance. It is important to note that the genes and variants that we identified equally predictive of resistance does non imply causation. These are correlations, and farther work volition exist needed to meet whether identified genes that are not currently known to contribute to resistance are biologically active or simply mere bystanders with other causal genes[23]. Future efforts will include testing whether some of these predicted genes or variants in genes are in fact biologically relevant. Under certain antibiotic-species combinations, the number of resistant and susceptible isolates are imbalanced. Although our current method tin reach proficient prediction accuracy, a specialized automobile learning method for imbalanced data (due east.g., SMOTE) could exist employed to written report model performances [24] (S1 Text: Treatment imbalanced resistant and susceptible phenotypes).
There were other limitations of our study. Our endeavor to validate the prediction models with a relatively minor number of isolates that were not included in the original preparation ready illustrates particular challenges. There was conspicuously strain diversity in the recently sequenced isolates that was not fully represented in the available NCBI training prepare which impacted our ability to fully validate our prediction models. This indicates that at that place continues to be a need for increased genome sequencing that is more broadly representative for certain pathogens (S1 Text: Evaluations with additional bacterial isolates and antimicrobial susceptibility phenotypes). This is farther illustrated by the increased accuracy that was seen when we included a big number of Klebsiella isolates and re-ran the model. In addition, some pathogens such as P. aeruginosa accept a smaller number of genomes available in the NCBI dataset with paired antibiogram data available while other pathogens (such as Salmonella) have a large number of genomes with AMR phenotypes available. It is probable that increasing the number of genomes available for training purposes in pathogens like Pseudomonas will likely further improve our accuracy of the prediction model arroyo (S1 Text: Improving prediction models by augmenting external datasets) [six]. For example, the recent study of Chiliad. tuberculosis resistance nerveless x,290 samples and the large scale enabled accurate prediction of point mutations and antibiotic resistance [22]. Our time to come efforts are aimed at further refining the VAMPr models to include larger numbers of isolates with a mixture of antibiotic susceptibility phenotypes.
In decision, we are providing the VAMPr online resources for researchers to utilize in their efforts to better study and predict antibody resistance from bacterial whole genome sequence information. Widespread employment of VAMPr may help with moving whole genome sequencing of bacterial pathogens out of the research lab setting and into the realm of clinical practice.
Methods
Data acquisition
Bacterial isolates with antibiotic susceptibility information were identified in the NCBI BioSample Antibiograms database. Isolates were identified by querying "antibiogram[filter]" in the National Middle for Biotechnology Data (NCBI) (NCBI Resource Coordinators, 2018) BioSample. The linked sequencing data was downloaded from the NCBI Sequence Read Archive (SRA). Finally, the antibiogram tables in the NCBI BioSample were downloaded using NCBI API. Minimum inhibitory concentration (MIC) values and reported antibody susceptibility information were recorded and checked for accurateness according to CLSI guidelines [12]. MIC values that were clearly mis-annotated were removed. For the purposes of this analysis, isolates that were intermediate for any particular drug were excluded, equally they just account for 0.six% of the total isolate. In addition, any bacterial isolate reported every bit both resistant and susceptible was excluded from assay.
Creation of AMR protein database
A reference database consisting of KO genes with gene-based variants was created that included both AMR protein sequences as well as AMR-like protein sequences (decoy sequences). The AMR-similar sequences are from genes known to non be involved in antibiotic resistance and have been shown to meliorate variant calling accuracies [25]. To create the AMR protein database, a list of Kyoto Encyclopedia of Genes and Genomes (KEGG) orthology (KO) involved in antimicrobial resistance (AMR) (S1 Fig) was created. The protein sequences linked to AMR KOs past KEGG API and UniProt ID mapping were downloaded from the UniProt database. These sequences were designated every bit AMR protein sequences. Further, protein sequences from KOs not related to AMR were also aligned to AMR protein sequences. AMR-like protein sequences were defined as those protein sequences with 80% identical amino acid alignment. The matrimony of AMR protein sequences and AMR-like protein sequences formed the AMR protein database which was utilized in all comparative alignment steps.
To facilitate the identification of variants, AMR protein sequences were clustered based on sequence identities using CD-Striking [26]. For each cluster, multiple sequence alignment (MSA) steps were used to determine cluster consensus sequences (CCS) using MAFFT [27]. Finally, bacterial isolate protein sequences were compared to CCSs to identify the variants (see side by side section and S1 Text: Derive explainable KO gene-based sequence variants).
Characterization of AMR variants
We developed an algorithm to characterize the AMR-related variants at the protein level (S1 Text). For each individual bacterial isolate, de novo genome associates was performed using SPAdes [28]. Open reading frame (ORF)s were identified, converted to amino acid sequences, and, protein BLAST of the sequences using the aforementioned AMR protein database was performed. The same query sequence was aligned to both AMR and AMR-like reference poly peptide sequences using Diamond [29]. After comparative alignments and removal of less than fourscore% identical amino acids, only alignments all-time matched to the AMR reference sequences were included (see S1 Text: Comparative alignments for filters on E-values, scrap-scores and fraction of identical amino acids and S2 Fig). Finally, the aligned poly peptide scaffold sequences were compared to the CCS to define a "normal" poly peptide versus a variant. For instance, given a perfect match, an isolate is designated every bit carrying the KO gene and thus denoted as normal. In contrast, if there were mismatched amino acids inside a CCS alignment, these would be deemed every bit novel variants, and in such cases, the detected variants would accept the following nomenclature: KO number, KO cluster number, sequence variant types and their details (exchange, insertion, and deletion). More details are provided in S3 Fig and S1 Text.
VAMPr association model to characterize variants
To quantitatively assess the clan between KO-based sequence variants and antibody resistance phenotypes, an association model for each species-antibiotic combination was created. In total, 52,479 associations between variants and antibiotic resistance were evaluated. Specifically, a ii-by-2 contingency table for all isolates based on carrier/non-carrier status of the variant and susceptible/resistant phenotypes was generated and the odds ratio and p-values based on Fisher's exact test were calculated in R three.iv.iv and adjusted for imitation discovery rate based on Benjamin-Hochberg procedure [30]. The fraction of resistant strains stratified by the variants' carrying condition was visualized in bar plots.
VAMPr prediction model for antibiotic resistance
Prediction models for each species-antibiotic combination were adult. KO-based sequence variants were designated as features and curated antibiotic resistant phenotypes every bit labels. For each species-antibiotics combination, an optimal prediction model with tuned hyperparameters was generated. A gradient boosting tree approach was utilized, given its accurate performance profile and efficient implementation [31]. Nested cross-validation (CV) was used to report unbiased prediction performance [32, 33]. The outer CV was 10-fold and the averaged prediction metrics including accuracy are reported (Table 1); the inner CV was v-fold and all inner folds were used for hyper-parameter tuning based on prediction accurateness. The default search space hyperparameters were chosen as follows: the number of rounds (the number of trees) was l, 100, 500 or 1000; the maximum immune depth of trees was sixteen or 64; the learning rate was from 0.025 or 0.05; the minimum loss reduction required to let further partitioning of the trees was 0; the fraction of features used for constructing each tree was 0.8; the fraction of isolates used for amalgam each tree was 0.9; and the minimum weight for each child tree was 0. The reported performance metrics included accuracy, F1-score, and area under the receiver operating characteristic bend (AUROC) [32]. We assessed the prediction accurateness using an independent dataset of bacterial isolates recovered from cancer patients with bloodstream infections [7]. In this report (11), all isolates were genetically (whole-genome sequence) and phenotypically (antibiotic susceptibility testing by broth microdilution assays) profiled. We followed the same same genotype and phenotype processing steps. The detected KO-based variants were used as predictors and the lab-measured antibiotic resistance phenotypes were used as the gilded standard. Performance metrics were calculated every bit described to a higher place.
Supporting data
S1 Tabular array. Summary of bacterial species and antibiotic drugs combinations in association and prediction models.
The resistant and susceptible isolates were counted based on the cutoff MIC (minimal inhibitory concentration) values reported in the 2018 CLSI guidelines.
https://doi.org/10.1371/journal.pcbi.1007511.s002
(PDF)
S2 Table. Bacterial isolate information.
This table listed three,393 bacterial isolates with their BioSample accretion ID and antimicrobial susceptibility measurement from NCBI Antibiogram. They are included in the analysis of VAMPr association and prediction models.
https://doi.org/10.1371/journal.pcbi.1007511.s003
(XLSX)
S1 Fig. Detailed steps in VAMPr variant characterization.
In VAMPr, we retrieved and curated antibiogram data from NCBI BioSample. The sequences of these isolates were retrieved from NCBI SRA, de novo assembled and curated by quality control steps (MLST identity cheque and phenotype QC). Based on pre-candy AMR gene databases (including both AMR poly peptide sequences and decoy sequences), we characterize sequence variants in 9 steps, from finding gene ORF to denoting AMR gene variants based on KEGG ortholog (KO). These explainable variants, as well as the curated phenotypes, will be utilized in downstream analyses (the association models and the prediction models).
https://doi.org/x.1371/periodical.pcbi.1007511.s006
(TIF)
S2 Fig. A general schematic analogy of comparative alignment.
Each query protein sequence is aligned to both AMR protein sequences and decoy(AMR-like) poly peptide sequences. We compared the best hit from the AMR protein sequences and the best striking from the decoy protein sequences. The better alignment results (denoted with ">") based on user specified criteria (due east.g. alignment scores with smaller E-values) will be retained. This step tin can improve alignment specificity.
https://doi.org/10.1371/journal.pcbi.1007511.s007
(TIF)
S3 Fig. Derive explainable KO gene-based sequence variants.
(Upper: References (DB)) all known protein databases reference from UniProt (IDs are listed on the correct); (Heart: Consensus) a consensus sequence is derived from UniProt sequences; (Bottom: Isolates) sequences from two isolates (SAMN04515808 and SAMN04254727) were compared to the consensus reference sequence, and their variants are denoted equally K20319.0|94|p|I (the 94th codon of KO-cistron cluster K20319.0 is changed from polar to I) and K20319.0|107|T|N (the 107th codon of KO-gene cluster K20319.0 is changed from T to N). The 2 variants are close, but the quondam variant is suggestive to induce ceftriaxone susceptibility for A. baumannii based on ii isolates and the latter variant is suggestive to induce imipenem resistance based on 10 isolates.
https://doi.org/x.1371/journal.pcbi.1007511.s008
(TIF)
S4 Fig. Comparison of prediction models.
We compared adaptive boosting (adaboost) [34],rubberband net [sixteen], thou-nearest neighbor, 3-layer neural network (perceptron), support vector machines (with radial kernel) [35] to extreme gradient boosting tree used in VAMPr (xgboost) [31]. The boxplots show the performance divergence (prediction accuracy) of xgboost to other models. All models are implemented in caret [36] and R [37]. A positive value indicates the prediction accuracy in xgboost is higher than the prediction accuracy of the other model.
https://doi.org/10.1371/periodical.pcbi.1007511.s009
(TIF)
Acknowledgments
We would like to acknowledge Jessie Norris's suggestions to better this manuscript, Bo Yao and Wei Guo for their supports on the software deployment.
References
- one. Chioro A, Coll-Seck AM, Hoie B, Moeloek North, Motsoaledi A, Rajatanavin R, et al. Antimicrobial resistance: a priority for global health action. Balderdash World Health Organ. 2015;93(7):439. Epub 2015/07/fifteen. pmid:26170498; PubMed Central PMCID: PMC4490824.
- View Commodity
- PubMed/NCBI
- Google Scholar
- ii. Ventola CL. The antibiotic resistance crunch: part 1: causes and threats. P T. 2015;40(iv):277–83. Epub 2015/04/eleven. pmid:25859123; PubMed Central PMCID: PMC4378521.
- View Article
- PubMed/NCBI
- Google Scholar
- 3. Satlin MJ, Cohen Due north, Ma KC, Gedrimaite Z, Soave R, Askin G, et al. Bacteremia due to carbapenem-resistant Enterobacteriaceae in neutropenic patients with hematologic malignancies. J Infect. 2016;73(4):336–45. Epub 2016/07/thirteen. pmid:27404978; PubMed Central PMCID: PMC5026910.
- View Article
- PubMed/NCBI
- Google Scholar
- 4. Evans SR, Tran TTT, Hujer AM, Loma CB, Hujer KM, Mediavilla JR, et al. Rapid Molecular Diagnostics to Inform Empiric Utilize of Ceftazidime/Avibactam and Ceftolozane/Tazobactam against Pseudomonas aeruginosa: PRIMERS IV. Clin Infect Dis. 2018. Epub 2018/09/22. pmid:30239599.
- View Article
- PubMed/NCBI
- Google Scholar
- five. Rossen JWA, Friedrich AW, Moran-Gilad J, Genomic ESGf, Molecular D. Practical issues in implementing whole-genome-sequencing in routine diagnostic microbiology. Clin Microbiol Infect. 2018;24(4):355–lx. Epub 2017/11/09. pmid:29117578.
- View Commodity
- PubMed/NCBI
- Google Scholar
- 6. Su K, Satola SW, Read TD. Genome-Based Prediction of Bacterial Antibiotic Resistance. Journal of clinical microbiology. 2019;57(3). Epub 2018/xi/02. pmid:30381421; PubMed Central PMCID: PMC6425178.
- View Article
- PubMed/NCBI
- Google Scholar
- 7. Shelburne SA, Kim J, Munita JM, Sahasrabhojane P, Shields RK, Press EG, et al. Whole-Genome Sequencing Accurately Identifies Resistance to Extended-Spectrum β-Lactams for Major Gram-Negative Bacterial Pathogens. Clinical Infectious Diseases. 2017;65(5):738–45. Epub 2017/05/05. pmid:28472260; PubMed Central PMCID: PMC5850535.
- View Article
- PubMed/NCBI
- Google Scholar
- 8. Zankari East, Hasman H, Cosentino Due south, Vestergaard M, Rasmussen South, Lund O, et al. Identification of acquired antimicrobial resistance genes. J Antimicrob Chemother. 2012;67(11):2640–4. Epub 2012/07/12. pmid:22782487; PubMed Central PMCID: PMC3468078.
- View Article
- PubMed/NCBI
- Google Scholar
- 9. Wattam AR, Abraham D, Dalay O, Disz TL, Driscoll T, Gabbard JL, et al. PATRIC, the bacterial bioinformatics database and analysis resource. Nucleic acids research. 2014;42(Database issue):D581–91. Epub 2013/11/15. pmid:24225323; PubMed Central PMCID: PMC3965095.
- View Article
- PubMed/NCBI
- Google Scholar
- 10. Tamma PD, Fan Y, Bergman Y, Pertea G, Kazmi AQ, Lewis Southward, et al. Applying Rapid Whole-Genome Sequencing To Predict Phenotypic Antimicrobial Susceptibility Testing Results among Carbapenem-Resistant Klebsiella pneumoniae Clinical Isolates. Antimicrob Agents Chemother. 2019;63(i). Epub 2018/10/31. pmid:30373801; PubMed Central PMCID: PMC6325187.
- View Article
- PubMed/NCBI
- Google Scholar
- 11. Kanehisa M, Goto Southward, Sato Y, Furumichi M, Tanabe M. KEGG for integration and estimation of big-scale molecular information sets. Nucleic acids enquiry. 2012;40(Database issue):D109–14. pmid:22080510; PubMed Fundamental PMCID: PMC3245020.
- View Article
- PubMed/NCBI
- Google Scholar
- 12. CLSI. Performance Standards for Antimicrobial Susceptibility Testing. 28th ed. CLSI supplement M100 ed. Wayne, PA: Clinical and Laboratory Standards Institute2018 2018.
- xiii. Jolley KA, Bray JE, Maiden MCJ. Open-access bacterial population genomics: BIGSdb software, the PubMLST.org website and their applications. Wellcome Open Res. 2018;3:124. Epub 2018/10/23. pmid:30345391; PubMed Central PMCID: PMC6192448.
- View Article
- PubMed/NCBI
- Google Scholar
- 14. Kos VN, Deraspe M, McLaughlin RE, Whiteaker JD, Roy PH, Alm RA, et al. The resistome of Pseudomonas aeruginosa in relationship to phenotypic susceptibility. Antimicrob Agents Chemother. 2015;59(one):427–36. Epub 2014/11/05. pmid:25367914; PubMed Fundamental PMCID: PMC4291382.
- View Article
- PubMed/NCBI
- Google Scholar
- 15. Lister PD, Wolter DJ, Hanson ND. Antibacterial-resistant Pseudomonas aeruginosa: clinical affect and circuitous regulation of chromosomally encoded resistance mechanisms. Clin Microbiol Rev. 2009;22(iv):582–610. Epub 2009/10/14. pmid:19822890; PubMed Central PMCID: PMC2772362.
- View Article
- PubMed/NCBI
- Google Scholar
- 16. Zou H, Hastie T. Regularization and variable selection via the rubberband net. Journal of the Majestic Statistical Gild: Series B (Statistical Methodology). 2005;67(two):301–20.
- View Article
- Google Scholar
- 17. Ellington MJ, Ekelund O, Aarestrup FM, Canton R, Doumith M, Giske C, et al. The role of whole genome sequencing in antimicrobial susceptibility testing of leaner: report from the EUCAST Subcommittee. Clin Microbiol Infect. 2017;23(1):two–22. Epub 2016/eleven/29. pmid:27890457.
- View Article
- PubMed/NCBI
- Google Scholar
- 18. Nguyen One thousand, Brettin T, Long SW, Musser JM, Olsen RJ, Olson R, et al. Developing an in silico minimum inhibitory concentration console test for Klebsiella pneumoniae. Sci Rep. 2018;8(1):421. Epub 2018/01/13. pmid:29323230; PubMed Central PMCID: PMC5765115.
- View Commodity
- PubMed/NCBI
- Google Scholar
- 19. Gillespie JJ, Wattam AR, Cammer SA, Gabbard JL, Shukla MP, Dalay O, et al. PATRIC: the comprehensive bacterial bioinformatics resources with a focus on human pathogenic species. Infect Immun. 2011;79(11):4286–98. Epub 2011/09/08. pmid:21896772; PubMed Central PMCID: PMC3257917.
- View Commodity
- PubMed/NCBI
- Google Scholar
- xx. Antonopoulos DA, Assaf R, Aziz RK, Brettin T, Bun C, Conrad N, et al. PATRIC every bit a unique resource for studying antimicrobial resistance. Briefings in bioinformatics. 2017.
- View Article
- Google Scholar
- 21. Davis JJ, Boisvert S, Brettin T, Kenyon RW, Mao C, Olson R, et al. Antimicrobial Resistance Prediction in PATRIC and RAST. Sci Rep. 2016;6:27930. Epub 2016/06/xv. pmid:27297683; PubMed Key PMCID: PMC4906388.
- View Article
- PubMed/NCBI
- Google Scholar
- 22. Consortium CR, the GP, Allix-Beguec C, Arandjelovic I, Bi 50, Beckert P, et al. Prediction of Susceptibility to First-Line Tuberculosis Drugs by Dna Sequencing. N Engl J Med. 2018;379(15):1403–15. Epub 2018/10/04. pmid:30280646; PubMed Primal PMCID: PMC6121966.
- View Commodity
- PubMed/NCBI
- Google Scholar
- 23. Knopp 1000, Andersson DI. Predictable Phenotypes of Antibiotic Resistance Mutations. MBio. 2018;9(3). Epub 2018/05/17. pmid:29764951; PubMed Primal PMCID: PMC5954217.
- View Article
- PubMed/NCBI
- Google Scholar
- 24. Chawla NV, Bowyer KW, Hall LO, Kegelmeyer WP. SMOTE: synthetic minority over-sampling technique. Journal of artificial intelligence research. 2002;16:321–57.
- View Article
- Google Scholar
- 25. Li H, Durbin R. Fast and accurate short read alignment with Burrows–Wheeler transform. bioinformatics. 2009;25(xiv):1754–lx. pmid:19451168
- View Article
- PubMed/NCBI
- Google Scholar
- 26. Fu L, Niu B, Zhu Z, Wu S, Li West. CD-Hit: accelerated for clustering the next-generation sequencing data. Bioinformatics. 2012;28(23):3150–2. Epub 2012/x/13. pmid:23060610; PubMed Central PMCID: PMC3516142.
- View Article
- PubMed/NCBI
- Google Scholar
- 27. Nakamura T, Yamada KD, Tomii K, Katoh Thou. Parallelization of MAFFT for large-calibration multiple sequence alignments. Bioinformatics. 2018;34(14):2490–two. Epub 2018/03/06. pmid:29506019; PubMed Cardinal PMCID: PMC6041967.
- View Commodity
- PubMed/NCBI
- Google Scholar
- 28. Bankevich A, Nurk South, Antipov D, Gurevich AA, Dvorkin M, Kulikov AS, et al. SPAdes: a new genome assembly algorithm and its applications to single-jail cell sequencing. J Comput Biol. 2012;19(five):455–77. Epub 2012/04/18. pmid:22506599; PubMed Fundamental PMCID: PMC3342519.
- View Article
- PubMed/NCBI
- Google Scholar
- 29. Buchfink B, Xie C, Huson DH. Fast and sensitive protein alignment using DIAMOND. Nature methods. 2015;12(i):59–60. Epub 2014/11/eighteen. pmid:25402007.
- View Article
- PubMed/NCBI
- Google Scholar
- 30. Benjamini Y, Hochberg Y. Decision-making the false discovery rate: a practical and powerful approach to multiple testing. Journal of the royal statistical society Serial B (Methodological). 1995:289–300.
- View Article
- Google Scholar
- 31. Chen T, Guestrin C, editors. Xgboost: A scalable tree boosting organization. Proceedings of the 22nd acm sigkdd international conference on knowledge discovery and data mining; 2016: ACM.
- 32. Friedman J, Hastie T, Tibshirani R. The elements of statistical learning: Springer series in statistics Springer, Berlin; 2001.
- 33. Cawley GC, Talbot NLC. On Over-fitting in Model Pick and Subsequent Selection Bias in Performance Evaluation. Journal of Machine Learning Research. 2010;xi(Jul):2079–107.
- View Article
- Google Scholar
- 34. Freund Y, Schapire RE. A decision-theoretic generalization of on-line learning and an application to boosting. J Comput Syst Sci. 1997;55(1):119–39. . WOS:A1997XT05700011.
- View Commodity
- Google Scholar
- 35. Cortes C, Vapnik V. Back up-Vector Networks. Auto Learning. 1995;20(three):273–97. WOS:A1995RX35400003.
- View Commodity
- Google Scholar
- 36. Kuhn M. Caret package. Journal of statistical software. 2008;28(5):1–26.
- 37. McDonald JH. Handbook of biological statistics: sparky house publishing Baltimore, MD; 2009.
Source: https://journals.plos.org/ploscompbiol/article?id=10.1371/journal.pcbi.1007511
Post a Comment for "Fusion Protein Antibiotic Resistance in One Reading"