CN115171792B - A hybrid prediction method for virulence factors and antibiotic resistance genes - Google Patents
A hybrid prediction method for virulence factors and antibiotic resistance genesInfo
- Publication number
- CN115171792B CN115171792B CN202210781902.8A CN202210781902A CN115171792B CN 115171792 B CN115171792 B CN 115171792B CN 202210781902 A CN202210781902 A CN 202210781902A CN 115171792 B CN115171792 B CN 115171792B
- Authority
- CN
- China
- Prior art keywords
- pssm
- feature
- antibiotic resistance
- features
- resistance genes
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
- G16B40/20—Supervised data analysis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/08—Learning methods
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B10/00—ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
- G16B30/10—Sequence alignment; Homology search
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B50/00—ICT programming tools or database systems specially adapted for bioinformatics
Landscapes
- Engineering & Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Theoretical Computer Science (AREA)
- Medical Informatics (AREA)
- General Health & Medical Sciences (AREA)
- Biophysics (AREA)
- Evolutionary Biology (AREA)
- Software Systems (AREA)
- Biotechnology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Data Mining & Analysis (AREA)
- Spectroscopy & Molecular Physics (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Biomedical Technology (AREA)
- Computational Linguistics (AREA)
- Molecular Biology (AREA)
- Animal Behavior & Ethology (AREA)
- Physiology (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Apparatus Associated With Microorganisms And Enzymes (AREA)
- Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)
Abstract
The invention discloses a mixed prediction method of virulence factors and antibiotic resistance genes in the technical field of deep learning and bioinformatics, which comprises the following steps of S1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database, S2, respectively calculating various core gene characteristics by using gene sequence information to construct a deep learning neural network architecture and a classical integrated learning architecture, S3, taking three types of sequence data in S1 as samples to divide a training data set and a testing data set, S4, obtaining a new training data set by using various classification methods, constructing a classification model for the new training data set, and obtaining performance evaluation indexes of the classification model. The mixed prediction method of virulence factors and antibiotic resistance genes has good prediction effect and high prediction accuracy.
Description
Technical Field
The invention relates to the technical field of deep learning and bioinformatics, in particular to a mixed prediction method of virulence factors and antibiotic resistance genes.
Background
Microbiology is critical to the internal ecosystem of hosts (e.g., humans, animals, and plants) and to maintain the external environment. In particular, pathogenic microorganisms cause diseases by carrying Virulence Factors (VFs) and Antibiotic Resistance Genes (ARGs), even threaten the life safety of a host, can accurately and timely identify VFs and ARGs, can effectively guide medical treatment, reduce the morbidity and mortality of the host, and reduce economic losses in the aspects of animal husbandry, aquaculture and the like.
Furthermore, VFs and ARGs, despite the different evolutionary pathways, share the common features of VFs and ARGs that are necessary for pathogenic bacteria to adapt to and survive in competing microbial environments, and in particular, VFs and ARG are both often transferred between bacteria by Horizontal Gene Transfer (HGT) and utilize similar systems (i.e., two-component systems, efflux pumps, cell wall changes and porins) to activate or inhibit expression of various genes. Pathogens can utilize VFs to cause disease in their host, while they can colonize in environments with selective antibiotic stress by acquisition or presence ARGs. Thus, to understand the causal relationships between microbiome composition, function and disease, VFs and ARGs must be determined simultaneously, while predicting VFs and ARGs can save pathogen monitoring time, particularly for on-site detection of epidemic pathogens. However, the bioinformatics conventional tools for identifying ARGs or VFs are usually focused on independent prediction of ARG or VFs, the prediction tools are relatively late, and the prediction accuracy and recall rate are relatively low, and furthermore, the conventional prediction methods for VFs and ARGs have the technical problems of high false negative rate, very sensitivity to cut-off threshold, and relatively poor prediction effect because only conserved genes can be identified, so that it is necessary to design a hybrid prediction method of virulence factors and antibiotic resistance genes.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a mixed prediction method of virulence factors and antibiotic resistance genes, which solves the technical problems that in the prior art, a prediction tool is behind, the prediction precision and recall rate are low, and the existing prediction effect is poor.
In order to achieve the above object, the technical scheme of the present invention is as follows:
The invention provides a mixed prediction method of virulence factors and antibiotic resistance genes, which comprises the following steps:
s1, respectively acquiring known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database;
S2, calculating various core gene characteristics by using the gene sequence information, and constructing a deep learning neural network architecture and a classical integrated learning architecture by the core gene characteristics respectively;
S3, taking the three types of sequence data in the S1 as samples, randomly extracting the samples as a data total set, carrying out five times of random division from the data total set, wherein four parts in each division are training data sets, and the rest parts are test data sets;
s4, acquiring a new training data set by utilizing a plurality of classification methods, constructing a classification model for the new training data set based on the extreme random tree, and acquiring performance evaluation indexes of the classification model.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S1 specifically comprises the steps of:
s11, acquiring known antibiotic resistance gene sequence data from databases ARDB, CARD and Uniprot;
s12, acquiring known virulence factor sequence data from a database VFDB, PATRIC, victors and Uniprot;
s13, acquiring negative sample gene sequence data from a database Uniprot.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S2 comprises the following specific steps:
S21, respectively calculating similar characteristics based on comparison scores, simple characteristics based on single-hot coding, characteristics based on gene evolution information and characteristics based on gene sequence information by utilizing gene sequence information;
S22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features of gene sequences based on single thermal coding, and training a neural network classification model in an end-to-end manner;
S23, constructing a classical integrated learning architecture by utilizing the characteristics based on the genetic evolution information and the characteristics based on the genetic sequence information, and training a classical machine learning classification model by using the prior characteristic information.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the calculation of similar features based on alignment scores in S21 comprises the following specific steps:
The DIAMOND program was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;
The training data set is de-duplicated with the data set for comparison by the CD-HIT program, and the comparison score is normalized to be a [0,1] interval;
The similarity features based on the bit scores for each gene sequence in the training dataset are converted into a fixed 12724+30945= 43669-dimensional feature vector.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the characteristics based on genetic evolution information in S21 are composed of three specific characteristics on the basis of a specific location scoring matrix, including PSSM-component characteristics, RPM-PSSM characteristics, and AADP-PSSM characteristics;
Wherein the PSSM-component features eliminate variations due to protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, defined as follows:
Wherein R i represents the ith row of the PSSM-composite feature matrix, R k represents the kth row of the normalized PSSM, p k represents the kth amino acid in the protein sequence, a i represents the ith amino acid in the 20 standard amino acids;
RPM-PSSM features transform the original PSSM by filtering negative values to 0 while leaving positive values unchanged, the concept of RPM-PSSM features comes from the residue-prober method, i.e. taking each amino acid corresponding to a specific column in the PSSM as a probe, the original PSSM is transformed into a 400-dimensional feature vector, using the definition as follows:
wherein M i represents the ith row of the RPM-PSSM feature matrix, M k represents the kth row of PSSM, p k represents the kth amino acid in the protein sequence, and a i represents the ith amino acid in the 20 standard amino acids;
AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, which converts to a fixed length 20-dimensional feature vector by averaging the columns of the original PSSM contour, defined as follows:
Wherein x j represents the j-th row of the alternative AAC-PSSM feature matrix, represents the average proportion of amino acid mutations in the evolution process, and p i,j represents the entities of the i-row and the j-column in the original PSSM;
DPC-PSSM was converted into a 400-dimensional feature vector of fixed length to avoid the loss of information due to X in the protein, defined as follows:
AADP-PSSM is converted into a fixed length 20+400=420 dimension feature vector by combining the two components.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the characteristics based on the gene sequence information in S21 include amino acid composition characteristics, dipeptide composition characteristics, and deviation characteristics of the dipeptide from an expected average value;
the amino acid composition characteristics represent the frequency of 20 natural amino acids in the protein sequence, calculated as follows:
Wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector;
The dipeptide composition characteristics represent the frequency of the dipeptide in the protein or polypeptide sequence, and the calculation formula is as follows:
Where N ab denotes the number of given dipeptides ab, N denotes the sequence length of the protein or peptide, and D (a, b) denotes the finally generated 400-dimensional feature vector.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the deviation of the dipeptide from the expected average is characterized by a combination of three features, theoretical average TM, dipeptide composition DPC and theoretical difference TV;
The calculation formula of the TM feature is as follows:
wherein C a and C b represent the codon numbers encoding amino acids a and b, respectively. C N is equal to 61, representing the total number of possible codons excluding the three stop codons.
The calculation formula of the TV features is as follows:
wherein TM represents TM features, TV represents TV features, and N represents the sequence length of the protein or peptide.
The calculation formula of the DDE feature is as follows:
Wherein DPC stands for DPC feature, TM stands for TM feature, and TV stands for TV feature.
As an aspect of the hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the a priori feature information training classical machine learning classification model in S23 includes a random forest classification algorithm, an extreme random tree classification algorithm, xgboost classification algorithm, gradientBoosting classification algorithm and Adaboost classification algorithm.
As an aspect of the hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S4 comprises the steps of:
s41, carrying out a stacking algorithm by utilizing a plurality of classification methods, and taking the prediction scores of the training data by different classification methods as a new training data set;
s42, constructing a classification model based on the extreme random tree by using a new training data set, scoring the model by using a test data set, repeating the experiment for five times, and taking the average result of the experiment for five times as a performance evaluation index of the model.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S41 specifically comprises the steps of:
s411, integrating a plurality of basic level classification models through a meta model;
s412, training the basic level classification model by using the whole training data set, and using the output of the basic level classification model as training characteristics by using the meta model;
S413, respectively training basic-level classification models by using a 5-time cross validation method.
By adopting the technical scheme, the invention has the following advantages:
1. The invention provides a mixed prediction method of virulence factors and antibiotic resistance genes, which can fully utilize the characteristics of a plurality of key core genes, superimpose classical ensemble learning methods and deep learning forces to efficiently predict potential virulence factors and antibiotic resistance genes at the same time, and has strong scientific performance and higher accuracy of prediction results.
2. The invention can accurately predict virulence factors, drug resistance genes and negative sample genes (neither virulence factors nor antibiotic resistance genes) simultaneously, can flexibly and accurately predict independently, solves the defects of high false negative rate, extremely sensitive cut-off threshold and only being capable of identifying conserved genes of the traditional optimal hit method, and obtains better prediction effect.
3. The invention has the advantages that the precision and recall rate of the novel virulence factors and drug resistance genes, the virulence factors and drug resistance genes in the real metagenome data and the pseudo virulence factors and drug resistance genes (gene fragments) are higher than those of the traditional prediction tool before, and compared with all the most advanced prediction tools, the result of the invention has competitive power and higher scientific performance by using the calculation method comprising machine learning and deep learning neural network.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings.
FIG. 1 is a flow chart of a method of hybrid prediction of virulence factors and antibiotic resistance genes of the present invention;
FIG. 2 is a bar graph comparison of the results of the hybrid prediction method of the present invention and other calculation methods for predicting virulence factors and antibiotic resistance genes simultaneously.
Detailed Description
The following detailed features and advantages of the present invention will be described in detail with reference to the following embodiments, and will be apparent to one skilled in the art from the description, claims, and drawings disclosed in the present specification.
Referring to FIG. 1, a method for mixed prediction of virulence factors and antibiotic resistance genes in microbial data comprises the following steps:
S1, respectively acquiring known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data (which neither belong to antibiotic resistance genes nor virulence factors) from a database;
s1 comprises the following specific steps:
s11, acquiring known antibiotic resistance gene sequence data from databases ARDB, CARD and Uniprot;
s12, acquiring known virulence factor sequence data from a database VFDB, PATRIC, victors and Uniprot;
s13, acquiring negative sample gene sequence data from a database Uniprot.
S2, calculating various core gene characteristics by using the gene sequence information, and constructing a deep learning neural network architecture and a classical integrated learning architecture by the core gene characteristics respectively;
s2 comprises the following specific steps:
s21, calculating the similar characteristics based on the comparison score, the simple characteristics based on the single thermal coding, the characteristics based on the gene evolution information and the characteristics based on the gene sequence information by utilizing the gene sequence information because the various core gene characteristics comprise the similar characteristics based on the comparison score, the characteristics based on the gene evolution information, the characteristics based on the gene sequence information and the characteristics based on the gene sequence information.
For a similar feature based on an alignment score consisting of an alignment score of virulence factors and antibiotic resistance genes with known virulence factors and antibiotic resistance genes, this feature considers the similarity distribution of sequences in the ARGs and VFs databases, not just the optimal hit rate, and the alignment score is used as a similarity index because it is different from e-value, it considers the degree of identity between sequences and is independent of the size of the database.
The step S21 of calculating similar characteristics based on the comparison score comprises the following specific steps:
The DIAMOND program, faster than BLAST, was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;
The training dataset has been de-duplicated with the dataset for comparison using the CD-HIT procedure to avoid the possibility of tag leakage, the alignment score is normalized to the [0,1] interval to represent similarity of sequences over distance;
The bit-score based similarity feature for each gene sequence in the training dataset is converted to a fixed 12724+30945 = 43669-dimensional feature vector, where each dimension is the alignment score that the DIAMOND program outputs between the full-gene length sequence and each available ARG and VF in the comparison dataset.
The features based on the genetic evolution information consist of three specific features on the basis of a specific position scoring matrix (PSSM), including PSSM-component features, RPM-PSSM features, AADP-PSSM features, wherein PSSM-component features eliminate the variation due to protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, defined as follows:
Wherein R i represents the ith row of the PSSM-composite feature matrix, R k represents the kth row of the normalized PSSM, p k represents the kth amino acid in the protein sequence, and a i represents the ith amino acid in the 20 standard amino acids.
The RPM-PSSM feature converts the original PSSM by filtering negative values to 0 while leaving positive values unchanged. The idea of the method is derived from the residue probe method, namely, each amino acid corresponding to a specific column in PSSM is regarded as a probe, and finally, the original PSSM is converted into a 400-dimensional feature vector by using the following definition:
Wherein M i represents the ith row of the RPM-PSSM feature matrix, M k represents the kth row of PSSM, p k represents the kth amino acid in the protein sequence, and a i represents the ith amino acid in the 20 standard amino acids.
AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, which converts to a fixed length 20-dimensional feature vector by averaging the columns of the original PSSM contour, defined as follows:
Wherein x j represents the j-th row of the alternative AAC-PSSM feature matrix, represents the average proportion of amino acid mutations during evolution, and p i,j represents the i-row and j-column entities in the original PSSM. Second, DPC-PSSM is converted into a 400-dimensional feature vector of fixed length to avoid information loss due to X in the protein, defined as follows:
AADP-PSSM is converted into a fixed length 20+400=420 dimension feature vector by combining the two components.
Features based on gene sequence information include amino acid composition features (AAC), dipeptide composition features (DPC), dipeptide bias features from expected average (DDE), pseudo amino acid composition features (PAAC) features, and quasi sequence order features (QSO).
Wherein the amino acid composition characteristic (AAC) represents the frequency of 20 natural amino acids (i.e. ACDEFGHIKLMNPQRSTVWY) in the protein sequence, which can be calculated as:
Wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector.
DPC characteristics represent the frequency of dipeptides in the protein or polypeptide sequence and can be calculated as:
Where N ab denotes the number of given dipeptides ab, N denotes the sequence length of the protein or peptide, and D (a, b) denotes the resulting 400-dimensional feature vector. Calculation of DPC characteristics refers to the previous description.
The DDE feature is a combination of three features, theoretical Mean (TM), dipeptide composition (DPC) and Theoretical Variance (TV), specifically, the calculation method of the TM feature is as follows:
Wherein C a and C b represent the codon numbers encoding amino acids a and b, respectively, and C N is equal to 61, representing the total number of possible codons excluding three stop codons.
The calculation method of the TV features is as follows:
Wherein TM represents the TM feature, the calculation is as described above, and N represents the sequence length of the protein or peptide.
The DDE feature calculation method is as follows:
where TM stands for TM feature and TV stands for TV feature, the calculation is as described above.
S22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features of gene sequences based on single thermal coding, and training a neural network classification model in an end-to-end manner;
S23, constructing a classical integrated learning architecture by utilizing the characteristics based on the genetic evolution information and the characteristics based on the genetic sequence information, and training a classical machine learning classification model by using the prior characteristic information.
The prior feature information training classical machine learning classification model in S23 includes a Random Forest (Random Forest) classification algorithm, an extreme Random tree (Extra Trees) classification algorithm, a Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.
S3, taking known antibiotic resistance gene data in the S1 as a first type sample, known virulence factor sequence information data as a second type sample, known negative sample gene sequence information data as a third type sample, randomly extracting three types of data samples, randomly dividing the whole training data set into five parts each time, wherein four parts are used as training data sets, and the rest parts are used as test data sets;
s4, acquiring a new training data set by utilizing a plurality of classification methods, constructing a classification model for the new training data set based on the extreme random tree, and acquiring performance evaluation indexes of the classification model.
S4 specifically comprises the following steps:
s41, carrying out a stacking algorithm by utilizing a plurality of classification methods, and taking the prediction scores of the training data by different classification methods as a new training data set;
in order to obtain excellent predictive performance of virulence factors and antibiotic resistance genes, classical machine learning methods and deep learning forces are integrated in a stacked algorithm,
S41 specifically comprises the following steps:
s411, integrating a plurality of basic level classification models through a meta model;
s412, training the basic level classification model by using the whole training data set, and using the output of the basic level classification model as training characteristics by using the meta model;
s413, respectively training basic classification models by using a 5-time cross validation method to solve the overfitting phenomenon in final prediction, wherein in a specific embodiment, the stacking algorithm in the invention is shown in the following table 1 by pseudo codes shown in the algorithm
TABLE 1 stacking algorithm is composed of pseudo code shown in the algorithm
S42, constructing a classification model based on the extreme random tree by using a new training data set, scoring the model by using a test data set, repeating the experiment for five times, and taking the average result of the experiment for five times as a performance evaluation index of the model.
Example two
In order to better illustrate the effect of the prediction method of the present invention, we implemented a strict procedure, and adopted a cross-validation step to evaluate the effectiveness of the present invention without bias, table 2 lists the results of the mixed prediction of virulence factors and drug resistance genes under five times of the cross-validation method in this example:
TABLE 2 results of simultaneous prediction of virulence factors and drug resistance genes under five-fold cross-validation of the present invention
In Table 2, precision, recall, F1 score, VFs, virulence factor, ARGs, drug resistance gene NSs, negative sample gene Micro-average
As can be seen from the above Table 2, the present embodiment obtains a higher evaluation score on the results of multiple crossing experiments, and this result indicates that the present invention not only can realize simultaneous prediction of virulence factors and antibiotic resistance genes, but also has excellent performance in terms of accuracy and recall rate.
Example III
To test the predictive ability of the invention for unknown Virulence Factors (VFs), drug resistance genes (ARGs) and negative sample genes (NSs), the invention constructs an independent dataset comprising 209 ARGs, 209 VFs and 209 NSs, notably these unknown genes are completely independent of the genes in the training dataset, all identical or repetitive sequences are removed by setting the identity threshold of CD-HIT to 100%, in addition we have introduced the currently available VRprofile model (the latest computational model) as a comparison method and the traditional "best HIT" method as a baseline (using the Diamond sequence alignment tool) as a comparison method, table 3 lists the results of the simultaneous predictions of unknown virulence factors and drug resistance genes for the invention in sequence for example (HyperVR), VRprofile and using the Diamond sequence alignment tool as a comparison method under three different parameters:
TABLE 3 results of simultaneous prediction of unknown virulence factors and drug resistance genes by the inventive example, VRprofile model and baseline comparison method
In the table, precision: precision, recall: recall, F1-score: F1 score, VFs: virulence factor, ARGs: drug resistance gene, NSs: negative sample gene, micro-average: micro average
From table 3, we can see that, in comparison of the models of the invention (HyperVR) and VRprofile and the comparison method using the Diamond sequence alignment tool under three different parameters, the experimental result of the invention (HyperVR) obtains the highest evaluation score, and has more excellent performance in terms of accuracy and recall than other baseline comparison methods.
FIG. 2 shows the results of comparison histograms of the present invention (HyperVR) and VRprofile models (the latest calculation model) and baseline comparison methods (comprising three different parameters) for simultaneous prediction of unknown virulence factors and drug resistance genes, wherein in FIG. 2, the a-pillar represents F1 score, the b-pillar represents recall, the c-pillar represents precision, diamond-81%, diamond-64%, diamond-21% respectively represent baseline as comparison method under three different parameters using Diamond sequence comparison tools, wherein the height of the histogram represents the good or bad of the prediction performance of the method, and the comparison of the histogram in FIG. 2 shows that the embodiment (HyperVR) of the present invention has higher prediction performance relative to the latest calculation model (VRprofile) and baseline comparison methods (comprising three different parameters), the comprehensive performance is superior to other models, the result is more competitive, the scientific performance is better, and the prediction effect is the best.
Finally, it is pointed out that while the invention has been described with reference to a specific embodiment thereof, it will be understood by those skilled in the art that the above embodiments are provided for illustration only and not as a definition of the limits of the invention, and various equivalent changes or substitutions may be made without departing from the spirit of the invention, therefore, all changes and modifications to the above embodiments shall fall within the scope of the appended claims.
Claims (7)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210781902.8A CN115171792B (en) | 2022-06-30 | 2022-06-30 | A hybrid prediction method for virulence factors and antibiotic resistance genes |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202210781902.8A CN115171792B (en) | 2022-06-30 | 2022-06-30 | A hybrid prediction method for virulence factors and antibiotic resistance genes |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN115171792A CN115171792A (en) | 2022-10-11 |
| CN115171792B true CN115171792B (en) | 2025-08-26 |
Family
ID=83490457
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202210781902.8A Active CN115171792B (en) | 2022-06-30 | 2022-06-30 | A hybrid prediction method for virulence factors and antibiotic resistance genes |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115171792B (en) |
Families Citing this family (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN116631516B (en) * | 2023-05-06 | 2024-07-12 | 海南大学 | Antituberculous peptide prediction system based on integration of mixed characteristic model and lifting model |
| CN116541785B (en) * | 2023-07-05 | 2023-09-12 | 北京建工环境修复股份有限公司 | Toxicity prediction method and system based on deep integration machine learning model |
| CN118571309B (en) * | 2024-04-16 | 2025-04-18 | 四川大学华西医院 | Method, device and equipment for gene prediction or classification of antibiotic resistance genes or virulence factors |
| CN118098372B (en) * | 2024-04-23 | 2024-07-02 | 华东交通大学 | Virulence factor identification method and system based on self-attention encoding and pooling mechanism |
| CN118866125B (en) * | 2024-07-11 | 2025-04-29 | 中国疾病预防控制中心性病艾滋病预防控制中心 | A method, system, device and medium for predicting HIV genotype drug resistance |
| CN120319310B (en) * | 2025-06-10 | 2025-08-15 | 山东大学 | A method for identifying horizontally transferred genes based on model error |
Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113724790A (en) * | 2021-09-07 | 2021-11-30 | 湖南大学 | PiRNA-disease association relation prediction method based on convolution denoising self-coding machine |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| WO2011119484A1 (en) * | 2010-03-23 | 2011-09-29 | Iogenetics, Llc | Bioinformatic processes for determination of peptide binding |
| CN116693695A (en) * | 2017-02-12 | 2023-09-05 | 百欧恩泰美国公司 | HLA-based methods and compositions and uses thereof |
| CN111210871B (en) * | 2020-01-09 | 2023-06-13 | 青岛科技大学 | Protein-protein interaction prediction method based on deep forests |
| US20240013862A1 (en) * | 2020-09-30 | 2024-01-11 | Zymergen Inc. | Methods to identify novel insecticidal proteins from complex metagenomic microbial samples |
-
2022
- 2022-06-30 CN CN202210781902.8A patent/CN115171792B/en active Active
Patent Citations (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN113724790A (en) * | 2021-09-07 | 2021-11-30 | 湖南大学 | PiRNA-disease association relation prediction method based on convolution denoising self-coding machine |
Also Published As
| Publication number | Publication date |
|---|---|
| CN115171792A (en) | 2022-10-11 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN115171792B (en) | A hybrid prediction method for virulence factors and antibiotic resistance genes | |
| Brealey et al. | Dental calculus as a tool to study the evolution of the mammalian oral microbiome | |
| CN110010194A (en) | A Prediction Method of RNA Secondary Structure | |
| US20200294628A1 (en) | Creation or use of anchor-based data structures for sample-derived characteristic determination | |
| CN111816255A (en) | Fusion of multi-view and optimal multi-label chain learning for RNA-binding protein identification | |
| KR20200133067A (en) | Method and system for predicting disease from gut microbial data | |
| CN115997255A (en) | Molecular techniques for predicting bacterial phenotypic traits from genome | |
| CN119007829B (en) | Resistance polypeptide identification method based on deep learning | |
| CN114861940B (en) | Bayesian optimization ensemble learning method for predicting sORFs in plant lncRNAs | |
| TWI582631B (en) | Dna sequence analyzing system for analyzing bacterial species and method thereof | |
| US20240153588A1 (en) | Systems and methods for identifying microbial biosynthetic genetic clusters | |
| Liu et al. | A putative bacterial ecocline in Klebsiella pneumoniae | |
| Wickramarachchi | Models and algorithms for metagenomics analysis and Plasmid classification | |
| Biswa et al. | Tameness selection pressure affects gut virome diversity in mice | |
| CN119541645B (en) | Metagenomic plasmid identification method, metagenomic plasmid identification system, terminal and storage medium | |
| Debras | Analysis of secondary metabolite biosynthetic gene clusters in lichen metagenomes | |
| Naidenov | Unleashing Genomic Insights with AB Learning: A Self-Supervised Whole-Genome Language Model | |
| Miller et al. | RNA-seq Parent-of-Origin Classification with Machine Learning applied to Alignment Features | |
| Sigmon | Integration of Optical Maps and Short-Read Sequencing Data in Genomic Investigation | |
| CN117594125A (en) | Fusion gene detection method based on double-end RNA-S | |
| Lee et al. | E2D: A Novel Tool for Annotating Protein Domains in Expressed Sequence Tags | |
| Lehtinen | Comparison of normalization and statistical testing methods of 16S rRNA gene sequencing data | |
| Kim | The Development of a Robust Computational System for Integrating Multi-Omics and Microbiome Data to Extract High-Impact Knowledge for the Advancement of Biomedical Research | |
| CN118571309A (en) | Method, device and equipment for predicting or classifying antibiotic resistance genes or genes of virulence factors | |
| Sifat et al. | DeepIndel: A ResNet-Based Method for Accurate Insertion and Deletion Detection from Long-Read Sequencing. |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |