[go: up one dir, main page]

CN115171792B - A hybrid prediction method for virulence factors and antibiotic resistance genes - Google Patents

A hybrid prediction method for virulence factors and antibiotic resistance genes

Info

Publication number
CN115171792B
CN115171792B CN202210781902.8A CN202210781902A CN115171792B CN 115171792 B CN115171792 B CN 115171792B CN 202210781902 A CN202210781902 A CN 202210781902A CN 115171792 B CN115171792 B CN 115171792B
Authority
CN
China
Prior art keywords
pssm
feature
antibiotic resistance
features
resistance genes
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210781902.8A
Other languages
Chinese (zh)
Other versions
CN115171792A (en
Inventor
彭绍亮
姬博亚
皮文定
刘文娟
赵雄君
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hunan University
Original Assignee
Hunan University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hunan University filed Critical Hunan University
Priority to CN202210781902.8A priority Critical patent/CN115171792B/en
Publication of CN115171792A publication Critical patent/CN115171792A/en
Application granted granted Critical
Publication of CN115171792B publication Critical patent/CN115171792B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • G16B40/20Supervised data analysis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N20/00Machine learning
    • G06N20/20Ensemble learning
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B10/00ICT specially adapted for evolutionary bioinformatics, e.g. phylogenetic tree construction or analysis
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B30/00ICT specially adapted for sequence analysis involving nucleotides or amino acids
    • G16B30/10Sequence alignment; Homology search
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B50/00ICT programming tools or database systems specially adapted for bioinformatics

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Theoretical Computer Science (AREA)
  • Medical Informatics (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Evolutionary Biology (AREA)
  • Software Systems (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Data Mining & Analysis (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Databases & Information Systems (AREA)
  • Computing Systems (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Animal Behavior & Ethology (AREA)
  • Physiology (AREA)
  • Chemical & Material Sciences (AREA)
  • Analytical Chemistry (AREA)
  • Proteomics, Peptides & Aminoacids (AREA)
  • Epidemiology (AREA)
  • Public Health (AREA)
  • Apparatus Associated With Microorganisms And Enzymes (AREA)
  • Measuring Or Testing Involving Enzymes Or Micro-Organisms (AREA)

Abstract

The invention discloses a mixed prediction method of virulence factors and antibiotic resistance genes in the technical field of deep learning and bioinformatics, which comprises the following steps of S1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database, S2, respectively calculating various core gene characteristics by using gene sequence information to construct a deep learning neural network architecture and a classical integrated learning architecture, S3, taking three types of sequence data in S1 as samples to divide a training data set and a testing data set, S4, obtaining a new training data set by using various classification methods, constructing a classification model for the new training data set, and obtaining performance evaluation indexes of the classification model. The mixed prediction method of virulence factors and antibiotic resistance genes has good prediction effect and high prediction accuracy.

Description

Mixed prediction method of virulence factor and antibiotic resistance gene
Technical Field
The invention relates to the technical field of deep learning and bioinformatics, in particular to a mixed prediction method of virulence factors and antibiotic resistance genes.
Background
Microbiology is critical to the internal ecosystem of hosts (e.g., humans, animals, and plants) and to maintain the external environment. In particular, pathogenic microorganisms cause diseases by carrying Virulence Factors (VFs) and Antibiotic Resistance Genes (ARGs), even threaten the life safety of a host, can accurately and timely identify VFs and ARGs, can effectively guide medical treatment, reduce the morbidity and mortality of the host, and reduce economic losses in the aspects of animal husbandry, aquaculture and the like.
Furthermore, VFs and ARGs, despite the different evolutionary pathways, share the common features of VFs and ARGs that are necessary for pathogenic bacteria to adapt to and survive in competing microbial environments, and in particular, VFs and ARG are both often transferred between bacteria by Horizontal Gene Transfer (HGT) and utilize similar systems (i.e., two-component systems, efflux pumps, cell wall changes and porins) to activate or inhibit expression of various genes. Pathogens can utilize VFs to cause disease in their host, while they can colonize in environments with selective antibiotic stress by acquisition or presence ARGs. Thus, to understand the causal relationships between microbiome composition, function and disease, VFs and ARGs must be determined simultaneously, while predicting VFs and ARGs can save pathogen monitoring time, particularly for on-site detection of epidemic pathogens. However, the bioinformatics conventional tools for identifying ARGs or VFs are usually focused on independent prediction of ARG or VFs, the prediction tools are relatively late, and the prediction accuracy and recall rate are relatively low, and furthermore, the conventional prediction methods for VFs and ARGs have the technical problems of high false negative rate, very sensitivity to cut-off threshold, and relatively poor prediction effect because only conserved genes can be identified, so that it is necessary to design a hybrid prediction method of virulence factors and antibiotic resistance genes.
Disclosure of Invention
In order to solve the problems, the invention aims to provide a mixed prediction method of virulence factors and antibiotic resistance genes, which solves the technical problems that in the prior art, a prediction tool is behind, the prediction precision and recall rate are low, and the existing prediction effect is poor.
In order to achieve the above object, the technical scheme of the present invention is as follows:
The invention provides a mixed prediction method of virulence factors and antibiotic resistance genes, which comprises the following steps:
s1, respectively acquiring known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database;
S2, calculating various core gene characteristics by using the gene sequence information, and constructing a deep learning neural network architecture and a classical integrated learning architecture by the core gene characteristics respectively;
S3, taking the three types of sequence data in the S1 as samples, randomly extracting the samples as a data total set, carrying out five times of random division from the data total set, wherein four parts in each division are training data sets, and the rest parts are test data sets;
s4, acquiring a new training data set by utilizing a plurality of classification methods, constructing a classification model for the new training data set based on the extreme random tree, and acquiring performance evaluation indexes of the classification model.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S1 specifically comprises the steps of:
s11, acquiring known antibiotic resistance gene sequence data from databases ARDB, CARD and Uniprot;
s12, acquiring known virulence factor sequence data from a database VFDB, PATRIC, victors and Uniprot;
s13, acquiring negative sample gene sequence data from a database Uniprot.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S2 comprises the following specific steps:
S21, respectively calculating similar characteristics based on comparison scores, simple characteristics based on single-hot coding, characteristics based on gene evolution information and characteristics based on gene sequence information by utilizing gene sequence information;
S22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features of gene sequences based on single thermal coding, and training a neural network classification model in an end-to-end manner;
S23, constructing a classical integrated learning architecture by utilizing the characteristics based on the genetic evolution information and the characteristics based on the genetic sequence information, and training a classical machine learning classification model by using the prior characteristic information.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the calculation of similar features based on alignment scores in S21 comprises the following specific steps:
The DIAMOND program was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;
The training data set is de-duplicated with the data set for comparison by the CD-HIT program, and the comparison score is normalized to be a [0,1] interval;
The similarity features based on the bit scores for each gene sequence in the training dataset are converted into a fixed 12724+30945= 43669-dimensional feature vector.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the characteristics based on genetic evolution information in S21 are composed of three specific characteristics on the basis of a specific location scoring matrix, including PSSM-component characteristics, RPM-PSSM characteristics, and AADP-PSSM characteristics;
Wherein the PSSM-component features eliminate variations due to protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, defined as follows:
Wherein R i represents the ith row of the PSSM-composite feature matrix, R k represents the kth row of the normalized PSSM, p k represents the kth amino acid in the protein sequence, a i represents the ith amino acid in the 20 standard amino acids;
RPM-PSSM features transform the original PSSM by filtering negative values to 0 while leaving positive values unchanged, the concept of RPM-PSSM features comes from the residue-prober method, i.e. taking each amino acid corresponding to a specific column in the PSSM as a probe, the original PSSM is transformed into a 400-dimensional feature vector, using the definition as follows:
wherein M i represents the ith row of the RPM-PSSM feature matrix, M k represents the kth row of PSSM, p k represents the kth amino acid in the protein sequence, and a i represents the ith amino acid in the 20 standard amino acids;
AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, which converts to a fixed length 20-dimensional feature vector by averaging the columns of the original PSSM contour, defined as follows:
Wherein x j represents the j-th row of the alternative AAC-PSSM feature matrix, represents the average proportion of amino acid mutations in the evolution process, and p i,j represents the entities of the i-row and the j-column in the original PSSM;
DPC-PSSM was converted into a 400-dimensional feature vector of fixed length to avoid the loss of information due to X in the protein, defined as follows:
AADP-PSSM is converted into a fixed length 20+400=420 dimension feature vector by combining the two components.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the characteristics based on the gene sequence information in S21 include amino acid composition characteristics, dipeptide composition characteristics, and deviation characteristics of the dipeptide from an expected average value;
the amino acid composition characteristics represent the frequency of 20 natural amino acids in the protein sequence, calculated as follows:
Wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector;
The dipeptide composition characteristics represent the frequency of the dipeptide in the protein or polypeptide sequence, and the calculation formula is as follows:
Where N ab denotes the number of given dipeptides ab, N denotes the sequence length of the protein or peptide, and D (a, b) denotes the finally generated 400-dimensional feature vector.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the deviation of the dipeptide from the expected average is characterized by a combination of three features, theoretical average TM, dipeptide composition DPC and theoretical difference TV;
The calculation formula of the TM feature is as follows:
wherein C a and C b represent the codon numbers encoding amino acids a and b, respectively. C N is equal to 61, representing the total number of possible codons excluding the three stop codons.
The calculation formula of the TV features is as follows:
wherein TM represents TM features, TV represents TV features, and N represents the sequence length of the protein or peptide.
The calculation formula of the DDE feature is as follows:
Wherein DPC stands for DPC feature, TM stands for TM feature, and TV stands for TV feature.
As an aspect of the hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the a priori feature information training classical machine learning classification model in S23 includes a random forest classification algorithm, an extreme random tree classification algorithm, xgboost classification algorithm, gradientBoosting classification algorithm and Adaboost classification algorithm.
As an aspect of the hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S4 comprises the steps of:
s41, carrying out a stacking algorithm by utilizing a plurality of classification methods, and taking the prediction scores of the training data by different classification methods as a new training data set;
s42, constructing a classification model based on the extreme random tree by using a new training data set, scoring the model by using a test data set, repeating the experiment for five times, and taking the average result of the experiment for five times as a performance evaluation index of the model.
As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S41 specifically comprises the steps of:
s411, integrating a plurality of basic level classification models through a meta model;
s412, training the basic level classification model by using the whole training data set, and using the output of the basic level classification model as training characteristics by using the meta model;
S413, respectively training basic-level classification models by using a 5-time cross validation method.
By adopting the technical scheme, the invention has the following advantages:
1. The invention provides a mixed prediction method of virulence factors and antibiotic resistance genes, which can fully utilize the characteristics of a plurality of key core genes, superimpose classical ensemble learning methods and deep learning forces to efficiently predict potential virulence factors and antibiotic resistance genes at the same time, and has strong scientific performance and higher accuracy of prediction results.
2. The invention can accurately predict virulence factors, drug resistance genes and negative sample genes (neither virulence factors nor antibiotic resistance genes) simultaneously, can flexibly and accurately predict independently, solves the defects of high false negative rate, extremely sensitive cut-off threshold and only being capable of identifying conserved genes of the traditional optimal hit method, and obtains better prediction effect.
3. The invention has the advantages that the precision and recall rate of the novel virulence factors and drug resistance genes, the virulence factors and drug resistance genes in the real metagenome data and the pseudo virulence factors and drug resistance genes (gene fragments) are higher than those of the traditional prediction tool before, and compared with all the most advanced prediction tools, the result of the invention has competitive power and higher scientific performance by using the calculation method comprising machine learning and deep learning neural network.
Drawings
Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings.
FIG. 1 is a flow chart of a method of hybrid prediction of virulence factors and antibiotic resistance genes of the present invention;
FIG. 2 is a bar graph comparison of the results of the hybrid prediction method of the present invention and other calculation methods for predicting virulence factors and antibiotic resistance genes simultaneously.
Detailed Description
The following detailed features and advantages of the present invention will be described in detail with reference to the following embodiments, and will be apparent to one skilled in the art from the description, claims, and drawings disclosed in the present specification.
Referring to FIG. 1, a method for mixed prediction of virulence factors and antibiotic resistance genes in microbial data comprises the following steps:
S1, respectively acquiring known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data (which neither belong to antibiotic resistance genes nor virulence factors) from a database;
s1 comprises the following specific steps:
s11, acquiring known antibiotic resistance gene sequence data from databases ARDB, CARD and Uniprot;
s12, acquiring known virulence factor sequence data from a database VFDB, PATRIC, victors and Uniprot;
s13, acquiring negative sample gene sequence data from a database Uniprot.
S2, calculating various core gene characteristics by using the gene sequence information, and constructing a deep learning neural network architecture and a classical integrated learning architecture by the core gene characteristics respectively;
s2 comprises the following specific steps:
s21, calculating the similar characteristics based on the comparison score, the simple characteristics based on the single thermal coding, the characteristics based on the gene evolution information and the characteristics based on the gene sequence information by utilizing the gene sequence information because the various core gene characteristics comprise the similar characteristics based on the comparison score, the characteristics based on the gene evolution information, the characteristics based on the gene sequence information and the characteristics based on the gene sequence information.
For a similar feature based on an alignment score consisting of an alignment score of virulence factors and antibiotic resistance genes with known virulence factors and antibiotic resistance genes, this feature considers the similarity distribution of sequences in the ARGs and VFs databases, not just the optimal hit rate, and the alignment score is used as a similarity index because it is different from e-value, it considers the degree of identity between sequences and is independent of the size of the database.
The step S21 of calculating similar characteristics based on the comparison score comprises the following specific steps:
The DIAMOND program, faster than BLAST, was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;
The training dataset has been de-duplicated with the dataset for comparison using the CD-HIT procedure to avoid the possibility of tag leakage, the alignment score is normalized to the [0,1] interval to represent similarity of sequences over distance;
The bit-score based similarity feature for each gene sequence in the training dataset is converted to a fixed 12724+30945 = 43669-dimensional feature vector, where each dimension is the alignment score that the DIAMOND program outputs between the full-gene length sequence and each available ARG and VF in the comparison dataset.
The features based on the genetic evolution information consist of three specific features on the basis of a specific position scoring matrix (PSSM), including PSSM-component features, RPM-PSSM features, AADP-PSSM features, wherein PSSM-component features eliminate the variation due to protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, defined as follows:
Wherein R i represents the ith row of the PSSM-composite feature matrix, R k represents the kth row of the normalized PSSM, p k represents the kth amino acid in the protein sequence, and a i represents the ith amino acid in the 20 standard amino acids.
The RPM-PSSM feature converts the original PSSM by filtering negative values to 0 while leaving positive values unchanged. The idea of the method is derived from the residue probe method, namely, each amino acid corresponding to a specific column in PSSM is regarded as a probe, and finally, the original PSSM is converted into a 400-dimensional feature vector by using the following definition:
Wherein M i represents the ith row of the RPM-PSSM feature matrix, M k represents the kth row of PSSM, p k represents the kth amino acid in the protein sequence, and a i represents the ith amino acid in the 20 standard amino acids.
AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, which converts to a fixed length 20-dimensional feature vector by averaging the columns of the original PSSM contour, defined as follows:
Wherein x j represents the j-th row of the alternative AAC-PSSM feature matrix, represents the average proportion of amino acid mutations during evolution, and p i,j represents the i-row and j-column entities in the original PSSM. Second, DPC-PSSM is converted into a 400-dimensional feature vector of fixed length to avoid information loss due to X in the protein, defined as follows:
AADP-PSSM is converted into a fixed length 20+400=420 dimension feature vector by combining the two components.
Features based on gene sequence information include amino acid composition features (AAC), dipeptide composition features (DPC), dipeptide bias features from expected average (DDE), pseudo amino acid composition features (PAAC) features, and quasi sequence order features (QSO).
Wherein the amino acid composition characteristic (AAC) represents the frequency of 20 natural amino acids (i.e. ACDEFGHIKLMNPQRSTVWY) in the protein sequence, which can be calculated as:
Wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector.
DPC characteristics represent the frequency of dipeptides in the protein or polypeptide sequence and can be calculated as:
Where N ab denotes the number of given dipeptides ab, N denotes the sequence length of the protein or peptide, and D (a, b) denotes the resulting 400-dimensional feature vector. Calculation of DPC characteristics refers to the previous description.
The DDE feature is a combination of three features, theoretical Mean (TM), dipeptide composition (DPC) and Theoretical Variance (TV), specifically, the calculation method of the TM feature is as follows:
Wherein C a and C b represent the codon numbers encoding amino acids a and b, respectively, and C N is equal to 61, representing the total number of possible codons excluding three stop codons.
The calculation method of the TV features is as follows:
Wherein TM represents the TM feature, the calculation is as described above, and N represents the sequence length of the protein or peptide.
The DDE feature calculation method is as follows:
where TM stands for TM feature and TV stands for TV feature, the calculation is as described above.
S22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features of gene sequences based on single thermal coding, and training a neural network classification model in an end-to-end manner;
S23, constructing a classical integrated learning architecture by utilizing the characteristics based on the genetic evolution information and the characteristics based on the genetic sequence information, and training a classical machine learning classification model by using the prior characteristic information.
The prior feature information training classical machine learning classification model in S23 includes a Random Forest (Random Forest) classification algorithm, an extreme Random tree (Extra Trees) classification algorithm, a Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.
S3, taking known antibiotic resistance gene data in the S1 as a first type sample, known virulence factor sequence information data as a second type sample, known negative sample gene sequence information data as a third type sample, randomly extracting three types of data samples, randomly dividing the whole training data set into five parts each time, wherein four parts are used as training data sets, and the rest parts are used as test data sets;
s4, acquiring a new training data set by utilizing a plurality of classification methods, constructing a classification model for the new training data set based on the extreme random tree, and acquiring performance evaluation indexes of the classification model.
S4 specifically comprises the following steps:
s41, carrying out a stacking algorithm by utilizing a plurality of classification methods, and taking the prediction scores of the training data by different classification methods as a new training data set;
in order to obtain excellent predictive performance of virulence factors and antibiotic resistance genes, classical machine learning methods and deep learning forces are integrated in a stacked algorithm,
S41 specifically comprises the following steps:
s411, integrating a plurality of basic level classification models through a meta model;
s412, training the basic level classification model by using the whole training data set, and using the output of the basic level classification model as training characteristics by using the meta model;
s413, respectively training basic classification models by using a 5-time cross validation method to solve the overfitting phenomenon in final prediction, wherein in a specific embodiment, the stacking algorithm in the invention is shown in the following table 1 by pseudo codes shown in the algorithm
TABLE 1 stacking algorithm is composed of pseudo code shown in the algorithm
S42, constructing a classification model based on the extreme random tree by using a new training data set, scoring the model by using a test data set, repeating the experiment for five times, and taking the average result of the experiment for five times as a performance evaluation index of the model.
Example two
In order to better illustrate the effect of the prediction method of the present invention, we implemented a strict procedure, and adopted a cross-validation step to evaluate the effectiveness of the present invention without bias, table 2 lists the results of the mixed prediction of virulence factors and drug resistance genes under five times of the cross-validation method in this example:
TABLE 2 results of simultaneous prediction of virulence factors and drug resistance genes under five-fold cross-validation of the present invention
In Table 2, precision, recall, F1 score, VFs, virulence factor, ARGs, drug resistance gene NSs, negative sample gene Micro-average
As can be seen from the above Table 2, the present embodiment obtains a higher evaluation score on the results of multiple crossing experiments, and this result indicates that the present invention not only can realize simultaneous prediction of virulence factors and antibiotic resistance genes, but also has excellent performance in terms of accuracy and recall rate.
Example III
To test the predictive ability of the invention for unknown Virulence Factors (VFs), drug resistance genes (ARGs) and negative sample genes (NSs), the invention constructs an independent dataset comprising 209 ARGs, 209 VFs and 209 NSs, notably these unknown genes are completely independent of the genes in the training dataset, all identical or repetitive sequences are removed by setting the identity threshold of CD-HIT to 100%, in addition we have introduced the currently available VRprofile model (the latest computational model) as a comparison method and the traditional "best HIT" method as a baseline (using the Diamond sequence alignment tool) as a comparison method, table 3 lists the results of the simultaneous predictions of unknown virulence factors and drug resistance genes for the invention in sequence for example (HyperVR), VRprofile and using the Diamond sequence alignment tool as a comparison method under three different parameters:
TABLE 3 results of simultaneous prediction of unknown virulence factors and drug resistance genes by the inventive example, VRprofile model and baseline comparison method
In the table, precision: precision, recall: recall, F1-score: F1 score, VFs: virulence factor, ARGs: drug resistance gene, NSs: negative sample gene, micro-average: micro average
From table 3, we can see that, in comparison of the models of the invention (HyperVR) and VRprofile and the comparison method using the Diamond sequence alignment tool under three different parameters, the experimental result of the invention (HyperVR) obtains the highest evaluation score, and has more excellent performance in terms of accuracy and recall than other baseline comparison methods.
FIG. 2 shows the results of comparison histograms of the present invention (HyperVR) and VRprofile models (the latest calculation model) and baseline comparison methods (comprising three different parameters) for simultaneous prediction of unknown virulence factors and drug resistance genes, wherein in FIG. 2, the a-pillar represents F1 score, the b-pillar represents recall, the c-pillar represents precision, diamond-81%, diamond-64%, diamond-21% respectively represent baseline as comparison method under three different parameters using Diamond sequence comparison tools, wherein the height of the histogram represents the good or bad of the prediction performance of the method, and the comparison of the histogram in FIG. 2 shows that the embodiment (HyperVR) of the present invention has higher prediction performance relative to the latest calculation model (VRprofile) and baseline comparison methods (comprising three different parameters), the comprehensive performance is superior to other models, the result is more competitive, the scientific performance is better, and the prediction effect is the best.
Finally, it is pointed out that while the invention has been described with reference to a specific embodiment thereof, it will be understood by those skilled in the art that the above embodiments are provided for illustration only and not as a definition of the limits of the invention, and various equivalent changes or substitutions may be made without departing from the spirit of the invention, therefore, all changes and modifications to the above embodiments shall fall within the scope of the appended claims.

Claims (7)

1.一种毒力因子和抗生素抗性基因的混合预测方法,其特征在于,包括以下步骤:1. A hybrid prediction method for virulence factors and antibiotic resistance genes, characterized by comprising the following steps: S1.分别从数据库中获取已知的抗生素抗性基因序列数据、毒力因子序列数据以及负样本基因序列数据;S1. Obtain known antibiotic resistance gene sequence data, virulence factor sequence data, and negative sample gene sequence data from the database respectively; S2.利用基因序列信息分别计算多种核心基因特征,并通过核心基因特征分别构建深度学习神经网络架构和经典集成学习架构;S2. Calculate multiple core gene features using gene sequence information and construct deep learning neural network architectures and classical ensemble learning architectures based on these core gene features. 所述S2包括以下具体步骤:The S2 includes the following specific steps: S21.利用基因序列信息分别计算基于比对得分的相似特征、基于独热编码的基因序列简单特征、基于基因进化信息的特征以及基于基因序列信息的特征;S21. Using gene sequence information, calculate similarity features based on alignment scores, simple gene sequence features based on one-hot encoding, features based on gene evolution information, and features based on gene sequence information; S22.利用基于比对得分的相似特征和基于独热编码的基因序列简单特征构建深度学习网络架构,以端到端的方式训练神经网络分类模型;S22. Build a deep learning network architecture using similarity features based on alignment scores and simple gene sequence features based on one-hot encoding to train a neural network classification model in an end-to-end manner. S23.利用基于基因进化信息的特征和基于基因序列信息的特征构建经典集成学习架构,以先验特征信息训练经典机器学习分类模型;S23. Build a classic ensemble learning architecture using features based on gene evolution information and features based on gene sequence information, and train a classic machine learning classification model using prior feature information. S3.将S1中三类序列数据作为样本,随机抽取作为数据总集,从数据总集进行五次随机划分,每次划分中的其中四个部分为训练数据集,剩余一部分为测试数据集;S3. Take the three types of sequence data in S1 as samples, randomly select them as the total data set, and randomly divide them into five parts. In each division, four parts are used as training data sets, and the remaining part is used as test data sets. S4.利用多种分类方法获取新的训练数据集;基于极端随机树对新的训练数据集构建分类模型,获取分类模型的性能评价指标;S4. Use multiple classification methods to obtain a new training data set; construct a classification model based on the extreme random tree for the new training data set, and obtain performance evaluation indicators of the classification model; 所述S4包括以下步骤:The S4 comprises the following steps: S41.利用多种分类方法进行堆叠算法,将不同分类方法对训练数据的预测得分作为新的训练数据集,为了获得毒力因子和抗生素抗性基因的卓越预测性能,将经典的机器学习方法和深度学习的力量集合在一个堆叠算法中;S41. We utilize multiple classification methods in a stacking algorithm, using the prediction scores of different classification methods on the training data as a new training dataset. To achieve superior prediction performance for virulence factors and antibiotic resistance genes, we combine the power of classic machine learning methods and deep learning in a single stacking algorithm. 所述S41具体包括以下步骤:The S41 specifically includes the following steps: S411.通过一个元模型整合了多个基础级分类模型;S411. Multiple base-level classification models are integrated through a meta-model; S412.基础级分类模型使用整个训练数据集进行训练,元模型则使用基础级分类模型的输出作为训练的特征;S412. The base-level classification model is trained using the entire training dataset, and the meta-model uses the output of the base-level classification model as a training feature; S413.利用5倍交叉验证法来分别训练基础级分类模型;S413. Use 5-fold cross validation method to train the base-level classification model separately; S42.基于极端随机树利用新的训练数据集构建分类模型,利用测试数据集对模型进行打分,重复进行五次实验,取五次实验的平均结果作为模型的性能评价指标。S42. Build a classification model based on the extreme random tree using the new training dataset, score the model using the test dataset, repeat the experiment five times, and take the average result of the five experiments as the performance evaluation indicator of the model. 2.根据权利要求1所述的一种毒力因子和抗生素抗性基因的混合预测方法,其特征在于,所述S1具体包括以下步骤:2. The hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, wherein S1 specifically comprises the following steps: S11.从数据库ARDB、CARD和Uniprot中获取已知的抗生素抗性基因序列数据;S11. Obtain known antibiotic resistance gene sequence data from the ARDB, CARD, and Uniprot databases; S12.从数据库VFDB、PATRIC、Victors和Uniprot中获取已知的毒力因子序列数据;S12. Obtain known virulence factor sequence data from the databases VFDB, PATRIC, Victors, and Uniprot; S13.从数据库Uniprot中获取负样本基因序列数据。S13. Obtain negative sample gene sequence data from the Uniprot database. 3.根据权利要求1所述的一种毒力因子和抗生素抗性基因的混合预测方法,其特征在于,所述S21中计算基于比对得分的相似特征包括以下具体步骤:3. The hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, wherein the step of calculating the similarity features based on the comparison score in S21 comprises the following specific steps: 选择了DIAMOND程序,将训练数据集中的基因序列与用于比较的其余已知12724个ARGs和30945个VFs在敏感的参数下进行比对;The DIAMOND program was selected to align the gene sequences in the training dataset with the remaining known 12,724 ARGs and 30,945 VFs for comparison under sensitive parameters; 将训练数据集已经用CD-HIT程序与用于比较的数据集进行了去重,比对得分被归一化为[0,1]区间;The training dataset has been deduplicated with the dataset used for comparison using the CD-HIT program, and the comparison scores have been normalized to the interval [0,1]. 将训练数据集中每个基因序列的基于比特分数的相似性特征转化为一个固定的12724+30945=43669维的特征向量。The bit score-based similarity feature of each gene sequence in the training dataset is converted into a fixed 12724+30945=43669-dimensional feature vector. 4.根据权利要求1所述的一种毒力因子和抗生素抗性基因的混合预测方法,其特征在于,所述S21中的基于基因进化信息的特征是由在特定位置评分矩阵的基础上的三种具体特征组成,包括PSSM-成分特征,RPM-PSSM特征以及AADP-PSSM特征;4. A hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, characterized in that the feature based on gene evolution information in S21 is composed of three specific features based on a specific position scoring matrix, including PSSM-component feature, RPM-PSSM feature and AADP-PSSM feature; 其中,PSSM-成分特征通过对每个自然发生的氨基酸类型的原始PSSM剖面的所有行进行求和以及平均值,消除了蛋白质序列长度所带来的变化,定义如下:The PSSM-composition signature is defined as follows: 其中,Ri表示PSSM-复合特征矩阵的第i行,rk表示归一化PSSM的第k行,pk表示蛋白质序列中的第k个氨基酸,ai表示20个标准氨基酸中的第i个氨基酸;Where R i represents the i-th row of the PSSM-composite feature matrix, r k represents the k-th row of the normalized PSSM, p k represents the k-th amino acid in the protein sequence, and a i represents the i-th amino acid in the 20 standard amino acids; RPM-PSSM特征通过将负值过滤为0而保留正值不变来转换原始PSSM,RPM-PSSM特征的思路来自于残基探针法,即把PSSM中特定列对应的每个氨基酸视为一个探针,原始的PSSM被转化为一个400维的特征向量,使用定义如下:The RPM-PSSM feature transforms the original PSSM by filtering negative values to 0 while leaving positive values unchanged. The idea of the RPM-PSSM feature comes from the residue probe method, that is, each amino acid corresponding to a specific column in the PSSM is regarded as a probe. The original PSSM is converted into a 400-dimensional feature vector, which is defined as follows: 其中,Mi表示RPM-PSSM特征矩阵的第i行,mk表示PSSM的第k行,pk表示蛋白质序列中第k个氨基酸,ai表示20个标准氨基酸中第i个氨基酸;Wherein, Mi represents the i-th row of the RPM-PSSM feature matrix, mk represents the k-th row of the PSSM, pk represents the k-th amino acid in the protein sequence, and ai represents the i-th amino acid in the 20 standard amino acids; AADP-PSSM特征将传统的AAC和DPC概念扩展到PSSM,AAC-PSSM通过对原始PSSM轮廓的列进行平均化,转化为一个固定长度的20维特征向量,定义如下:The AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM. AAC-PSSM converts the columns of the original PSSM profile into a fixed-length 20-dimensional feature vector, which is defined as follows: 其中,xj表示替换AAC-PSSM特征矩阵的第j行,代表进化过程中氨基酸突变的平均比例,pi,j表示原始PSSM中i行和j列的实体;Where x j represents the jth row of the replaced AAC-PSSM feature matrix, representing the average proportion of amino acid mutations during evolution, and p i,j represents the entity in row i and column j of the original PSSM; DPC-PSSM被转化为固定长度的400维特征向量,以避免蛋白质中的X导致的信息损失,定义如下:The DPC-PSSM is converted into a fixed-length 400-dimensional feature vector to avoid information loss caused by X in the protein, which is defined as follows: AADP-PSSM通过结合这两个成分,被转化为固定长度的20+400=420维的特征向量。By combining these two components, AADP-PSSM is converted into a fixed-length feature vector of 20+400=420 dimensions. 5.根据权利要求1所述的一种毒力因子和抗生素抗性基因的混合预测方法,其特征在于,所述S21中基于基因序列信息的特征包括氨基酸组成特征,二肽组成特征以及二肽与预期平均值的偏差特征;5. The hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, wherein the features based on gene sequence information in S21 include amino acid composition features, dipeptide composition features, and features of deviations of dipeptides from expected average values; 其中,氨基酸组成特征表示蛋白质序列中20个天然氨基酸的频率,计算公式如下:The amino acid composition feature represents the frequency of 20 natural amino acids in the protein sequence, and the calculation formula is as follows: 其中,N(a)表示特定氨基酸a的数量,N表示蛋白质或肽的序列长度,f(a)表示最终生成的20维特征向量;Where N(a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f(a) represents the final generated 20-dimensional feature vector; 二肽组成特征表示蛋白质或多肽序列中二肽的频率,计算公式如下:The dipeptide composition feature represents the frequency of dipeptides in a protein or polypeptide sequence, and is calculated as follows: 其中,Nab表示给定二肽ab的数量,N表示蛋白质或肽的序列长度,D(a,b)表示最终生成的400维的特征向量。Where Nab represents the number of given dipeptides ab, N represents the sequence length of the protein or peptide, and D(a,b) represents the final generated 400-dimensional feature vector. 6.根据权利要求5所述的一种毒力因子和抗生素抗性基因的混合预测方法,其特征在于,所述二肽与预期平均值的偏差特征是三个特征的组合:理论平均值TM、二肽组成DPC和理论差异TV;6. A hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 5, characterized in that the characteristic of the deviation of the dipeptide from the expected mean is a combination of three characteristics: theoretical mean TM, dipeptide composition DPC and theoretical difference TV; TM特征的计算公式如下:The calculation formula of TM features is as follows: 其中,Ca和Cb分别表示编码氨基酸a和b的密码子编号,CN等于61,表示不包括三个终止密码子的可能密码子的总数;Where C a and C b represent the codon numbers encoding amino acids a and b, respectively, and C N equals 61, which represents the total number of possible codons excluding the three stop codons; TV特征的计算公式如下:The calculation formula of TV feature is as follows: 其中,TM代表TM特征,TV代表TV特征,N表示蛋白质或肽的序列长度;Where TM represents TM features, TV represents TV features, and N represents the sequence length of the protein or peptide; DDE特征的计算公式如下:The calculation formula of DDE characteristics is as follows: 其中,DPC代表DPC特征,TM代表TM特征,TV代表TV特征。Among them, DPC represents DPC characteristics, TM represents TM characteristics, and TV represents TV characteristics. 7.根据权利要求1所述的一种毒力因子和抗生素抗性基因的混合预测方法,其特征在于,所述S23中的先验特征信息训练经典机器学习分类模型包括随机森林分类算法,极端随机树分类算法,Xgboost分类算法,GradientBoosting分类算法和Adaboost分类算法。7. A hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, characterized in that the prior feature information training classical machine learning classification model in S23 includes a random forest classification algorithm, an extreme random tree classification algorithm, an Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.
CN202210781902.8A 2022-06-30 2022-06-30 A hybrid prediction method for virulence factors and antibiotic resistance genes Active CN115171792B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210781902.8A CN115171792B (en) 2022-06-30 2022-06-30 A hybrid prediction method for virulence factors and antibiotic resistance genes

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210781902.8A CN115171792B (en) 2022-06-30 2022-06-30 A hybrid prediction method for virulence factors and antibiotic resistance genes

Publications (2)

Publication Number Publication Date
CN115171792A CN115171792A (en) 2022-10-11
CN115171792B true CN115171792B (en) 2025-08-26

Family

ID=83490457

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210781902.8A Active CN115171792B (en) 2022-06-30 2022-06-30 A hybrid prediction method for virulence factors and antibiotic resistance genes

Country Status (1)

Country Link
CN (1) CN115171792B (en)

Families Citing this family (6)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631516B (en) * 2023-05-06 2024-07-12 海南大学 Antituberculous peptide prediction system based on integration of mixed characteristic model and lifting model
CN116541785B (en) * 2023-07-05 2023-09-12 北京建工环境修复股份有限公司 Toxicity prediction method and system based on deep integration machine learning model
CN118571309B (en) * 2024-04-16 2025-04-18 四川大学华西医院 Method, device and equipment for gene prediction or classification of antibiotic resistance genes or virulence factors
CN118098372B (en) * 2024-04-23 2024-07-02 华东交通大学 Virulence factor identification method and system based on self-attention encoding and pooling mechanism
CN118866125B (en) * 2024-07-11 2025-04-29 中国疾病预防控制中心性病艾滋病预防控制中心 A method, system, device and medium for predicting HIV genotype drug resistance
CN120319310B (en) * 2025-06-10 2025-08-15 山东大学 A method for identifying horizontally transferred genes based on model error

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724790A (en) * 2021-09-07 2021-11-30 湖南大学 PiRNA-disease association relation prediction method based on convolution denoising self-coding machine

Family Cites Families (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
WO2011119484A1 (en) * 2010-03-23 2011-09-29 Iogenetics, Llc Bioinformatic processes for determination of peptide binding
CN116693695A (en) * 2017-02-12 2023-09-05 百欧恩泰美国公司 HLA-based methods and compositions and uses thereof
CN111210871B (en) * 2020-01-09 2023-06-13 青岛科技大学 Protein-protein interaction prediction method based on deep forests
US20240013862A1 (en) * 2020-09-30 2024-01-11 Zymergen Inc. Methods to identify novel insecticidal proteins from complex metagenomic microbial samples

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN113724790A (en) * 2021-09-07 2021-11-30 湖南大学 PiRNA-disease association relation prediction method based on convolution denoising self-coding machine

Also Published As

Publication number Publication date
CN115171792A (en) 2022-10-11

Similar Documents

Publication Publication Date Title
CN115171792B (en) A hybrid prediction method for virulence factors and antibiotic resistance genes
Brealey et al. Dental calculus as a tool to study the evolution of the mammalian oral microbiome
CN110010194A (en) A Prediction Method of RNA Secondary Structure
US20200294628A1 (en) Creation or use of anchor-based data structures for sample-derived characteristic determination
CN111816255A (en) Fusion of multi-view and optimal multi-label chain learning for RNA-binding protein identification
KR20200133067A (en) Method and system for predicting disease from gut microbial data
CN115997255A (en) Molecular techniques for predicting bacterial phenotypic traits from genome
CN119007829B (en) Resistance polypeptide identification method based on deep learning
CN114861940B (en) Bayesian optimization ensemble learning method for predicting sORFs in plant lncRNAs
TWI582631B (en) Dna sequence analyzing system for analyzing bacterial species and method thereof
US20240153588A1 (en) Systems and methods for identifying microbial biosynthetic genetic clusters
Liu et al. A putative bacterial ecocline in Klebsiella pneumoniae
Wickramarachchi Models and algorithms for metagenomics analysis and Plasmid classification
Biswa et al. Tameness selection pressure affects gut virome diversity in mice
CN119541645B (en) Metagenomic plasmid identification method, metagenomic plasmid identification system, terminal and storage medium
Debras Analysis of secondary metabolite biosynthetic gene clusters in lichen metagenomes
Naidenov Unleashing Genomic Insights with AB Learning: A Self-Supervised Whole-Genome Language Model
Miller et al. RNA-seq Parent-of-Origin Classification with Machine Learning applied to Alignment Features
Sigmon Integration of Optical Maps and Short-Read Sequencing Data in Genomic Investigation
CN117594125A (en) Fusion gene detection method based on double-end RNA-S
Lee et al. E2D: A Novel Tool for Annotating Protein Domains in Expressed Sequence Tags
Lehtinen Comparison of normalization and statistical testing methods of 16S rRNA gene sequencing data
Kim The Development of a Robust Computational System for Integrating Multi-Omics and Microbiome Data to Extract High-Impact Knowledge for the Advancement of Biomedical Research
CN118571309A (en) Method, device and equipment for predicting or classifying antibiotic resistance genes or genes of virulence factors
Sifat et al. DeepIndel: A ResNet-Based Method for Accurate Insertion and Deletion Detection from Long-Read Sequencing.

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant