CN115171792B

CN115171792B - A hybrid prediction method for virulence factors and antibiotic resistance genes

Info

Publication number: CN115171792B
Application number: CN202210781902.8A
Authority: CN
Inventors: 彭绍亮; 姬博亚; 皮文定; 刘文娟; 赵雄君
Original assignee: Hunan University
Current assignee: Hunan University
Priority date: 2022-06-30
Filing date: 2022-06-30
Publication date: 2025-08-26
Anticipated expiration: 2042-06-30
Also published as: CN115171792A

Abstract

The invention discloses a mixed prediction method of virulence factors and antibiotic resistance genes in the technical field of deep learning and bioinformatics, which comprises the following steps of S1, respectively obtaining known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database, S2, respectively calculating various core gene characteristics by using gene sequence information to construct a deep learning neural network architecture and a classical integrated learning architecture, S3, taking three types of sequence data in S1 as samples to divide a training data set and a testing data set, S4, obtaining a new training data set by using various classification methods, constructing a classification model for the new training data set, and obtaining performance evaluation indexes of the classification model. The mixed prediction method of virulence factors and antibiotic resistance genes has good prediction effect and high prediction accuracy.

Description

Mixed prediction method of virulence factor and antibiotic resistance gene

Technical Field

The invention relates to the technical field of deep learning and bioinformatics, in particular to a mixed prediction method of virulence factors and antibiotic resistance genes.

Background

Microbiology is critical to the internal ecosystem of hosts (e.g., humans, animals, and plants) and to maintain the external environment. In particular, pathogenic microorganisms cause diseases by carrying Virulence Factors (VFs) and Antibiotic Resistance Genes (ARGs), even threaten the life safety of a host, can accurately and timely identify VFs and ARGs, can effectively guide medical treatment, reduce the morbidity and mortality of the host, and reduce economic losses in the aspects of animal husbandry, aquaculture and the like.

Furthermore, VFs and ARGs, despite the different evolutionary pathways, share the common features of VFs and ARGs that are necessary for pathogenic bacteria to adapt to and survive in competing microbial environments, and in particular, VFs and ARG are both often transferred between bacteria by Horizontal Gene Transfer (HGT) and utilize similar systems (i.e., two-component systems, efflux pumps, cell wall changes and porins) to activate or inhibit expression of various genes. Pathogens can utilize VFs to cause disease in their host, while they can colonize in environments with selective antibiotic stress by acquisition or presence ARGs. Thus, to understand the causal relationships between microbiome composition, function and disease, VFs and ARGs must be determined simultaneously, while predicting VFs and ARGs can save pathogen monitoring time, particularly for on-site detection of epidemic pathogens. However, the bioinformatics conventional tools for identifying ARGs or VFs are usually focused on independent prediction of ARG or VFs, the prediction tools are relatively late, and the prediction accuracy and recall rate are relatively low, and furthermore, the conventional prediction methods for VFs and ARGs have the technical problems of high false negative rate, very sensitivity to cut-off threshold, and relatively poor prediction effect because only conserved genes can be identified, so that it is necessary to design a hybrid prediction method of virulence factors and antibiotic resistance genes.

Disclosure of Invention

In order to solve the problems, the invention aims to provide a mixed prediction method of virulence factors and antibiotic resistance genes, which solves the technical problems that in the prior art, a prediction tool is behind, the prediction precision and recall rate are low, and the existing prediction effect is poor.

In order to achieve the above object, the technical scheme of the present invention is as follows:

The invention provides a mixed prediction method of virulence factors and antibiotic resistance genes, which comprises the following steps:

s1, respectively acquiring known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data from a database;

S2, calculating various core gene characteristics by using the gene sequence information, and constructing a deep learning neural network architecture and a classical integrated learning architecture by the core gene characteristics respectively;

S3, taking the three types of sequence data in the S1 as samples, randomly extracting the samples as a data total set, carrying out five times of random division from the data total set, wherein four parts in each division are training data sets, and the rest parts are test data sets;

s4, acquiring a new training data set by utilizing a plurality of classification methods, constructing a classification model for the new training data set based on the extreme random tree, and acquiring performance evaluation indexes of the classification model.

As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S1 specifically comprises the steps of:

s11, acquiring known antibiotic resistance gene sequence data from databases ARDB, CARD and Uniprot;

s12, acquiring known virulence factor sequence data from a database VFDB, PATRIC, victors and Uniprot;

s13, acquiring negative sample gene sequence data from a database Uniprot.

As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S2 comprises the following specific steps:

S21, respectively calculating similar characteristics based on comparison scores, simple characteristics based on single-hot coding, characteristics based on gene evolution information and characteristics based on gene sequence information by utilizing gene sequence information;

S22, constructing a deep learning network architecture by using similar features based on comparison scores and simple features of gene sequences based on single thermal coding, and training a neural network classification model in an end-to-end manner;

S23, constructing a classical integrated learning architecture by utilizing the characteristics based on the genetic evolution information and the characteristics based on the genetic sequence information, and training a classical machine learning classification model by using the prior characteristic information.

As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the calculation of similar features based on alignment scores in S21 comprises the following specific steps:

The DIAMOND program was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;

The training data set is de-duplicated with the data set for comparison by the CD-HIT program, and the comparison score is normalized to be a [0,1] interval;

The similarity features based on the bit scores for each gene sequence in the training dataset are converted into a fixed 12724+30945= 43669-dimensional feature vector.

As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the characteristics based on genetic evolution information in S21 are composed of three specific characteristics on the basis of a specific location scoring matrix, including PSSM-component characteristics, RPM-PSSM characteristics, and AADP-PSSM characteristics;

Wherein the PSSM-component features eliminate variations due to protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, defined as follows:

Wherein R _i represents the ith row of the PSSM-composite feature matrix, R _k represents the kth row of the normalized PSSM, p _k represents the kth amino acid in the protein sequence, a _i represents the ith amino acid in the 20 standard amino acids;

RPM-PSSM features transform the original PSSM by filtering negative values to 0 while leaving positive values unchanged, the concept of RPM-PSSM features comes from the residue-prober method, i.e. taking each amino acid corresponding to a specific column in the PSSM as a probe, the original PSSM is transformed into a 400-dimensional feature vector, using the definition as follows:

wherein M _i represents the ith row of the RPM-PSSM feature matrix, M _k represents the kth row of PSSM, p _k represents the kth amino acid in the protein sequence, and a _i represents the ith amino acid in the 20 standard amino acids;

AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM, which converts to a fixed length 20-dimensional feature vector by averaging the columns of the original PSSM contour, defined as follows:

Wherein x _j represents the j-th row of the alternative AAC-PSSM feature matrix, represents the average proportion of amino acid mutations in the evolution process, and p _i,j represents the entities of the i-row and the j-column in the original PSSM;

DPC-PSSM was converted into a 400-dimensional feature vector of fixed length to avoid the loss of information due to X in the protein, defined as follows:

AADP-PSSM is converted into a fixed length 20+400=420 dimension feature vector by combining the two components.

As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the characteristics based on the gene sequence information in S21 include amino acid composition characteristics, dipeptide composition characteristics, and deviation characteristics of the dipeptide from an expected average value;

the amino acid composition characteristics represent the frequency of 20 natural amino acids in the protein sequence, calculated as follows:

Wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector;

The dipeptide composition characteristics represent the frequency of the dipeptide in the protein or polypeptide sequence, and the calculation formula is as follows:

Where N _ab denotes the number of given dipeptides ab, N denotes the sequence length of the protein or peptide, and D (a, b) denotes the finally generated 400-dimensional feature vector.

As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the deviation of the dipeptide from the expected average is characterized by a combination of three features, theoretical average TM, dipeptide composition DPC and theoretical difference TV;

The calculation formula of the TM feature is as follows:

wherein C _a and C _b represent the codon numbers encoding amino acids a and b, respectively. C _N is equal to 61, representing the total number of possible codons excluding the three stop codons.

The calculation formula of the TV features is as follows:

wherein TM represents TM features, TV represents TV features, and N represents the sequence length of the protein or peptide.

The calculation formula of the DDE feature is as follows:

Wherein DPC stands for DPC feature, TM stands for TM feature, and TV stands for TV feature.

As an aspect of the hybrid prediction method of virulence factors and antibiotic resistance genes, wherein the a priori feature information training classical machine learning classification model in S23 includes a random forest classification algorithm, an extreme random tree classification algorithm, xgboost classification algorithm, gradientBoosting classification algorithm and Adaboost classification algorithm.

As an aspect of the hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S4 comprises the steps of:

s41, carrying out a stacking algorithm by utilizing a plurality of classification methods, and taking the prediction scores of the training data by different classification methods as a new training data set;

s42, constructing a classification model based on the extreme random tree by using a new training data set, scoring the model by using a test data set, repeating the experiment for five times, and taking the average result of the experiment for five times as a performance evaluation index of the model.

As an aspect of a hybrid prediction method of virulence factors and antibiotic resistance genes, wherein S41 specifically comprises the steps of:

s411, integrating a plurality of basic level classification models through a meta model;

s412, training the basic level classification model by using the whole training data set, and using the output of the basic level classification model as training characteristics by using the meta model;

S413, respectively training basic-level classification models by using a 5-time cross validation method.

By adopting the technical scheme, the invention has the following advantages:

1. The invention provides a mixed prediction method of virulence factors and antibiotic resistance genes, which can fully utilize the characteristics of a plurality of key core genes, superimpose classical ensemble learning methods and deep learning forces to efficiently predict potential virulence factors and antibiotic resistance genes at the same time, and has strong scientific performance and higher accuracy of prediction results.

2. The invention can accurately predict virulence factors, drug resistance genes and negative sample genes (neither virulence factors nor antibiotic resistance genes) simultaneously, can flexibly and accurately predict independently, solves the defects of high false negative rate, extremely sensitive cut-off threshold and only being capable of identifying conserved genes of the traditional optimal hit method, and obtains better prediction effect.

3. The invention has the advantages that the precision and recall rate of the novel virulence factors and drug resistance genes, the virulence factors and drug resistance genes in the real metagenome data and the pseudo virulence factors and drug resistance genes (gene fragments) are higher than those of the traditional prediction tool before, and compared with all the most advanced prediction tools, the result of the invention has competitive power and higher scientific performance by using the calculation method comprising machine learning and deep learning neural network.

Drawings

Other features, objects and advantages of the present invention will become more apparent upon reading of the detailed description of non-limiting embodiments, made with reference to the following drawings.

FIG. 1 is a flow chart of a method of hybrid prediction of virulence factors and antibiotic resistance genes of the present invention;

FIG. 2 is a bar graph comparison of the results of the hybrid prediction method of the present invention and other calculation methods for predicting virulence factors and antibiotic resistance genes simultaneously.

Detailed Description

The following detailed features and advantages of the present invention will be described in detail with reference to the following embodiments, and will be apparent to one skilled in the art from the description, claims, and drawings disclosed in the present specification.

Referring to FIG. 1, a method for mixed prediction of virulence factors and antibiotic resistance genes in microbial data comprises the following steps:

S1, respectively acquiring known antibiotic resistance gene sequence data, virulence factor sequence data and negative sample gene sequence data (which neither belong to antibiotic resistance genes nor virulence factors) from a database;

s1 comprises the following specific steps:

s13, acquiring negative sample gene sequence data from a database Uniprot.

s2 comprises the following specific steps:

s21, calculating the similar characteristics based on the comparison score, the simple characteristics based on the single thermal coding, the characteristics based on the gene evolution information and the characteristics based on the gene sequence information by utilizing the gene sequence information because the various core gene characteristics comprise the similar characteristics based on the comparison score, the characteristics based on the gene evolution information, the characteristics based on the gene sequence information and the characteristics based on the gene sequence information.

For a similar feature based on an alignment score consisting of an alignment score of virulence factors and antibiotic resistance genes with known virulence factors and antibiotic resistance genes, this feature considers the similarity distribution of sequences in the ARGs and VFs databases, not just the optimal hit rate, and the alignment score is used as a similarity index because it is different from e-value, it considers the degree of identity between sequences and is independent of the size of the database.

The step S21 of calculating similar characteristics based on the comparison score comprises the following specific steps:

The DIAMOND program, faster than BLAST, was selected and the gene sequences in the training dataset were aligned with the remaining known 12724 ARGs and 30945 VFs for comparison under sensitive parameters;

The training dataset has been de-duplicated with the dataset for comparison using the CD-HIT procedure to avoid the possibility of tag leakage, the alignment score is normalized to the [0,1] interval to represent similarity of sequences over distance;

The bit-score based similarity feature for each gene sequence in the training dataset is converted to a fixed 12724+30945 = 43669-dimensional feature vector, where each dimension is the alignment score that the DIAMOND program outputs between the full-gene length sequence and each available ARG and VF in the comparison dataset.

The features based on the genetic evolution information consist of three specific features on the basis of a specific position scoring matrix (PSSM), including PSSM-component features, RPM-PSSM features, AADP-PSSM features, wherein PSSM-component features eliminate the variation due to protein sequence length by summing and averaging all rows of the original PSSM profile for each naturally occurring amino acid type, defined as follows:

Wherein R _i represents the ith row of the PSSM-composite feature matrix, R _k represents the kth row of the normalized PSSM, p _k represents the kth amino acid in the protein sequence, and a _i represents the ith amino acid in the 20 standard amino acids.

The RPM-PSSM feature converts the original PSSM by filtering negative values to 0 while leaving positive values unchanged. The idea of the method is derived from the residue probe method, namely, each amino acid corresponding to a specific column in PSSM is regarded as a probe, and finally, the original PSSM is converted into a 400-dimensional feature vector by using the following definition:

Wherein M _i represents the ith row of the RPM-PSSM feature matrix, M _k represents the kth row of PSSM, p _k represents the kth amino acid in the protein sequence, and a _i represents the ith amino acid in the 20 standard amino acids.

Wherein x _j represents the j-th row of the alternative AAC-PSSM feature matrix, represents the average proportion of amino acid mutations during evolution, and p _i,j represents the i-row and j-column entities in the original PSSM. Second, DPC-PSSM is converted into a 400-dimensional feature vector of fixed length to avoid information loss due to X in the protein, defined as follows:

Features based on gene sequence information include amino acid composition features (AAC), dipeptide composition features (DPC), dipeptide bias features from expected average (DDE), pseudo amino acid composition features (PAAC) features, and quasi sequence order features (QSO).

Wherein the amino acid composition characteristic (AAC) represents the frequency of 20 natural amino acids (i.e. ACDEFGHIKLMNPQRSTVWY) in the protein sequence, which can be calculated as:

Wherein N (a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f (a) represents the finally generated 20-dimensional feature vector.

DPC characteristics represent the frequency of dipeptides in the protein or polypeptide sequence and can be calculated as:

Where N _ab denotes the number of given dipeptides ab, N denotes the sequence length of the protein or peptide, and D (a, b) denotes the resulting 400-dimensional feature vector. Calculation of DPC characteristics refers to the previous description.

The DDE feature is a combination of three features, theoretical Mean (TM), dipeptide composition (DPC) and Theoretical Variance (TV), specifically, the calculation method of the TM feature is as follows:

Wherein C _a and C _b represent the codon numbers encoding amino acids a and b, respectively, and C _N is equal to 61, representing the total number of possible codons excluding three stop codons.

The calculation method of the TV features is as follows:

Wherein TM represents the TM feature, the calculation is as described above, and N represents the sequence length of the protein or peptide.

The DDE feature calculation method is as follows:

where TM stands for TM feature and TV stands for TV feature, the calculation is as described above.

The prior feature information training classical machine learning classification model in S23 includes a Random Forest (Random Forest) classification algorithm, an extreme Random tree (Extra Trees) classification algorithm, a Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.

S3, taking known antibiotic resistance gene data in the S1 as a first type sample, known virulence factor sequence information data as a second type sample, known negative sample gene sequence information data as a third type sample, randomly extracting three types of data samples, randomly dividing the whole training data set into five parts each time, wherein four parts are used as training data sets, and the rest parts are used as test data sets;

S4 specifically comprises the following steps:

in order to obtain excellent predictive performance of virulence factors and antibiotic resistance genes, classical machine learning methods and deep learning forces are integrated in a stacked algorithm,

S41 specifically comprises the following steps:

s413, respectively training basic classification models by using a 5-time cross validation method to solve the overfitting phenomenon in final prediction, wherein in a specific embodiment, the stacking algorithm in the invention is shown in the following table 1 by pseudo codes shown in the algorithm

TABLE 1 stacking algorithm is composed of pseudo code shown in the algorithm

Example two

In order to better illustrate the effect of the prediction method of the present invention, we implemented a strict procedure, and adopted a cross-validation step to evaluate the effectiveness of the present invention without bias, table 2 lists the results of the mixed prediction of virulence factors and drug resistance genes under five times of the cross-validation method in this example:

TABLE 2 results of simultaneous prediction of virulence factors and drug resistance genes under five-fold cross-validation of the present invention

In Table 2, precision, recall, F1 score, VFs, virulence factor, ARGs, drug resistance gene NSs, negative sample gene Micro-average

As can be seen from the above Table 2, the present embodiment obtains a higher evaluation score on the results of multiple crossing experiments, and this result indicates that the present invention not only can realize simultaneous prediction of virulence factors and antibiotic resistance genes, but also has excellent performance in terms of accuracy and recall rate.

Example III

To test the predictive ability of the invention for unknown Virulence Factors (VFs), drug resistance genes (ARGs) and negative sample genes (NSs), the invention constructs an independent dataset comprising 209 ARGs, 209 VFs and 209 NSs, notably these unknown genes are completely independent of the genes in the training dataset, all identical or repetitive sequences are removed by setting the identity threshold of CD-HIT to 100%, in addition we have introduced the currently available VRprofile model (the latest computational model) as a comparison method and the traditional "best HIT" method as a baseline (using the Diamond sequence alignment tool) as a comparison method, table 3 lists the results of the simultaneous predictions of unknown virulence factors and drug resistance genes for the invention in sequence for example (HyperVR), VRprofile and using the Diamond sequence alignment tool as a comparison method under three different parameters:

TABLE 3 results of simultaneous prediction of unknown virulence factors and drug resistance genes by the inventive example, VRprofile model and baseline comparison method

In the table, precision: precision, recall: recall, F1-score: F1 score, VFs: virulence factor, ARGs: drug resistance gene, NSs: negative sample gene, micro-average: micro average

From table 3, we can see that, in comparison of the models of the invention (HyperVR) and VRprofile and the comparison method using the Diamond sequence alignment tool under three different parameters, the experimental result of the invention (HyperVR) obtains the highest evaluation score, and has more excellent performance in terms of accuracy and recall than other baseline comparison methods.

FIG. 2 shows the results of comparison histograms of the present invention (HyperVR) and VRprofile models (the latest calculation model) and baseline comparison methods (comprising three different parameters) for simultaneous prediction of unknown virulence factors and drug resistance genes, wherein in FIG. 2, the a-pillar represents F1 score, the b-pillar represents recall, the c-pillar represents precision, diamond-81%, diamond-64%, diamond-21% respectively represent baseline as comparison method under three different parameters using Diamond sequence comparison tools, wherein the height of the histogram represents the good or bad of the prediction performance of the method, and the comparison of the histogram in FIG. 2 shows that the embodiment (HyperVR) of the present invention has higher prediction performance relative to the latest calculation model (VRprofile) and baseline comparison methods (comprising three different parameters), the comprehensive performance is superior to other models, the result is more competitive, the scientific performance is better, and the prediction effect is the best.

Finally, it is pointed out that while the invention has been described with reference to a specific embodiment thereof, it will be understood by those skilled in the art that the above embodiments are provided for illustration only and not as a definition of the limits of the invention, and various equivalent changes or substitutions may be made without departing from the spirit of the invention, therefore, all changes and modifications to the above embodiments shall fall within the scope of the appended claims.

Claims

1. A hybrid prediction method for virulence factors and antibiotic resistance genes, characterized by comprising the following steps:

S1. Obtain known antibiotic resistance gene sequence data, virulence factor sequence data, and negative sample gene sequence data from the database respectively;

S2. Calculate multiple core gene features using gene sequence information and construct deep learning neural network architectures and classical ensemble learning architectures based on these core gene features.

The S2 includes the following specific steps:

S21. Using gene sequence information, calculate similarity features based on alignment scores, simple gene sequence features based on one-hot encoding, features based on gene evolution information, and features based on gene sequence information;

S22. Build a deep learning network architecture using similarity features based on alignment scores and simple gene sequence features based on one-hot encoding to train a neural network classification model in an end-to-end manner.

S23. Build a classic ensemble learning architecture using features based on gene evolution information and features based on gene sequence information, and train a classic machine learning classification model using prior feature information.

S3. Take the three types of sequence data in S1 as samples, randomly select them as the total data set, and randomly divide them into five parts. In each division, four parts are used as training data sets, and the remaining part is used as test data sets.

S4. Use multiple classification methods to obtain a new training data set; construct a classification model based on the extreme random tree for the new training data set, and obtain performance evaluation indicators of the classification model;

The S4 comprises the following steps:

S41. We utilize multiple classification methods in a stacking algorithm, using the prediction scores of different classification methods on the training data as a new training dataset. To achieve superior prediction performance for virulence factors and antibiotic resistance genes, we combine the power of classic machine learning methods and deep learning in a single stacking algorithm.

The S41 specifically includes the following steps:

S411. Multiple base-level classification models are integrated through a meta-model;

S412. The base-level classification model is trained using the entire training dataset, and the meta-model uses the output of the base-level classification model as a training feature;

S413. Use 5-fold cross validation method to train the base-level classification model separately;

S42. Build a classification model based on the extreme random tree using the new training dataset, score the model using the test dataset, repeat the experiment five times, and take the average result of the five experiments as the performance evaluation indicator of the model.

2. The hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, wherein S1 specifically comprises the following steps:

S11. Obtain known antibiotic resistance gene sequence data from the ARDB, CARD, and Uniprot databases;

S12. Obtain known virulence factor sequence data from the databases VFDB, PATRIC, Victors, and Uniprot;

S13. Obtain negative sample gene sequence data from the Uniprot database.

3. The hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, wherein the step of calculating the similarity features based on the comparison score in S21 comprises the following specific steps:

The DIAMOND program was selected to align the gene sequences in the training dataset with the remaining known 12,724 ARGs and 30,945 VFs for comparison under sensitive parameters;

The training dataset has been deduplicated with the dataset used for comparison using the CD-HIT program, and the comparison scores have been normalized to the interval [0,1].

The bit score-based similarity feature of each gene sequence in the training dataset is converted into a fixed 12724+30945=43669-dimensional feature vector.

4. A hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, characterized in that the feature based on gene evolution information in S21 is composed of three specific features based on a specific position scoring matrix, including PSSM-component feature, RPM-PSSM feature and AADP-PSSM feature;

The PSSM-composition signature is defined as follows:

Where R _i represents the i-th row of the PSSM-composite feature matrix, r _k represents the k-th row of the normalized PSSM, p _k represents the k-th amino acid in the protein sequence, and a _i represents the i-th amino acid in the 20 standard amino acids;

The RPM-PSSM feature transforms the original PSSM by filtering negative values to 0 while leaving positive values unchanged. The idea of the RPM-PSSM feature comes from the residue probe method, that is, each amino acid corresponding to a specific column in the PSSM is regarded as a probe. The original PSSM is converted into a 400-dimensional feature vector, which is defined as follows:

Wherein, _Mi represents the i-th row of the RPM-PSSM feature matrix, _mk represents the k-th row of the PSSM, _pk represents the k-th amino acid in the protein sequence, and _ai represents the i-th amino acid in the 20 standard amino acids;

The AADP-PSSM feature extends the traditional AAC and DPC concepts to PSSM. AAC-PSSM converts the columns of the original PSSM profile into a fixed-length 20-dimensional feature vector, which is defined as follows:

Where x _j represents the jth row of the replaced AAC-PSSM feature matrix, representing the average proportion of amino acid mutations during evolution, and p _i,j represents the entity in row i and column j of the original PSSM;

The DPC-PSSM is converted into a fixed-length 400-dimensional feature vector to avoid information loss caused by X in the protein, which is defined as follows:

By combining these two components, AADP-PSSM is converted into a fixed-length feature vector of 20+400=420 dimensions.

5. The hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, wherein the features based on gene sequence information in S21 include amino acid composition features, dipeptide composition features, and features of deviations of dipeptides from expected average values;

The amino acid composition feature represents the frequency of 20 natural amino acids in the protein sequence, and the calculation formula is as follows:

Where N(a) represents the number of specific amino acids a, N represents the sequence length of the protein or peptide, and f(a) represents the final generated 20-dimensional feature vector;

The dipeptide composition feature represents the frequency of dipeptides in a protein or polypeptide sequence, and is calculated as follows:

Where _Nab represents the number of given dipeptides ab, N represents the sequence length of the protein or peptide, and D(a,b) represents the final generated 400-dimensional feature vector.

6. A hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 5, characterized in that the characteristic of the deviation of the dipeptide from the expected mean is a combination of three characteristics: theoretical mean TM, dipeptide composition DPC and theoretical difference TV;

The calculation formula of TM features is as follows:

Where C _a and C _b represent the codon numbers encoding amino acids a and b, respectively, and C _N equals 61, which represents the total number of possible codons excluding the three stop codons;

The calculation formula of TV feature is as follows:

Where TM represents TM features, TV represents TV features, and N represents the sequence length of the protein or peptide;

The calculation formula of DDE characteristics is as follows:

Among them, DPC represents DPC characteristics, TM represents TM characteristics, and TV represents TV characteristics.

7. A hybrid prediction method for virulence factors and antibiotic resistance genes according to claim 1, characterized in that the prior feature information training classical machine learning classification model in S23 includes a random forest classification algorithm, an extreme random tree classification algorithm, an Xgboost classification algorithm, a GradientBoosting classification algorithm and an Adaboost classification algorithm.