CN113764045B - DNA-binding protein identification method and related products based on XGboost algorithm - Google Patents
DNA-binding protein identification method and related products based on XGboost algorithm Download PDFInfo
- Publication number
- CN113764045B CN113764045B CN202111056316.9A CN202111056316A CN113764045B CN 113764045 B CN113764045 B CN 113764045B CN 202111056316 A CN202111056316 A CN 202111056316A CN 113764045 B CN113764045 B CN 113764045B
- Authority
- CN
- China
- Prior art keywords
- dna
- binding protein
- algorithm
- feature
- protein identification
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B40/00—ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/213—Feature extraction, e.g. by transforming the feature space; Summarisation; Mappings, e.g. subspace methods
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/21—Design or setup of recognition systems or techniques; Extraction of features in feature space; Blind source separation
- G06F18/214—Generating training patterns; Bootstrap methods, e.g. bagging or boosting
- G06F18/2148—Generating training patterns; Bootstrap methods, e.g. bagging or boosting characterised by the process organisation or structure, e.g. boosting cascade
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N20/00—Machine learning
- G06N20/20—Ensemble learning
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B20/00—ICT specially adapted for functional genomics or proteomics, e.g. genotype-phenotype associations
-
- G—PHYSICS
- G16—INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
- G16B—BIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
- G16B30/00—ICT specially adapted for sequence analysis involving nucleotides or amino acids
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Life Sciences & Earth Sciences (AREA)
- Data Mining & Analysis (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Evolutionary Biology (AREA)
- Bioinformatics & Computational Biology (AREA)
- Evolutionary Computation (AREA)
- Artificial Intelligence (AREA)
- Health & Medical Sciences (AREA)
- Medical Informatics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Spectroscopy & Molecular Physics (AREA)
- General Health & Medical Sciences (AREA)
- Biotechnology (AREA)
- Proteomics, Peptides & Aminoacids (AREA)
- Chemical & Material Sciences (AREA)
- Analytical Chemistry (AREA)
- Molecular Biology (AREA)
- Genetics & Genomics (AREA)
- Bioethics (AREA)
- Databases & Information Systems (AREA)
- Epidemiology (AREA)
- Public Health (AREA)
- Computing Systems (AREA)
- Mathematical Physics (AREA)
- Investigating Or Analysing Biological Materials (AREA)
Abstract
基于XGboost算法的DNA结合蛋白识别方法、系统、存储介质及设备,属于计算机与蛋白质识别结合技术领域。本发明为了解决现有的DNA结合蛋白识别方法存在不能兼顾通用性和识别准确率的问题。本发明利用DNA结合蛋白识别分类器对待识别的DNA结合蛋白进行识别;DNA结合蛋白识别分类器的确定过程中,首先获取处理的DNA结合蛋白特征数据集;采用不同的提取算法提取DNA结合蛋白数据集的数据特征,得到多个特征文件;并将不同特征提取算法提取的序列特征矩阵拼接起来,得到拼接后的特征矩阵;然后对生成的特征矩阵进行规范化处理,使用MRMD算法矩阵进行降维处理;最后使用XGboost算法构建并训练DNA结合蛋白识别分类器模型。主要用于DNA结合蛋白的识别。
A DNA-binding protein identification method, system, storage medium and device based on the XGboost algorithm belong to the technical field of computer and protein identification and combination. The present invention solves the problem that the existing DNA-binding protein identification methods cannot take into account both versatility and identification accuracy. The invention uses the DNA binding protein identification classifier to identify the DNA binding protein to be identified; in the determination process of the DNA binding protein identification classifier, firstly obtains the processed DNA binding protein characteristic data set; uses different extraction algorithms to extract the DNA binding protein data Then, the generated feature matrix is normalized, and the MRMD algorithm matrix is used for dimensionality reduction processing. ; Finally, the XGboost algorithm was used to construct and train the DNA-binding protein recognition classifier model. Mainly used for the identification of DNA-binding proteins.
Description
技术领域technical field
本发明属于计算机与蛋白质识别结合技术领域,具体涉及一种DNA结合蛋白的识别方法、系统、存储介质及设备。The invention belongs to the technical field of computer and protein identification and combination, and in particular relates to a DNA-binding protein identification method, system, storage medium and device.
背景技术Background technique
生物体含有许多大分子物质,如DNA和蛋白质,它们包含着生物体的遗传信息,是构成有机体的所有细胞和组织的重要组成部分。为了研究细胞的生命活动,有必要研究DNA和蛋白质以及它们之间的相互作用。DNA结合蛋白的研究在DNA复制重组、病毒感染和增殖等方面发挥着重要作用。在分子水平上研究生物体的基因表达,有必要研究DNA和蛋白质的结合。因此,DNA结合蛋白的准确鉴定是进一步研究细胞的生命活动的前提。Organisms contain many macromolecular substances, such as DNA and proteins, which contain the genetic information of the organism and are an important part of all cells and tissues that make up the organism. In order to study the life activities of cells, it is necessary to study DNA and proteins and the interactions between them. The study of DNA-binding proteins plays an important role in DNA replication and recombination, virus infection and proliferation. To study the gene expression of an organism at the molecular level, it is necessary to study the binding of DNA and proteins. Therefore, the accurate identification of DNA-binding proteins is the premise of further research on the life activities of cells.
目前常见的检测方法通常是单一的或复杂的,即采取一种特征提取方法和训练模型或者采用卷积神经网络等较复杂算法来进行DNA结合蛋白的识别。而不同的特征提取方法有不同的侧重点,识别的结果也有所差异,因此现有的DNA结合蛋白识别方法存在不能兼顾通用性和识别准确率的问题。The current common detection methods are usually single or complex, that is, a feature extraction method and a training model or a more complex algorithm such as a convolutional neural network is used to identify DNA-binding proteins. Different feature extraction methods have different emphases, and the recognition results are also different. Therefore, the existing DNA-binding protein recognition methods cannot take into account the generality and recognition accuracy.
发明内容SUMMARY OF THE INVENTION
本发明为了解决现有的DNA结合蛋白识别方法存在不能兼顾通用性和识别准确率的问题,进而提出了基于XGboost算法的DNA结合蛋白识别方法,以实现对DNA结合蛋白的更准确的识别的同时提高通用性。In order to solve the problem that the existing DNA binding protein identification methods cannot take into account the versatility and the identification accuracy, the present invention further proposes a DNA binding protein identification method based on the XGboost algorithm, so as to realize more accurate identification of DNA binding proteins at the same time. Improve versatility.
基于XGboost算法的DNA结合蛋白识别方法,利用DNA结合蛋白识别分类器对待识别的DNA结合蛋白进行识别,所述的DNA结合蛋白识别分类器的确定过程包括以下步骤:The DNA-binding protein identification method based on the XGboost algorithm uses the DNA-binding protein identification classifier to identify the DNA-binding protein to be identified, and the determination process of the DNA-binding protein identification classifier includes the following steps:
S1、获取处理的DNA结合蛋白特征数据集;DNA结合蛋白特征数据集包括训练集和测试集;S1. Obtain the processed DNA-binding protein feature data set; the DNA-binding protein feature data set includes a training set and a test set;
S2、采用不同的提取算法提取DNA结合蛋白数据集的数据特征,得到多个特征文件;S2, using different extraction algorithms to extract the data features of the DNA-binding protein dataset to obtain multiple feature files;
S3、将不同特征提取算法提取的序列特征矩阵拼接起来,得到拼接后的特征矩阵;S3, splicing the sequence feature matrices extracted by different feature extraction algorithms to obtain a spliced feature matrix;
S4、对S3生成的特征矩阵进行规范化处理,得到规范化处理后的特征矩阵;S4, normalize the feature matrix generated by S3 to obtain a normalized feature matrix;
S5、使用MRMD算法将S4生成的矩阵进行降维处理;S5. Use the MRMD algorithm to perform dimension reduction processing on the matrix generated by S4;
S6、使用XGboost算法构建并训练DNA结合蛋白识别分类器模型。S6. Use the XGboost algorithm to build and train a DNA-binding protein recognition classifier model.
进一步地,S2采用不同的提取算法提取原始的DNA结合蛋白数据集的数据特征的过程使用的提取算法为global encoding method of protein sequence、Multi-scaleContinuous and Discontinuous、Novel Matrix-Based Sequence Representation Modelwith Amino Acid、Position-Specific Scoring Matrix PSSM-AB、PSSM-Pse和PSSM-DWT。Further, S2 uses different extraction algorithms to extract the data features of the original DNA-binding protein dataset. The extraction algorithms used are global encoding method of protein sequence, Multi-scaleContinuous and Discontinuous, Novel Matrix-Based Sequence Representation Model with Amino Acid, Position-Specific Scoring Matrix PSSM-AB, PSSM-Pse and PSSM-DWT.
进一步地,S4所述的进行规范化处理的过程使用的是零-均值规范化算法。Further, the normalization process described in S4 uses a zero-mean normalization algorithm.
进一步地,所述的MRMD算法采用MRMD3.0算法。Further, the MRMD algorithm adopts the MRMD3.0 algorithm.
基于XGboost算法的DNA结合蛋白识别系统,所述系统用于执行所述的基于XGboost 算法的DNA结合蛋白识别方法。A DNA-binding protein identification system based on the XGboost algorithm, the system is used for executing the DNA-binding protein identification method based on the XGboost algorithm.
一种存储介质,所述存储介质中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现所述的基于XGboost算法的DNA结合蛋白识别方法。A storage medium, wherein at least one instruction is stored in the storage medium, and the at least one instruction is loaded and executed by a processor to realize the DNA-binding protein identification method based on the XGboost algorithm.
一种设备,所述设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现所述的基于XGboost算法的DNA结合蛋白识别方法。A device, the device includes a processor and a memory, the memory stores at least one instruction, the at least one instruction is loaded and executed by the processor to implement the DNA-binding protein identification method based on the XGboost algorithm.
本发明的有益效果是:The beneficial effects of the present invention are:
本发明利用核糖核苷酸的组成来表达蛋白质序列的特征,能够实现对DNA结合蛋白的准确识别,为相应研究开发提供了理论基础。而且本发明在进行模型构建时,对特征矩阵进行拼接和标准化,有效加强数据特征与数据标签之间的联系从而提高了准确率。在此基础上,本发明采用MRMD3.0算法对数据进行降维,用更少的数据获得了较高的识别精度,提高了DNA结合蛋白的识别效率。The invention utilizes the composition of ribonucleotides to express the characteristics of the protein sequence, can realize the accurate identification of the DNA binding protein, and provides a theoretical basis for the corresponding research and development. Moreover, when the present invention constructs the model, the feature matrix is spliced and standardized, which effectively strengthens the connection between the data feature and the data label, thereby improving the accuracy. On this basis, the present invention uses the MRMD3.0 algorithm to reduce the dimension of the data, obtains higher recognition accuracy with less data, and improves the recognition efficiency of DNA-binding proteins.
本发明通过XGboost算法搭建分类器,生成了综合性能更优,应用更加广泛的DNA结合蛋白识别模型,而且本发明采用不同特征提取方法提取特征并基于不同的特征进行识别,因此可以进一步提高DNA结合蛋白识别方法的通用性和识别准确率,即针对不同蛋白都可以进行识别且保证具有非常高的准确率。更进一步地,本发明的创新地使用了6种特征提取方法所提取的特征集并进行降维,结合了6种方法提取的特征信息并通过降维算法筛选出有较高特征价值的信息,采用比神经网络更简单的XGboost算法进行识别,在通用性基础上,可以以较少的特征集和较简单的算法达到了较高的识别效果。The present invention builds a classifier through the XGboost algorithm, and generates a DNA binding protein identification model with better comprehensive performance and wider application, and the present invention uses different feature extraction methods to extract features and identify them based on different features, so the DNA binding can be further improved. The versatility and recognition accuracy of the protein recognition method, that is, different proteins can be recognized and guaranteed to have a very high accuracy. Further, the present invention innovatively uses the feature sets extracted by 6 kinds of feature extraction methods and reduces the dimension, combines the feature information extracted by the 6 kinds of methods, and selects the information with higher feature value through the dimensionality reduction algorithm, The XGboost algorithm, which is simpler than the neural network, is used for identification. On the basis of generality, it can achieve a higher identification effect with fewer feature sets and simpler algorithms.
附图说明Description of drawings
图1为基于机器学习的DNA结合蛋白识别方法流程图;Fig. 1 is the flow chart of the DNA-binding protein identification method based on machine learning;
图2为在未拼接的独立数据集和拼接后的独立数据集上的比较结果。Figure 2 shows the comparison results on the unstitched independent dataset and the stitched independent dataset.
具体实施方式Detailed ways
现在将参考附图来详细描述本发明的示例性实施方式。应当理解,附图中示出和描述的实施方式仅仅是示例性的,意在阐释本发明的原理和精神,而并非限制本发明的范围。Exemplary embodiments of the present invention will now be described in detail with reference to the accompanying drawings. It should be understood that the embodiments shown and described in the accompanying drawings are exemplary only, and are intended to illustrate the principles and spirit of the present invention, and not to limit the scope of the present invention.
具体实施方式一:Specific implementation one:
本实施方式为基于XGboost算法的DNA结合蛋白识别方法,如图1所示,包括以下步骤:The present embodiment is a DNA-binding protein identification method based on the XGboost algorithm, as shown in Figure 1, including the following steps:
S1、下载DNA结合蛋白序列数据文件,获取处理的DNA结合蛋白特征数据集。S1. Download a DNA-binding protein sequence data file to obtain a processed DNA-binding protein feature data set.
本实施例中,使用两个训练集和两个测试集来训练和测试模型。其中,从ProteinDatabase数据库下载训练集PDB1075和测试集PDB186以及训练集PDB14189和测试集PDB2272。In this embodiment, two training sets and two test sets are used to train and test the model. Among them, the training set PDB1075 and the test set PDB186 and the training set PDB14189 and the test set PDB2272 are downloaded from the ProteinDatabase database.
训练数据集PDB1075由525个DNA结合蛋白和550个非DNA结合蛋白组成,测试数据集PDB186由93个DNA结合蛋白和93个非DNA结合蛋白组成每个蛋白质由100个氨基酸组成,训练数据集PDB14189由7129个DNA结合蛋白和7060个非DNA结合蛋白组成,测试数据集PDB2272由1153个DNA结合蛋白和1119个非DNA结合蛋白组成,其蛋白质序列相似性小于30%。The training dataset PDB1075 consists of 525 DNA-binding proteins and 550 non-DNA-binding proteins, and the test dataset PDB186 consists of 93 DNA-binding proteins and 93 non-DNA-binding proteins. Each protein consists of 100 amino acids, and the training dataset PDB14189 Consisting of 7129 DNA-binding proteins and 7060 non-DNA-binding proteins, the test dataset PDB2272 consists of 1153 DNA-binding proteins and 1119 non-DNA-binding proteins with protein sequence similarity less than 30%.
S2、采用不同的提取算法提取原始的DNA结合蛋白数据集的数据特征,得到多个特征文件;采用不同的提取算法提取原始的DNA结合蛋白数据集的数据特征进行提取的过程中,不同的特征提取算法对同一序列进行提取。S2. Use different extraction algorithms to extract the data features of the original DNA-binding protein data set to obtain multiple feature files; in the process of extracting the data features of the original DNA-binding protein data set by using different extraction algorithms, different features The extraction algorithm extracts the same sequence.
本实施例中,使用了6种特征提取方法,包括global encoding method ofprotein sequence (GE)、Multi-scale Continuous and Discontinuous(MCD)、NovelMatrix-Based Sequence Representation Model with Amino Acid(NMBAC)、Position-Specific Scoring Matrix(PSSM)PSSM-AB、PSSM-Pse和PSSM-DWT。In this example, six feature extraction methods are used, including global encoding method of protein sequence (GE), Multi-scale Continuous and Discontinuous (MCD), NovelMatrix-Based Sequence Representation Model with Amino Acid (NMBAC), Position-Specific Scoring Matrix (PSSM) PSSM-AB, PSSM-Pse and PSSM-DWT.
GE的统计原理是采用全局编码的方式获得的;MCD通过多尺度连续和不连续描述符获得序列特征;NMBAC统计原理是归一化Moreau-Broto自相关;PSSM-AB是基于位置特定评分矩阵的平均块;PSSM-DCT是基于PSSM的离散余弦变换;PSSM-DWT是基于 PSSM的离散小波变换。蛋白质序列可以用这些离散值来表示。The statistical principle of GE is obtained by global coding; MCD obtains sequence features through multi-scale continuous and discontinuous descriptors; NMBAC statistical principle is normalized Moreau-Broto autocorrelation; PSSM-AB is based on position-specific scoring matrix Average block; PSSM-DCT is PSSM-based discrete cosine transform; PSSM-DWT is PSSM-based discrete wavelet transform. Protein sequences can be represented by these discrete values.
在其他实施例中还可以使用其他数量的特征提取方法及其他特征提取方法。需要说明的是,本发明的创新地使用了这6种特征提取方法所提取的特征集并在后续过程进行降维,经过对不同的DNA结合蛋白的结构和生理性质进行了深入的研究和实验,发现这6种方法提取的特征信息可以有效的针对不同的DNA结合蛋白进行普适表示和代表,进而得到非常好的预测效果(通用性和识别准确率);也就是说本发明这6种特征提取方法及对应的特征组合是进过对不同的DNA结合蛋白的结构和生理性质进行深入研究和实验发现并创新使用的,在此之前本领域技术人员对蛋白质进行识别时认为同一类蛋白质在结构和生理形式上一定是非常相近似的,因此才会被归为一类蛋白,也就是说针对同一类蛋白质识别是利用其相同或相近似的结构来进行的,因此多数都是基于某类特征进行识别的,即使有利用不同特征进行蛋白质识别的方式,由于受到现有蛋白质相同/近似结构和性质的思维方式的影响,一般也不会针对不同特征提取方式提取的特征之间的是否可以在不同的空间维度内对蛋白质进行进行更好的表示展开研究,更不会对哪些特征的组合方式更加有利于识别方法的通用性及识别准确率进行研究,也不会预料到哪些特征的组合方式会取得更好的效果,也就是说:对于本领域技术人员而言,能够预料的效果也仅仅是不同特征的各个识别效果应该近似于共同识别的效果,即假设这有n种特征,分别对应的效果假设为1,那么这n种特征共同识别的效果最多也就是n,并不会超过n(因为有可能不同的特征本身之间可能具有相关性,特征在数据空间上就具有相关性),而且本领域技术人员也不会预料到“不同特征是对不同方面的表示,对于蛋白质整体识别而言不同特征之间的组合更加有利于对蛋白质进行数据空间的多方位表达,因此不同特征之间相当于具有协同作用”,而本发明经过研究和实验后选择的6种特征之间则具有非常好协同表达,进而得到了超过n的识别效果。Other numbers of feature extraction methods and other feature extraction methods may also be used in other embodiments. It should be noted that the present invention innovatively uses the feature sets extracted by these six feature extraction methods and performs dimensionality reduction in the subsequent process, and conducts in-depth research and experiments on the structure and physiological properties of different DNA-binding proteins. , it is found that the feature information extracted by these six methods can effectively represent and represent different DNA-binding proteins universally, so as to obtain a very good prediction effect (universality and recognition accuracy); that is to say, these six methods of the present invention The feature extraction method and the corresponding feature combination were discovered through in-depth research and experimental discovery on the structure and physiological properties of different DNA-binding proteins and used innovatively. The structure and physiological form must be very similar, so they will be classified as a type of protein, that is to say, the identification of the same type of protein is carried out by using its same or similar structure, so most of them are based on a certain type of protein. For feature identification, even if there is a way of using different features for protein identification, due to the influence of the existing way of thinking of the same/similar structure and properties of proteins, it is generally not possible to determine whether the features extracted by different feature extraction methods are acceptable or not. We will conduct research on better representation of proteins in different spatial dimensions, and will not study which combination of features is more conducive to the versatility and accuracy of recognition methods, and will not predict which combination of features. The method will achieve better results, that is to say: for those skilled in the art, the expected effect is only that each recognition effect of different features should be similar to the common recognition effect, that is, assuming that there are n kinds of features, respectively The corresponding effect is assumed to be 1, then the effect of common identification of these n features is at most n, and will not exceed n (because there may be correlations between different features themselves, and the features are related in the data space. ), and those skilled in the art would not expect that "different features represent different aspects, and for the overall identification of proteins, the combination of different features is more conducive to the multi-directional expression of proteins in the data space, so different features It is equivalent to having a synergistic effect between them”, and the 6 features selected after research and experiments in the present invention have very good synergistic expression, and then obtain a recognition effect exceeding n.
同时这6种特征提取方法还会协同和促进后续的处理过程,即后续通过降维算法筛选出有较高特征价值的信息,采用比神经网络更简单的XGboost算法进行识别,可以以较少的特征集和较简单的算法达到了较高的识别效果。At the same time, these six feature extraction methods will also synergize and promote the subsequent processing process, that is, the information with higher feature value is screened out through the dimensionality reduction algorithm, and the XGboost algorithm, which is simpler than the neural network, can be used for identification. Feature sets and simpler algorithms achieve high recognition results.
S3、根据S2的特征文件生成的特征矩阵:针对同一序列进行提取特征,将不同特征提取算法提取的序列特征矩阵拼接起来,得到拼接后的特征矩阵。S3. The feature matrix generated according to the feature file of S2: extract features for the same sequence, and splicing the sequence feature matrices extracted by different feature extraction algorithms to obtain a spliced feature matrix.
本实施例中的特征矩阵是6种特征提取方法对同一序列提取的,将这6种特征提取方法提取的6个序列特征矩阵拼接起来。The feature matrix in this embodiment is extracted from the same sequence by six feature extraction methods, and the six sequence feature matrices extracted by these six feature extraction methods are spliced together.
不同的特征提取方法采用不同的算法提取序列特征,每一种算法都有自己的针对性和特点;将不同的方法提取出的序列特征矩阵拼接起来可以有效弥补各方法的不足从而提高准确率,故选取此算法进行数据的构建。Different feature extraction methods use different algorithms to extract sequence features, and each algorithm has its own pertinence and characteristics; splicing the sequence feature matrices extracted by different methods can effectively make up for the shortcomings of each method and improve the accuracy. Therefore, this algorithm is selected for data construction.
S4、使用零-均值规范化算法对S3生成的特征矩阵进行规范化处理,得到规范化处理后的特征矩阵。S4, using the zero-mean normalization algorithm to normalize the feature matrix generated by S3 to obtain a normalized feature matrix.
使用零-均值规范化算法对S3生成的特征矩阵进行规范化处理,可以加强数据特征与数据标签之间的关系,使数据更加规范统一进而有效提升精度。Using the zero-mean normalization algorithm to normalize the feature matrix generated by S3 can strengthen the relationship between data features and data labels, make the data more standardized and unified, and effectively improve the accuracy.
S5、使用MRMD3.0算法将S4生成的矩阵进行降维处理。S5. Use the MRMD3.0 algorithm to perform dimension reduction processing on the matrix generated by S4.
Max-Relevance-Max-Distance(MRMD3.0)算法是一种最大相关性最大距离降维方法。邹权等人在2015年开发的一种降维方法,命名为Max-Relevance-Max-Distance(MRMD),用户指南和完整的runtime程序可以从以下网址获取和下载:https://github.com/heshida01 /MRMD3.0。降维即降低维度,通过算法计算每个序列特征的权重并进行分类比较,将权重不高的特征过滤舍弃留下权重较高的特征并记录结果。它通过距离函数判断数据独立性,分3步完成降维操作。它首先评估每个特征对分类的贡献,然后量化每个特征对分类的贡献。其次,计算不同特征的权重进行分类,并对选择的特征进行相应的排序。最后,对不同数量的特征进行过滤和分类,并记录结果。The Max-Relevance-Max-Distance (MRMD3.0) algorithm is a maximum correlation and maximum distance dimension reduction method. A dimensionality reduction method developed by Zou Quan et al. in 2015, named Max-Relevance-Max-Distance (MRMD), the user guide and the complete runtime program can be obtained and downloaded from: https://github.com /heshida01 /MRMD3.0. Dimensionality reduction is to reduce the dimension, calculate the weight of each sequence feature through an algorithm and perform classification comparison, filter the features with low weight and discard the features with high weight, and record the results. It judges the data independence through the distance function, and completes the dimensionality reduction operation in three steps. It first evaluates the contribution of each feature to classification, and then quantifies the contribution of each feature to classification. Second, the weights of different features are calculated for classification, and the selected features are sorted accordingly. Finally, filter and classify different numbers of features, and record the results.
S6、使用XGboost算法构建并训练DNA结合蛋白识别分类器模型。S6. Use the XGboost algorithm to build and train a DNA-binding protein recognition classifier model.
XGboost算法一种机器学习模型,通过整合多个弱学习器来达到更强的学习效果。XGBoost模型对损失函数进行二阶泰勒展开,并使用各种方法来尽可能防止过拟合。将步骤S5中生成的特征矩阵存储在CSV文件中,用XGboost算法读取该文件并进行分类识别计算,生成DNA结合蛋白识别分类器。The XGboost algorithm is a machine learning model that achieves stronger learning effects by integrating multiple weak learners. The XGBoost model performs a second-order Taylor expansion of the loss function and uses various methods to prevent overfitting as much as possible. The feature matrix generated in step S5 is stored in a CSV file, and the XGboost algorithm is used to read the file and perform classification and recognition calculation to generate a DNA-binding protein recognition classifier.
利用训练好的DNA结合蛋白识别分类器对待识别的DNA结合蛋白进行识别。The DNA-binding protein to be identified is identified by using the trained DNA-binding protein identification classifier.
本发明与其他先进的DNA结合蛋白模型的性能比较:Performance comparison of the present invention with other advanced DNA-binding protein models:
在PDB 1075数据集上,通过随机提取30%的数据作为测试集来评估拼接序列特征和单序列特征的性能。On the PDB 1075 dataset, the performance of spliced-sequence features and single-sequence features is evaluated by randomly extracting 30% of the data as the test set.
表1为特征的维度Table 1 is the dimension of the feature
表2为数据集信息Table 2 is the dataset information
图2及表3-表5描述了实验结果。与其他单序列特征相比,PSSM-DWT(mcc:0.4981)获得了更好的性能。在所有参数上,拼接序列特征都比单个序列特征执行得更好。剪接序列特征(ROC:0.85)也获得了最佳ROC性能。Figure 2 and Tables 3-5 describe the experimental results. Compared with other single-sequence features, PSSM-DWT (mcc: 0.4981) achieves better performance. Concatenated sequence features perform better than single sequence features on all parameters. The splice sequence feature (ROC: 0.85) also achieved the best ROC performance.
表3为针对PDB1075数据集基于不同特征提取方法的XGboost算法的处理结果Table 3 shows the processing results of the XGboost algorithm based on different feature extraction methods for the PDB1075 dataset
表4为针对PDB2272数据集不同方法的处理结果Table 4 shows the processing results of different methods for the PDB2272 dataset
表5为针对PDB186测试集不同方法(PDB1075训练)的处理结果Table 5 shows the processing results of different methods for the PDB186 test set (PDB1075 training)
我们使用PDB1075作为训练集,PDB186作为测试集来评估我们的实验方法,并将我们的方法的实验结果与其他13种方法的实验结果进行了比较。图2清楚地显示了完整的实验结果。对于MSDBP、MSFBinder、Local-DPP MKSVM-HKA和Adilina的工作,5种方法的MCC值都在0.6以上(分别为0.606、0.616、0.625、0.648和0.670)。因此,这些方法具有很好的性能。虽然Adilina的工作(SN:95.0%)在SN值方面表现最好,但XGBoost的结果达到了最优的Acc(85.48%)、MCC(0.713)和Spec(80.6%)。在PDB1075和PDB186上, XGBoost的性能优于其他方法。We use PDB1075 as the training set and PDB186 as the test set to evaluate our experimental method, and compare the experimental results of our method with those of 13 other methods. Figure 2 clearly shows the complete experimental results. For the works of MSDBP, MSFBinder, Local-DPP MKSVM-HKA, and Adilina, the MCC values of the five methods are all above 0.6 (0.606, 0.616, 0.625, 0.648, and 0.670, respectively). Therefore, these methods have good performance. While Adilina's work (SN: 95.0%) performed the best in terms of SN value, the results of XGBoost achieved the best Acc (85.48%), MCC (0.713) and Spec (80.6%). On PDB1075 and PDB186, XGBoost outperforms other methods.
我们去除了PDB2272中与PDB14189序列同源性超过40%的蛋白质,以避免两个数据集之间的同源性偏差。PDB14189是训练集,PDB2272是测试集。我们在PDB2272上独立测试了XGBoost,使用PDB14189作为训练集,并将其与其他5种分类方法进行了比较。详细的实验结果见图2。实验结果表明,与其他方法相比,XGBoost获得了最优的ACC值、 MCC值和SPEC值,分别为78.26%、0.5652和76.05%。对于PDB2272,XGBoost比其他分类方法表现出更好的性能。We removed proteins in PDB2272 with more than 40% sequence homology to PDB14189 to avoid homology bias between the two datasets. PDB14189 is the training set and PDB2272 is the test set. We independently tested XGBoost on PDB2272, using PDB14189 as the training set, and compared it with 5 other classification methods. The detailed experimental results are shown in Figure 2. The experimental results show that compared with other methods, XGBoost obtains the optimal ACC value, MCC value and SPEC value, which are 78.26%, 0.5652 and 76.05%, respectively. For PDB2272, XGBoost shows better performance than other classification methods.
具体实施方式二:Specific implementation two:
本实施方式为基于XGboost算法的DNA结合蛋白识别系统,所述系统用于执行所述的基于XGboost算法的DNA结合蛋白识别方法。This embodiment is a DNA-binding protein identification system based on the XGboost algorithm, and the system is used to execute the DNA-binding protein identification method based on the XGboost algorithm.
具体实施方式三:Specific implementation three:
本实施方式为一种存储介质,所述存储介质中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现所述的基于XGboost算法的DNA结合蛋白识别方法。This embodiment is a storage medium, and the storage medium stores at least one instruction, and the at least one instruction is loaded and executed by a processor to implement the DNA-binding protein identification method based on the XGboost algorithm.
具体实施方式四:Specific implementation four:
本实施方式为一种设备,所述设备包括处理器和存储器,所述存储器中存储有至少一条指令,所述至少一条指令由处理器加载并执行以实现所述的基于XGboost算法的DNA结合蛋白识别方法。This embodiment is a device, the device includes a processor and a memory, the memory stores at least one instruction, and the at least one instruction is loaded and executed by the processor to implement the DNA-binding protein based on the XGboost algorithm recognition methods.
本发明的上述算例仅为详细地说明本发明的计算模型和计算流程,而并非是对本发明的实施方式的限定。对于所属领域的普通技术人员来说,在上述说明的基础上还可以做出其它不同形式的变化或变动,这里无法对所有的实施方式予以穷举,凡是属于本发明的技术方案所引伸出的显而易见的变化或变动仍处于本发明的保护范围之列。The above calculation examples of the present invention are only to illustrate the calculation model and calculation process of the present invention in detail, but are not intended to limit the embodiments of the present invention. For those of ordinary skill in the art, on the basis of the above description, other different forms of changes or changes can also be made, and it is impossible to list all the embodiments here. Obvious changes or modifications are still within the scope of the present invention.
Claims (6)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111056316.9A CN113764045B (en) | 2021-09-09 | 2021-09-09 | DNA-binding protein identification method and related products based on XGboost algorithm |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111056316.9A CN113764045B (en) | 2021-09-09 | 2021-09-09 | DNA-binding protein identification method and related products based on XGboost algorithm |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN113764045A CN113764045A (en) | 2021-12-07 |
| CN113764045B true CN113764045B (en) | 2022-05-06 |
Family
ID=78794332
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111056316.9A Active CN113764045B (en) | 2021-09-09 | 2021-09-09 | DNA-binding protein identification method and related products based on XGboost algorithm |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113764045B (en) |
Families Citing this family (1)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114863998B (en) * | 2022-04-08 | 2024-09-13 | 苏州科技大学 | DNA binding protein recognition method based on deep sparse representation network |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112906755A (en) * | 2021-01-27 | 2021-06-04 | 深圳职业技术学院 | Plant resistance protein identification method, device, equipment and storage medium |
| CN113095156A (en) * | 2021-03-23 | 2021-07-09 | 西安深信科创信息技术有限公司 | Double-current network signature identification method and device based on inverse gray scale mode |
Family Cites Families (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US8131039B2 (en) * | 2007-09-26 | 2012-03-06 | Siemens Medical Solutions Usa, Inc. | System and method for multiple-instance learning for computer aided diagnosis |
| US20110161259A1 (en) * | 2009-12-30 | 2011-06-30 | Hon Hai Precision Industry Co. Ltd. | System and method for simplification of a matrix based boosting algorithm |
| US9734393B2 (en) * | 2012-03-20 | 2017-08-15 | Facebook, Inc. | Gesture-based control system |
| CN112037221B (en) * | 2020-11-03 | 2021-02-02 | 杭州迪英加科技有限公司 | Multi-domain co-adaptation training method for cervical cancer TCT slice positive cell detection model |
-
2021
- 2021-09-09 CN CN202111056316.9A patent/CN113764045B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN112906755A (en) * | 2021-01-27 | 2021-06-04 | 深圳职业技术学院 | Plant resistance protein identification method, device, equipment and storage medium |
| CN113095156A (en) * | 2021-03-23 | 2021-07-09 | 西安深信科创信息技术有限公司 | Double-current network signature identification method and device based on inverse gray scale mode |
Non-Patent Citations (2)
| Title |
|---|
| xIP-seq Platform: An Integrative Framework for High-Throughput Sequencing Data Analysis;Xin Wang ET AL;《2009 Ohio Collaborative Conference on Bioinformatics》;20090617;第1-6页 * |
| 基于蛋白质序列全面特征和集成学习的DNA结合蛋白预测研究;陈鹏丞;《中国优秀硕士学位论文全文数据库(电子期刊)》;20210531;第5-57页 * |
Also Published As
| Publication number | Publication date |
|---|---|
| CN113764045A (en) | 2021-12-07 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN104239858B (en) | A kind of method and apparatus of face characteristic checking | |
| CN102968626B (en) | A kind of method of facial image coupling | |
| EP2390810A2 (en) | Taxonomic classification of metagenomic sequences | |
| CN112199957B (en) | Character entity alignment method and system based on attribute and relationship information joint embedding | |
| Al-Ghalith et al. | BURST enables mathematically optimal short-read alignment for big data | |
| CN105184266B (en) | A kind of finger venous image recognition methods | |
| CN114821237A (en) | Unsupervised ship re-identification method and system based on multi-stage comparison learning | |
| CN114550831B (en) | A gastric cancer proteomic classification framework identification method based on deep learning feature extraction | |
| WO2021120587A1 (en) | Method and apparatus for retina classification based on oct, computer device, and storage medium | |
| CN118430654A (en) | Method for generating target antibacterial peptide | |
| CN113764045B (en) | DNA-binding protein identification method and related products based on XGboost algorithm | |
| CN111863135A (en) | False positive structural variation filtering method, storage medium and computing device | |
| Li et al. | Protein sequence comparison and DNA-binding protein identification with generalized PseAAC and graphical representation | |
| CN113724779A (en) | SNAREs protein identification method, system, storage medium and equipment based on machine learning technology | |
| Ayadi et al. | Evolutionary biclustering algorithm of gene expression data | |
| Wei et al. | A novel magnification-robust network with sparse self-attention for micro-expression recognition | |
| CN117668509A (en) | An interactive feature selection method based on maximum correlation and minimum redundancy | |
| dos Santos Jr et al. | Partial least squares for face hashing | |
| WO2020211248A1 (en) | Living body detection log parsing method and apparatus, storage medium and computer device | |
| CN118412042A (en) | Method for obtaining antibacterial peptide discriminator | |
| CN112309577B (en) | Multi-mode feature selection method for optimizing parkinsonism voice data | |
| Dong et al. | Protein remote homology detection based on binary profiles | |
| Ignacio | Intrinsic hierarchical clustering behavior recovers higher dimensional shape information | |
| KR102225231B1 (en) | IDENTIFYING METHOD FOR TUMOR PATIENT BASED ON miRNA IN EXOSOME AND APPARATUS FOR THE SAME | |
| CN115344531A (en) | Method and system for compressed fast medical interoperability resource (FHIR) file similarity search |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| CB03 | Change of inventor or designer information |
Inventor after: Zhao Yuming Inventor after: Wang Guohua Inventor after: Zhao Ziye Inventor after: Zou Quan Inventor before: Wang Guohua Inventor before: Zhao Ziye Inventor before: Zou Quan |
|
| CB03 | Change of inventor or designer information | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |