[go: up one dir, main page]

CN118969060B - A method for predicting protein function based on transfer learning and three-channel combined GNN - Google Patents

A method for predicting protein function based on transfer learning and three-channel combined GNN Download PDF

Info

Publication number
CN118969060B
CN118969060B CN202411442388.0A CN202411442388A CN118969060B CN 118969060 B CN118969060 B CN 118969060B CN 202411442388 A CN202411442388 A CN 202411442388A CN 118969060 B CN118969060 B CN 118969060B
Authority
CN
China
Prior art keywords
protein
amino acid
network
database
sequence
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202411442388.0A
Other languages
Chinese (zh)
Other versions
CN118969060A (en
Inventor
黎晟
石海鹏
石海鹤
曾子涵
龚嘉盈
陈彬
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Jiangxi Normal University
Original Assignee
Jiangxi Normal University
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Jiangxi Normal University filed Critical Jiangxi Normal University
Priority to CN202411442388.0A priority Critical patent/CN118969060B/en
Publication of CN118969060A publication Critical patent/CN118969060A/en
Application granted granted Critical
Publication of CN118969060B publication Critical patent/CN118969060B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B15/00ICT specially adapted for analysing two-dimensional or three-dimensional molecular structures, e.g. structural or functional relations or structure alignment
    • G16B15/20Protein or domain folding
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/044Recurrent networks, e.g. Hopfield networks
    • G06N3/0442Recurrent networks, e.g. Hopfield networks characterised by memory or gating, e.g. long short-term memory [LSTM] or gated recurrent units [GRU]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • G06N3/0455Auto-encoder networks; Encoder-decoder networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/084Backpropagation, e.g. using gradient descent
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods
    • G06N3/096Transfer learning
    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16BBIOINFORMATICS, i.e. INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR GENETIC OR PROTEIN-RELATED DATA PROCESSING IN COMPUTATIONAL MOLECULAR BIOLOGY
    • G16B40/00ICT specially adapted for biostatistics; ICT specially adapted for bioinformatics-related machine learning or data mining, e.g. knowledge discovery or pattern finding
    • YGENERAL TAGGING OF NEW TECHNOLOGICAL DEVELOPMENTS; GENERAL TAGGING OF CROSS-SECTIONAL TECHNOLOGIES SPANNING OVER SEVERAL SECTIONS OF THE IPC; TECHNICAL SUBJECTS COVERED BY FORMER USPC CROSS-REFERENCE ART COLLECTIONS [XRACs] AND DIGESTS
    • Y02TECHNOLOGIES OR APPLICATIONS FOR MITIGATION OR ADAPTATION AGAINST CLIMATE CHANGE
    • Y02ATECHNOLOGIES FOR ADAPTATION TO CLIMATE CHANGE
    • Y02A90/00Technologies having an indirect contribution to adaptation to climate change
    • Y02A90/10Information and communication technologies [ICT] supporting adaptation to climate change, e.g. for weather forecasting or climate simulation

Landscapes

  • Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Theoretical Computer Science (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • General Health & Medical Sciences (AREA)
  • Biophysics (AREA)
  • Software Systems (AREA)
  • Evolutionary Computation (AREA)
  • Artificial Intelligence (AREA)
  • Data Mining & Analysis (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Mathematical Physics (AREA)
  • Biomedical Technology (AREA)
  • Medical Informatics (AREA)
  • Spectroscopy & Molecular Physics (AREA)
  • Biotechnology (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Evolutionary Biology (AREA)
  • Epidemiology (AREA)
  • Databases & Information Systems (AREA)
  • Public Health (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Bioethics (AREA)
  • Chemical & Material Sciences (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
  • Investigating Or Analysing Biological Materials (AREA)

Abstract

本发明属于深度学习和生物信息学领域,公开了基于迁移学习和三通道组合GNN预测蛋白质功能的方法,该方法通过解析蛋白质空间结构文件,提取出有不同连接密度的氨基酸残基接触图;基于迁移学习构建氨基酸序列混合特征提取器,以得到氨基酸序列的残基级混合特征映射;引入图卷积网络和拥有多头图注意力机制的GATv2网络构建三通道组合图神经网络,用于进行蛋白质功能预测;通过多标签分类器得到每一个蛋白质基因本体论功能标签的概率值。本发明提高了蛋白质功能预测性能,方便了研究人员高效且全面的获取蛋白质条目在不同权威数据库中的复杂关联信息。

The present invention belongs to the field of deep learning and bioinformatics, and discloses a method for predicting protein function based on transfer learning and three-channel combined GNN. The method extracts the contact graph of amino acid residues with different connection densities by parsing the protein spatial structure file; constructs an amino acid sequence mixed feature extractor based on transfer learning to obtain the residue-level mixed feature map of the amino acid sequence; introduces a graph convolutional network and a GATv2 network with a multi-head graph attention mechanism to construct a three-channel combined graph neural network for protein function prediction; and obtains the probability value of each protein gene ontology function label through a multi-label classifier. The present invention improves the protein function prediction performance and facilitates researchers to efficiently and comprehensively obtain the complex association information of protein entries in different authoritative databases.

Description

Method for predicting protein function based on transfer learning and three-channel combined GNN
Technical Field
The invention relates to the field of deep learning and bioinformatics, in particular to a method for predicting protein functions based on combination of transfer learning and three channels GNN.
Background
Proteins are necessary for vital activities and cellular metabolism, and research on functions of proteins directly clarifies various special mechanisms of the proteins in physiology or pathology, and reveals the processes of synthesis, degradation, processing and modification of the proteins, so that people can better understand the expression and regulation of genes. Protein functions are currently mainly measured by techniques such as enzyme activity measurement and cell biology experiments, and are classified into three main categories, namely molecular functions (Molecular Function, MF), biological processes (Biological Process, BP) and cell components (Cellular Component, CC) based on Gene Ontologies (GO).
With the development of protein sequencing technologies such as X-ray single crystal diffraction, a frozen electron microscope and the like and the development of structural analysis technologies thereof, and the application of a protein structure prediction model, known amino acid sequences and structure numbers have a great difference from corresponding functional annotations, and researchers are increasingly turning to a calculation method for performing functional prediction in order to avoid high equipment and time costs caused by directly measuring protein functions through experiments. The sequence and molecular structure characteristics of proteins determine to some extent the specific functions they possess, and by combining amino acid sequence or molecular structure information, the design of efficient and accurate protein function prediction methods has become an important point in recent years for protein function research.
Initially researchers have deduced the function of proteins by aligning the similarity and homology of sequences, i.e. sequences with high homology, which are also likely to be similar in function, and have transferred their known functional notes to similar proteins. Based on this theory, researchers have developed Blast2GO tools for functional annotation transfer and analysis using sequence similarity. There are also methods for predicting protein functions by comparing similarities between structures to infer that proteins have similar functions, and there are documents that suggest calculating similarities between protein structures by using a contact map.
Direct prediction of protein function based on sequence or structural similarity and homology is highly susceptible to error. With the excellent performance of deep learning technology in solving multi-domain problems, people begin to explore its application in protein function prediction. For example, the DeepGO model uses CNN to extract amino acid sequence features and combines with protein interaction networks extracted from the sting database to predict protein function. The DeepGO model only performs characteristic modeling on the sequence, and ignores the great influence of the molecular structure of the protein on the specific function possessed by the protein. Based on this, a model DeepFRI combining the function of molecular structure prediction was proposed by researchers, 10M sequences were sampled from the whole sequence set in Pfam database, and an LSTM-based protein language model was trained as a feature extractor for amino acid sequences. And (3) inputting the residue-level feature matrix of the amino acid sequence and the residue adjacent matrix extracted from the protein space structure file into a GCN graph convolution neural network to predict the protein function. However, LSTM has significant drawbacks when processing some long sequences of text data, and has limitations when processing some complex random-map data using GCN only. In experimental data selection DeepFRI used only the high-precision protein structure in the PDB protein database, as well as the homology-based structural MODEL constructed by SWISS-MODEL, but most proteins lacked these data. Meanwhile, in actual operation, since each protein database is not directly related, key information such as protein sequence, molecular structure, functional annotation and the like is required to be retrieved from different source databases through database cross references, which greatly prevents researchers from completely and conveniently obtaining relevant information of the proteins, and some databases providing downloading such as a protein-ligand interaction database ChEMBL are relational databases of MySQL type, and are lower in access convenience and retrieval efficiency than NoSQL databases of document type.
Disclosure of Invention
In order to solve the problems, the invention provides a method for predicting protein functions based on transfer learning and three-channel combination GNN, which improves prediction accuracy, and simultaneously provides a protein document type associated database, which is convenient for efficiently acquiring detailed associated information of related proteins in different protein databases, and based on the detailed associated information, a user can extract a reference data set of a protein downstream prediction task driven by data from the protein document type associated database.
The technical scheme adopted by the invention is that the method for predicting the protein function based on transfer learning and three-channel combined GNN comprises the following steps:
Step S1, acquiring an amino acid sequence of a protein, a protein space structure file and a protein Gene Ontology (GO) function tag from a protein public database, constructing a NoSQL type protein document type association database after cross-association and data cleaning of the acquired data, acquiring a reference data set of a protein function prediction task from the protein document type association database, and dividing the reference data set into a training set, a verification set and a test set;
S2, analyzing a protein space structure file, extracting space coordinates of Ca atoms in each amino acid residue in a protein molecule from the space coordinates, and extracting amino acid residue contact diagrams with different connection densities according to different contact thresholds based on space contact relations of Ca atoms in all residues so as to map a complex space structure of the protein molecule;
s3, constructing an amino acid sequence mixed feature extractor based on transfer learning, extracting multiple types of residue level features from the amino acid sequence through the amino acid sequence mixed feature extractor to form a sequence primary embedded representation;
s4, a graph convolution network and a GATv network with a multi-head graph attention mechanism are introduced to construct a three-channel combined graph neural network, a residue-level mixed feature map of an amino acid sequence is used as an initial feature matrix input of the three-channel combined graph neural network, an amino acid residue contact graph is used as an adjacent matrix input of different feature channels, a complex graph relationship among amino acid residues in protein molecules is captured and learned, and protein function prediction is carried out by combining the residue-level mixed feature map of the amino acid sequence and multi-level structural features;
S5, constructing a multi-label classifier, carrying out global pooling and aggregation operation on the output node characteristics of the three-channel combined graph neural network, and mapping the aggregated characteristics into a probability space of 0-1 to obtain a probability value of each protein gene ontology functional label;
S6, forming a complete protein function prediction model by the sequence mixed characteristic extractor, the three-channel combined graph neural network and the multi-label classifier, pre-training the protein function prediction model on a training set, evaluating the performances of the protein function prediction model in different training stages by using a verification set and a test set, and performing protein function prediction by using a protein function prediction model qualified in training.
Further preferably, the spatial contact relationship of Ca atoms in the residues in the step S2 is that a protein is composed of N amino acid residues, a protein spatial structure file is traversed, the spatial structure coordinates of Ca atoms in each amino acid residue are extracted from the protein spatial structure file, the spatial distance between all Ca atoms is calculated based on the coordinates, a distance threshold is set to judge whether two residues are contacted or not, and therefore an amino acid residue contact matrix of N x N is obtained, then only a side r with a contact relationship is taken, and an adjacent matrix with a distance matrix of 2 x 2r is converted according to a source node and a target node of the side r, and the adjacent matrix is taken as the amino acid residue contact graph.
Further preferably, the amino acid residue contact patterns of the different connection densities in step S2 are obtained by calculating the amino acid residue contact patterns of the protein space structure file actually measured by PDB and the protein space structure file predicted by AlphaFold, respectively, and calculating the amino acid residue contact patterns representing the different contact densities according to the different distance threshold values.
Further preferably, in step S3, the complete sequence or a certain segment of the subsequence of the protein is input into an amino acid sequence mixed feature extractor, the sequence extracted by the Bi-LSTM network is first represented by embedding, then the sequence extracted by the ESM2 network consisting of 33 layers of the transducer encoder is continuously extracted, and the sequence extracted by the Bi-LSTM network and the sequence extracted by the ESM2 network are represented by embedding together as a primary sequence of the amino acid sequence.
Further preferably, in the step S3, the feature fusion and enrichment operation is performed on the primary embedded representation of the sequence, wherein the primary embedded representation of the sequence is divided into a sequence single thermal coding mapping, a residue sequence depth context feature and a residue biochemical feature coding according to category slices, the slices are subjected to a leachable nonlinear mapping, the feature enrichment and dimension reduction are performed on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer, and finally, the residue-level mixed feature mapping of the amino acid sequence is obtained through a ReLU nonlinear mapping and feature fusion operation.
Further preferably, the three-channel combined graph neural network comprises a first characteristic channel formed by two layers of GCN networks connected in series, a second characteristic channel formed by the other two layers of GCN networks connected in series, wherein the first characteristic channel is parallel to the second characteristic channel to form a2 x 2 parallel double-channel GCN network, a GATv network with a multi-head graph attention mechanism is introduced to form a third characteristic channel, and the GATv network is connected in series to the parallel double-channel GCN network.
The three-channel combined graph neural network is further preferably characterized in that a residue-level mixed feature map of an amino acid sequence is used as an initial feature matrix input of a2 x 2 parallel two-channel GCN network, a sparse amino acid residue contact graph extracted by a smaller contact threshold is used as an adjacent matrix input of one feature channel of the parallel network, a dense amino acid residue contact graph extracted by a larger contact threshold is used as an adjacent matrix input of the other feature channel, the output features of the 2 x 2 parallel two-channel GCN network are subjected to feature fusion to obtain fusion features, the fusion features are used as feature matrix input of the GATv network, the sparse amino acid residue contact graph is used as an adjacent matrix input of the GATv network, and the GATv network obtains output node features of the three-channel combined graph neural network after further noise filtering and feature extraction are performed on the fusion features.
Further preferably, in step S1, a NoSQL-type protein document type association database is constructed based on mongo db.
Further preferably, the step of constructing a protein document type association database of the NoSQL type is as follows:
Step S101, testing an open data interface of a Uniprot database by using an interface test script and a test tool PostMan, obtaining a Uniprot database identifier corresponding to a protein item through screening conditions, wherein the screened protein item needs to have a protein space structure file which is tested by an experiment or a protein space structure file which is predicted by a AlphaFold model, selecting a protein gene ontology as a functional label of protein, the total number of protein gene ontology labels which are possessed by each screened protein item is more than or equal to 3, and the screened Uniprot database identifier is used as a source data ID of the database;
Step S102, accessing a Uniprot database data interface through a Uniprot database identifier obtained by screening, acquiring a cross index ID of a corresponding protein item in a PDB, alphaFold database from a data packet returned by the interface, and simultaneously associating the main key target ID of a target protein information table of a ChEMBL database through the Uniprot database identifier, and acquiring the other secondary table IDs in a ChEMBL database according to the association;
Step S103, classifying data interfaces corresponding to the IDs are requested, and xml and JSON data packets returned by BeautifulSoup, jsonpath frames and regular expression analysis interfaces are used for obtaining a JSON dictionary composed of protein association information in Uniprot, PDB, alphaFold databases and a JSON dictionary composed of target protein association data in ChEMBL databases;
Step S104, based on the JSON dictionary, constructing a protein information main table by using a Pymongo frame by taking a Uniprot database identifier as a main key, and constructing a ChEMBL associated information table containing target information by taking a main key target ID as a main key;
Step S105, cross-correlating the main key target point ID in the ChEMBL associated information table with the Uniprot database identifier in the protein information main table to obtain a final protein document associated database, wherein a user can search key information of related protein items in the Uniprot, PDB, alphaFold, chEMBL database only through the Uniprot database identifier.
The beneficial effects of the invention are as follows:
(1) According to the method for constructing the NoSQL type protein document type association database based on the MongoDB, through the NoSQL type protein document type association database, a user can search detailed association data of related protein items only according to one main key target point ID without cross indexing in a plurality of databases, so that the search efficiency of protein association information is greatly improved. And compared with a directly downloaded complete MySQL type public database, the NoSQL type database has great improvement in access convenience.
(2) Most of the protein function prediction MODELs currently comprise DeepFRI high-precision protein structures in a PDB protein database and homology-based structural MODELs constructed by SWISS-MODEL, but most of the proteins lack corresponding data, so that an experimental data set obtained from a document type NoSQL protein association database can simultaneously have PDB high-precision structural data determined by experiments and structural data predicted by a AlphaFold MODEL, and the generalization performance of the MODEL can be improved while the data acquisition cost is reduced.
(3) The invention constructs an amino acid sequence mixed feature extractor based on transfer learning, wherein the transfer learning is based on two protein language models of Bi-LSTM and transfomer, and combines the sequence single thermal coding mapping, the residue sequence depth context feature and the residue biochemical feature coding to jointly form a sequence primary embedded representation of the amino acid sequence, and the sequence primary embedded representation is subjected to leachable nonlinear mapping in the model training process, so that the experimental cost is reduced, and the residue features are further enriched, thereby the feature expression capability of the obtained residue level mixed feature mapping is more powerful.
(4) The DeepFRI model architecture is formed by serially connecting GCN networks, and has certain limitation in processing complex irregular graphs such as protein space structures, the invention introduces a GATv network with a multi-head attention mechanism, allows a protein function prediction model to learn multiple groups of attention weights simultaneously, and each attention head can learn different attention weights, thereby enhancing the expression capability of the protein function prediction model on complex graph structures.
(5) The invention combines GCN network and GATv network with multi-head diagram attention mechanism to form a three-channel combined diagram neural network architecture, simultaneously learns two types of residue contact diagrams extracted by different contact thresholds through a2 x 2 parallel GCN network, and then further carries out noise filtration on the characteristics output by the two-channel parallel network through a GATv network connected in series. Meanwhile, in the final multi-label classification stage of the model, the invention enables the protein function prediction model to further acquire more node characteristics by fusing two types of global pooling information.
Drawings
In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.
Fig. 1 is a schematic diagram of the present invention.
FIG. 2 is a diagram of an amino acid sequence hybrid feature extractor architecture in accordance with the present invention.
Fig. 3 is a three-channel combined graph neural network architecture diagram in the present invention.
Fig. 4 is a diagram of a multi-label classifier architecture in accordance with the present invention.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
As shown in fig. 1, the method for predicting protein functions based on transfer learning and three-channel combined GNN comprises the following steps:
Step S1, acquiring an amino acid sequence of protein, a protein space structure file and a protein Gene Ontology (GO) functional label from a protein public database, constructing a protein document type association database of a NoSQL type based on MongoDB after cross-linking and data cleaning of the acquired data, acquiring a reference data set of a protein function prediction task from the protein document type association database, dividing the reference data set into a training set, a verification set and a test set, wherein the training set is used for training a protein function prediction model provided by the invention, the verification set is used for evaluating the concrete performance of each round (epoch) in the model training process, and the test set is used for testing the overall performance of the model after training is finished. The protein space structure file comprises protein space structure files which are predicted by experimental determination and AlphaFold model;
S2, analyzing a protein space structure file, extracting space coordinates of Ca atoms in each amino acid residue in a protein molecule from the space coordinates, and extracting amino acid residue contact diagrams with different connection densities according to different contact thresholds based on space contact relations of Ca atoms in all residues so as to map a complex space structure of the protein molecule;
s3, constructing an amino acid sequence mixed feature extractor based on transfer learning, extracting multiple types of residue level features from the amino acid sequence through the amino acid sequence mixed feature extractor to form a sequence primary embedded representation;
s4, a graph convolution network and a GATv network with a multi-head graph attention mechanism are introduced to construct a three-channel combined graph neural network, a residue-level mixed feature map of an amino acid sequence is used as an initial feature matrix input of the three-channel combined graph neural network, an amino acid residue contact graph is used as an adjacent matrix input of different feature channels, a complex graph relationship among amino acid residues in protein molecules is captured and learned, and protein function prediction is carried out by combining the residue-level mixed feature map of the amino acid sequence and multi-level structural features;
S5, constructing a multi-label classifier, carrying out global pooling and aggregation operation on the output node characteristics of the three-channel combined graph neural network, and mapping the aggregated characteristics into a probability space of 0-1 to obtain a probability value of each protein gene ontology functional label;
S6, forming a complete protein function prediction model by the sequence mixed characteristic extractor, the three-channel combined graph neural network and the multi-label classifier, pre-training the protein function prediction model on a training set, evaluating the performances of the protein function prediction model in different training stages by using a verification set and a test set, and performing protein function prediction by using a protein function prediction model qualified in training.
The step S1 of constructing a protein document type association database of a NoSQL type based on MongoDB comprises the following steps:
Step S101, testing an open data interface of a Uniprot database by using an interface test script and a test tool PostMan, obtaining a Uniprot database identifier corresponding to a protein item through screening conditions, wherein the screened protein item is required to have a protein space structure file which is determined by experiments or a protein space structure file which is predicted by a AlphaFold model, the length of the amino acid sequence of each protein is not more than 1022 (the ESM network processes the amino acid sequence of the protein consisting of 1022 residues at most once), a protein Gene Ontology (GO) is selected as a functional label of the protein, the total number of the protein gene ontology labels of each protein item which is screened out is more than or equal to 3, and 26335 Uniprot database identifiers which meet the conditions are screened out in the embodiment as source data IDs of the database.
Step S102, accessing a Uniprot database data interface through a Uniprot database identifier obtained by screening, acquiring a cross index ID of a corresponding protein item in a PDB, alphaFold database from a data packet returned by the interface, associating a primary key Target (Target) ID of a Target protein information table of a ChEMBL database through the Uniprot database identifier, and acquiring the rest secondary table IDs in a ChEMBL database according to the association;
Step S103, classifying data interfaces corresponding to the IDs are requested, and xml and JSON data packets returned by BeautifulSoup, jsonpath frames and regular expression analysis interfaces are used for obtaining a JSON dictionary composed of protein association information in Uniprot, PDB, alphaFold databases and a JSON dictionary composed of target protein association data in ChEMBL databases;
Step S104, based on the JSON dictionary, constructing a protein information main table by using a Pymongo frame by taking a Uniprot database identifier as a main key, and constructing a ChEMBL associated information table containing target information by taking a main key target ID as a main key;
Step S105, cross-correlating the main key target point ID in the ChEMBL associated information table with the Uniprot database identifier in the protein information main table to obtain a final protein document associated database, wherein a user can search key information of related protein items in the Uniprot, PDB, alphaFold, chEMBL database only through the Uniprot database identifier.
The method comprises the steps of obtaining a basic data set of a protein function prediction task from a protein document type associated database, specifically, screening 5000 protein items with a high-precision experimental measurement structure from 26335 protein data items in the embodiment, namely Dataset, screening 2000 protein items with a AlphaFold model measurement structure, expanding the protein items to Dataset1, and obtaining a 7000 mixed data set marked as Dataset 2. In order to reduce the experimental cost, the embodiment only takes the GO term with evidence coding (EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC, HTP, HDA, HMP, HGI, HEP) as the functional label of the protein gene ontology of the protein, and divides the extracted data set into a training set, a verification set and a test set according to the proportion of 8:1:1, wherein the protein items of the test set simultaneously have the functional labels of the three protein gene ontologies of MF, BP and CC, and the distribution of the functional labels of the protein gene ontologies of the data set is shown in Table 1.
TABLE 1 functional tag distribution of protein Gene ontologies of the data set extracted in the examples
The spatial contact relation of Ca atoms in the residues in the step S2 is that a protein is assumed to be composed of N amino acid residues, a protein spatial structure file is traversed, the spatial structure coordinates of the Ca atoms in each amino acid residue are extracted from the protein spatial structure file, the spatial distance between all Ca atoms is calculated based on the coordinates, a distance threshold is set to judge whether the two residues are in contact or not, and therefore an N-by-N amino acid residue contact matrix is obtained, then only a side r with the contact relation is taken, and a 2-by-2 r adjacent matrix is converted according to a source node and a target node of the side r, and is taken as the amino acid residue contact diagram.
The amino acid residue contact diagrams of different connection densities in the step S2 are respectively calculated according to the protein space structure file actually measured by the PDB and the amino acid residue contact diagram of the protein space structure file predicted by AlphaFold, and the amino acid residue contact diagrams representing different contact densities are calculated according to different distance threshold values. Specifically, for the protein entry with the Uniprot database identifier of P05067, the corresponding PDB experiment measurement structure index ID is 8OTF, and the alpha fold model prediction structure index ID is:
AF-P05067-F1, thereby obtaining protein space structure file of P05067, obtaining space coordinates of Ca atoms in each amino acid residue in the protein molecule by analyzing the file, the complete P05067 protein is composed of 770 amino acid residues, calculating space distance between Ca atoms and constructing amino acid residue contact matrix ,Representing dimensions, converting by taking only the side of the amino acid residue in contact matrix with contact relationAs an adjacency matrixAs a map of amino acid residue contact of protein P05067.
Specifically, the residue contact calculation formula may be defined as:
;
Wherein the method comprises the steps of Residues extracted from protein space structure filesSum residuesThe three-dimensional coordinates of the Ca atoms in (c),In order to contact the distance threshold value,Representing residuesSum residuesThe spatial distance of Ca atoms in the sequence is considered to be in spatial contact with two amino acid residues when the calculated spatial distance is less than the threshold. In this example, two types of contact patterns of amino acid residues under the contact threshold values 8 a and 10 a are calculated, representing a sparse amino acid residue contact pattern and a dense amino acid residue contact pattern, respectively.
As shown in fig. 2. In step S3, the complete sequence or a certain segment of subsequence of the protein is input into an amino acid sequence mixed characteristic extractor, a total 6155-dimensional sequence embedded representation is extracted from a Bi-LSTM network, then an ESM2 network (Evolutionary Scale Modelling) consisting of 33 layers of transformers encoders is continuously extracted to obtain a total 1280-dimensional sequence embedded representation, and the total sequence embedded representation is taken as a sequence primary embedded representation of the amino acid sequence.
In the step S3, the feature fusion and enrichment operation is carried out on the primary embedded representation of the sequence, namely when the length of the amino acid sequence is overlong, the overall calculation complexity of the model is obviously improved, and in order to obtain more effective features while the training cost is reduced, the primary embedded representation of the sequence is divided into sequence single-heat coding mapping, residue sequence depth context feature and residue biochemical feature coding according to class slices. And (3) carrying out leavable nonlinear mapping on the slices in the training process of the whole model, carrying out feature enrichment and dimension reduction on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer, and finally obtaining residue-level mixed feature mapping of the amino acid sequence through ReLU nonlinear mapping and feature fusion operation.
Specifically, for the protein item with UPID of P05067, the complete sequence (770 amino acids total) is input into an amino acid sequence mixed characteristic extractor, and the Bi-LSTM network extracts the sequence single thermal coding mapping(20 Standard amino acids and a class of unknown amino acids) and residue sequence depth context characteristicsThe ESM2 network will extract its residue biochemical characteristic codeThereby obtaining a primary embedded representation of the sequence,Representing the dimension. In order to avoid resource waste caused by excessive number of amino acid residues, the method is toAnd slicing according to the category, performing three-time learnable nonlinear mapping, and performing feature enrichment and dimension reduction on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer to be 21, 512 and 320 when the dimension is linearly changed. Finally, performing feature fusion on the three types of output features after non-linear ReLU mapping to obtain residue-level mixed feature mapping of the amino acid sequenceIn performing the above-described learnable nonlinear transformation, all weight parameters and corresponding bias values will be trained with all parameters in the subsequent whole protein function prediction model.
As shown in FIG. 3, the three-channel combined graph neural network comprises a first characteristic channel formed by two layers of GCN networks connected in series, a second characteristic channel formed by the other same two layers of GCN networks connected in series, wherein the first characteristic channel is parallel to the second characteristic channel to form a 2X 2 parallel double-channel GCN network, a GATv network with a multi-head graph attention mechanism is introduced to form a third characteristic channel, and the GATv network is connected in series to the parallel double-channel GCN network. The activation functions of the GCN network and GATv network employ a ReLU activation function.
Specifically, the three-channel combined graph neural network has characteristic aggregation and propagation modes that residue-level mixed characteristic mapping of amino acid sequences is usedAs an initial feature matrix input of the 2 x2 parallel dual-channel GCN network, a sparse amino acid residue contact diagram extracted by an 8 a contact threshold is used as an adjacent matrix input of one feature channel of the parallel network, and a dense amino acid residue contact diagram extracted by a 10a contact threshold is used as an adjacent matrix input of the other feature channel. The output characteristics of the 2 x2 parallel double-channel GCN network are subjected to characteristic fusion to obtain fusion characteristicsTaking the amino acid residue as a third characteristic channel, namely, inputting a characteristic matrix of GATv network, simultaneously taking a sparse amino acid residue contact diagram extracted by using an 8A threshold as an adjacent matrix input of GATv network, wherein the GATv network is a natural matrix of the GATv networkFurther noise filtering and feature extraction are carried out to obtain the output node features of the three-channel combined graph neural network
In step S5, the process of global pooling and aggregation operation is performed on the output node characteristics of the three-channel combined graph neural network is shown in fig. 4;
Output node characteristics for three-channel combined graph neural network And respectively carrying out average pooling and accumulation pooling, and then aggregating two kinds of pooling information to obtain more comprehensive node characteristic information.
Specifically, the global pooling operation may be specifically defined as:
;
;
;
Wherein, Is the first in the output node characteristicsThe feature vectors of the individual nodes are used,Representing the accumulated pooling information is presented,Represents the information of the mean value pool,Indicating the characteristics of the aggregated nodes, N is the number of the nodes, and concat indicates the aggregation operation.
The node characteristics after aggregationAnd carrying out batch normalization, reLU activation and Dropout operation in sequence, mapping a full connection layer, and mapping the feature output into a probability space of 0-1 by using a sigmoid function to obtain the probability value of each protein gene ontology functional label.
The method for pre-training the protein function prediction model on the training set in the step S6 comprises the steps of selecting the training set in the reference data set as training data of the protein function prediction model, respectively taking MF, BP and CC labels as functional labels of protein gene ontology during supervised learning, selecting Adam as a training optimizer, setting the learning rate to be 0.0001, selecting a binary cross entropy function as a loss function during training, setting the size of a training Batch (Batch size) to be 64, and setting the maximum iteration round (epoch) to be 50. And in the training process, using a verification set to evaluate the protein function prediction model performance after each iteration round (epoch), namely after each round of training is finished, the protein function prediction model predicts protein gene ontology function labels of all protein items in the verification set, calculates the area (M-AUPR) under a micro precision recall curve, the area (M-AUPR) under the macro precision recall curve and the score on three evaluation indexes of the maximum F1 score (F-MAX) respectively, and if the score is improved compared with the previous round, stores the protein function prediction model pre-trained in the round. Training is stopped after the maximum iteration round (epoch) is reached. Finally, the performance of the pre-trained protein function prediction model is evaluated on a test set.
The protein function prediction model provided by the invention is used for respectively pre-training under training sets corresponding to Dataset and Dataset, testing under a testing set of Dataset1, respectively calculating the scores of the protein function prediction model for predicting MF labels (482 in total) on three types of evaluation indexes, and taking the highest score in 5 tests, wherein the results are shown in Table 2.
TABLE 2 expansion AlphaFold predicted influence of protein Structure data on model Performance
Model evaluation and comparison are carried out on the DeepFRI models and the protein function prediction model of the invention under a dataset Dataset, 50 iterative rounds (epochs) are pre-trained, the scores of the models on three types of evaluation indexes are calculated respectively, the highest scores in 5 tests are taken, the predicted performances of the models on three types of protein gene ontology function labels of MF (total 517), BP (total 2124) and CC (total 408) are discussed respectively, and the results are shown in tables 3 and 4.
TABLE 3 DeepFRI model prediction of protein function
TABLE 4 prediction of protein function by protein function prediction model of the invention
As can be seen from the experimental results, the model is superior to the DeepFRI model in most cases, and the AlphaFold predicted protein structure data is used as the training input of the model, so that the problem caused by the lack of the spatial structure of experimental measurement of protein items is avoided, and the generalization performance of the model is improved. Meanwhile, the protein document type associated database provided by the invention can greatly improve the retrieval efficiency of protein associated information and is convenient for the downstream prediction task of the protein.
Finally, it should be noted that the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited to the foregoing embodiment, but may be modified or some of the technical features thereof may be substituted by those illustrated in the foregoing embodiment. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims (6)

1. The method for predicting protein functions based on transfer learning and three-channel combined GNN is characterized by comprising the following steps:
S1, acquiring an amino acid sequence, a protein space structure file and a protein gene ontology function tag of a protein from a protein public database, constructing a protein document type association database of a NoSQL type after cross-association and data cleaning of the acquired data, acquiring a reference data set of a protein function prediction task from the protein document type association database, and dividing the reference data set into a training set, a verification set and a test set;
S2, analyzing a protein space structure file, extracting space coordinates of Ca atoms in each amino acid residue in a protein molecule from the space coordinates, and extracting amino acid residue contact diagrams with different connection densities according to different contact thresholds based on space contact relations of Ca atoms in all residues so as to map a complex space structure of the protein molecule;
s3, constructing an amino acid sequence mixed feature extractor based on transfer learning, extracting multiple types of residue level features from the amino acid sequence through the amino acid sequence mixed feature extractor to form a sequence primary embedded representation;
Inputting the complete sequence or a certain segment of subsequence of the protein into an amino acid sequence mixed characteristic extractor, firstly extracting sequence embedded representation by a Bi-LSTM network, then continuously extracting sequence embedded representation by an ESM2 network consisting of 33 layers of transformers, and taking the sequence embedded representation extracted by the Bi-LSTM network and the ESM2 network together as sequence primary embedded representation of the amino acid sequence;
The sequence primary embedded representation is divided into sequence single-heat coding mapping, residue sequence depth context feature and residue biochemical characteristic coding according to category slices, leachable nonlinear mapping is carried out on the slices, feature enrichment and dimension reduction are carried out on the sequence primary embedded representation by adjusting the dimension of each hidden layer, and finally, residue level mixed feature mapping of the amino acid sequence is obtained through ReLU nonlinear mapping and feature fusion operation;
S4, a graph rolling network and a GATv network with a multi-head graph attention mechanism are introduced to construct a three-channel combined graph neural network, a residue-level mixed characteristic map of an amino acid sequence is used as an initial characteristic matrix input of the three-channel combined graph neural network, an amino acid residue contact graph is used as an adjacent matrix input of different characteristic channels, complex graph relations among amino acid residues in protein molecules are captured and learned, and protein function prediction is carried out by combining the residue-level mixed characteristic map of the amino acid sequence and multi-stage structural characteristics;
S5, constructing a multi-label classifier, carrying out global pooling and aggregation operation on the output node characteristics of the three-channel combined graph neural network, and mapping the aggregated characteristics into a probability space of 0-1 to obtain a probability value of each protein gene ontology functional label;
S6, forming a complete protein function prediction model by the sequence mixed characteristic extractor, the three-channel combined graph neural network and the multi-label classifier, pre-training the protein function prediction model on a training set, evaluating the performances of the protein function prediction model in different training stages by using a verification set and a test set, and performing protein function prediction by using a protein function prediction model qualified in training.
2. The method for predicting protein function based on combination of transfer learning and three-way GNN according to claim 1, wherein the spatial contact relation of Ca atoms in the residues in step S2 is that a protein is composed of N amino acid residues, the spatial structure file of the protein is traversed, the spatial structure coordinates of Ca atoms in each amino acid residue are extracted therefrom, the spatial distance between all Ca atoms is calculated based on the coordinates, a distance threshold is set to determine whether or not to contact between two residues, thereby obtaining an N x N amino acid residue contact matrix, then, only edge r with contact relation is taken, and an adjacent matrix with a distance matrix of 2 x 2r is converted according to the source node and the target node of the edge r, which is taken as the amino acid residue contact graph.
3. The method for predicting protein function based on combination of transfer learning and three-way GNN according to claim 2, wherein the amino acid residue contact patterns of different linkage densities in step S2 are amino acid residue contact patterns of protein space structure file and AlphaFold predicted protein space structure file actually measured by PDB are calculated separately, and amino acid residue contact patterns representing different contact densities are calculated according to different distance threshold values.
4. The method for predicting protein functions based on transfer learning and three-channel combined GNN according to claim 1 is characterized in that the characteristic aggregation and propagation mode of the three-channel combined graph neural network is that a residue-level mixed characteristic map of an amino acid sequence is used as an initial characteristic matrix input of a2 x 2 parallel two-channel GCN network, a sparse amino acid residue contact graph extracted by a smaller contact threshold is used as an adjacent matrix input of one characteristic channel of the parallel network, a dense amino acid residue contact graph extracted by a larger contact threshold is used as an adjacent matrix input of the other characteristic channel, the output characteristics of the 2 x 2 parallel two-channel GCN network are subjected to characteristic fusion to obtain fusion characteristics, the fusion characteristics are used as a characteristic matrix input of a GATv network, a sparse amino acid residue contact graph is used as an adjacent matrix input of the GATv network, and the GATv network obtains output node characteristics of the three-channel combined graph neural network after further noise filtering and characteristic extraction of the fusion characteristics.
5. The method for predicting protein function based on combination of transfer learning and three-way GNN according to claim 1, wherein in step S1, a NoSQL-type protein document type association database is constructed based on mongo db.
6. The method for predicting protein functionality based on combination of transfer learning and three-way GNN of claim 5, wherein the step of constructing a NoSQL-type protein document-type relational database is as follows:
Step S101, testing an open data interface of a Uniprot database by using an interface test script and a test tool PostMan, obtaining a Uniprot database identifier corresponding to a protein item through screening conditions, wherein the screened protein item needs to have a protein space structure file which is tested by an experiment or a protein space structure file which is predicted by a AlphaFold model, selecting a protein gene ontology as a functional label of protein, the total number of protein gene ontology labels which are possessed by each screened protein item is more than or equal to 3, and the screened Uniprot database identifier is used as a source data ID of the database;
Step S102, accessing a Uniprot database data interface through a Uniprot database identifier obtained by screening, acquiring a cross index ID of a corresponding protein item in a PDB, alphaFold database from a data packet returned by the interface, and simultaneously associating the main key target ID of a target protein information table of a ChEMBL database through the Uniprot database identifier, and acquiring the other secondary table IDs in a ChEMBL database according to the association;
Step S103, classifying data interfaces corresponding to the request types of IDs, and analyzing xml and JSON data packets returned by the interfaces through BeautifulSoup, jsonpath frames and regular expressions to obtain a JSON dictionary formed by protein association information in three databases Uniprot, PDB, alphaFold and a JSON dictionary formed by target protein association data in a ChEMBL database;
Step S104, based on the JSON dictionary, constructing a protein information main table by using a Pymongo frame by taking a Uniprot database identifier as a main key, and constructing a ChEMBL associated information table containing target information by taking a main key target ID as a main key;
Step S105, cross-correlating the main key target point ID in the ChEMBL associated information table with the Uniprot database identifier in the protein information main table to obtain a final protein document associated database, and searching key information of related protein items in the Uniprot, PDB, alphaFold, chEMBL database only through the Uniprot database identifier by a user.
CN202411442388.0A 2024-10-16 2024-10-16 A method for predicting protein function based on transfer learning and three-channel combined GNN Active CN118969060B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202411442388.0A CN118969060B (en) 2024-10-16 2024-10-16 A method for predicting protein function based on transfer learning and three-channel combined GNN

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202411442388.0A CN118969060B (en) 2024-10-16 2024-10-16 A method for predicting protein function based on transfer learning and three-channel combined GNN

Publications (2)

Publication Number Publication Date
CN118969060A CN118969060A (en) 2024-11-15
CN118969060B true CN118969060B (en) 2025-01-21

Family

ID=93404010

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202411442388.0A Active CN118969060B (en) 2024-10-16 2024-10-16 A method for predicting protein function based on transfer learning and three-channel combined GNN

Country Status (1)

Country Link
CN (1) CN118969060B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119517171B (en) * 2025-01-20 2025-04-29 之江实验室 A method and device for mining and screening functional proteins

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111500611A (en) * 2020-04-14 2020-08-07 成都富岱生物医药有限公司 Method for attenuating and modifying ribosome inactivating protein for blocking receptor binding
CN112085247A (en) * 2020-07-22 2020-12-15 浙江工业大学 A Deep Learning-Based Method for Predicting Protein Residue Contacts

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP2566515B1 (en) * 2010-05-03 2017-08-02 aTyr Pharma, Inc. Innovative discovery of therapeutic, diagnostic, and antibody compositions related to protein fragments of arginyl-trna synthetases
CN112820350B (en) * 2021-03-18 2022-08-09 湖南工学院 Lysine propionylation prediction method and system based on transfer learning
CN113192559B (en) * 2021-05-08 2023-09-26 中山大学 Protein-protein interaction site prediction method based on deep graph convolution network
WO2023014912A1 (en) * 2021-08-05 2023-02-09 Illumina, Inc. Transfer learning-based use of protein contact maps for variant pathogenicity prediction
CN113868374B (en) * 2021-09-15 2024-04-12 西安交通大学 Graph convolution network biomedical information extraction method based on multi-head attention mechanism
CN114388064B (en) * 2021-12-15 2024-12-17 深圳先进技术研究院 Multi-modal information fusion method, system, terminal and storage medium for protein characterization learning
CN115545098B (en) * 2022-09-23 2023-09-08 青海师范大学 A node classification method for three-channel graph neural network based on attention mechanism
CN116417093A (en) * 2022-12-06 2023-07-11 苏州科技大学 A Drug-Target Interaction Prediction Method Combining Transformer and Graph Neural Network
CN116935951B (en) * 2023-06-09 2025-09-19 西安邮电大学 Method and system for identifying anticancer peptide based on attention mechanism and multi-granularity-level characteristics

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111500611A (en) * 2020-04-14 2020-08-07 成都富岱生物医药有限公司 Method for attenuating and modifying ribosome inactivating protein for blocking receptor binding
CN112085247A (en) * 2020-07-22 2020-12-15 浙江工业大学 A Deep Learning-Based Method for Predicting Protein Residue Contacts

Also Published As

Publication number Publication date
CN118969060A (en) 2024-11-15

Similar Documents

Publication Publication Date Title
CN115472221B (en) A protein fitness prediction method based on deep learning
Jiang et al. Predicting protein function by multi-label correlated semi-supervised learning
CN114093515B (en) An age prediction method based on ensemble learning of intestinal flora prediction model
CN116417093A (en) A Drug-Target Interaction Prediction Method Combining Transformer and Graph Neural Network
Ma et al. MIDIA: exploring denoising autoencoders for missing data imputation
CN118969060B (en) A method for predicting protein function based on transfer learning and three-channel combined GNN
CN114708903A (en) Method for predicting distance between protein residues based on self-attention mechanism
Jabbar et al. An evolutionary algorithm for heart disease prediction
CN107291895A (en) A kind of quick stratification document searching method
CN111667880A (en) A protein residue contact map prediction method based on deep residual neural network
CN119691159B (en) Technological topic evolution stage prediction method and system based on multiple graph representation
CN115312125B (en) Deep learning method for predicting drug-target interaction based on biological substructure
CN119905145A (en) Method, medium and device for constructing a universal prediction model for classifying mutant enzymes-substrates
CN119580827A (en) Drug-target binding prediction method based on variational coding
CN119153100A (en) Disease risk characterization prediction system and method
CN117437976B (en) Disease risk screening method and system based on gene detection
CN112071362A (en) A method for detection of protein complexes that fuse global and local topologies
CN112735604B (en) Novel coronavirus classification method based on deep learning algorithm
Kelil et al. A general measure of similarity for categorical sequences
Bouzaachane Applying Face Recognition in Video Surveillance for Security Systems
Hu et al. Mining, modeling, and evaluation of subnetworks from large biomolecular networks and its comparison study
CN119811507B (en) Liquid-liquid phase separation protein prediction method and system based on multiple characteristics
CN120072105B (en) Task-specific drug molecule activity cliff prediction method
CN118016167B (en) Cell clustering method, device and medium for unbalanced single-cell RNA-seq data
CN118553303B (en) A method for representing single-cell transcriptomes based on multi-omics

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant