CN118969060B

CN118969060B - A method for predicting protein function based on transfer learning and three-channel combined GNN

Info

Publication number: CN118969060B
Application number: CN202411442388.0A
Authority: CN
Inventors: 黎晟; 石海鹏; 石海鹤; 曾子涵; 龚嘉盈; 陈彬
Original assignee: Jiangxi Normal University
Current assignee: Jiangxi Normal University
Priority date: 2024-10-16
Filing date: 2024-10-16
Publication date: 2025-01-21
Anticipated expiration: 2044-10-16
Also published as: CN118969060A

Abstract

The present invention belongs to the field of deep learning and bioinformatics, and discloses a method for predicting protein function based on transfer learning and three-channel combined GNN. The method extracts the contact graph of amino acid residues with different connection densities by parsing the protein spatial structure file; constructs an amino acid sequence mixed feature extractor based on transfer learning to obtain the residue-level mixed feature map of the amino acid sequence; introduces a graph convolutional network and a GATv2 network with a multi-head graph attention mechanism to construct a three-channel combined graph neural network for protein function prediction; and obtains the probability value of each protein gene ontology function label through a multi-label classifier. The present invention improves the protein function prediction performance and facilitates researchers to efficiently and comprehensively obtain the complex association information of protein entries in different authoritative databases.

Description

Method for predicting protein function based on transfer learning and three-channel combined GNN

Technical Field

The invention relates to the field of deep learning and bioinformatics, in particular to a method for predicting protein functions based on combination of transfer learning and three channels GNN.

Background

Proteins are necessary for vital activities and cellular metabolism, and research on functions of proteins directly clarifies various special mechanisms of the proteins in physiology or pathology, and reveals the processes of synthesis, degradation, processing and modification of the proteins, so that people can better understand the expression and regulation of genes. Protein functions are currently mainly measured by techniques such as enzyme activity measurement and cell biology experiments, and are classified into three main categories, namely molecular functions (Molecular Function, MF), biological processes (Biological Process, BP) and cell components (Cellular Component, CC) based on Gene Ontologies (GO).

With the development of protein sequencing technologies such as X-ray single crystal diffraction, a frozen electron microscope and the like and the development of structural analysis technologies thereof, and the application of a protein structure prediction model, known amino acid sequences and structure numbers have a great difference from corresponding functional annotations, and researchers are increasingly turning to a calculation method for performing functional prediction in order to avoid high equipment and time costs caused by directly measuring protein functions through experiments. The sequence and molecular structure characteristics of proteins determine to some extent the specific functions they possess, and by combining amino acid sequence or molecular structure information, the design of efficient and accurate protein function prediction methods has become an important point in recent years for protein function research.

Initially researchers have deduced the function of proteins by aligning the similarity and homology of sequences, i.e. sequences with high homology, which are also likely to be similar in function, and have transferred their known functional notes to similar proteins. Based on this theory, researchers have developed Blast2GO tools for functional annotation transfer and analysis using sequence similarity. There are also methods for predicting protein functions by comparing similarities between structures to infer that proteins have similar functions, and there are documents that suggest calculating similarities between protein structures by using a contact map.

Direct prediction of protein function based on sequence or structural similarity and homology is highly susceptible to error. With the excellent performance of deep learning technology in solving multi-domain problems, people begin to explore its application in protein function prediction. For example, the DeepGO model uses CNN to extract amino acid sequence features and combines with protein interaction networks extracted from the sting database to predict protein function. The DeepGO model only performs characteristic modeling on the sequence, and ignores the great influence of the molecular structure of the protein on the specific function possessed by the protein. Based on this, a model DeepFRI combining the function of molecular structure prediction was proposed by researchers, 10M sequences were sampled from the whole sequence set in Pfam database, and an LSTM-based protein language model was trained as a feature extractor for amino acid sequences. And (3) inputting the residue-level feature matrix of the amino acid sequence and the residue adjacent matrix extracted from the protein space structure file into a GCN graph convolution neural network to predict the protein function. However, LSTM has significant drawbacks when processing some long sequences of text data, and has limitations when processing some complex random-map data using GCN only. In experimental data selection DeepFRI used only the high-precision protein structure in the PDB protein database, as well as the homology-based structural MODEL constructed by SWISS-MODEL, but most proteins lacked these data. Meanwhile, in actual operation, since each protein database is not directly related, key information such as protein sequence, molecular structure, functional annotation and the like is required to be retrieved from different source databases through database cross references, which greatly prevents researchers from completely and conveniently obtaining relevant information of the proteins, and some databases providing downloading such as a protein-ligand interaction database ChEMBL are relational databases of MySQL type, and are lower in access convenience and retrieval efficiency than NoSQL databases of document type.

Disclosure of Invention

In order to solve the problems, the invention provides a method for predicting protein functions based on transfer learning and three-channel combination GNN, which improves prediction accuracy, and simultaneously provides a protein document type associated database, which is convenient for efficiently acquiring detailed associated information of related proteins in different protein databases, and based on the detailed associated information, a user can extract a reference data set of a protein downstream prediction task driven by data from the protein document type associated database.

The technical scheme adopted by the invention is that the method for predicting the protein function based on transfer learning and three-channel combined GNN comprises the following steps:

Step S1, acquiring an amino acid sequence of a protein, a protein space structure file and a protein Gene Ontology (GO) function tag from a protein public database, constructing a NoSQL type protein document type association database after cross-association and data cleaning of the acquired data, acquiring a reference data set of a protein function prediction task from the protein document type association database, and dividing the reference data set into a training set, a verification set and a test set;

S2, analyzing a protein space structure file, extracting space coordinates of Ca atoms in each amino acid residue in a protein molecule from the space coordinates, and extracting amino acid residue contact diagrams with different connection densities according to different contact thresholds based on space contact relations of Ca atoms in all residues so as to map a complex space structure of the protein molecule;

s3, constructing an amino acid sequence mixed feature extractor based on transfer learning, extracting multiple types of residue level features from the amino acid sequence through the amino acid sequence mixed feature extractor to form a sequence primary embedded representation;

s4, a graph convolution network and a GATv network with a multi-head graph attention mechanism are introduced to construct a three-channel combined graph neural network, a residue-level mixed feature map of an amino acid sequence is used as an initial feature matrix input of the three-channel combined graph neural network, an amino acid residue contact graph is used as an adjacent matrix input of different feature channels, a complex graph relationship among amino acid residues in protein molecules is captured and learned, and protein function prediction is carried out by combining the residue-level mixed feature map of the amino acid sequence and multi-level structural features;

S5, constructing a multi-label classifier, carrying out global pooling and aggregation operation on the output node characteristics of the three-channel combined graph neural network, and mapping the aggregated characteristics into a probability space of 0-1 to obtain a probability value of each protein gene ontology functional label;

S6, forming a complete protein function prediction model by the sequence mixed characteristic extractor, the three-channel combined graph neural network and the multi-label classifier, pre-training the protein function prediction model on a training set, evaluating the performances of the protein function prediction model in different training stages by using a verification set and a test set, and performing protein function prediction by using a protein function prediction model qualified in training.

Further preferably, the spatial contact relationship of Ca atoms in the residues in the step S2 is that a protein is composed of N amino acid residues, a protein spatial structure file is traversed, the spatial structure coordinates of Ca atoms in each amino acid residue are extracted from the protein spatial structure file, the spatial distance between all Ca atoms is calculated based on the coordinates, a distance threshold is set to judge whether two residues are contacted or not, and therefore an amino acid residue contact matrix of N x N is obtained, then only a side r with a contact relationship is taken, and an adjacent matrix with a distance matrix of 2 x 2r is converted according to a source node and a target node of the side r, and the adjacent matrix is taken as the amino acid residue contact graph.

Further preferably, the amino acid residue contact patterns of the different connection densities in step S2 are obtained by calculating the amino acid residue contact patterns of the protein space structure file actually measured by PDB and the protein space structure file predicted by AlphaFold, respectively, and calculating the amino acid residue contact patterns representing the different contact densities according to the different distance threshold values.

Further preferably, in step S3, the complete sequence or a certain segment of the subsequence of the protein is input into an amino acid sequence mixed feature extractor, the sequence extracted by the Bi-LSTM network is first represented by embedding, then the sequence extracted by the ESM2 network consisting of 33 layers of the transducer encoder is continuously extracted, and the sequence extracted by the Bi-LSTM network and the sequence extracted by the ESM2 network are represented by embedding together as a primary sequence of the amino acid sequence.

Further preferably, in the step S3, the feature fusion and enrichment operation is performed on the primary embedded representation of the sequence, wherein the primary embedded representation of the sequence is divided into a sequence single thermal coding mapping, a residue sequence depth context feature and a residue biochemical feature coding according to category slices, the slices are subjected to a leachable nonlinear mapping, the feature enrichment and dimension reduction are performed on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer, and finally, the residue-level mixed feature mapping of the amino acid sequence is obtained through a ReLU nonlinear mapping and feature fusion operation.

Further preferably, the three-channel combined graph neural network comprises a first characteristic channel formed by two layers of GCN networks connected in series, a second characteristic channel formed by the other two layers of GCN networks connected in series, wherein the first characteristic channel is parallel to the second characteristic channel to form a2 x 2 parallel double-channel GCN network, a GATv network with a multi-head graph attention mechanism is introduced to form a third characteristic channel, and the GATv network is connected in series to the parallel double-channel GCN network.

The three-channel combined graph neural network is further preferably characterized in that a residue-level mixed feature map of an amino acid sequence is used as an initial feature matrix input of a2 x 2 parallel two-channel GCN network, a sparse amino acid residue contact graph extracted by a smaller contact threshold is used as an adjacent matrix input of one feature channel of the parallel network, a dense amino acid residue contact graph extracted by a larger contact threshold is used as an adjacent matrix input of the other feature channel, the output features of the 2 x 2 parallel two-channel GCN network are subjected to feature fusion to obtain fusion features, the fusion features are used as feature matrix input of the GATv network, the sparse amino acid residue contact graph is used as an adjacent matrix input of the GATv network, and the GATv network obtains output node features of the three-channel combined graph neural network after further noise filtering and feature extraction are performed on the fusion features.

Further preferably, in step S1, a NoSQL-type protein document type association database is constructed based on mongo db.

Further preferably, the step of constructing a protein document type association database of the NoSQL type is as follows:

Step S101, testing an open data interface of a Uniprot database by using an interface test script and a test tool PostMan, obtaining a Uniprot database identifier corresponding to a protein item through screening conditions, wherein the screened protein item needs to have a protein space structure file which is tested by an experiment or a protein space structure file which is predicted by a AlphaFold model, selecting a protein gene ontology as a functional label of protein, the total number of protein gene ontology labels which are possessed by each screened protein item is more than or equal to 3, and the screened Uniprot database identifier is used as a source data ID of the database;

Step S102, accessing a Uniprot database data interface through a Uniprot database identifier obtained by screening, acquiring a cross index ID of a corresponding protein item in a PDB, alphaFold database from a data packet returned by the interface, and simultaneously associating the main key target ID of a target protein information table of a ChEMBL database through the Uniprot database identifier, and acquiring the other secondary table IDs in a ChEMBL database according to the association;

Step S103, classifying data interfaces corresponding to the IDs are requested, and xml and JSON data packets returned by BeautifulSoup, jsonpath frames and regular expression analysis interfaces are used for obtaining a JSON dictionary composed of protein association information in Uniprot, PDB, alphaFold databases and a JSON dictionary composed of target protein association data in ChEMBL databases;

Step S104, based on the JSON dictionary, constructing a protein information main table by using a Pymongo frame by taking a Uniprot database identifier as a main key, and constructing a ChEMBL associated information table containing target information by taking a main key target ID as a main key;

Step S105, cross-correlating the main key target point ID in the ChEMBL associated information table with the Uniprot database identifier in the protein information main table to obtain a final protein document associated database, wherein a user can search key information of related protein items in the Uniprot, PDB, alphaFold, chEMBL database only through the Uniprot database identifier.

The beneficial effects of the invention are as follows:

(1) According to the method for constructing the NoSQL type protein document type association database based on the MongoDB, through the NoSQL type protein document type association database, a user can search detailed association data of related protein items only according to one main key target point ID without cross indexing in a plurality of databases, so that the search efficiency of protein association information is greatly improved. And compared with a directly downloaded complete MySQL type public database, the NoSQL type database has great improvement in access convenience.

(2) Most of the protein function prediction MODELs currently comprise DeepFRI high-precision protein structures in a PDB protein database and homology-based structural MODELs constructed by SWISS-MODEL, but most of the proteins lack corresponding data, so that an experimental data set obtained from a document type NoSQL protein association database can simultaneously have PDB high-precision structural data determined by experiments and structural data predicted by a AlphaFold MODEL, and the generalization performance of the MODEL can be improved while the data acquisition cost is reduced.

(3) The invention constructs an amino acid sequence mixed feature extractor based on transfer learning, wherein the transfer learning is based on two protein language models of Bi-LSTM and transfomer, and combines the sequence single thermal coding mapping, the residue sequence depth context feature and the residue biochemical feature coding to jointly form a sequence primary embedded representation of the amino acid sequence, and the sequence primary embedded representation is subjected to leachable nonlinear mapping in the model training process, so that the experimental cost is reduced, and the residue features are further enriched, thereby the feature expression capability of the obtained residue level mixed feature mapping is more powerful.

(4) The DeepFRI model architecture is formed by serially connecting GCN networks, and has certain limitation in processing complex irregular graphs such as protein space structures, the invention introduces a GATv network with a multi-head attention mechanism, allows a protein function prediction model to learn multiple groups of attention weights simultaneously, and each attention head can learn different attention weights, thereby enhancing the expression capability of the protein function prediction model on complex graph structures.

(5) The invention combines GCN network and GATv network with multi-head diagram attention mechanism to form a three-channel combined diagram neural network architecture, simultaneously learns two types of residue contact diagrams extracted by different contact thresholds through a2 x 2 parallel GCN network, and then further carries out noise filtration on the characteristics output by the two-channel parallel network through a GATv network connected in series. Meanwhile, in the final multi-label classification stage of the model, the invention enables the protein function prediction model to further acquire more node characteristics by fusing two types of global pooling information.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present invention, the drawings required for the description of the embodiments will be briefly described below, and it is apparent that the drawings in the following description are only some embodiments of the present invention, and other drawings may be obtained according to these drawings without inventive effort for a person skilled in the art.

Fig. 1 is a schematic diagram of the present invention.

FIG. 2 is a diagram of an amino acid sequence hybrid feature extractor architecture in accordance with the present invention.

Fig. 3 is a three-channel combined graph neural network architecture diagram in the present invention.

Fig. 4 is a diagram of a multi-label classifier architecture in accordance with the present invention.

Detailed Description

The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.

As shown in fig. 1, the method for predicting protein functions based on transfer learning and three-channel combined GNN comprises the following steps:

Step S1, acquiring an amino acid sequence of protein, a protein space structure file and a protein Gene Ontology (GO) functional label from a protein public database, constructing a protein document type association database of a NoSQL type based on MongoDB after cross-linking and data cleaning of the acquired data, acquiring a reference data set of a protein function prediction task from the protein document type association database, dividing the reference data set into a training set, a verification set and a test set, wherein the training set is used for training a protein function prediction model provided by the invention, the verification set is used for evaluating the concrete performance of each round (epoch) in the model training process, and the test set is used for testing the overall performance of the model after training is finished. The protein space structure file comprises protein space structure files which are predicted by experimental determination and AlphaFold model;

The step S1 of constructing a protein document type association database of a NoSQL type based on MongoDB comprises the following steps:

Step S101, testing an open data interface of a Uniprot database by using an interface test script and a test tool PostMan, obtaining a Uniprot database identifier corresponding to a protein item through screening conditions, wherein the screened protein item is required to have a protein space structure file which is determined by experiments or a protein space structure file which is predicted by a AlphaFold model, the length of the amino acid sequence of each protein is not more than 1022 (the ESM network processes the amino acid sequence of the protein consisting of 1022 residues at most once), a protein Gene Ontology (GO) is selected as a functional label of the protein, the total number of the protein gene ontology labels of each protein item which is screened out is more than or equal to 3, and 26335 Uniprot database identifiers which meet the conditions are screened out in the embodiment as source data IDs of the database.

Step S102, accessing a Uniprot database data interface through a Uniprot database identifier obtained by screening, acquiring a cross index ID of a corresponding protein item in a PDB, alphaFold database from a data packet returned by the interface, associating a primary key Target (Target) ID of a Target protein information table of a ChEMBL database through the Uniprot database identifier, and acquiring the rest secondary table IDs in a ChEMBL database according to the association;

The method comprises the steps of obtaining a basic data set of a protein function prediction task from a protein document type associated database, specifically, screening 5000 protein items with a high-precision experimental measurement structure from 26335 protein data items in the embodiment, namely Dataset, screening 2000 protein items with a AlphaFold model measurement structure, expanding the protein items to Dataset1, and obtaining a 7000 mixed data set marked as Dataset 2. In order to reduce the experimental cost, the embodiment only takes the GO term with evidence coding (EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC, HTP, HDA, HMP, HGI, HEP) as the functional label of the protein gene ontology of the protein, and divides the extracted data set into a training set, a verification set and a test set according to the proportion of 8:1:1, wherein the protein items of the test set simultaneously have the functional labels of the three protein gene ontologies of MF, BP and CC, and the distribution of the functional labels of the protein gene ontologies of the data set is shown in Table 1.

TABLE 1 functional tag distribution of protein Gene ontologies of the data set extracted in the examples

The spatial contact relation of Ca atoms in the residues in the step S2 is that a protein is assumed to be composed of N amino acid residues, a protein spatial structure file is traversed, the spatial structure coordinates of the Ca atoms in each amino acid residue are extracted from the protein spatial structure file, the spatial distance between all Ca atoms is calculated based on the coordinates, a distance threshold is set to judge whether the two residues are in contact or not, and therefore an N-by-N amino acid residue contact matrix is obtained, then only a side r with the contact relation is taken, and a 2-by-2 r adjacent matrix is converted according to a source node and a target node of the side r, and is taken as the amino acid residue contact diagram.

The amino acid residue contact diagrams of different connection densities in the step S2 are respectively calculated according to the protein space structure file actually measured by the PDB and the amino acid residue contact diagram of the protein space structure file predicted by AlphaFold, and the amino acid residue contact diagrams representing different contact densities are calculated according to different distance threshold values. Specifically, for the protein entry with the Uniprot database identifier of P05067, the corresponding PDB experiment measurement structure index ID is 8OTF, and the alpha fold model prediction structure index ID is:

AF-P05067-F1, thereby obtaining protein space structure file of P05067, obtaining space coordinates of Ca atoms in each amino acid residue in the protein molecule by analyzing the file, the complete P05067 protein is composed of 770 amino acid residues, calculating space distance between Ca atoms and constructing amino acid residue contact matrix ,Representing dimensions, converting by taking only the side of the amino acid residue in contact matrix with contact relationAs an adjacency matrixAs a map of amino acid residue contact of protein P05067.

Specifically, the residue contact calculation formula may be defined as:

;

Wherein the method comprises the steps of Residues extracted from protein space structure filesSum residuesThe three-dimensional coordinates of the Ca atoms in (c),In order to contact the distance threshold value,Representing residuesSum residuesThe spatial distance of Ca atoms in the sequence is considered to be in spatial contact with two amino acid residues when the calculated spatial distance is less than the threshold. In this example, two types of contact patterns of amino acid residues under the contact threshold values 8 a and 10 a are calculated, representing a sparse amino acid residue contact pattern and a dense amino acid residue contact pattern, respectively.

As shown in fig. 2. In step S3, the complete sequence or a certain segment of subsequence of the protein is input into an amino acid sequence mixed characteristic extractor, a total 6155-dimensional sequence embedded representation is extracted from a Bi-LSTM network, then an ESM2 network (Evolutionary Scale Modelling) consisting of 33 layers of transformers encoders is continuously extracted to obtain a total 1280-dimensional sequence embedded representation, and the total sequence embedded representation is taken as a sequence primary embedded representation of the amino acid sequence.

In the step S3, the feature fusion and enrichment operation is carried out on the primary embedded representation of the sequence, namely when the length of the amino acid sequence is overlong, the overall calculation complexity of the model is obviously improved, and in order to obtain more effective features while the training cost is reduced, the primary embedded representation of the sequence is divided into sequence single-heat coding mapping, residue sequence depth context feature and residue biochemical feature coding according to class slices. And (3) carrying out leavable nonlinear mapping on the slices in the training process of the whole model, carrying out feature enrichment and dimension reduction on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer, and finally obtaining residue-level mixed feature mapping of the amino acid sequence through ReLU nonlinear mapping and feature fusion operation.

Specifically, for the protein item with UPID of P05067, the complete sequence (770 amino acids total) is input into an amino acid sequence mixed characteristic extractor, and the Bi-LSTM network extracts the sequence single thermal coding mapping(20 Standard amino acids and a class of unknown amino acids) and residue sequence depth context characteristicsThe ESM2 network will extract its residue biochemical characteristic codeThereby obtaining a primary embedded representation of the sequence,Representing the dimension. In order to avoid resource waste caused by excessive number of amino acid residues, the method is toAnd slicing according to the category, performing three-time learnable nonlinear mapping, and performing feature enrichment and dimension reduction on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer to be 21, 512 and 320 when the dimension is linearly changed. Finally, performing feature fusion on the three types of output features after non-linear ReLU mapping to obtain residue-level mixed feature mapping of the amino acid sequenceIn performing the above-described learnable nonlinear transformation, all weight parameters and corresponding bias values will be trained with all parameters in the subsequent whole protein function prediction model.

As shown in FIG. 3, the three-channel combined graph neural network comprises a first characteristic channel formed by two layers of GCN networks connected in series, a second characteristic channel formed by the other same two layers of GCN networks connected in series, wherein the first characteristic channel is parallel to the second characteristic channel to form a 2X 2 parallel double-channel GCN network, a GATv network with a multi-head graph attention mechanism is introduced to form a third characteristic channel, and the GATv network is connected in series to the parallel double-channel GCN network. The activation functions of the GCN network and GATv network employ a ReLU activation function.

Specifically, the three-channel combined graph neural network has characteristic aggregation and propagation modes that residue-level mixed characteristic mapping of amino acid sequences is usedAs an initial feature matrix input of the 2 x2 parallel dual-channel GCN network, a sparse amino acid residue contact diagram extracted by an 8 a contact threshold is used as an adjacent matrix input of one feature channel of the parallel network, and a dense amino acid residue contact diagram extracted by a 10a contact threshold is used as an adjacent matrix input of the other feature channel. The output characteristics of the 2 x2 parallel double-channel GCN network are subjected to characteristic fusion to obtain fusion characteristicsTaking the amino acid residue as a third characteristic channel, namely, inputting a characteristic matrix of GATv network, simultaneously taking a sparse amino acid residue contact diagram extracted by using an 8A threshold as an adjacent matrix input of GATv network, wherein the GATv network is a natural matrix of the GATv networkFurther noise filtering and feature extraction are carried out to obtain the output node features of the three-channel combined graph neural network。

In step S5, the process of global pooling and aggregation operation is performed on the output node characteristics of the three-channel combined graph neural network is shown in fig. 4;

Output node characteristics for three-channel combined graph neural network And respectively carrying out average pooling and accumulation pooling, and then aggregating two kinds of pooling information to obtain more comprehensive node characteristic information.

Specifically, the global pooling operation may be specifically defined as:

;

Wherein, Is the first in the output node characteristicsThe feature vectors of the individual nodes are used,Representing the accumulated pooling information is presented,Represents the information of the mean value pool,Indicating the characteristics of the aggregated nodes, N is the number of the nodes, and concat indicates the aggregation operation.

The node characteristics after aggregationAnd carrying out batch normalization, reLU activation and Dropout operation in sequence, mapping a full connection layer, and mapping the feature output into a probability space of 0-1 by using a sigmoid function to obtain the probability value of each protein gene ontology functional label.

The method for pre-training the protein function prediction model on the training set in the step S6 comprises the steps of selecting the training set in the reference data set as training data of the protein function prediction model, respectively taking MF, BP and CC labels as functional labels of protein gene ontology during supervised learning, selecting Adam as a training optimizer, setting the learning rate to be 0.0001, selecting a binary cross entropy function as a loss function during training, setting the size of a training Batch (Batch size) to be 64, and setting the maximum iteration round (epoch) to be 50. And in the training process, using a verification set to evaluate the protein function prediction model performance after each iteration round (epoch), namely after each round of training is finished, the protein function prediction model predicts protein gene ontology function labels of all protein items in the verification set, calculates the area (M-AUPR) under a micro precision recall curve, the area (M-AUPR) under the macro precision recall curve and the score on three evaluation indexes of the maximum F1 score (F-MAX) respectively, and if the score is improved compared with the previous round, stores the protein function prediction model pre-trained in the round. Training is stopped after the maximum iteration round (epoch) is reached. Finally, the performance of the pre-trained protein function prediction model is evaluated on a test set.

The protein function prediction model provided by the invention is used for respectively pre-training under training sets corresponding to Dataset and Dataset, testing under a testing set of Dataset1, respectively calculating the scores of the protein function prediction model for predicting MF labels (482 in total) on three types of evaluation indexes, and taking the highest score in 5 tests, wherein the results are shown in Table 2.

TABLE 2 expansion AlphaFold predicted influence of protein Structure data on model Performance

Model evaluation and comparison are carried out on the DeepFRI models and the protein function prediction model of the invention under a dataset Dataset, 50 iterative rounds (epochs) are pre-trained, the scores of the models on three types of evaluation indexes are calculated respectively, the highest scores in 5 tests are taken, the predicted performances of the models on three types of protein gene ontology function labels of MF (total 517), BP (total 2124) and CC (total 408) are discussed respectively, and the results are shown in tables 3 and 4.

TABLE 3 DeepFRI model prediction of protein function

TABLE 4 prediction of protein function by protein function prediction model of the invention

As can be seen from the experimental results, the model is superior to the DeepFRI model in most cases, and the AlphaFold predicted protein structure data is used as the training input of the model, so that the problem caused by the lack of the spatial structure of experimental measurement of protein items is avoided, and the generalization performance of the model is improved. Meanwhile, the protein document type associated database provided by the invention can greatly improve the retrieval efficiency of protein associated information and is convenient for the downstream prediction task of the protein.

Finally, it should be noted that the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited to the foregoing embodiment, but may be modified or some of the technical features thereof may be substituted by those illustrated in the foregoing embodiment. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.

Claims

1. The method for predicting protein functions based on transfer learning and three-channel combined GNN is characterized by comprising the following steps:

S1, acquiring an amino acid sequence, a protein space structure file and a protein gene ontology function tag of a protein from a protein public database, constructing a protein document type association database of a NoSQL type after cross-association and data cleaning of the acquired data, acquiring a reference data set of a protein function prediction task from the protein document type association database, and dividing the reference data set into a training set, a verification set and a test set;

Inputting the complete sequence or a certain segment of subsequence of the protein into an amino acid sequence mixed characteristic extractor, firstly extracting sequence embedded representation by a Bi-LSTM network, then continuously extracting sequence embedded representation by an ESM2 network consisting of 33 layers of transformers, and taking the sequence embedded representation extracted by the Bi-LSTM network and the ESM2 network together as sequence primary embedded representation of the amino acid sequence;

The sequence primary embedded representation is divided into sequence single-heat coding mapping, residue sequence depth context feature and residue biochemical characteristic coding according to category slices, leachable nonlinear mapping is carried out on the slices, feature enrichment and dimension reduction are carried out on the sequence primary embedded representation by adjusting the dimension of each hidden layer, and finally, residue level mixed feature mapping of the amino acid sequence is obtained through ReLU nonlinear mapping and feature fusion operation;

S4, a graph rolling network and a GATv network with a multi-head graph attention mechanism are introduced to construct a three-channel combined graph neural network, a residue-level mixed characteristic map of an amino acid sequence is used as an initial characteristic matrix input of the three-channel combined graph neural network, an amino acid residue contact graph is used as an adjacent matrix input of different characteristic channels, complex graph relations among amino acid residues in protein molecules are captured and learned, and protein function prediction is carried out by combining the residue-level mixed characteristic map of the amino acid sequence and multi-stage structural characteristics;

2. The method for predicting protein function based on combination of transfer learning and three-way GNN according to claim 1, wherein the spatial contact relation of Ca atoms in the residues in step S2 is that a protein is composed of N amino acid residues, the spatial structure file of the protein is traversed, the spatial structure coordinates of Ca atoms in each amino acid residue are extracted therefrom, the spatial distance between all Ca atoms is calculated based on the coordinates, a distance threshold is set to determine whether or not to contact between two residues, thereby obtaining an N x N amino acid residue contact matrix, then, only edge r with contact relation is taken, and an adjacent matrix with a distance matrix of 2 x 2r is converted according to the source node and the target node of the edge r, which is taken as the amino acid residue contact graph.

3. The method for predicting protein function based on combination of transfer learning and three-way GNN according to claim 2, wherein the amino acid residue contact patterns of different linkage densities in step S2 are amino acid residue contact patterns of protein space structure file and AlphaFold predicted protein space structure file actually measured by PDB are calculated separately, and amino acid residue contact patterns representing different contact densities are calculated according to different distance threshold values.

4. The method for predicting protein functions based on transfer learning and three-channel combined GNN according to claim 1 is characterized in that the characteristic aggregation and propagation mode of the three-channel combined graph neural network is that a residue-level mixed characteristic map of an amino acid sequence is used as an initial characteristic matrix input of a2 x 2 parallel two-channel GCN network, a sparse amino acid residue contact graph extracted by a smaller contact threshold is used as an adjacent matrix input of one characteristic channel of the parallel network, a dense amino acid residue contact graph extracted by a larger contact threshold is used as an adjacent matrix input of the other characteristic channel, the output characteristics of the 2 x 2 parallel two-channel GCN network are subjected to characteristic fusion to obtain fusion characteristics, the fusion characteristics are used as a characteristic matrix input of a GATv network, a sparse amino acid residue contact graph is used as an adjacent matrix input of the GATv network, and the GATv network obtains output node characteristics of the three-channel combined graph neural network after further noise filtering and characteristic extraction of the fusion characteristics.

5. The method for predicting protein function based on combination of transfer learning and three-way GNN according to claim 1, wherein in step S1, a NoSQL-type protein document type association database is constructed based on mongo db.

6. The method for predicting protein functionality based on combination of transfer learning and three-way GNN of claim 5, wherein the step of constructing a NoSQL-type protein document-type relational database is as follows:

Step S103, classifying data interfaces corresponding to the request types of IDs, and analyzing xml and JSON data packets returned by the interfaces through BeautifulSoup, jsonpath frames and regular expressions to obtain a JSON dictionary formed by protein association information in three databases Uniprot, PDB, alphaFold and a JSON dictionary formed by target protein association data in a ChEMBL database;

Step S105, cross-correlating the main key target point ID in the ChEMBL associated information table with the Uniprot database identifier in the protein information main table to obtain a final protein document associated database, and searching key information of related protein items in the Uniprot, PDB, alphaFold, chEMBL database only through the Uniprot database identifier by a user.