Disclosure of Invention
In order to solve the problems, the invention provides a method for predicting protein functions based on transfer learning and three-channel combination GNN, which improves prediction accuracy, and simultaneously provides a protein document type associated database, which is convenient for efficiently acquiring detailed associated information of related proteins in different protein databases, and based on the detailed associated information, a user can extract a reference data set of a protein downstream prediction task driven by data from the protein document type associated database.
The technical scheme adopted by the invention is that the method for predicting the protein function based on transfer learning and three-channel combined GNN comprises the following steps:
Step S1, acquiring an amino acid sequence of a protein, a protein space structure file and a protein Gene Ontology (GO) function tag from a protein public database, constructing a NoSQL type protein document type association database after cross-association and data cleaning of the acquired data, acquiring a reference data set of a protein function prediction task from the protein document type association database, and dividing the reference data set into a training set, a verification set and a test set;
S2, analyzing a protein space structure file, extracting space coordinates of Ca atoms in each amino acid residue in a protein molecule from the space coordinates, and extracting amino acid residue contact diagrams with different connection densities according to different contact thresholds based on space contact relations of Ca atoms in all residues so as to map a complex space structure of the protein molecule;
s3, constructing an amino acid sequence mixed feature extractor based on transfer learning, extracting multiple types of residue level features from the amino acid sequence through the amino acid sequence mixed feature extractor to form a sequence primary embedded representation;
s4, a graph convolution network and a GATv network with a multi-head graph attention mechanism are introduced to construct a three-channel combined graph neural network, a residue-level mixed feature map of an amino acid sequence is used as an initial feature matrix input of the three-channel combined graph neural network, an amino acid residue contact graph is used as an adjacent matrix input of different feature channels, a complex graph relationship among amino acid residues in protein molecules is captured and learned, and protein function prediction is carried out by combining the residue-level mixed feature map of the amino acid sequence and multi-level structural features;
S5, constructing a multi-label classifier, carrying out global pooling and aggregation operation on the output node characteristics of the three-channel combined graph neural network, and mapping the aggregated characteristics into a probability space of 0-1 to obtain a probability value of each protein gene ontology functional label;
S6, forming a complete protein function prediction model by the sequence mixed characteristic extractor, the three-channel combined graph neural network and the multi-label classifier, pre-training the protein function prediction model on a training set, evaluating the performances of the protein function prediction model in different training stages by using a verification set and a test set, and performing protein function prediction by using a protein function prediction model qualified in training.
Further preferably, the spatial contact relationship of Ca atoms in the residues in the step S2 is that a protein is composed of N amino acid residues, a protein spatial structure file is traversed, the spatial structure coordinates of Ca atoms in each amino acid residue are extracted from the protein spatial structure file, the spatial distance between all Ca atoms is calculated based on the coordinates, a distance threshold is set to judge whether two residues are contacted or not, and therefore an amino acid residue contact matrix of N x N is obtained, then only a side r with a contact relationship is taken, and an adjacent matrix with a distance matrix of 2 x 2r is converted according to a source node and a target node of the side r, and the adjacent matrix is taken as the amino acid residue contact graph.
Further preferably, the amino acid residue contact patterns of the different connection densities in step S2 are obtained by calculating the amino acid residue contact patterns of the protein space structure file actually measured by PDB and the protein space structure file predicted by AlphaFold, respectively, and calculating the amino acid residue contact patterns representing the different contact densities according to the different distance threshold values.
Further preferably, in step S3, the complete sequence or a certain segment of the subsequence of the protein is input into an amino acid sequence mixed feature extractor, the sequence extracted by the Bi-LSTM network is first represented by embedding, then the sequence extracted by the ESM2 network consisting of 33 layers of the transducer encoder is continuously extracted, and the sequence extracted by the Bi-LSTM network and the sequence extracted by the ESM2 network are represented by embedding together as a primary sequence of the amino acid sequence.
Further preferably, in the step S3, the feature fusion and enrichment operation is performed on the primary embedded representation of the sequence, wherein the primary embedded representation of the sequence is divided into a sequence single thermal coding mapping, a residue sequence depth context feature and a residue biochemical feature coding according to category slices, the slices are subjected to a leachable nonlinear mapping, the feature enrichment and dimension reduction are performed on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer, and finally, the residue-level mixed feature mapping of the amino acid sequence is obtained through a ReLU nonlinear mapping and feature fusion operation.
Further preferably, the three-channel combined graph neural network comprises a first characteristic channel formed by two layers of GCN networks connected in series, a second characteristic channel formed by the other two layers of GCN networks connected in series, wherein the first characteristic channel is parallel to the second characteristic channel to form a2 x 2 parallel double-channel GCN network, a GATv network with a multi-head graph attention mechanism is introduced to form a third characteristic channel, and the GATv network is connected in series to the parallel double-channel GCN network.
The three-channel combined graph neural network is further preferably characterized in that a residue-level mixed feature map of an amino acid sequence is used as an initial feature matrix input of a2 x 2 parallel two-channel GCN network, a sparse amino acid residue contact graph extracted by a smaller contact threshold is used as an adjacent matrix input of one feature channel of the parallel network, a dense amino acid residue contact graph extracted by a larger contact threshold is used as an adjacent matrix input of the other feature channel, the output features of the 2 x 2 parallel two-channel GCN network are subjected to feature fusion to obtain fusion features, the fusion features are used as feature matrix input of the GATv network, the sparse amino acid residue contact graph is used as an adjacent matrix input of the GATv network, and the GATv network obtains output node features of the three-channel combined graph neural network after further noise filtering and feature extraction are performed on the fusion features.
Further preferably, in step S1, a NoSQL-type protein document type association database is constructed based on mongo db.
Further preferably, the step of constructing a protein document type association database of the NoSQL type is as follows:
Step S101, testing an open data interface of a Uniprot database by using an interface test script and a test tool PostMan, obtaining a Uniprot database identifier corresponding to a protein item through screening conditions, wherein the screened protein item needs to have a protein space structure file which is tested by an experiment or a protein space structure file which is predicted by a AlphaFold model, selecting a protein gene ontology as a functional label of protein, the total number of protein gene ontology labels which are possessed by each screened protein item is more than or equal to 3, and the screened Uniprot database identifier is used as a source data ID of the database;
Step S102, accessing a Uniprot database data interface through a Uniprot database identifier obtained by screening, acquiring a cross index ID of a corresponding protein item in a PDB, alphaFold database from a data packet returned by the interface, and simultaneously associating the main key target ID of a target protein information table of a ChEMBL database through the Uniprot database identifier, and acquiring the other secondary table IDs in a ChEMBL database according to the association;
Step S103, classifying data interfaces corresponding to the IDs are requested, and xml and JSON data packets returned by BeautifulSoup, jsonpath frames and regular expression analysis interfaces are used for obtaining a JSON dictionary composed of protein association information in Uniprot, PDB, alphaFold databases and a JSON dictionary composed of target protein association data in ChEMBL databases;
Step S104, based on the JSON dictionary, constructing a protein information main table by using a Pymongo frame by taking a Uniprot database identifier as a main key, and constructing a ChEMBL associated information table containing target information by taking a main key target ID as a main key;
Step S105, cross-correlating the main key target point ID in the ChEMBL associated information table with the Uniprot database identifier in the protein information main table to obtain a final protein document associated database, wherein a user can search key information of related protein items in the Uniprot, PDB, alphaFold, chEMBL database only through the Uniprot database identifier.
The beneficial effects of the invention are as follows:
(1) According to the method for constructing the NoSQL type protein document type association database based on the MongoDB, through the NoSQL type protein document type association database, a user can search detailed association data of related protein items only according to one main key target point ID without cross indexing in a plurality of databases, so that the search efficiency of protein association information is greatly improved. And compared with a directly downloaded complete MySQL type public database, the NoSQL type database has great improvement in access convenience.
(2) Most of the protein function prediction MODELs currently comprise DeepFRI high-precision protein structures in a PDB protein database and homology-based structural MODELs constructed by SWISS-MODEL, but most of the proteins lack corresponding data, so that an experimental data set obtained from a document type NoSQL protein association database can simultaneously have PDB high-precision structural data determined by experiments and structural data predicted by a AlphaFold MODEL, and the generalization performance of the MODEL can be improved while the data acquisition cost is reduced.
(3) The invention constructs an amino acid sequence mixed feature extractor based on transfer learning, wherein the transfer learning is based on two protein language models of Bi-LSTM and transfomer, and combines the sequence single thermal coding mapping, the residue sequence depth context feature and the residue biochemical feature coding to jointly form a sequence primary embedded representation of the amino acid sequence, and the sequence primary embedded representation is subjected to leachable nonlinear mapping in the model training process, so that the experimental cost is reduced, and the residue features are further enriched, thereby the feature expression capability of the obtained residue level mixed feature mapping is more powerful.
(4) The DeepFRI model architecture is formed by serially connecting GCN networks, and has certain limitation in processing complex irregular graphs such as protein space structures, the invention introduces a GATv network with a multi-head attention mechanism, allows a protein function prediction model to learn multiple groups of attention weights simultaneously, and each attention head can learn different attention weights, thereby enhancing the expression capability of the protein function prediction model on complex graph structures.
(5) The invention combines GCN network and GATv network with multi-head diagram attention mechanism to form a three-channel combined diagram neural network architecture, simultaneously learns two types of residue contact diagrams extracted by different contact thresholds through a2 x 2 parallel GCN network, and then further carries out noise filtration on the characteristics output by the two-channel parallel network through a GATv network connected in series. Meanwhile, in the final multi-label classification stage of the model, the invention enables the protein function prediction model to further acquire more node characteristics by fusing two types of global pooling information.
Detailed Description
The following describes in further detail the embodiments of the present invention with reference to the drawings and examples. The following examples are illustrative of the invention and are not intended to limit the scope of the invention.
As shown in fig. 1, the method for predicting protein functions based on transfer learning and three-channel combined GNN comprises the following steps:
Step S1, acquiring an amino acid sequence of protein, a protein space structure file and a protein Gene Ontology (GO) functional label from a protein public database, constructing a protein document type association database of a NoSQL type based on MongoDB after cross-linking and data cleaning of the acquired data, acquiring a reference data set of a protein function prediction task from the protein document type association database, dividing the reference data set into a training set, a verification set and a test set, wherein the training set is used for training a protein function prediction model provided by the invention, the verification set is used for evaluating the concrete performance of each round (epoch) in the model training process, and the test set is used for testing the overall performance of the model after training is finished. The protein space structure file comprises protein space structure files which are predicted by experimental determination and AlphaFold model;
S2, analyzing a protein space structure file, extracting space coordinates of Ca atoms in each amino acid residue in a protein molecule from the space coordinates, and extracting amino acid residue contact diagrams with different connection densities according to different contact thresholds based on space contact relations of Ca atoms in all residues so as to map a complex space structure of the protein molecule;
s3, constructing an amino acid sequence mixed feature extractor based on transfer learning, extracting multiple types of residue level features from the amino acid sequence through the amino acid sequence mixed feature extractor to form a sequence primary embedded representation;
s4, a graph convolution network and a GATv network with a multi-head graph attention mechanism are introduced to construct a three-channel combined graph neural network, a residue-level mixed feature map of an amino acid sequence is used as an initial feature matrix input of the three-channel combined graph neural network, an amino acid residue contact graph is used as an adjacent matrix input of different feature channels, a complex graph relationship among amino acid residues in protein molecules is captured and learned, and protein function prediction is carried out by combining the residue-level mixed feature map of the amino acid sequence and multi-level structural features;
S5, constructing a multi-label classifier, carrying out global pooling and aggregation operation on the output node characteristics of the three-channel combined graph neural network, and mapping the aggregated characteristics into a probability space of 0-1 to obtain a probability value of each protein gene ontology functional label;
S6, forming a complete protein function prediction model by the sequence mixed characteristic extractor, the three-channel combined graph neural network and the multi-label classifier, pre-training the protein function prediction model on a training set, evaluating the performances of the protein function prediction model in different training stages by using a verification set and a test set, and performing protein function prediction by using a protein function prediction model qualified in training.
The step S1 of constructing a protein document type association database of a NoSQL type based on MongoDB comprises the following steps:
Step S101, testing an open data interface of a Uniprot database by using an interface test script and a test tool PostMan, obtaining a Uniprot database identifier corresponding to a protein item through screening conditions, wherein the screened protein item is required to have a protein space structure file which is determined by experiments or a protein space structure file which is predicted by a AlphaFold model, the length of the amino acid sequence of each protein is not more than 1022 (the ESM network processes the amino acid sequence of the protein consisting of 1022 residues at most once), a protein Gene Ontology (GO) is selected as a functional label of the protein, the total number of the protein gene ontology labels of each protein item which is screened out is more than or equal to 3, and 26335 Uniprot database identifiers which meet the conditions are screened out in the embodiment as source data IDs of the database.
Step S102, accessing a Uniprot database data interface through a Uniprot database identifier obtained by screening, acquiring a cross index ID of a corresponding protein item in a PDB, alphaFold database from a data packet returned by the interface, associating a primary key Target (Target) ID of a Target protein information table of a ChEMBL database through the Uniprot database identifier, and acquiring the rest secondary table IDs in a ChEMBL database according to the association;
Step S103, classifying data interfaces corresponding to the IDs are requested, and xml and JSON data packets returned by BeautifulSoup, jsonpath frames and regular expression analysis interfaces are used for obtaining a JSON dictionary composed of protein association information in Uniprot, PDB, alphaFold databases and a JSON dictionary composed of target protein association data in ChEMBL databases;
Step S104, based on the JSON dictionary, constructing a protein information main table by using a Pymongo frame by taking a Uniprot database identifier as a main key, and constructing a ChEMBL associated information table containing target information by taking a main key target ID as a main key;
Step S105, cross-correlating the main key target point ID in the ChEMBL associated information table with the Uniprot database identifier in the protein information main table to obtain a final protein document associated database, wherein a user can search key information of related protein items in the Uniprot, PDB, alphaFold, chEMBL database only through the Uniprot database identifier.
The method comprises the steps of obtaining a basic data set of a protein function prediction task from a protein document type associated database, specifically, screening 5000 protein items with a high-precision experimental measurement structure from 26335 protein data items in the embodiment, namely Dataset, screening 2000 protein items with a AlphaFold model measurement structure, expanding the protein items to Dataset1, and obtaining a 7000 mixed data set marked as Dataset 2. In order to reduce the experimental cost, the embodiment only takes the GO term with evidence coding (EXP, IDA, IPI, IMP, IGI, IEP, TAS, IC, HTP, HDA, HMP, HGI, HEP) as the functional label of the protein gene ontology of the protein, and divides the extracted data set into a training set, a verification set and a test set according to the proportion of 8:1:1, wherein the protein items of the test set simultaneously have the functional labels of the three protein gene ontologies of MF, BP and CC, and the distribution of the functional labels of the protein gene ontologies of the data set is shown in Table 1.
TABLE 1 functional tag distribution of protein Gene ontologies of the data set extracted in the examples
The spatial contact relation of Ca atoms in the residues in the step S2 is that a protein is assumed to be composed of N amino acid residues, a protein spatial structure file is traversed, the spatial structure coordinates of the Ca atoms in each amino acid residue are extracted from the protein spatial structure file, the spatial distance between all Ca atoms is calculated based on the coordinates, a distance threshold is set to judge whether the two residues are in contact or not, and therefore an N-by-N amino acid residue contact matrix is obtained, then only a side r with the contact relation is taken, and a 2-by-2 r adjacent matrix is converted according to a source node and a target node of the side r, and is taken as the amino acid residue contact diagram.
The amino acid residue contact diagrams of different connection densities in the step S2 are respectively calculated according to the protein space structure file actually measured by the PDB and the amino acid residue contact diagram of the protein space structure file predicted by AlphaFold, and the amino acid residue contact diagrams representing different contact densities are calculated according to different distance threshold values. Specifically, for the protein entry with the Uniprot database identifier of P05067, the corresponding PDB experiment measurement structure index ID is 8OTF, and the alpha fold model prediction structure index ID is:
AF-P05067-F1, thereby obtaining protein space structure file of P05067, obtaining space coordinates of Ca atoms in each amino acid residue in the protein molecule by analyzing the file, the complete P05067 protein is composed of 770 amino acid residues, calculating space distance between Ca atoms and constructing amino acid residue contact matrix ,Representing dimensions, converting by taking only the side of the amino acid residue in contact matrix with contact relationAs an adjacency matrixAs a map of amino acid residue contact of protein P05067.
Specifically, the residue contact calculation formula may be defined as:
;
Wherein the method comprises the steps of Residues extracted from protein space structure filesSum residuesThe three-dimensional coordinates of the Ca atoms in (c),In order to contact the distance threshold value,Representing residuesSum residuesThe spatial distance of Ca atoms in the sequence is considered to be in spatial contact with two amino acid residues when the calculated spatial distance is less than the threshold. In this example, two types of contact patterns of amino acid residues under the contact threshold values 8 a and 10 a are calculated, representing a sparse amino acid residue contact pattern and a dense amino acid residue contact pattern, respectively.
As shown in fig. 2. In step S3, the complete sequence or a certain segment of subsequence of the protein is input into an amino acid sequence mixed characteristic extractor, a total 6155-dimensional sequence embedded representation is extracted from a Bi-LSTM network, then an ESM2 network (Evolutionary Scale Modelling) consisting of 33 layers of transformers encoders is continuously extracted to obtain a total 1280-dimensional sequence embedded representation, and the total sequence embedded representation is taken as a sequence primary embedded representation of the amino acid sequence.
In the step S3, the feature fusion and enrichment operation is carried out on the primary embedded representation of the sequence, namely when the length of the amino acid sequence is overlong, the overall calculation complexity of the model is obviously improved, and in order to obtain more effective features while the training cost is reduced, the primary embedded representation of the sequence is divided into sequence single-heat coding mapping, residue sequence depth context feature and residue biochemical feature coding according to class slices. And (3) carrying out leavable nonlinear mapping on the slices in the training process of the whole model, carrying out feature enrichment and dimension reduction on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer, and finally obtaining residue-level mixed feature mapping of the amino acid sequence through ReLU nonlinear mapping and feature fusion operation.
Specifically, for the protein item with UPID of P05067, the complete sequence (770 amino acids total) is input into an amino acid sequence mixed characteristic extractor, and the Bi-LSTM network extracts the sequence single thermal coding mapping(20 Standard amino acids and a class of unknown amino acids) and residue sequence depth context characteristicsThe ESM2 network will extract its residue biochemical characteristic codeThereby obtaining a primary embedded representation of the sequence,Representing the dimension. In order to avoid resource waste caused by excessive number of amino acid residues, the method is toAnd slicing according to the category, performing three-time learnable nonlinear mapping, and performing feature enrichment and dimension reduction on the primary embedded representation of the sequence by adjusting the dimension of each hidden layer to be 21, 512 and 320 when the dimension is linearly changed. Finally, performing feature fusion on the three types of output features after non-linear ReLU mapping to obtain residue-level mixed feature mapping of the amino acid sequenceIn performing the above-described learnable nonlinear transformation, all weight parameters and corresponding bias values will be trained with all parameters in the subsequent whole protein function prediction model.
As shown in FIG. 3, the three-channel combined graph neural network comprises a first characteristic channel formed by two layers of GCN networks connected in series, a second characteristic channel formed by the other same two layers of GCN networks connected in series, wherein the first characteristic channel is parallel to the second characteristic channel to form a 2X 2 parallel double-channel GCN network, a GATv network with a multi-head graph attention mechanism is introduced to form a third characteristic channel, and the GATv network is connected in series to the parallel double-channel GCN network. The activation functions of the GCN network and GATv network employ a ReLU activation function.
Specifically, the three-channel combined graph neural network has characteristic aggregation and propagation modes that residue-level mixed characteristic mapping of amino acid sequences is usedAs an initial feature matrix input of the 2 x2 parallel dual-channel GCN network, a sparse amino acid residue contact diagram extracted by an 8 a contact threshold is used as an adjacent matrix input of one feature channel of the parallel network, and a dense amino acid residue contact diagram extracted by a 10a contact threshold is used as an adjacent matrix input of the other feature channel. The output characteristics of the 2 x2 parallel double-channel GCN network are subjected to characteristic fusion to obtain fusion characteristicsTaking the amino acid residue as a third characteristic channel, namely, inputting a characteristic matrix of GATv network, simultaneously taking a sparse amino acid residue contact diagram extracted by using an 8A threshold as an adjacent matrix input of GATv network, wherein the GATv network is a natural matrix of the GATv networkFurther noise filtering and feature extraction are carried out to obtain the output node features of the three-channel combined graph neural network。
In step S5, the process of global pooling and aggregation operation is performed on the output node characteristics of the three-channel combined graph neural network is shown in fig. 4;
Output node characteristics for three-channel combined graph neural network And respectively carrying out average pooling and accumulation pooling, and then aggregating two kinds of pooling information to obtain more comprehensive node characteristic information.
Specifically, the global pooling operation may be specifically defined as:
;
;
;
Wherein, Is the first in the output node characteristicsThe feature vectors of the individual nodes are used,Representing the accumulated pooling information is presented,Represents the information of the mean value pool,Indicating the characteristics of the aggregated nodes, N is the number of the nodes, and concat indicates the aggregation operation.
The node characteristics after aggregationAnd carrying out batch normalization, reLU activation and Dropout operation in sequence, mapping a full connection layer, and mapping the feature output into a probability space of 0-1 by using a sigmoid function to obtain the probability value of each protein gene ontology functional label.
The method for pre-training the protein function prediction model on the training set in the step S6 comprises the steps of selecting the training set in the reference data set as training data of the protein function prediction model, respectively taking MF, BP and CC labels as functional labels of protein gene ontology during supervised learning, selecting Adam as a training optimizer, setting the learning rate to be 0.0001, selecting a binary cross entropy function as a loss function during training, setting the size of a training Batch (Batch size) to be 64, and setting the maximum iteration round (epoch) to be 50. And in the training process, using a verification set to evaluate the protein function prediction model performance after each iteration round (epoch), namely after each round of training is finished, the protein function prediction model predicts protein gene ontology function labels of all protein items in the verification set, calculates the area (M-AUPR) under a micro precision recall curve, the area (M-AUPR) under the macro precision recall curve and the score on three evaluation indexes of the maximum F1 score (F-MAX) respectively, and if the score is improved compared with the previous round, stores the protein function prediction model pre-trained in the round. Training is stopped after the maximum iteration round (epoch) is reached. Finally, the performance of the pre-trained protein function prediction model is evaluated on a test set.
The protein function prediction model provided by the invention is used for respectively pre-training under training sets corresponding to Dataset and Dataset, testing under a testing set of Dataset1, respectively calculating the scores of the protein function prediction model for predicting MF labels (482 in total) on three types of evaluation indexes, and taking the highest score in 5 tests, wherein the results are shown in Table 2.
TABLE 2 expansion AlphaFold predicted influence of protein Structure data on model Performance
Model evaluation and comparison are carried out on the DeepFRI models and the protein function prediction model of the invention under a dataset Dataset, 50 iterative rounds (epochs) are pre-trained, the scores of the models on three types of evaluation indexes are calculated respectively, the highest scores in 5 tests are taken, the predicted performances of the models on three types of protein gene ontology function labels of MF (total 517), BP (total 2124) and CC (total 408) are discussed respectively, and the results are shown in tables 3 and 4.
TABLE 3 DeepFRI model prediction of protein function
TABLE 4 prediction of protein function by protein function prediction model of the invention
As can be seen from the experimental results, the model is superior to the DeepFRI model in most cases, and the AlphaFold predicted protein structure data is used as the training input of the model, so that the problem caused by the lack of the spatial structure of experimental measurement of protein items is avoided, and the generalization performance of the model is improved. Meanwhile, the protein document type associated database provided by the invention can greatly improve the retrieval efficiency of protein associated information and is convenient for the downstream prediction task of the protein.
Finally, it should be noted that the foregoing description is only a preferred embodiment of the present invention, and the present invention is not limited to the foregoing embodiment, but may be modified or some of the technical features thereof may be substituted by those illustrated in the foregoing embodiment. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of the present invention should be included in the protection scope of the present invention.