Detailed Description
Exemplary embodiments of the present disclosure are described below in conjunction with the accompanying drawings, which include various details of the embodiments of the present disclosure to facilitate understanding, and should be considered as merely exemplary. Accordingly, one of ordinary skill in the art will recognize that various changes and modifications of the embodiments described herein can be made without departing from the scope and spirit of the present disclosure. Also, descriptions of well-known functions and constructions are omitted in the following description for clarity and conciseness.
Artificial intelligence (ARTIFICIALINTELLIGENCE, AI for short) is a piece of technical science that studies, develops theories, methods, techniques and application systems for simulating, extending and expanding human intelligence. At present, the AI technology has the advantages of high automation degree, high accuracy and low cost, and is widely applied.
Deep learning (DEEP LEARNING, DL for short) is a new research direction in the field of machine learning (MACHINE LEARNING, ML for short), and learns the internal rules and presentation hierarchy of sample data, and the information obtained in the learning process is greatly helpful for explaining data such as text, images and sounds. Its final goal is to have the machine have analytical learning capabilities like a person, and to recognize text, image, and sound data. The method mainly comprises a convolutional operation-based neural network system, a multi-layer neuron-based self-coding neural network, and a deep confidence network for further optimizing the weight of the neural network by combining identification information by pre-training in a multi-layer self-coding neural network mode. Deep learning has achieved many results in search technology, data mining, machine learning, machine translation, natural language processing, multimedia learning, speech, recommendation and personalization techniques, and other related fields. The deep learning makes the machine imitate the activities of human beings such as audio-visual and thinking, solves a plurality of complex pattern recognition problems, and makes the related technology of artificial intelligence greatly advanced.
Graph Theory (Graph Theory) is a branch of mathematics. It takes the graph as the study object. A graph in a graph theory is a graph formed of a number of given points and lines connecting the two points, and this graph is generally used to describe a specific relationship between something, where the points represent something, and the lines connecting the two points represent that there is such a relationship between the corresponding two things.
Biotechnology (Biotechnology) is a comprehensive technology based on life sciences, which uses the characteristics and functions of living beings (or biological tissues, cells and other components) to design, construct new substances or new lines with desired properties, and process products or provide services in combination with engineering principles. The information technology (Information Science) is a technology for acquiring, transmitting and processing research information, and is formed by combining computer technology, communication technology and microelectronic technology, namely, the information is processed by a computer, and modern electronic communication technology is used for acquiring, storing, processing, utilizing information and manufacturing related products, developing technology and new subjects of information service. Information technology and biotechnology are both high and new technologies, and the two technologies are not mutually related in new economy, but complement each other, so that the rapid development of the 21 st century economy is jointly promoted.
Methods, apparatuses, electronic devices, and storage media for molecular screening according to embodiments of the present disclosure are described below with reference to the accompanying drawings.
Fig. 1 is a flow diagram of a method of molecular screening according to a first embodiment of the present disclosure.
As shown in fig. 1, the method for molecular screening according to the embodiments of the present disclosure may specifically include the following steps:
s101, acquiring a first tag map of molecules to be screened in a set of molecules to be screened and a second tag map of reference molecules.
In particular, the main body of the molecular screening method according to the embodiments of the present disclosure may be a molecular screening apparatus provided by the embodiments of the present disclosure, where the molecular screening apparatus may be a hardware device having a data information processing capability and/or software necessary for driving the hardware device to operate. Alternatively, the execution body may include a workstation, a server, a computer, a user terminal, and other devices. The user terminal comprises, but is not limited to, a mobile phone, a computer, intelligent voice interaction equipment, intelligent household appliances, vehicle-mounted terminals and the like.
The molecular screening method of the embodiment of the disclosure can be applied to the realization of Ligand-based drug virtual screening (LVS for short), and potential drug molecules are screened from a molecular database according to the principle that molecules with similar structures tend to have similar attributes.
In the embodiment of the disclosure, a plurality of molecules in a molecular set to be screened (such as a molecular database) are used as the molecules to be screened, and each molecule to be screened is compared with a reference molecule, so as to obtain a target molecule with higher similarity with the reference molecule, wherein the molecule can be a drug molecule. In some embodiments, for each molecule to be screened in the set of molecules to be screened, a tag map of the molecule to be screened (molecule a) is obtained as a first tag map (G A), and a tag map of the reference molecule (molecule B) is obtained as a second tag map (G B), wherein the relationship of the tag map vertices and edges corresponds to the structure of the molecule, which can be regarded as an undirected map of features comprising the molecule.
S102, each molecule to be screened and the reference molecule form a molecule pair, conflict information between mapping between vertexes of the first label graph and the second label graph is obtained for each molecule pair, and a mapping graph of the molecule pair is generated based on the conflict information between the mapping.
In the embodiment of the disclosure, each molecule to be screened forms a molecule pair with a reference molecule, and for each molecule pair, a mapping between vertices of a first tag map of the molecule to be screened and a second tag map of the reference molecule in the molecule pair, for example, a mapping between vertex a 1 in the first tag map G A and vertex B 1 in the second tag map G B, is obtained, where the mapping may be understood as constructing a correspondence between vertex a 1 and vertex B 1. It should be noted that the obtained mapping should be all possible mappings existing between vertices of the first label graph and the second label graph. Furthermore, conflict information between any two of all possible mappings is obtained, which can be understood as if there is a conflict between two mappings, then the two mappings cannot exist at the same time. Embodiments of the present disclosure generate a map (G AB) of the molecular pair based on the acquired map and conflict information between the maps.
And S103, sampling the mapping graph to obtain a maximum weight full-connection subgraph of the molecular pair.
In the embodiment of the disclosure, the map G AB is an undirected graph, including a vertex set, an edge set and a weight set, and the map of the molecular pair is sampled based on gaussian glass color sampling (Gaussian Boson Sampling, abbreviated as GBS) to obtain a maximum weight connection sub-graph of the molecular pair, where the maximum weight full connection sub-graph with better effect can be obtained by greedy shrinkage, local expansion and sampling post-processing by using historical sampling information.
S104, according to the maximum weight full-connection subgraph of each molecule pair, selecting the molecule to be screened with the maximum similarity with the reference molecule from the molecule set to be screened, and taking the molecule to be screened as the target molecule.
In the embodiment of the disclosure, a maximum weight full-connection subgraph of each molecule pair is obtained, the similarity between the molecules to be screened and the reference molecule is obtained according to the maximum weight full-connection subgraphs, the molecules to be screened with the maximum similarity with the reference molecule are screened from the molecule set to be screened, and the molecule to be screened with the maximum similarity is taken as the target molecule.
In summary, in the molecular screening method according to the embodiment of the present disclosure, a first tag map of a molecule to be screened and a second tag map of a reference molecule in a set of molecules to be screened are obtained, each molecule to be screened and the reference molecule form a molecule pair, for each molecule pair, mapping between vertices of the first tag map and the second tag map and conflict information between the mapping are obtained, a map of the molecule pair is generated based on the conflict information between the mapping and the mapping, the map is sampled, a maximum weight full-connection sub-map of the molecule pair is obtained, and the molecule to be screened with the maximum similarity to the reference molecule is screened from the set of molecules to be screened as a target molecule according to the maximum weight full-connection sub-map of each molecule pair. According to the method, molecules are modeled into an undirected graph with labels (namely characteristics), a mapping graph of molecular pairs is generated based on the label graph, a maximum weight full-connection subgraph of the mapping graph is obtained by utilizing Gaussian glass color sampling, a similarity score of the molecular pairs is obtained based on the maximum weight full-connection subgraph, and a framework for realizing molecular screening based on Gaussian glass color sampling is provided, so that the problem of virtual screening of medicines based on ligands can be effectively realized by utilizing Gaussian glass color sampling, and the efficiency of molecular screening is improved.
Fig. 2 is a flow diagram of a method of molecular screening according to a second embodiment of the present disclosure.
As shown in fig. 2, the method for molecular screening according to the embodiment of the disclosure may specifically include the following steps based on the embodiment shown in fig. 1:
S201, acquiring a first tag map of molecules to be screened in a set of molecules to be screened and a second tag map of reference molecules.
The step S102 may specifically include the steps of:
s202, performing key feature matching on each first vertex in the first label graph and each second vertex in the second label graph respectively to obtain mapping between the first vertex and the second vertex.
In the embodiment of the present disclosure, each vertex in the first label graph G A is regarded as a first vertex Ai, each vertex in the second label graph G B is regarded as a second vertex Bj, a key feature of one of the first vertices Ai is obtained as a first key feature, and a key feature of one of the second vertices Bj is obtained as a second key feature, whether the attributes of the first key feature and the second key feature are the same is determined, and if at least one attribute of the key feature is the same, a mapping between the first vertex Ai and the second vertex Bj is generated.
The above-mentioned key feature is a feature of an atom or a ring corresponding to a vertex, and the feature is plural and may be classified into a key feature and a non-key feature.
S203, obtaining two maps without conflict from all the maps to form a map pair, and forming edges between the two maps included in the map pair.
In the embodiment of the disclosure, according to the conflict information of the acquired mappings, two mappings without conflict are acquired from all the mappings, and the two mappings are determined as a pair of mappings without conflict. In generating the map of the molecular pairs, edges are formed between the two maps corresponding to the map pairs, and it is understood that the edges are not formed between the maps where there is a conflict.
S204, obtaining the weight corresponding to the mapping.
In the embodiment of the disclosure, for each mapping, the weight corresponding to the mapping is obtained according to the atomic numbers and the characteristics of the first vertex and the second vertex corresponding to the mapping.
S205, generating a mapping diagram by mapping to the vertexes of the mapping diagram, edges between conflict-free mapping pairs and the weight of the mapping.
In the disclosed embodiment, each possible mapping is taken as a vertex of the mapping graph, edges are formed between conflict-free mapping pairs, and a weight set is formed according to the weight of each mapping, so that the mapping graph comprising the vertex set, the edge set and the weight set is formed.
S206, sampling the mapping graph to obtain a maximum weight full-connection subgraph of the molecular pair.
S207, selecting the molecules to be screened with the maximum similarity with the reference molecules from the molecule set to be screened according to the maximum weight full-connection subgraph of each molecule pair, and taking the molecules to be screened as target molecules.
In the embodiment of the present disclosure, step S201 is the same as step S101 in the above embodiment, and steps S206 to S207 are the same as steps S103 to S104, which are not described herein.
On the basis of the above embodiment, as shown in fig. 3, the "obtaining two maps without collision from all maps to form a map pair" in step S203 may include the following steps:
S301, selecting two maps from all the maps, and determining a first vertex and a second vertex which correspond to the two maps respectively.
In the embodiment of the disclosure, two mappings are arbitrarily selected from all mappings of the first label graph and the second label graph, and a first vertex and a second vertex corresponding to each of the two mappings are determined. For example, one mapping corresponds to first vertex A 1 and second vertex B 1, and the other mapping corresponds to first vertex A 2 and second vertex B 2.
S302, obtaining a first atom or a first ring represented by a first vertex corresponding to each of the two mappings.
In an embodiment of the present disclosure, the atoms characterized by the first vertex a 1 (i.e., the first atom and the first ring) and the atoms or rings characterized by a 2 (i.e., the first atom or the first ring) are obtained.
S303, obtaining a second atom or a second ring represented by a second vertex corresponding to each of the two mappings.
In an embodiment of the present disclosure, the atoms characterized by the second vertex B 1 (i.e., the second atom and the second ring) and the atoms or rings characterized by B 2 (i.e., the second atom or the second ring) are obtained.
S304, in response to that no atomic conflict and no distance conflict exist between the two first atoms or the first rings, and no atomic conflict and no distance conflict exist between the two second atoms or the second rings, determining that the two mappings are two mappings without conflict, and forming a mapping pair.
In an embodiment of the present disclosure, it is determined whether there is an atomic conflict and a distance conflict between two first atoms or first rings, and whether there is an atomic conflict and a distance conflict between two second atoms or second rings:
The atomic conflict can be determined by comparing two first atoms or first rings and two second atoms or second rings, and determining that the two mappings have no atomic conflict if the same atoms are not present in the two first atoms or first rings and the same atoms are not present in the two second atoms or second rings, which can be represented by the following formula:
Wherein A 1 and A 2 in equation one represent a single atom or a collection of atoms in a ring characterized by two first vertices, respectively, and B 1 and B 2 represent a single atom or a collection of atoms in a ring characterized by two second vertices, respectively.
The distance conflict may be determined by obtaining a first distance between two first atoms or first rings, obtaining a second distance between two second atoms or second rings, and determining that there is no distance conflict for the two mappings in response to a difference between the first distance and the second distance being less than or equal to a preset distance.
It will be appreciated that the similarity of the molecular pairs depends on the number of labels of the common substructure between the molecules, that is, the similarity of the molecular pairs is determined based on the maximum common substructure (Labelled Maximum Common Substructure, abbreviated as LMCS) between the molecular pairs, and the maximum weight full-connection sub-graph of the mapping graph of the molecular pairs is obtained by using GBS, so as to be equivalent to the maximum common substructure between the molecular pairs, and in order to ensure that the obtained maximum weight full-connection sub-graph can effectively correspond to the common substructure, the distances between two vertices mapped to two vertices in the same label graph are approximately the same as a condition without distance collision. For example, for two mappings, the difference between the distance between two atoms corresponding to the first vertex of the two mappings (first distance) and the distance between two atoms corresponding to the second vertex of the two mappings (second distance) does not exceed a preset distance. Can be expressed by the following formula two:
|dist(A1,A2)-dist(B1,B2)|>τ
Wherein A 1 and A 2 in equation one represent a single atom or ring characterized by two first vertices, respectively, B 1 and B 2 represent a single atom or ring characterized by two second vertices, respectively, the first and second distances may be calculated based on Euclidean distances, wherein the 3D coordinates of the atoms are taken as atomic coordinates, the 3D coordinates of the geometric center of the ring are taken as coordinates of the ring, and τ may be taken as coordinates of the ring in some embodiments Is arranged as
On the basis of the above embodiment, the step of "obtaining the weight corresponding to the mapping" in step S204 may include the steps of determining, for each mapping, a first vertex and a second vertex corresponding to the mapping, generating the weight of the mapping according to the number of atoms and the characteristics of each of the first vertex and the second vertex, and the characteristics being the characteristics of atoms or rings corresponding to the vertices.
On the basis of the above embodiment, as shown in fig. 4, the method for molecular screening according to the embodiment of the present disclosure further includes a process of generating the mapped weight according to the number of atoms and the characteristics of each of the first vertex and the second vertex, which may be implemented by:
S401, obtaining an average value of the atomic number of the first vertex and the atomic number of the second vertex.
In the embodiment of the disclosure, for each mapping, an average value of the number of atoms of the first vertex and the number of atoms of the second vertex corresponding to the mapping is obtained.
S402, obtaining the number of the features with the same attribute or value between the first vertex and the second vertex.
In the embodiment of the present disclosure, feature matching is performed on a feature corresponding to a first vertex (i.e., a feature of an atom or a ring represented by the first vertex) and a feature corresponding to a second vertex (i.e., a feature of an atom or a ring represented by the second vertex), where feature matching may be performed by comparing whether the attribute or the value of the feature is the same or not, so as to obtain the number of features with the same attribute or value between the first vertex and the second vertex.
S403, determining the weight of the mapping between the first vertex and the second vertex according to the average value and the attribute or the number of the same characteristics.
In the embodiment of the present disclosure, for each mapping, the weight between the first vertex and the second vertex is determined according to the obtained average value and the attribute or the number of features with the same value, and may be expressed by the formula three:
Wherein V AB in formula three is the vertex set of map G AB=(VAB,EAB,WAB), m i represents the ith map in the map, a represents the first vertex corresponding to the ith map, a i represents a single atom or a set of multiple atoms in a ring represented by the first vertex, B represents the second vertex corresponding to the ith map, B i represents a single atom or a set of multiple atoms in a ring represented by the second vertex, and L a∩Lb represents the number of features having the same attribute or value between the first vertex and the second vertex.
On the basis of the above embodiment, as shown in fig. 5, the method for molecular screening according to the embodiment of the present disclosure further includes a process of generating a tag map corresponding to a molecule, where the molecule may be a molecule to be screened or a reference molecule, and may be implemented by the following steps:
S501, generating an undirected graph of any molecule according to the molecular structure of any molecule, wherein the top point in the undirected graph corresponds to a single atom of the molecule or a ring containing a plurality of atoms, and the side in the undirected graph corresponds to a chemical bond in the molecule.
In the embodiment of the disclosure, an undirected graph can be generated by taking atoms as vertexes and bonds as edges, for a ring containing multiple atoms, the ring is contracted into a single vertex, and for two rings with common bonds, an auxiliary edge is added between corresponding vertexes in the graph, and as shown in fig. 6, a label graph G A and a label graph G B are constructed by molecules A and molecules B.
S502, extracting the characteristics of each atom in any molecule, and generating a characteristic label set of any molecule based on the extracted characteristics, wherein the characteristic label set comprises the characteristics of a single atom and the characteristics of a ring, and the characteristics of the ring are obtained by polymerizing the characteristics of a plurality of atoms contained in the ring.
In the embodiment of the present disclosure, as shown in table 1, the features of the atoms may include a plurality of atomic numbers, implicit H bonds, form charges, degrees, and the like, the features of each atom included in the molecule are extracted, the feature type of the extracted features is identified, and key feature tags of the extracted features are determined based on the feature type, wherein the key feature tags are used to indicate whether the features are key features, and a feature tag set is generated based on the extracted features, the feature types of the features, and the key feature tags of the features.
Among other things, the chemical and pharmaceutical characteristics of atoms are considered in embodiments of the present disclosure, and the characteristics are classified into two types, critical (C) and non-critical (NC) based on a "critical" mechanism, to facilitate mapping. Wherein the feature types include additive chemical features, non-additive chemical features, and pharmaceutical features, the non-additive chemical features and the pharmaceutical features being key features, the additive chemical features being non-key features.
TABLE 1 atomic feature and Key feature tag correspondence table
Wherein "(" drug ") is used to label a pharmaceutical feature.
The ring characteristics can be obtained by polymerizing the characteristics of atoms contained in the ring. For example, for additive features (i.e., implicit H-bonds, formal charge, and degrees), we add the corresponding values for each atom in each ring. For non-additive features (i.e., atomic number and attached chemical bonds), we maintain an ordered list. Finally, for the other seven pharmaceutical features, if any atom in the ring has that feature, we set its feature tag to "true" and if no atom has that feature, we set the feature tag to "false".
S503, generating a label graph corresponding to any molecule according to the undirected graph and the characteristic label set of any molecule, wherein any molecule is a molecule to be screened, the corresponding label graph is a first label graph, any molecule is a reference molecule, and the corresponding label graph is a second label graph.
In the disclosed embodiment, two molecules are denoted as a and B, and their corresponding tag graphs are denoted as G A=(VA,EA,LA) and G B=(VB,EB,LB), respectively), the tag graphs including a vertex set V, an edge set E, and a feature tag set L.
On the basis of the above embodiment, as shown in fig. 7, the step S207 of "selecting the molecule to be screened having the greatest similarity with the reference molecule from the set of molecules to be screened according to the maximum weight full-connection subgraph of each molecule pair" may include the following steps:
S701, obtaining the similarity between the label graphs of the molecules to be screened in the molecular pairs and the reference molecules based on the maximum full-connection graph of the molecular pairs.
S702, screening out the molecules to be screened with the maximum similarity with the reference molecule from the molecule set to be screened, and taking the molecules to be screened as target molecules.
On the basis of the above embodiment, as shown in fig. 8, the step of "obtaining the similarity between the tag maps of the molecules to be screened and the reference molecules in the pair of molecules" in the step S701 based on the maximum full-ligation map of the pair of molecules may include the following steps:
S801, a first total weight of a first tag map is acquired.
In the embodiment of the present disclosure, a sum of weights of all vertices in the first label graph is taken as a first total weight, where the weight of each vertex may be determined by using a feature number corresponding to the vertex and an atomic number corresponding to the vertex.
S802, obtaining a second total weight of the second label graph.
In the embodiment of the present disclosure, the sum of the weights of all the vertices in the second label graph is taken as the second total weight, where the weight of each vertex may be determined by using the feature number corresponding to the vertex and the atomic number corresponding to the vertex.
S803, determining a third total weight of the maximum weight fully-connected subgraph based on the weights of the vertexes included in the maximum weight fully-connected subgraph.
S804, determining the similarity between the first label and the second label graph according to the first total weight, the second total weight and the third total weight.
In the disclosed embodiment, the average value of Bunke and Shearer similarity measures may be used as the similarity score, which may be expressed by the equation four:
Wherein w v is represented as the weight w v=|Lv |+|v| of the vertex in the label graph, where |v| is the number of atoms contained in the vertex v, M represents the vertex set of the maximum weight fully-connected subgraph, ω mi is the same as ω mi in equation three, and represents the weight of the vertex in the maximum weight fully-connected subgraph. When A and B are the same molecule, sim (A, B) =1, when When Sim (a, B) =0. Otherwise, sim (A, B) ε (0, 1). It can thus correctly characterize the similarity of molecules.
In summary, in the molecular screening method according to the embodiment of the present disclosure, a first tag map of a molecule to be screened and a second tag map of a reference molecule in a set of molecules to be screened are obtained, each molecule to be screened and the reference molecule form a molecule pair, for each molecule pair, mapping between vertices of the first tag map and the second tag map and conflict information between the mapping are obtained, a map of the molecule pair is generated based on the conflict information between the mapping and the mapping, the map is sampled, a maximum weight full-connection sub-map of the molecule pair is obtained, and the molecule to be screened with the maximum similarity to the reference molecule is screened from the set of molecules to be screened as a target molecule according to the maximum weight full-connection sub-map of each molecule pair. According to the method, molecules are modeled into an undirected graph with labels (namely characteristics), a mapping graph of molecular pairs is generated based on the label graph, a maximum weight full-connection subgraph of the mapping graph is obtained by utilizing Gaussian glass color sampling, a similarity score of the molecular pairs is obtained based on the maximum weight full-connection subgraph, and a framework for realizing molecular screening based on Gaussian glass color sampling is provided, so that the problem of virtual screening of medicines based on ligands can be effectively realized by utilizing Gaussian glass color sampling, and the efficiency of molecular screening is improved.
For clarity of describing the molecular screening method according to the embodiments of the present disclosure, reference is now made to fig. 9, fig. 9 is an overall schematic diagram of the molecular screening method according to the embodiments of the present disclosure, as shown in fig. 9, a molecule a is used as a molecule to be screened, a first tag graph G A corresponding to the molecule a to be screened and a second tag graph G B corresponding to the reference molecule B are generated according to the structure of the molecule, respectively, a mapping graph G AB corresponding to the molecule a to be screened and the reference molecule B is established according to the first tag graph G A and the second tag graph G B, a maximum weight full-connection sub-graph (Maximum Weight Clique, abbreviated as MWC) of the mapping graph G AB is generated based on the GBS algorithm, and a target atom (gray solid circle as shown in fig. 9) is mapped back to the first tag graph and the second tag graph, i.e. a common sub-structure of the two molecules is determined, so as to generate a similarity score (i.e. similarity) of the first tag graph and the second tag graph.
Fig. 10 is a block diagram of an apparatus for molecular screening according to a first embodiment of the present disclosure.
As shown in fig. 10, the molecular screening apparatus 1000 according to the embodiment of the present disclosure includes an obtaining module 1001, a mapping module 1002, a sampling module 1003, and a screening module 1004.
An obtaining module 1001 is configured to obtain a first tag map of molecules to be screened in a set of molecules to be screened and a second tag map of reference molecules.
The mapping module 1002 is configured to form a molecular pair from each molecule to be screened and a reference molecule, obtain, for each molecular pair, mapping between vertices of the first label graph and the second label graph and conflict information between the mapping, and generate a mapping graph of the molecular pair based on the mapping and the conflict information between the mapping.
And the sampling module 1003 is used for sampling the mapping graph to acquire the maximum weight full-connection subgraph of the molecular pair.
And a screening module 1004, configured to screen the molecules to be screened with the highest similarity with the reference molecule from the set of molecules to be screened as target molecules according to the maximum weight full-connection subgraph of each molecule pair.
It should be noted that the explanation of the above embodiments of the method for molecular screening is also applicable to the device for molecular screening according to the embodiments of the present disclosure, and specific processes are not described herein.
In summary, the device for molecular screening according to the embodiments of the present disclosure obtains a first tag map of molecules to be screened and a second tag map of reference molecules in a set of molecules to be screened, each molecule to be screened and the reference molecules form a pair of molecules, for each pair of molecules, conflict information between mapping between vertices of the first tag map and the second tag map is obtained, a map of the pair of molecules is generated based on the conflict information between the mapping and the mapping, the map is sampled, a maximum weight full connection subgraph of the pair of molecules is obtained, and the molecule to be screened with the maximum similarity to the reference molecule is screened from the set of molecules to be screened as a target molecule according to the maximum weight full connection subgraph of each pair of molecules. According to the method, molecules are modeled into an undirected graph with labels (namely characteristics), a mapping graph of molecular pairs is generated based on the label graph, a maximum weight full-connection subgraph of the mapping graph is obtained by utilizing Gaussian glass color sampling, a similarity score of the molecular pairs is obtained based on the maximum weight full-connection subgraph, and a framework for realizing molecular screening based on Gaussian glass color sampling is provided, so that the problem of virtual screening of medicines based on ligands can be effectively realized by utilizing Gaussian glass color sampling, and the efficiency of molecular screening is improved.
Fig. 11 is a block diagram of an apparatus for molecular screening according to a second embodiment of the present disclosure.
As shown in fig. 11, the apparatus 1100 for molecular screening according to an embodiment of the present disclosure includes an obtaining module 1101, a mapping module 1102, a sampling module 1103, and a screening module 1104.
The obtaining module 1101 has the same structure and function as the obtaining module 1001 in the previous embodiment, the mapping module 1102 has the same structure and function as the mapping module 1002 in the previous embodiment, the sampling module 1103 has the same structure and function as the sampling module 1003 in the previous embodiment, and the filtering module 1104 has the same structure and function as the filtering module 1004 in the previous embodiment.
Further, the obtaining module 1101 includes a first generating sub-module 11011 configured to generate an undirected graph of any molecule according to a molecular structure of any molecule, wherein a vertex in the undirected graph corresponds to a single atom of the molecule or a ring containing a plurality of atoms, and an edge in the undirected graph corresponds to a chemical bond in the molecule, a second generating sub-module 11012 configured to extract a feature of each atom in any molecule and generate a feature tag set of any molecule based on the extracted feature, where the feature tag set includes a feature of the single atom and a feature of the ring, and the feature of the ring is obtained by polymerizing features of a plurality of atoms contained in the ring, and a third generating sub-module 11013 configured to generate a tag graph corresponding to any molecule according to the undirected graph and the feature tag set of any molecule, where any molecule is a molecule to be screened, the corresponding tag graph is a first tag graph, and any molecule is a reference molecule, and the corresponding tag graph is a second tag graph.
Further, the second generating submodule 11012 includes a first determining unit configured to identify a feature type of the extracted feature and determine a key feature tag of the extracted feature based on the feature type, the key feature tag indicating whether the feature is a key feature, and a second generating unit configured to generate a feature tag set based on the extracted feature, the feature type of the feature, and the key feature tag of the feature.
Further, the molecule is a drug molecule, and the feature types include an additive chemical feature, a non-additive chemical feature, and a pharmaceutical feature, the non-additive chemical feature and the pharmaceutical feature being key features, the additive chemical feature being a non-key feature.
Further, the mapping module 1102 includes a feature matching sub-module, configured to match each first vertex in the first label graph with each second vertex in the second label graph to obtain a mapping between the first vertex and the second vertex, a fourth generating sub-module, configured to obtain two maps without collision from all the maps to form a mapping pair, and form an edge between the two maps included in the mapping pair, a first obtaining sub-module, configured to obtain a weight corresponding to the mapping, and a fifth generating sub-module, configured to generate the mapping graph with the mapping as the vertex of the mapping graph, the edge between the mapping pair without collision, and the weight of the mapping.
Further, the fourth generation submodule comprises a second determination unit, a first acquisition unit and a second acquisition unit, wherein the second determination unit is used for selecting two mappings from all the mappings and determining a first vertex and a second vertex which are respectively corresponding to the two mappings, the first acquisition unit is used for acquiring a first atom or a first ring which are characterized by the first vertex which is respectively corresponding to the two mappings, the second acquisition unit is used for acquiring a second atom or a second ring which is characterized by the second vertex which is respectively corresponding to the two mappings, and the third determination unit is used for determining that the two mappings are collision-free and form a mapping pair if no atom conflict and no distance conflict exist between the two first atoms or the first rings and no atom conflict and no distance conflict exists between the two second atoms or the second rings.
Further, the third determining unit comprises an atom comparison subunit for atom comparing two first atoms or first rings and atom comparing two second atoms or second rings, and a second determining subunit for determining that no atomic conflict exists in the two mappings in response to the fact that the same atom does not exist in the two first atoms or the first rings and the same atom does not exist in the two second atoms or the second rings.
Further, the third determining unit comprises a first obtaining subunit, a second obtaining subunit and a second determining subunit, wherein the first obtaining subunit is used for obtaining a first distance between two first atoms or first rings, the second obtaining subunit is used for obtaining a second distance between two second atoms or second rings, and the second determining subunit is used for determining that no distance conflict exists between the two mappings in response to the difference value between the first distance and the second distance is smaller than or equal to a preset distance.
Further, the feature matching sub-module comprises a third obtaining unit, a judging unit and a second generating unit, wherein the third obtaining unit is used for obtaining first key features of one of the first vertexes and second key features of one of the second vertexes, the judging unit is used for judging whether the attributes of the first key features and the second key features are identical, and the second generating unit is used for generating a mapping between one of the first vertexes and one of the second vertexes if the attributes of at least one key feature are identical.
Further, the obtaining submodule comprises a third generating unit, a first generating unit and a second generating unit, wherein the third generating unit is used for determining a first vertex and a second vertex corresponding to each mapping, and generating the weight of the mapping according to the number of atoms and the characteristics of the first vertex and the second vertex, and the characteristics are the characteristics of atoms or rings corresponding to the vertices.
Further, the third generating unit comprises a third obtaining subunit, a fourth obtaining subunit and a third determining unit, wherein the third obtaining subunit is used for obtaining the average value of the atomic number of the first vertex and the atomic number of the second vertex, the fourth obtaining subunit is used for obtaining the number of the characteristics with the same attribute or value between the first vertex and the second vertex, and the third determining unit is used for determining the mapping weight between the first vertex and the second vertex according to the average value and the number of the characteristics with the same attribute or value.
Further, the sampling module 1103 includes a sampling sub-module, configured to sample the map based on a gaussian glass sampling algorithm, so as to obtain the maximum weight fully-connected subgraph.
Further, the screening module 1104 includes a second obtaining submodule, configured to obtain a similarity between a to-be-screened molecule in the molecular pair and a tag map of the reference molecule based on a maximum full-connection map of the molecular pair, and a screening submodule, configured to screen out a to-be-screened molecule with the maximum similarity with the reference molecule from the to-be-screened molecule set, as a target molecule.
Further, the second obtaining sub-module comprises a fourth obtaining unit, a fifth obtaining unit, a third determining unit and a fourth determining unit, wherein the fourth obtaining unit is used for obtaining the first total weight of the first label graph, the fifth obtaining unit is used for obtaining the second total weight of the second label graph, the third determining unit is used for determining the third total weight of the maximum weight full-connection sub-graph based on the weight of the vertex included in the maximum weight full-connection sub-graph, and the fourth determining unit is used for determining the similarity between the first label graph and the second label graph according to the first total weight, the second total weight and the third total weight.
In summary, the device for molecular screening according to the embodiments of the present disclosure obtains a first tag map of molecules to be screened and a second tag map of reference molecules in a set of molecules to be screened, each molecule to be screened and the reference molecules form a pair of molecules, for each pair of molecules, conflict information between mapping between vertices of the first tag map and the second tag map is obtained, a map of the pair of molecules is generated based on the conflict information between the mapping and the mapping, the map is sampled, a maximum weight full connection subgraph of the pair of molecules is obtained, and the molecule to be screened with the maximum similarity to the reference molecule is screened from the set of molecules to be screened as a target molecule according to the maximum weight full connection subgraph of each pair of molecules. According to the method, molecules are modeled into an undirected graph with labels (namely characteristics), a mapping graph of molecular pairs is generated based on the label graph, a maximum weight full-connection subgraph of the mapping graph is obtained by utilizing Gaussian glass color sampling, a similarity score of the molecular pairs is obtained based on the maximum weight full-connection subgraph, and a framework for realizing molecular screening based on Gaussian glass color sampling is provided, so that the problem of virtual screening of medicines based on ligands can be effectively realized by utilizing Gaussian glass color sampling, and the efficiency of molecular screening is improved.
In the technical scheme of the disclosure, the related processes of collecting, storing, using, processing, transmitting, providing, disclosing and the like of the personal information of the user accord with the regulations of related laws and regulations, and the public order colloquial is not violated.
According to embodiments of the present disclosure, the present disclosure also provides an electronic device, a readable storage medium and a computer program product.
Fig. 12 shows a schematic block diagram of an example electronic device 1200 that can be used to implement embodiments of the present disclosure. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the disclosure described and/or claimed herein.
As shown in fig. 12, the electronic device 1200 includes a computing unit 1201 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 1202 or a computer program loaded from a storage unit 1208 into a Random Access Memory (RAM) 1203. In the RAM 1203, various programs and data required for the operation of the electronic device 1200 may also be stored. The computing unit 1201, the ROM1202, and the RAM 1203 are connected to each other via a bus 1204. An input/output (I/O) interface 1205 is also connected to the bus 1204.
Various components in the electronic device 1200 are connected to the I/O interface 1205, including an input unit 1206, such as a keyboard, mouse, etc., an output unit 1207, such as various types of displays, speakers, etc., a storage unit 1208, such as a magnetic disk, optical disk, etc., and a communication unit 1209, such as a network card, modem, wireless communication transceiver, etc. The communication unit 1209 allows the electronic device 1200 to exchange information/data with other devices through a computer network, such as the internet, and/or various telecommunications networks.
The computing unit 1201 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 1201 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, digital Signal Processors (DSPs), and any suitable processor, controller, microcontroller, etc. The computing unit 1201 performs the various methods and processes described above, such as the methods of molecular screening shown in fig. 1-9. For example, in some embodiments, the method of molecular screening may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as storage unit 1208. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 1200 via the ROM 1202 and/or the communication unit 1209. When the computer program is loaded into the RAM1203 and executed by the computing unit 1201, one or more steps of the semantic parsing method described above may be performed. Alternatively, in other embodiments, the computing unit 1201 may be configured to perform the method of molecular screening in any other suitable way (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuit systems, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems On Chip (SOCs), load programmable logic devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include being implemented in one or more computer programs that are executable and/or interpretable on a programmable system including at least one programmable processor, which may be a special or general purpose programmable processor, operable to receive data and instructions from, and to transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present disclosure may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of this disclosure, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to the user and a keyboard and a pointing device (e.g., a mouse or a trackball) by which the user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user, for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback), and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include a Local Area Network (LAN), a Wide Area Network (WAN), the Internet, and a blockchain network.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server can be a cloud server, also called a cloud computing server or a cloud host, and is a host product in a cloud computing service system, so that the defects of high management difficulty and weak service expansibility in the traditional physical hosts and VPS service ("Virtual PRIVATE SERVER" or simply "VPS") are overcome. The server may also be a server of a distributed system or a server that incorporates a blockchain.
According to an embodiment of the present disclosure, the present disclosure further provides a computer program product comprising a computer program, wherein the computer program when executed by a processor realizes the steps of the method of molecular screening according to the embodiments of the present disclosure described above.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps recited in the present disclosure may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed aspects are achieved, and are not limited herein.
The above detailed description should not be taken as limiting the scope of the present disclosure. It will be apparent to those skilled in the art that various modifications, combinations, sub-combinations and alternatives are possible, depending on design requirements and other factors. Any modifications, equivalent substitutions and improvements made within the spirit and principles of the present disclosure are intended to be included within the scope of the present disclosure.