[go: up one dir, main page]

CN114334022B - Solubility prediction model of compound molecule and application - Google Patents

Solubility prediction model of compound molecule and application Download PDF

Info

Publication number
CN114334022B
CN114334022B CN202111683434.2A CN202111683434A CN114334022B CN 114334022 B CN114334022 B CN 114334022B CN 202111683434 A CN202111683434 A CN 202111683434A CN 114334022 B CN114334022 B CN 114334022B
Authority
CN
China
Prior art keywords
model
cyclodextrin
compound
training
phase
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202111683434.2A
Other languages
Chinese (zh)
Other versions
CN114334022A (en
Inventor
陈凌云
王恺
李海燕
杨文明
王文首
赖才达
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Hangzhou Yetai Pharmaceutical Technology Co ltd
Original Assignee
Hangzhou Yetai Pharmaceutical Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Hangzhou Yetai Pharmaceutical Technology Co ltd filed Critical Hangzhou Yetai Pharmaceutical Technology Co ltd
Priority to CN202111683434.2A priority Critical patent/CN114334022B/en
Publication of CN114334022A publication Critical patent/CN114334022A/en
Application granted granted Critical
Publication of CN114334022B publication Critical patent/CN114334022B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Landscapes

  • Polysaccharides And Polysaccharide Derivatives (AREA)
  • Other Investigation Or Analysis Of Materials By Electrical Means (AREA)

Abstract

The invention provides a transfer learning model for predicting the solubilizing effect of cyclodextrin on compound molecules. The model firstly learns SMILES characteristics of a large number of compound molecules to obtain a pre-training model, data in the pre-training model are transferred to a model application domain adaptation fine-tuning stage and a QSPR modeling stage, then the compound molecules included by cyclodextrin and corresponding solubility data are learned, and finally the solubilization effect of the compound molecules included by cyclodextrin is predicted. The method solves the problem of low accuracy caused by insufficient solubility data of the cyclodextrin inclusion compound in the current stage of machine learning, is suitable for various cyclodextrin inclusion compounds, and has strong generalization performance.

Description

Solubility prediction model of compound molecule and application
Technical Field
The invention relates to the field of artificial intelligence assisted preparation research and development, in particular to a model and a method for predicting the solubilizing effect of cyclodextrin inclusion on compound molecules based on transfer learning.
Background
The drug is present in solution at the site of absorption, and the solubility of the active pharmaceutical ingredient is an important parameter in achieving the desired concentration. It has been shown that about 40% of drug candidates are rejected due to poor solubility, and thus increasing drug solubility is a frequently encountered and important challenge in drug development.
Inclusion of drugs with cyclodextrins to increase the solubility of drugs is a common drug solubilization means. At present, various cyclodextrin inclusion compounds are put on the market as pharmaceutical products. In addition, the cyclodextrin inclusion of the medicine has the effects of increasing the stability of the medicine, covering the unpleasant odor of the medicine and the like. The traditional preparation development means needs to perform solubility test after the drug is included by cyclodextrin through experiments, so that the operation is complicated and time-consuming, and materials are not saved. If the actual solubilizing effect of cyclodextrin on drug molecules cannot meet the drug development requirements, time, labor and materials are wasted.
In view of the above, there is a need to develop a new technology for saving materials and rapidly knowing the solubilization effect of cyclodextrin on active ingredients of drugs, so as to meet the drug research and development requirements and increase the drug research and development speed and cost.
Disclosure of Invention
The invention aims to solve the problems in the prior art and provides a prediction model and a prediction method for the molecular solubility of a cyclodextrin-included compound. The prediction method of the model has high accuracy, is suitable for predicting the solubilization effect of various cyclodextrins on the compound molecules, can reduce the implementation of a large number of traditional manual experiments, improves the research and development efficiency, and saves the development cost.
The purpose of the invention can be realized by the following technical scheme:
the invention provides a method for predicting the solubilizing effect of cyclodextrin on compound molecules, which uses a transfer learning model.
The small solubility data of the cyclodextrin inclusion compound, namely the small sample size, is a big problem of predicting the solubilization effect of the compound after the cyclodextrin inclusion by using artificial intelligence. Meanwhile, the accuracy of prediction is also affected by data from different sources. As for the prior art, the solubility data of 218 beta-cyclodextrin inclusion compounds reported in the prior literature and the solubility data of 220 SBE-beta-cyclodextrin inclusion compounds obtained through experiments are taken as samples, 90% of the samples are taken as a training set, and the remaining 10% of the samples are taken as a test set, so that the prediction effects of the random forest model and the Cubist model are verified respectively. The result shows that the prediction effect of the random forest on the two cyclodextrins is more consistent, and the Cubist model is not suitable for the SBE-beta-cyclodextrin inclusion compound.
According to the invention, through a transfer learning model, inherent characteristics of more than one million compound molecules are taken as a pre-training set, and the problem that the sample size is small or errors exist in data from different sources in the prior art is solved.
In a preferred embodiment, the invention provides a migration learning model comprising a pre-training phase and a QSPR modeling phase.
In a preferred embodiment, the parameter trained during the pre-training phase is a characteristic of a molecule of the compound that is not included with cyclodextrin, preferably the SMILES sequence of the molecule of the compound.
In the transfer learning model provided by the invention, the SMILES sequence of a compound molecule is taken as a training parameter in a pre-training stage, and one compound molecule has a plurality of SMILES sequences, so that data can be enhanced. After the SMILES characteristics of millions of compound molecules are machine-learned in a pre-training stage, the SMILES characteristics and corresponding Gibbs free energy changes (delta G) of the inclusion compounds of the same cyclodextrin and different compound molecules are trained in a QSPR modeling stage, and finally the delta G of the same cyclodextrin inclusion compounds of other compound molecules is calculated. As the inherent basic chemical characteristics of different molecules can be distinguished through SMILES grammar rule information in the pre-training stage, and the basic chemical characteristics determine the solubilizing effect of cyclodextrin on the molecules, the method is suitable for predicting the solubilizing effect of a compound by inclusion of multiple cyclodextrins.
In a preferred embodiment, the model application domain adaptation fine tuning phase may also be added after the pre-training phase.
In a preferred embodiment, the data of the model application domain fitting fine tuning phase is a molecular characteristic of the cyclodextrin-included compound, preferably a SMILES sequence of a molecule of the cyclodextrin-included compound.
And adding a model application domain adaptation fine-tuning stage, and enabling the application domain of the transfer learning model to be more adaptive to a cyclodextrin inclusion compound system by learning the molecular characteristics of the cyclodextrin inclusion compound, such as SMILES sequence characteristics.
In a preferred embodiment, the pre-training phase or the model application domain adaptation fine-tuning phase trains the data using a method of self-supervised learning.
In a preferred embodiment, the QSPR modeling phase shares the model parameters of the pre-training phase or the model application domain adaptation fine-tuning phase.
In a preferred embodiment, the cyclodextrin of the present invention may be an α, β, γ -cyclodextrin or a derivative thereof, preferably β -cyclodextrin or SEB- β -cyclodextrin.
The transfer learning model provided by the invention is suitable for various cyclodextrin inclusion compounds. Can be any physiologically acceptable substituted or unsubstituted water-soluble cyclodextrin, or a derivative thereof. Such as alpha, beta, gamma-cyclodextrin or derivatives thereof, in particular derivatives in which one or more of the hydroxyl groups are substituted, for example by alkyl, hydroxyalkyl, cycloalkyl, alkylcarbonyl, hydroxyalkoxyalkyl, alkylcarboxyloxyalkyl or alkoxycarbonylalkyl. Preferably, the cyclodextrin of the present invention is β -cyclodextrin or SEB- β -cyclodextrin.
In a preferred embodiment, the molar ratio of compound molecules and cyclodextrin in the cyclodextrin inclusion compound is from 1 to 1, preferably 1.
Because the inherent characteristics of the compound molecules are taken as the pre-training set, and the training set in the QSPR stage can be the SMILES sequence of any cyclodextrin inclusion compound, the method for predicting the solubilizing effect of the cyclodextrin on the compound molecules based on the transfer learning model is not limited by the types of the cyclodextrin and the proportion of the cyclodextrin to the compound molecules.
Compared with the prior art, the prediction model of the solubilization effect of cyclodextrin on the compound provided by the invention takes the inherent structural features of the compound as a data set, and solves the problems that in the prior art, cyclodextrin inclusion compound solubility data samples are few, and the prediction effects of different types of cyclodextrin inclusion compounds are different greatly. The prediction model of the invention is not limited by the type of cyclodextrin and the ratio of cyclodextrin to compound in the cyclodextrin inclusion compound, the R formulas of the predicted value and the experimental value of the beta-cyclodextrin and SEB-beta-cyclodextrin inclusion compound are respectively 0.949 and 0.917, the accuracy is high, and the generalization performance is good.
Drawings
FIG. 1 is a diagram of a migration learning model architecture of the present invention.
Detailed Description
The technical solution of the present invention will be further described in detail with reference to the following embodiments. It is to be understood that the following examples are only illustrative and explanatory of the present invention and should not be construed as limiting the scope of the present invention. All techniques implemented based on the teachings of the present invention are within the intended scope of protection.
The data relating to the solubilization of cyclodextrins are derived from the literature International Journal of pharmaceuticals 418 (2011), 207-216.
Example 1
The migration learning model architecture is shown in fig. 1. The embedding layer converts SMILES sequences with indefinite lengths into vectorized numerical representations with definite lengths, the coding layer learns the feature representation corresponding to the SMILES sequences, and the classifier/regressor layer uses the feature representation output by the coding layer to make final prediction.
1. Pre-training phase
100 ten thousand small molecules are selected from a ChEMBL 27 database, a plurality of SMILES sequences generated for each molecule are subjected to data enhancement, and a pre-training model is obtained by training through a self-supervision learning method.
2. Model application domain adaptation fine-tuning phase
The initial parameters of the model in the model application domain adaptation fine tuning phase completely share the parameters of the model in the pre-training phase. All molecules in an SEB-beta-cyclodextrin solubilization system data set in a document are used for generating a plurality of SMILES sequences for each molecule to perform data enhancement, and the data are used for fine-tuning a model trained in a pre-training stage by a self-supervision learning method, so that an application domain of the model is more adaptive to the SEB-beta-cyclodextrin solubilization system data set.
QSPR modeling phase
The embedding layer and the encoding layer of the QSPR modeling stage share the model application domain adaptation fine-tuning stage trained parameters.
(1) And data enhancement. The SEB-beta-cyclodextrin solubilization system data set in the literature is divided into a training set (90%) and a testing set (10%). For the training set, each molecule generates a plurality of pieces of data (X, input to the model) in a manner that generates a plurality of SMILES sequences, and gaussian noise is added based on the original Δ G to generate Y corresponding to each piece of data. For the test set, multiple pieces of data were generated in the same manner for each molecule, and Y remained unchanged.
(2) And (5) training a model. The embedding layer and the coding layer use the parameters trained in the model application domain adaptation fine-tuning stage, freeze (in the training process, all the parameters are kept unchanged), the regressor uses the random initial parameters, and then the model is trained by using the training set to obtain the finally trained model.
(3) And using the model. Inputting the data enhanced test set data into a model, and calculating a delta G predicted value (y ') of each piece of data' 1 ,y′ 2 ,y′ 3 …y′ n ). The predicted value of solubility after solubilization of each molecule using cyclodextrin is the respective SMILES sequence predicted value (y ') for that molecule' 1 ,y′ 2 ,y′ 3 …y′ n ) Average value of (2)
Figure GDA0003514123180000052
The results are shown in Table 1.
TABLE 1
Figure GDA0003514123180000051
The result shows that the solubilizing effect of the SEB-beta-cyclodextrin on the compound predicted by using the migration learning model is consistent with the experimental data in the literature.
Example 2
Other models in the prior art are used for prediction based on the same data set. With R 2 Evaluation of the prediction accuracy of the model, R 2 The calculation formula of (c) is:
Figure GDA0003514123180000061
wherein i is the serial number of the molecule in the test set (i-th molecule),
Figure GDA0003514123180000062
is the predicted value of the solubility of the i-th molecule after solubilization of cyclodextrin, y i The deltag determined for each molecule of the experiment,
Figure GDA0003514123180000063
the average value of Δ G was determined for all experiments in the test data set. The predicted performance of the different models is shown in table 2.
TABLE 2
Model (model) Test set R 2
Transfer learning model 0.917
GBDT 0.821
Random forest 0.857
AdaBoost 0.781
The result shows that the prediction accuracy of the transfer learning model is far higher than the accuracy of the random forest model and other two prediction models in the prior art.
Example 3
Based on the data set of the beta-cyclodextrin solubilization system in the literature, the data set is divided into a training set (90%) and a test set (10%), and the results of the β -cyclodextrin solubilization effect prediction on the compound molecules in the test set are shown in table 3 by using the migration learning model of the present invention.
TABLE 3
Figure GDA0003514123180000064
Figure GDA0003514123180000071
The result shows that the prediction accuracy of the transfer learning model of the invention on a beta-cyclodextrin solubilization system is also very high, which indicates that the transfer learning model of the invention has very good generalization.
It should be noted that the above-mentioned several preferred embodiments are further non-limiting detailed descriptions of the technical solutions of the present invention, and are only used for illustrating the technical concepts and features of the present invention. It is intended that the present invention be understood and implemented by those skilled in the art, and not limited thereto. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims (8)

1. A model for predicting the solubilizing effect of cyclodextrin on compound molecules, which is characterized in that: using a migratory learning model, wherein the migratory learning model comprises a pre-training phase and a QSPR modeling phase, the data learned by the pre-training phase is the SMILES sequence of a compound molecule which is not included by cyclodextrin, and the cyclodextrin is beta-cyclodextrin or SEB-beta-cyclodextrin.
2. The model of claim 1, wherein: and the transfer learning model is added into a model application domain adaptation fine tuning stage after the pre-training stage.
3. The model of claim 2, wherein: the parameters of the model application domain adaptation training phase are the molecular characteristics of the cyclodextrin-included compound.
4. The model of claim 3, wherein: the parameters of the training of the model application domain adaptation fine tuning phase are the SMILES sequences of the cyclodextrin-included compound molecules.
5. The model of claim 2, wherein: the pre-training phase or the model application domain adaptation fine-tuning phase trains the model by using a self-supervision learning method.
6. The model of claim 2, wherein: the QSPR modeling phase shares model parameters of a pre-training phase or a model application domain adaptation fine-tuning phase.
7. The model of claim 1, wherein: the molar ratio of the compound molecules to the cyclodextrin in the inclusion compound is 1.
8. The model of claim 7, wherein: the molar ratio of the compound molecules to the cyclodextrin in the inclusion compound is 1.
CN202111683434.2A 2021-12-31 2021-12-31 Solubility prediction model of compound molecule and application Active CN114334022B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111683434.2A CN114334022B (en) 2021-12-31 2021-12-31 Solubility prediction model of compound molecule and application

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111683434.2A CN114334022B (en) 2021-12-31 2021-12-31 Solubility prediction model of compound molecule and application

Publications (2)

Publication Number Publication Date
CN114334022A CN114334022A (en) 2022-04-12
CN114334022B true CN114334022B (en) 2022-11-18

Family

ID=81022838

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111683434.2A Active CN114334022B (en) 2021-12-31 2021-12-31 Solubility prediction model of compound molecule and application

Country Status (1)

Country Link
CN (1) CN114334022B (en)

Families Citing this family (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN116631533B (en) * 2023-05-30 2025-09-26 中国药科大学 A prediction method for drug solubilization performance based on excipient classification
CN120600170A (en) * 2025-08-08 2025-09-05 昆仑数智科技有限责任公司 Method and device for predicting solubility of carbon dioxide in ionic solution and electronic equipment

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984811A (en) * 2017-06-05 2018-12-11 欧阳德方 A kind of pharmaceutical preparation prescription virtual design and the method and system of assessment

Family Cites Families (8)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1337916A4 (en) * 2000-11-20 2007-09-05 Procter & Gamble Predictive method for polymers
WO2004063221A1 (en) * 2002-12-26 2004-07-29 Takeda Pharmaceutical Company Limited Metastin derivative and use thereof
EP1833826B1 (en) * 2004-12-22 2009-05-27 Janssen Pharmaceutica N.V. Tricyclic delta-opioid modulators
US20080104001A1 (en) * 2006-10-27 2008-05-01 Kipp James E Algorithm for estimation of binding equlibria in inclusion complexation, host compounds identified thereby and compositions of host compound and pharmaceutical
US9150666B2 (en) * 2008-01-30 2015-10-06 Ada Foundation Hydrolytically stable, hydrophilic adhesion-promoting monomers and polymers made therefrom
CN103585641A (en) * 2013-10-21 2014-02-19 海南卫康制药(潜山)有限公司 Lipid coating and cyclodextrin inclusion synergic flavoring method and related preparation thereof
KR20160104729A (en) * 2014-01-24 2016-09-05 컨플루언스 라이프 사이언시스, 인코포레이티드 Substituted pyroolopyridines and pyrrolopyrazines for treating cancer or inflammatory diseases
WO2021022185A1 (en) * 2019-08-01 2021-02-04 Massachusetts Institute Of Technology Poly(beta-thioester) polymers and polymeric nanoparticles

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108984811A (en) * 2017-06-05 2018-12-11 欧阳德方 A kind of pharmaceutical preparation prescription virtual design and the method and system of assessment

Non-Patent Citations (1)

* Cited by examiner, † Cited by third party
Title
QSPR模型的构建与验证;魏梦碧;《中国优秀硕士学位论文全文数据库 工程科技I辑》;20190115;第13-22页 *

Also Published As

Publication number Publication date
CN114334022A (en) 2022-04-12

Similar Documents

Publication Publication Date Title
CN114334022B (en) Solubility prediction model of compound molecule and application
CN106383816B (en) The recognition methods of Chinese minority area place name based on deep learning
Diniz-Filho et al. Phylogenetic comparative methods and the geographic range size–body size relationship in new world terrestrial carnivora
Neal et al. MCMC for integer‐valued ARMA processes
Castejón et al. Automatic design of analog electronic circuits using grammatical evolution
CN112397157B (en) Molecular generation method based on sub-graph-variation self-coding structure
CN109658989A (en) Class drug compound toxicity prediction method based on deep learning
CN110751698A (en) A text-to-image generation method based on a hybrid network model
CN106096066A (en) The Text Clustering Method embedded based on random neighbor
CN113051399A (en) Small sample fine-grained entity classification method based on relational graph convolutional network
CN114118416B (en) A Variational Graph Autoencoder Method Based on Multi-Task Learning
CN113591955B (en) Method, system, equipment and medium for extracting global information of graph data
CN118779435B (en) Natural language knowledge extraction method and device based on large and small model collaboration
CN108121823A (en) Babbling emotions dialog generation system and method
CN116628623B (en) High-dimensional feature reconstruction and fusion method based on SMT quality big data
CN115204279A (en) Abnormal network traffic identification method based on VAE and COD-SNN
Whitney Bootstrapping via graph propagation
Dimonte et al. Rank-into-rank hypotheses and the failure of GCH
Awaya et al. Unsupervised tree boosting for learning probability distributions
CN114610888B (en) An automatic monitoring and synthesis method for defect reports in developer group chats
CN111782818A (en) Device, method, system and memory for constructing biomedical knowledge graph
CN117912597B (en) Molecular toxicity prediction method based on global attention mechanism
CN116434872B (en) A molecular graph contrastive learning pre-training method based on graph pooling data enhancement
CN114882338B (en) Image key point matching method based on graph neural network
Benatan et al. Practical considerations for probabilistic backpropagation

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant