CN114334022B

CN114334022B - Solubility prediction model of compound molecule and application

Info

Publication number: CN114334022B
Application number: CN202111683434.2A
Authority: CN
Inventors: 陈凌云; 王恺; 李海燕; 杨文明; 王文首; 赖才达
Original assignee: Hangzhou Yetai Pharmaceutical Technology Co ltd
Current assignee: Hangzhou Yetai Pharmaceutical Technology Co ltd
Priority date: 2021-12-31
Filing date: 2021-12-31
Publication date: 2022-11-18
Anticipated expiration: 2041-12-31
Also published as: CN114334022A

Abstract

The invention provides a transfer learning model for predicting the solubilizing effect of cyclodextrin on compound molecules. The model firstly learns SMILES characteristics of a large number of compound molecules to obtain a pre-training model, data in the pre-training model are transferred to a model application domain adaptation fine-tuning stage and a QSPR modeling stage, then the compound molecules included by cyclodextrin and corresponding solubility data are learned, and finally the solubilization effect of the compound molecules included by cyclodextrin is predicted. The method solves the problem of low accuracy caused by insufficient solubility data of the cyclodextrin inclusion compound in the current stage of machine learning, is suitable for various cyclodextrin inclusion compounds, and has strong generalization performance.

Description

Solubility prediction model of compound molecule and application

Technical Field

The invention relates to the field of artificial intelligence assisted preparation research and development, in particular to a model and a method for predicting the solubilizing effect of cyclodextrin inclusion on compound molecules based on transfer learning.

Background

The drug is present in solution at the site of absorption, and the solubility of the active pharmaceutical ingredient is an important parameter in achieving the desired concentration. It has been shown that about 40% of drug candidates are rejected due to poor solubility, and thus increasing drug solubility is a frequently encountered and important challenge in drug development.

Inclusion of drugs with cyclodextrins to increase the solubility of drugs is a common drug solubilization means. At present, various cyclodextrin inclusion compounds are put on the market as pharmaceutical products. In addition, the cyclodextrin inclusion of the medicine has the effects of increasing the stability of the medicine, covering the unpleasant odor of the medicine and the like. The traditional preparation development means needs to perform solubility test after the drug is included by cyclodextrin through experiments, so that the operation is complicated and time-consuming, and materials are not saved. If the actual solubilizing effect of cyclodextrin on drug molecules cannot meet the drug development requirements, time, labor and materials are wasted.

In view of the above, there is a need to develop a new technology for saving materials and rapidly knowing the solubilization effect of cyclodextrin on active ingredients of drugs, so as to meet the drug research and development requirements and increase the drug research and development speed and cost.

Disclosure of Invention

The invention aims to solve the problems in the prior art and provides a prediction model and a prediction method for the molecular solubility of a cyclodextrin-included compound. The prediction method of the model has high accuracy, is suitable for predicting the solubilization effect of various cyclodextrins on the compound molecules, can reduce the implementation of a large number of traditional manual experiments, improves the research and development efficiency, and saves the development cost.

The purpose of the invention can be realized by the following technical scheme:

the invention provides a method for predicting the solubilizing effect of cyclodextrin on compound molecules, which uses a transfer learning model.

The small solubility data of the cyclodextrin inclusion compound, namely the small sample size, is a big problem of predicting the solubilization effect of the compound after the cyclodextrin inclusion by using artificial intelligence. Meanwhile, the accuracy of prediction is also affected by data from different sources. As for the prior art, the solubility data of 218 beta-cyclodextrin inclusion compounds reported in the prior literature and the solubility data of 220 SBE-beta-cyclodextrin inclusion compounds obtained through experiments are taken as samples, 90% of the samples are taken as a training set, and the remaining 10% of the samples are taken as a test set, so that the prediction effects of the random forest model and the Cubist model are verified respectively. The result shows that the prediction effect of the random forest on the two cyclodextrins is more consistent, and the Cubist model is not suitable for the SBE-beta-cyclodextrin inclusion compound.

According to the invention, through a transfer learning model, inherent characteristics of more than one million compound molecules are taken as a pre-training set, and the problem that the sample size is small or errors exist in data from different sources in the prior art is solved.

In a preferred embodiment, the invention provides a migration learning model comprising a pre-training phase and a QSPR modeling phase.

In a preferred embodiment, the parameter trained during the pre-training phase is a characteristic of a molecule of the compound that is not included with cyclodextrin, preferably the SMILES sequence of the molecule of the compound.

In the transfer learning model provided by the invention, the SMILES sequence of a compound molecule is taken as a training parameter in a pre-training stage, and one compound molecule has a plurality of SMILES sequences, so that data can be enhanced. After the SMILES characteristics of millions of compound molecules are machine-learned in a pre-training stage, the SMILES characteristics and corresponding Gibbs free energy changes (delta G) of the inclusion compounds of the same cyclodextrin and different compound molecules are trained in a QSPR modeling stage, and finally the delta G of the same cyclodextrin inclusion compounds of other compound molecules is calculated. As the inherent basic chemical characteristics of different molecules can be distinguished through SMILES grammar rule information in the pre-training stage, and the basic chemical characteristics determine the solubilizing effect of cyclodextrin on the molecules, the method is suitable for predicting the solubilizing effect of a compound by inclusion of multiple cyclodextrins.

In a preferred embodiment, the model application domain adaptation fine tuning phase may also be added after the pre-training phase.

In a preferred embodiment, the data of the model application domain fitting fine tuning phase is a molecular characteristic of the cyclodextrin-included compound, preferably a SMILES sequence of a molecule of the cyclodextrin-included compound.

And adding a model application domain adaptation fine-tuning stage, and enabling the application domain of the transfer learning model to be more adaptive to a cyclodextrin inclusion compound system by learning the molecular characteristics of the cyclodextrin inclusion compound, such as SMILES sequence characteristics.

In a preferred embodiment, the pre-training phase or the model application domain adaptation fine-tuning phase trains the data using a method of self-supervised learning.

In a preferred embodiment, the QSPR modeling phase shares the model parameters of the pre-training phase or the model application domain adaptation fine-tuning phase.

In a preferred embodiment, the cyclodextrin of the present invention may be an α, β, γ -cyclodextrin or a derivative thereof, preferably β -cyclodextrin or SEB- β -cyclodextrin.

The transfer learning model provided by the invention is suitable for various cyclodextrin inclusion compounds. Can be any physiologically acceptable substituted or unsubstituted water-soluble cyclodextrin, or a derivative thereof. Such as alpha, beta, gamma-cyclodextrin or derivatives thereof, in particular derivatives in which one or more of the hydroxyl groups are substituted, for example by alkyl, hydroxyalkyl, cycloalkyl, alkylcarbonyl, hydroxyalkoxyalkyl, alkylcarboxyloxyalkyl or alkoxycarbonylalkyl. Preferably, the cyclodextrin of the present invention is β -cyclodextrin or SEB- β -cyclodextrin.

In a preferred embodiment, the molar ratio of compound molecules and cyclodextrin in the cyclodextrin inclusion compound is from 1 to 1, preferably 1.

Because the inherent characteristics of the compound molecules are taken as the pre-training set, and the training set in the QSPR stage can be the SMILES sequence of any cyclodextrin inclusion compound, the method for predicting the solubilizing effect of the cyclodextrin on the compound molecules based on the transfer learning model is not limited by the types of the cyclodextrin and the proportion of the cyclodextrin to the compound molecules.

Compared with the prior art, the prediction model of the solubilization effect of cyclodextrin on the compound provided by the invention takes the inherent structural features of the compound as a data set, and solves the problems that in the prior art, cyclodextrin inclusion compound solubility data samples are few, and the prediction effects of different types of cyclodextrin inclusion compounds are different greatly. The prediction model of the invention is not limited by the type of cyclodextrin and the ratio of cyclodextrin to compound in the cyclodextrin inclusion compound, the R formulas of the predicted value and the experimental value of the beta-cyclodextrin and SEB-beta-cyclodextrin inclusion compound are respectively 0.949 and 0.917, the accuracy is high, and the generalization performance is good.

Drawings

FIG. 1 is a diagram of a migration learning model architecture of the present invention.

Detailed Description

The technical solution of the present invention will be further described in detail with reference to the following embodiments. It is to be understood that the following examples are only illustrative and explanatory of the present invention and should not be construed as limiting the scope of the present invention. All techniques implemented based on the teachings of the present invention are within the intended scope of protection.

The data relating to the solubilization of cyclodextrins are derived from the literature International Journal of pharmaceuticals 418 (2011), 207-216.

Example 1

The migration learning model architecture is shown in fig. 1. The embedding layer converts SMILES sequences with indefinite lengths into vectorized numerical representations with definite lengths, the coding layer learns the feature representation corresponding to the SMILES sequences, and the classifier/regressor layer uses the feature representation output by the coding layer to make final prediction.

1. Pre-training phase

100 ten thousand small molecules are selected from a ChEMBL 27 database, a plurality of SMILES sequences generated for each molecule are subjected to data enhancement, and a pre-training model is obtained by training through a self-supervision learning method.

2. Model application domain adaptation fine-tuning phase

The initial parameters of the model in the model application domain adaptation fine tuning phase completely share the parameters of the model in the pre-training phase. All molecules in an SEB-beta-cyclodextrin solubilization system data set in a document are used for generating a plurality of SMILES sequences for each molecule to perform data enhancement, and the data are used for fine-tuning a model trained in a pre-training stage by a self-supervision learning method, so that an application domain of the model is more adaptive to the SEB-beta-cyclodextrin solubilization system data set.

QSPR modeling phase

The embedding layer and the encoding layer of the QSPR modeling stage share the model application domain adaptation fine-tuning stage trained parameters.

(1) And data enhancement. The SEB-beta-cyclodextrin solubilization system data set in the literature is divided into a training set (90%) and a testing set (10%). For the training set, each molecule generates a plurality of pieces of data (X, input to the model) in a manner that generates a plurality of SMILES sequences, and gaussian noise is added based on the original Δ G to generate Y corresponding to each piece of data. For the test set, multiple pieces of data were generated in the same manner for each molecule, and Y remained unchanged.

(2) And (5) training a model. The embedding layer and the coding layer use the parameters trained in the model application domain adaptation fine-tuning stage, freeze (in the training process, all the parameters are kept unchanged), the regressor uses the random initial parameters, and then the model is trained by using the training set to obtain the finally trained model.

(3) And using the model. Inputting the data enhanced test set data into a model, and calculating a delta G predicted value (y ') of each piece of data' ₁ ,y′ ₂ ,y′ ₃ …y′ _n ). The predicted value of solubility after solubilization of each molecule using cyclodextrin is the respective SMILES sequence predicted value (y ') for that molecule' ₁ ,y′ ₂ ,y′ ₃ …y′ _n ) Average value of (2)

The results are shown in Table 1.

TABLE 1

The result shows that the solubilizing effect of the SEB-beta-cyclodextrin on the compound predicted by using the migration learning model is consistent with the experimental data in the literature.

Example 2

Other models in the prior art are used for prediction based on the same data set. With R ² Evaluation of the prediction accuracy of the model, R ² The calculation formula of (c) is:

wherein i is the serial number of the molecule in the test set (i-th molecule),

is the predicted value of the solubility of the i-th molecule after solubilization of cyclodextrin, y _i The deltag determined for each molecule of the experiment,

the average value of Δ G was determined for all experiments in the test data set. The predicted performance of the different models is shown in table 2.

TABLE 2

Model (model)	Test set R ²
		Transfer learning model	0.917
GBDT	0.821
		Random forest	0.857
AdaBoost	0.781

The result shows that the prediction accuracy of the transfer learning model is far higher than the accuracy of the random forest model and other two prediction models in the prior art.

Example 3

Based on the data set of the beta-cyclodextrin solubilization system in the literature, the data set is divided into a training set (90%) and a test set (10%), and the results of the β -cyclodextrin solubilization effect prediction on the compound molecules in the test set are shown in table 3 by using the migration learning model of the present invention.

TABLE 3

The result shows that the prediction accuracy of the transfer learning model of the invention on a beta-cyclodextrin solubilization system is also very high, which indicates that the transfer learning model of the invention has very good generalization.

It should be noted that the above-mentioned several preferred embodiments are further non-limiting detailed descriptions of the technical solutions of the present invention, and are only used for illustrating the technical concepts and features of the present invention. It is intended that the present invention be understood and implemented by those skilled in the art, and not limited thereto. All equivalent changes and modifications made according to the spirit of the present invention should be covered within the protection scope of the present invention.

Claims

1. A model for predicting the solubilizing effect of cyclodextrin on compound molecules, which is characterized in that: using a migratory learning model, wherein the migratory learning model comprises a pre-training phase and a QSPR modeling phase, the data learned by the pre-training phase is the SMILES sequence of a compound molecule which is not included by cyclodextrin, and the cyclodextrin is beta-cyclodextrin or SEB-beta-cyclodextrin.

2. The model of claim 1, wherein: and the transfer learning model is added into a model application domain adaptation fine tuning stage after the pre-training stage.

3. The model of claim 2, wherein: the parameters of the model application domain adaptation training phase are the molecular characteristics of the cyclodextrin-included compound.

4. The model of claim 3, wherein: the parameters of the training of the model application domain adaptation fine tuning phase are the SMILES sequences of the cyclodextrin-included compound molecules.

5. The model of claim 2, wherein: the pre-training phase or the model application domain adaptation fine-tuning phase trains the model by using a self-supervision learning method.

6. The model of claim 2, wherein: the QSPR modeling phase shares model parameters of a pre-training phase or a model application domain adaptation fine-tuning phase.

7. The model of claim 1, wherein: the molar ratio of the compound molecules to the cyclodextrin in the inclusion compound is 1.

8. The model of claim 7, wherein: the molar ratio of the compound molecules to the cyclodextrin in the inclusion compound is 1.