[go: up one dir, main page]

CN112420135B - Virtual sample generation method based on sample method and quantile regression - Google Patents

Virtual sample generation method based on sample method and quantile regression Download PDF

Info

Publication number
CN112420135B
CN112420135B CN202011312051.XA CN202011312051A CN112420135B CN 112420135 B CN112420135 B CN 112420135B CN 202011312051 A CN202011312051 A CN 202011312051A CN 112420135 B CN112420135 B CN 112420135B
Authority
CN
China
Prior art keywords
sample
virtual
variable
independent variable
samples
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202011312051.XA
Other languages
Chinese (zh)
Other versions
CN112420135A (en
Inventor
朱群雄
朱梅玉
贺彦林
徐圆
张洋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Chemical Technology
Original Assignee
Beijing University of Chemical Technology
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Chemical Technology filed Critical Beijing University of Chemical Technology
Priority to CN202011312051.XA priority Critical patent/CN112420135B/en
Publication of CN112420135A publication Critical patent/CN112420135A/en
Application granted granted Critical
Publication of CN112420135B publication Critical patent/CN112420135B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G16INFORMATION AND COMMUNICATION TECHNOLOGY [ICT] SPECIALLY ADAPTED FOR SPECIFIC APPLICATION FIELDS
    • G16CCOMPUTATIONAL CHEMISTRY; CHEMOINFORMATICS; COMPUTATIONAL MATERIALS SCIENCE
    • G16C20/00Chemoinformatics, i.e. ICT specially adapted for the handling of physicochemical or structural data of chemical particles, elements, compounds or mixtures
    • G16C20/70Machine learning, data mining or chemometrics

Landscapes

  • Engineering & Computer Science (AREA)
  • Medical Informatics (AREA)
  • Chemical & Material Sciences (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Evolutionary Computation (AREA)
  • General Health & Medical Sciences (AREA)
  • Health & Medical Sciences (AREA)
  • Artificial Intelligence (AREA)
  • Crystallography & Structural Chemistry (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Bioinformatics & Cheminformatics (AREA)
  • Bioinformatics & Computational Biology (AREA)
  • Computing Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Complex Calculations (AREA)

Abstract

The invention discloses a virtual sample generation method based on a sample method and quantile regression, which is characterized in that the input space of an original sample is divided by adopting the sample method, the input of the virtual sample is generated at a scarce position, the output of the virtual sample is predicted by Gaussian process regression, and then the influence trend of the independent variable on the dependent variable is analyzed by the quantile regression, so that the generated virtual sample is screened. According to the invention, experiments are carried out on the standard function data set and the actual industrial data set, and experimental results show that the invention can effectively expand samples, thereby improving the modeling accuracy of the samples. The invention uses the relative importance of independent variables to dependent variables to generate virtual samples according to different influence degrees. In addition, the invention uses the correlation between the independent variable and the dependent variable to delete the virtual samples which do not accord with the correlation.

Description

Virtual sample generation method based on sample method and quantile regression
Technical Field
The invention relates to the technical field of chemical industry prediction, in particular to a virtual sample generation method based on a sample method and quantile regression.
Background
The chemical process has the characteristics of large production scale, complex process flow, various raw materials and the like, and is also influenced by enterprise management and the like, so that the process modeling is very important for automatically processing abnormal and irregular events of the process. The development of modern science and technology and the upgrading of storage equipment enable a large amount of data to be stored, and the society enters the cloud era. It is difficult to obtain enough representative samples to build a model to capture the inherent features of conventional operations and maintenance. On the one hand, small fluctuations between data, high cost of data acquisition, low probability of occurrence of an abnormal event, etc., greatly reduce the representativeness of the acquired samples. On the other hand, the nonlinearity, noise, missing values, and uncertainty of the data make the number of data samples available for modeling small. Thus, the "big data, small samples" problem remains prominent. The problem of small samples refers to the fact that the number of available samples is small, usually less than 30, and further reflects the substantial problem that the information is insufficient, and the limited samples cannot completely describe the whole sample feature space, so that the expression of the overall features is insufficient. Modeling directly with small samples can make both the accuracy and applicability of the model challenging.
The problem of small sample size is not limited to the process industry, but is ubiquitous in the fields of computer science, biomedical engineering, and materials science. There are a number of well-established machine learning methods available for direct learning of small sample datasets, including gray prediction models (GM), support vector machines (Support Vector Machine, SVM), kernel regression and bayesian networks. The gray prediction model is one of the most widely used tools for modeling and analyzing uncertain systems in which part of the information is known and part of the information is unknown. It is applicable to exponentially varying one-dimensional data, but is difficult to apply to time-series data with uneven intervals. The SVM describes data distribution according to a marginal structure, and reduces the requirements on the size and distribution of the data. In many engineering applications, SVM has better performance than traditional neural networks when the training data set is limited. But it is very sensitive to noisy data or outliers. Bayesian networks are loop-free directed graphs that capture the probability dependencies and dependencies between variables. The main advantage of bayesian networks over other methods is that it is easy to combine existing data with expert judgment within its probabilistic framework for uncertainty reasoning. But it is generally not permissible to obtain the network structure and parameters of the bayesian network from a small dataset.
The small sample problem is solved, and in addition to learning the small sample dataset directly, new samples can be generated by virtual sample generation techniques (Virtual Sample Generation, VSG). The main idea is to generate a virtual sample by utilizing potential information such as priori knowledge or sample distribution function on the basis of the original small sample data set, fill sample information intervals, further improve generalization capability of the original small sample data set and improve model accuracy. In general, virtual sample generation techniques can be divided into three categories: (1) Virtual sample generation technology based on distribution, represented by Bootstrap and SMOTE; (2) A deep learning based virtual sample generation technique to generate a countermeasure network (GAN) and a variational self-encoder (VAE) representation; (3) Virtual sample generation techniques based on information diffusion, including overall trend diffusion (MTD); (4) SVM-based virtual sample generation techniques. The distribution-based virtual sample generation technique fits the probability model of the original small sample by parameterizing the probability model, and then samples the approximated probability model to generate a new sample. Bootstrap is a resampling method, a new sample is obtained by using a put-back mode, the true distribution is simulated by sampling distribution, a new sample which is different from an original small sample is not generated essentially, and the generated sample does not carry new information. SMOTE synthesizes new samples by repeatedly sampling along the lines between a minority class of samples and its neighbors. The method for generating the virtual sample based on the deep learning (such as GANs and VAEs) has better capability in characteristic representation by utilizing a nonlinear multi-layer structure, so that the method has good effect of generating a structured virtual sample (such as a pseudo image) in the field of image recognition. The information diffusion-based virtual sample generation technique generates virtual samples by padding the information interval for the original small samples. The MTD utilizes a triangular membership function to estimate the overall distribution, and the asymmetrically expanded element is a small sample boundary. The SVM may generate virtual samples nearby as the limit state function is approximated.
None of the existing methods focuses on the correlation between the independent variable and the dependent variable, and the influence degree of the independent variable on the dependent variable, and the generated virtual sample has no constraint relation between the independent variable and the dependent variable.
Disclosure of Invention
In order to solve the limitations and defects existing in the prior art, the invention provides a virtual sample generation method based on a sample method and quantile regression, which comprises the following steps:
Dividing a sample square into an input space, generating virtual input by the input space, generating virtual output by Gaussian process regression, analyzing the influence trend of the independent variable on the dependent variable according to quantile regression, and screening a virtual sample according to quantile regression;
The step of analyzing the importance of the independent variable relative to the dependent variable by the dominance analysis comprises the following steps: performing advantage analysis on independent variables and dependent variables in an original small sample data set to obtain the relative importance of the independent variables relative to the dependent variables;
The step of dividing the input space into sampling parties comprises the following steps: according to the proportional relation between the relative importance and the side length of the sample square, dividing the sample square of the input space, and simultaneously controlling the dividing result according to the total number of the sample square to enable the number of the divided sample square to be within a preset range;
The step of generating a virtual input in the input space comprises: according to the divided sample squares, carrying out Cartesian product on the projection values in each sample square, generating virtual input, wherein the Euclidean distance of each original small sample of each projection of the generated Cartesian product is smaller than a preset numerical value;
the step of generating a virtual output by Gaussian process regression includes: modeling the original small sample data set by using Gaussian process regression, and predicting virtual output corresponding to the virtual input;
The step of analyzing the trend of the influence of the independent variable on the dependent variable according to quantile regression comprises the following steps: modeling the original small sample by using linear quantile regression, and analyzing the influence trend of the independent variable on the dependent variable;
the step of screening the virtual samples according to quantile regression comprises the following steps: and screening the generated virtual samples, deleting the virtual samples which do not accord with the correlation between the independent variables and the dependent variables, and obtaining the left virtual samples as the finally generated virtual samples.
Optionally, the step of analyzing the importance of the independent variable relative to the dependent variable further comprises:
Quantifying the importance of the independent variable relative to the dependent variable using dominance analysis, separating the relative importance of individual variables into direct effects, overall effects, and partial effects, the direct effects being the individual effects of the independent variable on the dependent variable; the overall effect is the effect of the independent variable on the dependent variable when the independent variable is put with all other independent variables; the partial effect is the effect of the independent variable on the dependent variable when the independent variable is put with other partial independent variables;
calculating the relative importance of each independent variable according to the average value of the direct effect, the total effect and the partial effect;
the relative importance of each argument is compared.
Optionally, the step of dividing the input space into sample sides further includes: selecting a minimum Euclidean distance dist min and a minimum relative importance imp min in an input space, and obtaining the division of each dimension sample party according to the proportional relation between the minimum Euclidean distance and the minimum relative importance;
Obtaining a sample division number sample i of each dimension according to the original sample side length L i of each dimension and the total sample side length L i of each dimension, wherein the expression of the sample division number is as follows:
The expression of the original side length L i and the total side length L i of the original side is as follows:
Li=xi,max-xi,min(i=0,1......n)
multiplying the sample division number of each dimension to obtain a total sample number, wherein the expression of the total sample number is as follows:
If the total number of the sample sides is larger than the preset number of the sample sides, the side length of each sample side is 1.2 times of the original sample sides, the sample sides are divided again according to the new side length of the sample sides, and the steps are repeated until the total number of the sample sides is within the preset number of the sample sides.
Optionally, the step of generating a virtual input by the input space further includes:
projecting each original sample to each dimension of an x space to obtain a projection set of each dimension, wherein the expression of the projection set is as follows:
Ω={xi,j}(i=0,1KKn;j=0,1KKm)
Screening each sample from the dimension with the largest number of sample, starting from x with the smallest dimension, if the sample has x projection in the dimension, the sample is not processed, if the sample has no x projection in the dimension, projection values are generated at the center position of one side of the sample, a point of an input space is generated, the values of other dimensions are equal to the midpoints of the maximum value and the minimum value, and the expression of the midpoints is as follows:
Carrying out Cartesian product on the projections of each dimension in each sample side to obtain samples with different value combinations;
for each generated point, if the point is coincident with the x value of the original sample point, deleting the point;
and for m original sample points for generating the point, calculating the Euclidean distance of the two points in the input space, and if the Euclidean distance is greater than 1/7, the reserved points are generated virtual input.
Optionally, the step of analyzing the influence trend of the independent variable on the dependent variable according to quantile regression includes:
Analyzing the influence trend of the independent variable on the dependent variable according to a linear quantile regression model to obtain a quantile regression model under 10 quantiles, wherein the expression of the linear quantile regression model is as follows:
The 10 quantiles are 0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95 respectively.
Optionally, the step of screening the virtual samples according to quantile regression includes:
Establishing an unchecked queue and a qualified queue according to the generated virtual sample;
Searching a point closest to the qualified queue in the unchecked queue for checking;
For a predicted value y1 'of the virtual sample, a predicted value y2' of a point closest to the point in the qualified queue is obtained according to an output value fractional point obtained by an original sample, the predicted value y1 'and the predicted value y2' are obtained, and a regression coefficient b of a corresponding independent variable is obtained according to the predicted value y1 'and the predicted value y 2';
If the virtual sample accords with the correlation between the independent variable and the dependent variable, the virtual sample is reserved, and if the virtual sample does not accord with the correlation between the independent variable and the dependent variable, the virtual sample is deleted, and the expression of the correlation between the independent variable and the dependent variable is as follows:
wherein b is a regression coefficient.
The invention has the following beneficial effects:
the invention researches the relative importance of independent variable to dependent variable and uses this as the basis for sample division of input space. In the traditional sampling method, the lengths of the sides of each sample side are the same, even if the lengths of the sides of the sample sides are different, the setting of the side lengths of the sample sides is not based, and the sample sides are obtained by most experience or enumeration experiment comparison.
In the invention, the correlation between the independent variable and the dependent variable is considered in the process of generating the virtual sample, the screening is carried out, and the virtual sample which does not accord with the correlation is removed. In the conventional virtual sample generation method, this is not considered, and the generated virtual sample is not analyzed for rationality. The invention screens generated virtual samples more accords with a chemical process mechanism.
Drawings
Fig. 1 is a flowchart of a virtual sample generation method based on a sample method and quantile regression according to an embodiment of the present invention.
Fig. 2 is a flowchart of dividing an input space into sample squares according to an embodiment of the present invention.
FIG. 3 is a diagram showing the relative importance of independent variables to dependent variables of an f (x 1, x 2) standard function dataset provided in accordance with one embodiment of the present invention.
Fig. 4 is a schematic diagram of f (x 1, x 2) standard function data set sampling division according to an embodiment of the present invention.
Fig. 5 is a diagram of f (x 1, x 2) standard function dataset input space virtual input generation according to an embodiment of the present invention.
Fig. 6 is a diagram of a filtered f (x 1, x 2) standard function dataset of a spatial virtual sample distribution diagram according to an embodiment of the present invention.
FIG. 7a is a graph showing the effect of f (x 1, x 2) standard function dataset quantile regression screening according to an embodiment of the present invention.
FIG. 7b is a diagram showing another effect of f (x 1, x 2) standard function dataset quantile regression screening according to an embodiment of the present invention.
FIG. 8a is a graph showing the results of uniform square division of f (x 1, x 2) standard function data set samples according to an embodiment of the present invention.
FIG. 8b is a graph showing another example of the f (x 1, x 2) standard function data set sample and uniform sample division according to the first embodiment of the present invention.
Detailed Description
In order to enable those skilled in the art to better understand the technical scheme of the present invention, the method for generating a virtual sample based on the sample method and quantile regression provided by the present invention is described in detail below with reference to the accompanying drawings.
Example 1
The technical scheme provided by the embodiment aims at: because the characterization of the sample overall characteristics by the small sample is insufficient, the model accuracy is not high and the generalization capability is poor by directly using the small sample to carry out modeling prediction. The embodiment provides a virtual sample generation technology, which improves the accuracy of small sample modeling by increasing the number of samples.
The embodiment focuses on dividing the input space of the sample by using a sample method, generating the input of the virtual sample in a data sparse area, generating corresponding virtual output by Gaussian process regression, and finally screening the virtual sample by quantile regression. The main contribution of the research is that the relative importance of the independent variable to the response variable is analyzed, virtual samples are generated according to different influence degrees, and then the correlation between the independent variable and the response variable is used for analysis, so that the virtual samples which do not accord with the correlation are deleted. Compared with the previous method, the virtual sample obtained by the method proposed in the present embodiment is used for modeling, and the error of the model is smaller.
Specifically, the present embodiment provides a virtual sample generation method based on a sample method and quantile regression, the method including: the method comprises the steps of analyzing the importance process of the independent variable relative to the dependent variable by virtue of the dominance analysis, dividing a sampling side process into an input space, generating a virtual input process by virtue of the input space, generating a virtual output process by virtue of the Gaussian process regression, analyzing the influence trend process of the independent variable on the dependent variable according to the quantile regression, and screening a virtual sample process according to the quantile regression.
In this embodiment, the importance process of the dominance analysis independent variable relative to the dependent variable is: and performing advantage analysis on the independent variable and the dependent variable in the original small sample data set to obtain the relative importance of the independent variable relative to the dependent variable.
In this embodiment, the dividing the sample square for the input space includes: according to the proportional relation between the relative importance and the side length of the sample square, the input space is divided into the sample square, and the dividing result is controlled through the total sample square quantity, so that the divided sample square quantity is in a reasonable range.
In this embodiment, the input space generating virtual input process includes: and according to the divided sample sides, carrying out Cartesian product on the projection values in each sample side to generate virtual input, and simultaneously ensuring that the Euclidean distance of the original small sample of each projection generating the Cartesian product is smaller than a certain value, so that the generation quantity is avoided.
In this embodiment, the gaussian process regression generation virtual output process is: modeling the original small sample data set by using Gaussian process regression, and predicting virtual output corresponding to the virtual input.
In this embodiment, the process of analyzing the trend of the influence of the independent variable on the dependent variable according to the quantile regression is as follows: modeling the original small sample by using linear fractional regression, and analyzing the influence trend of the independent variable on the dependent variable.
In this embodiment, the process of screening the virtual samples according to quantile regression is as follows: and screening the generated virtual samples, deleting the virtual samples which do not accord with the correlation between the independent variables and the dependent variables, and finally obtaining the generated virtual samples.
Fig. 1 is a flowchart of a virtual sample generation method based on a sample method and quantile regression according to an embodiment of the present invention. As shown in fig. 1, the virtual sample generation process is divided into three steps, wherein the first step is to generate a virtual input value in an input space, the second step is to generate an output value of a corresponding virtual sample through gaussian process regression, and the third step is to screen the virtual sample according to quantile regression, and the screening result is the virtual sample.
In the stage of generating virtual input values in the input space, firstly, performing advantage analysis on small sample data to obtain the relative importance of independent variables. The relative importance of an independent variable is large, meaning that the proportion of the interpreted variance that this independent variable contributes to the dependent variable is large, the degree of dispersion of data in other dimensions is greater where the data is the same in that dimension. Then the side length of the sample side in this dimension corresponds to the division of the sample side.
Thus, in the division of the sample, the side length of the sample is proportional to the relative importance of the independent variable. Fig. 2 is a flowchart of dividing an input space into sample squares according to an embodiment of the present invention. As shown in fig. 2, in this embodiment, the minimum euclidean distance dist min in the input space is selected, and the minimum relative importance imp min is corresponding to the minimum distance, so as to obtain the division of each dimension sample. The side length of the original sample square of each dimension is
Initial value, total side length of each dimension is
Li=xi,max-xi,min(i=0,1......n) (2)
Obtaining the sample division number of the dimension as
Multiplying the sample division number of each dimension to obtain the total sample number of
If the side length of the sample is too small, this will result in too many sample sides, and thus the side length of the sample sides is in turn controlled by controlling the number of sample sides. If the total sample side number is larger than the set value, the side length of each sample side becomes 1.2 times of the original one, namely
li,new=li,old×1.2 (5)
Re-partitioning the sample, and repeating the steps until the total number of sample is within a specified range. Since the importance of the presence of certain arguments is relatively small in a data structure having multiple arguments, this can result in a particularly dense division of the sample across a dimension, and in order to avoid this, the maximum number of sample divisions per dimension is specified to be determined based on the number of arguments.
After dividing the sample, a virtual input is generated in the input space. Firstly, each original sample is projected to each dimension of the input space, and the projection set of each dimension is omega= { x i,j } (i=0, 1kkn; j=0, 1 kkm).
And then checking each dimension, and if one of the square sections of the dimension has no projection value, generating a projection value, wherein the projection value is positioned on the center point of the square section. Meanwhile, a virtual small sample value is generated, the value of the dimension takes a projection value, and the values of other dimensions take the midpoints of the maximum and minimum values for later calculation.
When generating the virtual sample, the projection value in each sample square is subjected to Cartesian product. The projection value comes from the original sample point, the sample point with the far Euclidean distance, the correlation between the two sample points is relatively weak, and the two sample points can be ignored when forming the Cartesian product, so that the sample points can be screened before generation. Principle of screening: the euclidean distance of the original sample for generating the virtual sample in the input space should be less than 1/7 of the maximum euclidean distance, because the range of action of the original sample point is limited, the distance is too far, and data is generated in a very sparse place, so that the error of the output value obtained by the regression of the gaussian process can be very large. The deleted samples are also processed: (1) to ensure that virtual samples are generated within each party. According to the sample method, if virtual samples are generated in each sample side, the virtual samples can be effectively filled in the input space to a certain extent. Thus, for a blank, a deleted point is randomly selected for restoration. (2) For the edge region of the small sample space, the generated virtual samples are deleted. For the generation of the virtual samples, different generation methods should be adopted in the small sample space and the expansion space, and the method which performs well in the small sample space is not necessarily applicable in the expansion space, so that the virtual samples in the edge area are to be deleted, and in this embodiment, the virtual samples in the two largest and smallest sample sides of each dimension are selected to be deleted.
The second step is to generate virtual output values corresponding to the virtual inputs using gaussian process regression.
The third step is to analyze the trend of the influence of the independent variable on the dependent variable using quantile regression. Since the influence of an independent variable of a relatively large importance on a dependent variable is greater with respect to a change in the strain amount caused by the same independent variable change, a judgment is made in a direction of a relatively large importance. The sample side with the original sample is selected for screening.
Finding a dimension with great relative importance, finding a corresponding original sample point for each projection in the dimension, finding a virtual sample point which is different from the ith dimension of the sample point but the same as other dimensions according to the original sample point, establishing two queues, an unchecked queue and a qualified queue.
Each time searching for the nearest point from the qualified queue in the unchecked queue.
In the checking process, for the predicted value y1 'of the virtual sample, the predicted value y2' of the nearest point in the qualified queue is obtained according to the output value fractional points obtained by the original sample, and the regression coefficients b of the independent variables under the corresponding conditions are obtained. The regression coefficients may express the relationship of the independent variable variation and the dependent variable variation of the dimension at the quantile. Thus, if the following relationship is met, it is retained, and if not, it is deleted.
In this embodiment, in order to verify the validity of a virtual sample generation method based on a sample method and quantile regression, experiments were performed on both standard function data sets and actual industrial data sets.
In the standard function data set, the used standard functions are shown in the following formula, and x1 and x2 are mutually independent to form x and obey U (0-1) distribution.
30 Sample points are randomly generated in the sample space domain, constituting a small sample data set for training.
And performing advantage analysis on the original sample to obtain the relative importance of x1 and x2 of 0.9422900361786678,0.05770996382133217 respectively. FIG. 3 is a diagram showing the relative importance of independent variables to dependent variables of an f (x 1, x 2) standard function dataset provided in accordance with one embodiment of the present invention. As shown in fig. 3: the euclidean distance is calculated, the minimum euclidean distance is 0.001, the minimum relative importance is 0.05770996382133217, based on the fact that the obtained initial side length of the sample is 0.00949,0.001, it can be seen that the side length of the sample is very small, and therefore the number of the obtained sample is very large, the number of generated virtual samples is also increased, and errors are increased. And calculating to obtain the maximum and minimum values of each dimension, and obtaining the rectangular side length of data distribution to be 0.9616,0.9169. Fig. 4 is a schematic diagram of f (x 1, x 2) standard function data set sampling division according to an embodiment of the present invention. The sample division result is shown in fig. 4, and after iterative calculation, the number of sample is 3, 38, and the side length of the sample is 0.41018627,0.02461118.
Fig. 5 is a diagram of f (x 1, x 2) standard function dataset input space virtual input generation according to an embodiment of the present invention. Generating virtual samples of the input space as shown in fig. 5, wherein square points are generated virtual sample points, and 114 points are total; after screening the edge area samples, 95 sample points remain.
And predicting a corresponding output value by using Gaussian process regression, and screening data by fractional regression to obtain a virtual sample. Fig. 6 is a diagram of a filtered f (x 1, x 2) standard function dataset of a spatial virtual sample distribution diagram according to an embodiment of the present invention. The distribution in the input space is shown in fig. 6, and 36 virtual input values are shown as square points. The actual values generated from the standard function are compared with the predicted values and the results are shown in table 1.
Table 1 comparison of the true and predicted values generated from the standard function
FIG. 7a is a graph showing the effect of f (x 1, x 2) standard function dataset quantile regression screening according to an embodiment of the present invention. FIG. 7b is a diagram showing another effect of f (x 1, x 2) standard function dataset quantile regression screening according to an embodiment of the present invention. The comparison of 59 data removed by quantile regression screening and the real output is shown in fig. 7a, and the error of the virtual sample is reserved in fig. 7b, and by comparison, it can be seen that quantile regression screening is effective, and the virtual sample with larger error can be removed.
FIG. 8a is a graph showing the results of uniform square division of f (x 1, x 2) standard function data set samples according to an embodiment of the present invention. FIG. 8b is a graph showing another example of the f (x 1, x 2) standard function data set sample and uniform sample division according to the first embodiment of the present invention. The effectiveness of sample division is also compared experimentally, a uniform sample division method is adopted to divide the input space into samples, then the virtual samples are generated in the same generation mode, the error is shown in fig. 8b, the quality of the virtual samples is obviously inferior to that of the virtual sample generation method proposed by the method of the embodiment, and therefore, the sample division proposed by the method of the embodiment is effective.
To verify the role of this example in the chemical process, experiments were performed using an industrial dataset. The dataset used was high density Polyethylene (HIGH DENSITY Polyethylene, HDPE) with 15 inputs and 1 output. The HDPE dataset had 180 data samples in total, 150 of which were used as training sets and 30 as test sets. In the virtual sample generation process, in order to avoid too large calculation amount of Cartesian products, strategies are adjusted when virtual input is generated in each sample party. For each sample party, finding an original sample point closest to the center point of the sample party as a basis of change, starting from the dimension with the largest number of sample parties, replacing the dimension of the original sample point with a projection value of the dimension in the sample party, generating a virtual sample, and judging whether the generated sample already exists: if the sample exists, regenerating, and reselecting the dimension with decreasing sample side number to generate a sample; if so, the generated samples are eligible. The results divided 258 sample sides, finally generated 39 virtual samples, tested using test set, and the results are shown in table 2, and compared with the results before virtual sample generation, the model accuracy is improved. And compared with Bootstrap and TTD, the results are improved.
Table 2 results of comparison of Bootstrap, TTD and virtual sample generation methods before virtual samples are generated
The result shows that the virtual sample generation method provided by the embodiment can improve the modeling precision of the small sample by generating the virtual sample.
The embodiment provides a virtual sample generation method based on a sample method and quantile regression, wherein the virtual sample generation method is based on the relative importance of independent variables, the sample method is adopted to divide the input space of an original sample, the input of the virtual sample is generated at a scarce position, the output of the virtual sample is predicted through Gaussian process regression, and then the quantile regression is utilized to analyze the influence trend of the independent variables on the dependent variables, so that the generated virtual sample is screened. In the embodiment, experiments are carried out on the standard function data set and the actual industrial data set, and experimental results show that the embodiment can effectively expand samples, so that the modeling accuracy of the samples is improved. The present embodiment uses the relative importance of the independent variable to the dependent variable to generate virtual samples according to the degree of influence. In addition, the present embodiment uses the correlation between the independent variable and the dependent variable, and deletes the virtual samples that do not conform to the correlation.
It is to be understood that the above embodiments are merely illustrative of the application of the principles of the present invention, but not in limitation thereof. Various modifications and improvements may be made by those skilled in the art without departing from the spirit and substance of the invention, and are also considered to be within the scope of the invention.

Claims (6)

1. A virtual sample generation method based on a sample method and quantile regression is characterized by comprising the following steps:
Dividing a sample square into an input space, generating virtual input by the input space, generating virtual output by Gaussian process regression, analyzing the influence trend of the independent variable on the dependent variable according to quantile regression, and screening a virtual sample according to quantile regression;
The step of analyzing the importance of the independent variable relative to the dependent variable by the dominance analysis comprises the following steps: performing a dominance analysis on the independent variables and dependent variables in an original small sample dataset to obtain relative importance of the independent variables with respect to the dependent variables, the original small sample dataset comprising an industrial dataset comprising a high density polyethylene dataset having 15 inputs and 1 output comprising 180 data samples;
The step of dividing the input space into sampling parties comprises the following steps: according to the proportional relation between the relative importance and the side length of the sample square, dividing the sample square of the input space, and simultaneously controlling the dividing result according to the total number of the sample square to enable the number of the divided sample square to be within a preset range;
The step of generating a virtual input in the input space comprises: according to the divided sample squares, carrying out Cartesian product on the projection values in each sample square, generating virtual input, wherein the Euclidean distance of each original small sample of each projection of the generated Cartesian product is smaller than a preset numerical value;
the step of generating a virtual output by Gaussian process regression includes: modeling the original small sample data set by using Gaussian process regression, and predicting virtual output corresponding to the virtual input;
The step of analyzing the trend of the influence of the independent variable on the dependent variable according to quantile regression comprises the following steps: modeling the original small sample by using linear quantile regression, and analyzing the influence trend of the independent variable on the dependent variable;
the step of screening the virtual samples according to quantile regression comprises the following steps: and screening the generated virtual samples, deleting the virtual samples which do not accord with the correlation between the independent variables and the dependent variables, and obtaining the left virtual samples as the finally generated virtual samples.
2. The method of claim 1, wherein the step of analyzing the importance of the resolved independent variable to the dependent variable further comprises:
Quantifying the importance of the independent variable relative to the dependent variable using dominance analysis, separating the relative importance of individual variables into direct effects, overall effects, and partial effects, the direct effects being the individual effects of the independent variable on the dependent variable; the overall effect is the effect of the independent variable on the dependent variable when the independent variable is put with all other independent variables; the partial effect is the effect of the independent variable on the dependent variable when the independent variable is put with other partial independent variables;
calculating the relative importance of each independent variable according to the average value of the direct effect, the total effect and the partial effect;
the relative importance of each argument is compared.
3. The method for generating a virtual sample based on a sample method and quantile regression according to claim 1, wherein the step of dividing the input space into sample parties further comprises: selecting a minimum Euclidean distance dist min and a minimum relative importance imp min in an input space, and obtaining the division of each dimension sample party according to the proportional relation between the minimum Euclidean distance and the minimum relative importance;
Obtaining a sample division number sample i of each dimension according to the original sample side length L i of each dimension and the total sample side length L i of each dimension, wherein the expression of the sample division number is as follows:
The expression of the original side length L i and the total side length L i of the original side is as follows:
Li=xi,max-xi,min(i=0,1......n)
multiplying the sample division number of each dimension to obtain a total sample number, wherein the expression of the total sample number is as follows:
If the total number of the sample sides is larger than the preset number of the sample sides, the side length of each sample side is 1.2 times of the original sample sides, the sample sides are divided again according to the new side length of the sample sides, and the steps are repeated until the total number of the sample sides is within the preset number of the sample sides.
4. The method of generating virtual samples based on a sample method and quantile regression according to claim 1, wherein the step of generating virtual inputs in the input space further comprises:
projecting each original sample to each dimension of an x space to obtain a projection set of each dimension, wherein the expression of the projection set is as follows:
Ω={xi,j}(i=0,1......n;j=0,1......m)
Screening each sample from the dimension with the largest number of sample, starting from x with the smallest dimension, if the sample has x projection in the dimension, the sample is not processed, if the sample has no x projection in the dimension, projection values are generated at the center position of one side of the sample, a point of an input space is generated, the values of other dimensions are equal to the midpoints of the maximum value and the minimum value, and the expression of the midpoints is as follows:
Carrying out Cartesian product on the projections of each dimension in each sample side to obtain samples with different value combinations;
for each generated point, if the point is coincident with the x value of the original sample point, deleting the point;
and for m original sample points for generating the point, calculating the Euclidean distance of the two points in the input space, and if the Euclidean distance is greater than 1/7, the reserved points are generated virtual input.
5. The method for generating a virtual sample based on a sample method and quantile regression according to claim 1, wherein the step of analyzing the trend of the influence of the independent variable on the dependent variable according to the quantile regression comprises:
Analyzing the influence trend of the independent variable on the dependent variable according to a linear quantile regression model to obtain a quantile regression model under 10 quantiles, wherein the expression of the linear quantile regression model is as follows:
The 10 quantiles are 0.05,0.15,0.25,0.35,0.45,0.55,0.65,0.75,0.85,0.95 respectively.
6. The method for generating a virtual sample based on a sample method and quantile regression according to claim 1, wherein the step of screening the virtual sample according to the quantile regression comprises:
Establishing an unchecked queue and a qualified queue according to the generated virtual sample;
Searching a point closest to the qualified queue in the unchecked queue for checking;
For a predicted value y1 'of the virtual sample, a predicted value y2' of a point closest to the point in the qualified queue is obtained according to an output value fractional point obtained by an original sample, the predicted value y1 'and the predicted value y2' are obtained, and a regression coefficient b of a corresponding independent variable is obtained according to the predicted value y1 'and the predicted value y 2';
If the virtual sample accords with the correlation between the independent variable and the dependent variable, the virtual sample is reserved, and if the virtual sample does not accord with the correlation between the independent variable and the dependent variable, the virtual sample is deleted, and the expression of the correlation between the independent variable and the dependent variable is as follows:
wherein b is a regression coefficient.
CN202011312051.XA 2020-11-20 2020-11-20 Virtual sample generation method based on sample method and quantile regression Active CN112420135B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202011312051.XA CN112420135B (en) 2020-11-20 2020-11-20 Virtual sample generation method based on sample method and quantile regression

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202011312051.XA CN112420135B (en) 2020-11-20 2020-11-20 Virtual sample generation method based on sample method and quantile regression

Publications (2)

Publication Number Publication Date
CN112420135A CN112420135A (en) 2021-02-26
CN112420135B true CN112420135B (en) 2024-09-13

Family

ID=74777018

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202011312051.XA Active CN112420135B (en) 2020-11-20 2020-11-20 Virtual sample generation method based on sample method and quantile regression

Country Status (1)

Country Link
CN (1) CN112420135B (en)

Families Citing this family (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN115035962B (en) * 2022-01-26 2025-01-28 昆明理工大学 Virtual sample generation and soft sensor modeling method based on variational autoencoder and generative adversarial network
CN114637882B (en) * 2022-05-17 2022-08-19 深圳市华世智能科技有限公司 Method for generating marked sample based on computer graphics technology
CN117218564B (en) * 2023-10-12 2025-03-28 中山大学 A method, system, equipment and medium for generating collaborative decision-making criteria for unmanned aerial vehicles

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766290A (en) * 2016-08-18 2018-03-06 中国石油化工股份有限公司 Convergent multiple regression engineering statistics new method
CN111260149A (en) * 2020-02-10 2020-06-09 北京工业大学 A Prediction Method of Dioxin Emission Concentration

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7054828B2 (en) * 2000-12-20 2006-05-30 International Business Machines Corporation Computer method for using sample data to predict future population and domain behaviors
JP5293739B2 (en) * 2008-08-05 2013-09-18 富士通株式会社 Prediction model creation method, creation system and creation program by multiple regression analysis
CN106650774A (en) * 2016-10-11 2017-05-10 国云科技股份有限公司 A method of obtaining the regression relationship between dependent variable and independent variable in data analysis

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN107766290A (en) * 2016-08-18 2018-03-06 中国石油化工股份有限公司 Convergent multiple regression engineering statistics new method
CN111260149A (en) * 2020-02-10 2020-06-09 北京工业大学 A Prediction Method of Dioxin Emission Concentration

Also Published As

Publication number Publication date
CN112420135A (en) 2021-02-26

Similar Documents

Publication Publication Date Title
CN112420135B (en) Virtual sample generation method based on sample method and quantile regression
Iakovlev et al. Learning continuous-time pdes from sparse data with graph neural networks
Sasena Flexibility and efficiency enhancements for constrained global design optimization with kriging approximations
Reich et al. Evaluating machine learning models for engineering problems
Silhavy et al. Categorical variable segmentation model for software development effort estimation
Chalmond Modeling and inverse problems in imaging analysis
Antunes et al. Knee/elbow estimation based on first derivative threshold
JP2020194560A (en) Causal relationship analyzing method and electronic device
WO2010148238A2 (en) System and method for solving multiobjective optimization problems
Berger et al. Robust Gaussian process modelling for engine calibration
Makarova et al. Overfitting in Bayesian optimization: an empirical study and early-stopping solution
JP2020004409A (en) Automation and self-optimization type determination of execution parameter of software application on information processing platform
Li et al. Online statistical inference for nonlinear stochastic approximation with markovian data
CN116306931B (en) Knowledge graph construction method applied to industrial field
Meyer Density estimation with distribution element trees
CN112200219A (en) A feature extraction method for ultra-large-scale wafer defect data
Falini et al. Spline based Hermite quasi-interpolation for univariate time series
Zhang et al. Optimal selection of segmentation algorithms based on performance evaluation
CN112861874A (en) Expert field denoising method and system based on multi-filter denoising result
Tu et al. Production yield estimation by the metamodel method with a boundary-focused experiment design
Abbasi et al. A support vector machine-based method for LPV-ARX identification with noisy scheduling parameters
Huntsman Topological mixture estimation
Ishii et al. Classification of time series generation processes using experimental tools: a survey and proposal of an automatic and systematic approach
Seifi et al. An integrated statistical process monitoring and fuzzy transformation approach to improve process performance via image data
Fan et al. Change-point testing and estimation for risk measures in time series

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant