CN116719519B

CN116719519B - Generalized linear model training method, device, equipment and medium in banking field

Info

Publication number: CN116719519B
Application number: CN202310714236.0A
Authority: CN
Inventors: 请求不公布姓名
Original assignee: Tianyun Rongchuang Data Science & Technology Beijing Co ltd
Current assignee: Tianyun Rongchuang Data Science & Technology Beijing Co ltd
Priority date: 2023-06-15
Filing date: 2023-06-15
Publication date: 2024-01-30
Anticipated expiration: 2043-06-15
Also published as: CN116719519A

Abstract

The application relates to a generalized linear model training method, device, equipment and medium in the field of banks, wherein the method comprises the following steps: displaying an interactive graphical interface; in response to a user input of adding a plurality of target function components in a function component column to a canvas and a setting input of configuration parameters of each of the plurality of target function components in a component configuration column, displaying a target generalized linear model training flow constructed by the plurality of target function components; and in response to input of a training process of executing the target generalized linear model based on the target original data, training the target generalized linear model through a first data reading component, a target data processing component set, a machine learning training component, a model transformation evaluation component, a data writing component, a second data reading component and a coefficient change analysis component until the target generalized linear model is output under the condition that the evaluation index of the first model is greater than or equal to an index threshold value.

Description

Generalized linear model training method, device, equipment and medium in banking field

Technical Field

The present application relates to the field of computer technologies, and in particular, to a generalized linear model training method, apparatus, electronic device, and storage medium in the field of banking.

Background

At present, the rapid development of the machine learning age makes the human society bring about a huge and profound revolution. Large clusters of computations make the fitting result of the model more and more close through tens of thousands of iterations. This process also continuously motivates researchers in the relevant fields to excavate deeply the phenomena (fitting results) and the essence (feature dimensions). The nonlinear model is always somewhat strong for conclusion and problem analysis, and is insufficient to peep the essence through the phenomenon, so that related personnel are not hesitant in analysis. In view of this, we have to refocus the gaze on classical statistical model-Generalized Linear Model (GLM). The GLM has a certain place in industrial application and is not decayed for a long time by the inherent explanatory principle, so that a solid sharp tool for the connection between a model result and a characteristic dimension is explored, and a powerful theoretical basis is further provided for special requirements of some financial industries.

However, the generalized linear model for data analysis needs to be trained at present, a complex calculation process is needed to be implemented by using a programming language, and the training efficiency of the model is low.

Disclosure of Invention

The application provides a generalized linear model training method, device, electronic equipment and storage medium in the field of banks, and the model training efficiency can be improved.

In a first aspect, the present application provides a generalized linear model training method in the banking field, including: displaying an interactive graphical interface, wherein the interactive graphical interface comprises a functional component column, a canvas and a component configuration column; the function component column comprises various function components for constructing a generalized linear model training process, the canvas is used for constructing the generalized linear model training process, and the component configuration column is used for configuring the operation parameters of each function component in the constructed generalized linear model training process; in response to a user input of adding a plurality of target function components in a function component column to a canvas and a setting input of configuration parameters of each of the plurality of target function components in a component configuration column, displaying a target generalized linear model training flow constructed by the plurality of target function components; the plurality of target function components comprise a first data reading component, a target data processing component set, a machine learning training component, a model transformation evaluation component, a data writing component, a second data reading component and a coefficient change analysis component; the set of target data processing components includes a data column filtering component; reading the target raw data by a first data reading component in response to input of a target generalized linear model training process based on the target raw data; processing the target original data through a target data processing assembly set to obtain a target training set and a target verification set which comprise at least one characteristic variable; training an initial generalized linear model corresponding to the machine learning training component based on the target training set through the machine learning training component to obtain a target generalized linear model; obtaining and storing a standardized coefficient of each characteristic variable corresponding to the target generalized linear model from the machine learning training component through the data writing component; evaluating the target generalized linear model based on the target verification set through a model transformation evaluation component to obtain a first model evaluation index corresponding to the target verification set; under the condition that the first model evaluation index is smaller than the index threshold, taking the target generalized linear model as an initial generalized linear model, returning to adjust the super parameters of the initial generalized linear model through the machine learning training assembly, continuing to train the initial generalized linear model through the machine learning training assembly, updating the target generalized linear model, and acquiring and storing the standardized coefficient of each characteristic variable corresponding to the updated target generalized linear model from the machine learning training assembly through the data writing-out assembly; reading the front and back twice standardized coefficients corresponding to each characteristic variable stored by the data writing-out assembly through the second data reading assembly; under the condition that the primary standardized coefficient corresponding to the feature variable to be deleted is not in the corresponding coefficient range or the variation of the primary standardized coefficient corresponding to the feature variable to be deleted is not in the corresponding variation range, respectively deleting the feature variable to be deleted in the target training set and the target verification set through the data column filtering component to obtain an updated target training set and target verification set; and taking the target generalized linear model as an initial generalized linear model, returning to training the initial generalized linear model through the machine learning training component, and updating the target generalized linear model until the target generalized linear model is output under the condition that the evaluation index of the first model is greater than or equal to the index threshold value.

In a second aspect, the present application provides a generalized linear model training device in the banking field, including: the display module is used for displaying an interactive graphical interface, wherein the interactive graphical interface comprises a functional component column, a canvas and a component configuration column; the function component column comprises various function components for constructing a generalized linear model training process, the canvas is used for constructing the generalized linear model training process, and the component configuration column is used for configuring the operation parameters of each function component in the constructed generalized linear model training process; the display module is also used for responding to the input that a plurality of target function components in the function component column are added to the canvas by a user and the setting input of the configuration parameters of each target function component in the plurality of target function components in the component configuration column, and displaying a target generalized linear model training flow constructed by the plurality of target function components; the plurality of target function components comprise a first data reading component, a target data processing component set, a machine learning training component, a model transformation evaluation component, a data writing component, a second data reading component and a coefficient change analysis component; the set of target data processing components includes a data column filtering component; the data reading module is used for responding to the input of the training process of the target generalized linear model based on the target original data, and reading the target original data through the first data reading component; the data processing module is used for processing the target original data through the target data processing assembly set to obtain a target training set and a target verification set which comprise at least one characteristic variable; the model training module is used for training an initial generalized linear model corresponding to the machine learning training component through the machine learning training component based on the target training set to obtain a target generalized linear model; the data writing-out module is used for acquiring and storing the standardized coefficient of each characteristic variable corresponding to the target generalized linear model from the machine learning training assembly through the data writing-out assembly; the transformation evaluation module is used for evaluating the target generalized linear model based on the target verification set through the model transformation evaluation component to obtain a first model evaluation index corresponding to the target verification set; the parameter adjusting module is used for taking the target generalized linear model as an initial generalized linear model and returning to adjust the super parameters of the initial generalized linear model through the machine learning training component under the condition that the first model evaluation index is smaller than the index threshold; the model training module is also used for continuously training the initial generalized linear model through the machine learning training component and updating the target generalized linear model; the data writing-out module is also used for acquiring and storing the standardized coefficient of each characteristic variable corresponding to the updated target generalized linear model from the machine learning training component through the data writing-out component; the data reading module is also used for reading the front and back twice normalization coefficients corresponding to each characteristic variable stored by the data writing-out assembly through the second data reading assembly; the data deleting module is used for respectively deleting the feature variables to be deleted in the target training set and the target verification set through the data column filtering component under the condition that the primary normalization coefficient corresponding to the feature variable to be deleted is not in the corresponding coefficient range or the variation of the primary normalization coefficient corresponding to the feature variable to be deleted is not in the corresponding variation range, so as to obtain an updated target training set and an updated target verification set; the model training module is also used for taking the target generalized linear model as an initial generalized linear model, returning to training the initial generalized linear model through the machine learning training component, and updating the target generalized linear model; and the model output module is used for outputting the target generalized linear model under the condition that the first model evaluation index is greater than or equal to the index threshold value.

In a third aspect, the present application provides an electronic device, including: a processor for executing a computer program stored in a memory, which when executed by the processor implements the steps of any of the banking domain generalized linear model training methods provided in the first aspect.

In a fourth aspect, the present application provides a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of any of the banking domain generalized linear model training methods provided in the first aspect.

A fifth aspect of embodiments of the present application provides a computer program product, wherein the computer program product comprises a computer program or instructions which, when run on a processor, cause the processor to execute the computer program or instructions for implementing the steps of the method for generalized linear model training in banking as described in the first aspect.

In a sixth aspect of the embodiments of the present application, there is provided a chip, the chip including a processor, a memory and a communication interface, the communication interface being coupled to the processor, the memory being configured to store a program or instructions executable on the processor, the processor being configured to execute the program or instructions to implement the steps of the generalized linear model training method of banking according to the first aspect.

Compared with the prior art, the technical scheme provided by the embodiment of the application has the following advantages: in the embodiment of the application, the friendly and extremely high-practicability interactive graphical interface is adopted, and the functional components such as data reading, data processing, generalized linear algorithm, model analysis and the like which are commonly used in the generalized linear modeling process are integrated in the interactive graphical interface, so that when the generalized linear model for data analysis needs to be trained, a complex calculation process is realized without using a programming language, a result can be obtained only by selecting the corresponding functional component and setting the operation parameters, the requirement threshold of the programming capability of a data modeling person is reduced, and the efficiency of machine learning modeling analysis is improved.

Drawings

The accompanying drawings, which are incorporated in and constitute a part of this specification, illustrate embodiments consistent with the application and together with the description, serve to explain the principles of the application.

In order to more clearly illustrate the embodiments of the present application or the technical solutions in the prior art, the drawings that are required to be used in the description of the embodiments or the prior art will be briefly described below, and it will be obvious to those skilled in the art that other drawings can be obtained from these drawings without inventive effort.

FIG. 1 is a schematic flow chart of a generalized linear model training method in the banking field;

FIG. 2 is a schematic diagram of a display interactive graphical interface provided herein;

FIG. 3 is a schematic flow chart of another method for training a generalized linear model in the banking domain provided by the present application;

FIG. 4 is a schematic diagram of time segment division of a training set and a verification set provided in the present application;

FIG. 5 is a schematic diagram of a univariate fitted curve provided herein;

FIG. 6 is a schematic representation of another univariate fit curve provided herein;

FIG. 7 is a schematic representation of yet another univariate fitted curve provided herein;

FIG. 8 is a schematic representation of yet another univariate fit curve provided herein;

FIG. 9 is a schematic representation of yet another univariate fit curve provided herein;

FIG. 10 is a schematic representation of yet another univariate fit curve provided herein;

FIG. 11 is a schematic illustration of yet another univariate fit curve provided herein;

FIG. 12 is a schematic representation of yet another univariate fit curve provided herein;

FIG. 13 is a schematic representation of yet another univariate fit curve provided herein;

FIG. 14 is a schematic representation of yet another univariate fit curve provided herein;

FIG. 15 is a schematic representation of yet another univariate fit curve provided herein;

FIG. 16 is a schematic representation of yet another univariate fit curve provided herein;

FIG. 17 is a schematic structural diagram of a generalized linear model training device in the banking field provided by the present application;

fig. 18 is a schematic hardware structure of an electronic device provided in the present application.

Detailed Description

In order that the above objects, features and advantages of the present application may be more clearly understood, a further description of the aspects of the present application will be provided below. It should be noted that, in the case of no conflict, the embodiments of the present application and the features in the embodiments may be combined with each other.

In the following description, numerous specific details are set forth in order to provide a thorough understanding of the present application, but the present application may be practiced otherwise than as described herein; it will be apparent that the embodiments in the specification are only some, but not all, embodiments of the application.

The terms first, second and the like in the description and in the claims, are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged, as appropriate, such that embodiments of the present application may be implemented in sequences other than those illustrated or described herein, and that the objects identified by "first," "second," etc. are generally of a type and not limited to the number of objects, e.g., the first object may be one or more.

At present, a generalized linear model for data analysis needs to be trained, a complex calculation process is needed to be realized by using a programming language, and the training efficiency of the model is low.

In the embodiment of the application, the friendly and extremely high-practicability interactive graphical interface is adopted, and the functional components such as data reading, data processing, generalized linear algorithm, model analysis and the like which are commonly used in the generalized linear modeling process are integrated in the interactive graphical interface, so that when the generalized linear model for data analysis needs to be trained, a complex calculation process is realized without using a programming language, a result can be obtained only by selecting the corresponding functional component and setting the operation parameters, the requirement threshold of the programming capability of a data modeling person is reduced, and the efficiency of machine learning modeling analysis is improved.

The electronic device in the embodiment of the application can be a tablet computer, a notebook computer, a palm computer and the like, and can be specifically determined according to actual conditions without limitation.

The technical solutions of the present application are explained in detail below by means of several specific examples.

Fig. 1 is a schematic flow chart of a generalized linear model training method in the banking field provided in the present application, and as shown in fig. 1, the generalized linear model training method in the banking field may include the following steps 101 to 112.

101. Displaying an interactive graphical interface.

The interactive graphical interface comprises a functional component column, a canvas and a component configuration column; the function component column comprises various function components for constructing a generalized linear model training process, the canvas is used for constructing the generalized linear model training process, and the component configuration column is used for configuring the operation parameters of each function component in the constructed generalized linear model training process.

The function component column provides data reading conversion and calculation components involved in the generalized linear model training process, and a user can select corresponding function components according to actual requirements to construct the generalized linear model training process. The canvas is used for setting functional components and processing flows which need to be used. The canvas can provide functions of page scaling and saving the current flow as a picture, so that a user can clearly and intuitively train the flow of the generalized linear model. The component configuration column is used for configuring the operation parameters of each functional component selected by a user to construct the generalized linear model training process.

Illustratively, the above functional components include, but are not limited to: a data reading component, a data processing component (including a data cleaning component, a data integration, a feature transformation component, etc.), a machine learning component, a model analysis component, etc.

102. In response to a user input adding a plurality of target function components in a function component column to the canvas and a setting input of a configuration parameter for each of the plurality of target function components in a component configuration column, a target generalized linear model training process constructed from the plurality of target function components is displayed.

The system comprises a plurality of target function components, a model transformation evaluation component, a data writing-out component, a coefficient change analysis component and a coefficient change analysis component, wherein the target function components comprise a first data reading component, a target data processing component set, a machine learning training component, a model transformation evaluation component and a data writing-out component; the set of target data processing components includes a data column filtering component.

Wherein, in response to a selection input (which may be a drag input or a double click input, etc.) of a user for each target function component in the function component column, each target function component is displayed in the canvas, and in response to a parameter configuration input and a connection input of the user for each target function component, configuration parameters of each target function component and connection relations between different target function components are determined, thereby obtaining a target generalized linear model training flow constructed by a plurality of target function components.

The user can select a corresponding plurality of functional components for constructing the generalized linear model training process according to the requirements, and configure and connect each functional component. Illustratively, the first and second modules are connected to one another. The user can drag the functional components in the functional component bar onto the canvas through drag input to use the components, set a generalized linear model training flow by connecting input and output of different functional components, and configure corresponding functional components by inputting data or parameter options in the component configuration bar. The configuration items of the functional components are used for customizing the operation parameters of the functional components, and the data processing flow is determined by the data flow direction among the functional components.

103. In response to an input to perform a target generalized linear model training procedure based on the target raw data, the target raw data is read by a first data reading component.

The first data reading component may be a data frame reading component for reading target raw data required for training the generalized linear model. For example, the number of data reading components comprised by the first data reading component may be determined based on the target raw data being comprised of several data sets. Illustratively, the target raw data includes transaction behavior data and customer base information data, and thus the first data reading component includes two data reading components that read the transaction behavior data and the customer base information data, respectively.

104. And processing the target original data through the target data processing assembly set to obtain a target training set and a target verification set which comprise at least one characteristic variable.

The target data processing component set is used for processing the read target original data, and the target data processing component set comprises a plurality of data processing components, which can include but is not limited to: the system comprises a data association component, a data processing operation component such as an outlier processing component and a missing value processing component, a characteristic conversion operation component such as a characteristic coding component, a data discretization component and a standardization component, and a characteristic engineering operation component such as a characteristic selection component and a characteristic derivative component; and performing data operations such as data cleaning, feature transformation and derivation on the read-in target original data by utilizing the target data processing assembly set to form a target training set and a target verification set.

Illustratively, the data splitting component is utilized to connect with a data reading component for reading transaction behavior data, and the configuration parameters are observation period data and presentation period data according to time; and dragging other data processing components to be connected with the data output end of the observation period in the data splitting component according to the requirement to form characteristic variables of the customer dimension. The aggregation component is connected with the data output end of the data splitting component, and the overdue times of the dimension of the client are summarized; the binary component is used for connecting the aggregation component, when the overdue times of the clients are greater than or equal to 5 times, the client is defined as an offending client, the client is marked as 1, and other clients are marked as 0, so that a tag variable of a client dimension is formed; respectively connecting a client characteristic variable and a client tag variable by utilizing a data association component to form a training set; and similarly, calculating to obtain a target verification set.

105. And training an initial generalized linear model corresponding to the machine learning training component based on the target training set through the machine learning training component to obtain a target generalized linear model.

The machine learning training component can comprise a generalized linear model component and a fitting component, and is used for carrying out model construction on the training set, and the generalized linear model component and the fitting component are utilized for carrying out generalized linear model fitting construction on the training set after data processing. The generalized linear model component may include a linear regression model component, a logistic regression model component, and the like.

106. And obtaining and storing the standardized coefficient of each characteristic variable corresponding to the target generalized linear model from the machine learning training component through the data writing component.

107. And evaluating the target generalized linear model based on the target verification set through a model transformation evaluation component to obtain a first model evaluation index corresponding to the target verification set.

The model transformation evaluation component comprises a model transformation component and a model evaluation component, the model effect of the target verification set is verified, the target verification set is predicted by the model transformation component, and the model effect of the target verification set is evaluated by the model evaluation component. The model evaluation component comprises a classification evaluation component, a regression evaluation component and other model evaluation components.

108. Under the condition that the first model evaluation index is smaller than the index threshold, taking the target generalized linear model as an initial generalized linear model, returning to adjust the super parameters of the initial generalized linear model through the machine learning training component, continuing to train the initial generalized linear model through the machine learning training component, and updating the target generalized linear model.

The index threshold may be determined according to practical situations, and is not limited herein. For example, the indicator threshold is an AUC threshold of 0.8. The first model evaluation index being less than the index threshold indicates that the first model evaluation index indicates that the target generalized linear model is not available.

Illustratively, when an AUC value greater than or equal to 0.8 indicates that the target generalized linear model is available, the target generalized linear model may be output, or univariate analysis may also be performed.

It can be understood that the target generalized linear model is used as an initial generalized linear model, the initial generalized linear model is continuously trained through the machine learning training component, and the target generalized linear model is updated, namely the machine learning training component is also used for optimizing training of the target generalized linear model obtained by the previous training; the machine learning training component is also used for adjusting the super-parameters of the generalized linear model (including the initial generalized linear model before training and the target generalized linear model after training); see in particular step 108 above.

Illustratively, the generalized linear model component is a logistic regression model component and the model evaluation component is an AUC component. The logistic regression model training can be realized by operating the fitting component through the fitting component connecting the logistic regression model component and the component outputting the training set, and the standardized coefficients of each characteristic variable of the generalized linear model are saved by connecting the data writing component with the coefficient output end of the fitting component. The model transformation component is connected with the output of the logistic regression model component and the component for outputting the target verification set, the prediction of the target verification set can be realized by operating the model transformation component, and the model effect is verified by connecting the AUC component with a data port of the model transformation component; if the model effect is not feasible (indicating that the model is not available), performing model hyper-parameter adjustment in the logistic regression model component until the model verification effect is feasible, or the variation of the AUC value of the target verification set is greater than or equal to the variation threshold value for two times before and after, performing model hyper-parameter adjustment in the logistic regression model component until the model verification effect is feasible, or the variation of the AUC value of the target verification set is less than the variation threshold value for two times before and after, and performing the next optimization (how to optimize through univariate fitting curve analysis as follows).

109. And acquiring and storing the standardized coefficient of each characteristic variable corresponding to the updated target generalized linear model from the machine learning training component through the data writing component.

110. And reading the front and back twice standardized coefficients corresponding to each characteristic variable stored by the data writing-out component through the second data reading component.

111. And under the condition that the primary standardized coefficient corresponding to the characteristic variable to be deleted is not in the corresponding coefficient range or the variation of the primary standardized coefficient corresponding to the characteristic variable to be deleted is not in the corresponding variation range, deleting the characteristic variable to be deleted in the target training set and the target verification set respectively through the data column filtering component to obtain an updated target training set and target verification set.

The coefficient range and the variation range may be specifically determined according to practical situations, and are not limited herein. In this way, the feature variable to be deleted is an unstable feature variable in the model training process, and the component and the coefficient change analysis component are written out through data; the target data processing assembly set comprises a data column filtering assembly and the like, and unstable variables in the model training process can be deleted rapidly, so that the training efficiency of the model is improved.

112. And taking the target generalized linear model as an initial generalized linear model, returning to training the initial generalized linear model through the machine learning training component, and updating the target generalized linear model until the target generalized linear model is output under the condition that the evaluation index of the first model is greater than or equal to the index threshold value.

It can be understood that the target generalized linear model is used as an initial generalized linear model, the initial generalized linear model is trained through the machine learning training component, and the target generalized linear model is updated, namely the machine learning training component is also used for optimizing and training the target generalized linear model obtained by the previous training; see in particular step 112 above.

The coefficient change analysis component is used for carrying out change analysis of the model standardization coefficient (recorded as front and back two standardization coefficients) after two model training (can be continuous two model training or discontinuous two model training).

The data is written out of the model standardization coefficient stored twice before and after the assembly through the second data reading assembly; and the input of the coefficient change analysis component is connected with the output of the second data reading component, the coefficient change analysis component configures a coefficient change range, and when the standardized coefficient change corresponding to the characteristic variable exceeds a specified coefficient change range, the characteristic variable is determined to be the characteristic variable to be deleted. Specifically, under the condition that the front and back twice standardized coefficients corresponding to each characteristic variable are analyzed through the coefficient change analysis component, and the primary standardized coefficient corresponding to the characteristic variable to be deleted is not in the corresponding coefficient range or the change amount of the front and back twice standardized coefficients corresponding to the characteristic variable to be deleted is not in the corresponding change range, the characteristic variable to be deleted in the target training set and the target verification set is deleted respectively through the data column filtering component, and the updated target training set and the target verification set are obtained.

Illustratively, as shown in FIG. 2, a schematic diagram of an interactive graphical interface is shown, wherein the left region is a functional component bar, the middle region is a canvas, and the right region is a component configuration bar. In the training process of the target generalized linear model, the output of a first data reading component is connected with the original data input of a target data processing component set, the training set output of the target data processing component set is connected with the training set input of a machine learning training component, the coefficient output of the machine learning training component is connected with the input of a data writing component, a second data reading component reads data from the data writing component, the output of the second data reading component is connected with the input of a coefficient change analysis component, the result of the coefficient change analysis component is displayed through an interactive graphical interface, then the data column filtering component in the target data processing component set is returned, and feature variables to be deleted in the target training set and the target verification set are deleted through the data column filtering component respectively; the model output of the machine learning training component is connected with the model input of the model transformation component, the verification set output of the target data processing component set is connected with the verification set input of the model transformation component, the output of the model transformation component is connected with the input of the model evaluation component, and then whether the target data processing component set is returned to perform data optimization or the machine learning training component set is returned to perform super-parameter optimization is determined according to the output of the model evaluation component.

It should be noted that, the dashed line in fig. 2 is used to illustrate the flow direction of the data or the training flow in the model training process.

In the embodiment of the application, the interactive graphical interface is internally provided with functional components such as data reading, data processing, model training, model evaluation and model analysis which are commonly used in the generalized linear modeling process, a user does not need to use a programming language to realize a complex calculation process, and the user can train to obtain a required generalized linear model by only selecting the required functional components and setting operation parameters. The functional components provided in the embodiment of the application are rich in types, and a user can select the corresponding functional components according to own requirements, so that the efficiency of machine learning modeling analysis is improved. In the embodiment of the application, the complex data processing and model building training process is presented as the tool box which is easy to understand, easy to use and visual by adopting the friendly and extremely strong practical interactive graphical interface, and a user uses the functional components provided by the system device to realize the generalized linear modeling process, so that the requirement threshold of the programming capability of data modeling personnel is reduced.

In the embodiment of the application, the interactive graphical interface can be constructed based on the distributed file system, the execution efficiency of the distributed file system is higher, the ultra-large-scale data set can be effectively processed, and the method has the advantages of high stability, strong expandability and the like.

In some embodiments of the present application, under the condition that the variation of the first model evaluation index relative to the last first model evaluation index is greater than the variation threshold, the target generalized linear model is used as an initial generalized linear model, the machine learning training component is returned to adjust the hyper-parameters of the initial generalized linear model, the machine learning training component is used to continuously train the target generalized linear model so as to optimize the target generalized linear model, the first model evaluation index of the target generalized linear model is improved, and the above-mentioned process can be repeatedly executed until the first model evaluation index is greater than or equal to the index threshold (indicating that the target generalized linear model is available) or the variation of the first model evaluation index relative to the last first model evaluation index is less than or equal to the variation threshold (indicating that the target generalized linear model is unavailable, but the method of adjusting the model hyper-parameters optimizes the target generalized linear model to reach the limit, and the method of adjusting the model hyper-parameters continuously optimizes the target generalized linear model, so that the effect is poor).

In some embodiments of the present application, under the condition that the variation of the first model evaluation index relative to the last first model evaluation index is smaller than or equal to the variation threshold, the prediction result corresponding to the target verification set output by the model transformation evaluation component may be analyzed by the variable analysis component to obtain an analysis result of each feature variable, then the analysis result of each feature variable is drawn by the univariate fitting drawing component to obtain a univariate fitting curve corresponding to each feature variable, and then the target generalized linear model is further optimized according to the univariate fitting curve corresponding to each feature variable, so that the first model evaluation index (the first model evaluation index is greater than or equal to the index threshold) and each univariate fitting curve indicate that the target generalized linear model is available.

The variation threshold may be determined according to practical situations, and is not limited herein. For example, the variation threshold is a variation duty cycle, e.g., the variation threshold is 5%.

Illustratively, when the AUC value is smaller than 0.8 and the improvement degree of the AUC value compared with the last AUC value is larger than 0.5%, the target generalized linear model is taken as an initial generalized linear model, the super-parameters of the initial generalized linear model are adjusted to optimize the target generalized linear model, and the AUC value of the target generalized linear model is improved until the AUC value is larger than or equal to 0.8, or the AUC value is smaller than 0.8 and the improvement degree of the AUC value compared with the last AUC value is smaller than or equal to 0.5%, and the univariate analysis is entered.

In some embodiments of the present application, determining that the target generalized linear model is available according to the univariate fitting curve corresponding to each characteristic variable includes: under the condition that the univariate fitting curve corresponding to each characteristic variable meets the preset fitting condition, determining that the target generalized linear model is available according to the univariate fitting curve corresponding to each characteristic variable; wherein, under the condition that the target generalized linear model is a classification model, the preset fitting conditions comprise: the value of the factor variable corresponding to the target independent variable in the actual occurrence rate curve is smaller than or equal to the value of the factor variable corresponding to the target independent variable in the predicted occurrence rate upper limit curve, and is larger than or equal to the value of the factor variable corresponding to the target independent variable in the predicted occurrence rate lower limit curve; the single-variable fitting curve corresponding to each characteristic variable comprises an actual occurrence rate curve, a predicted occurrence rate upper limit curve and a predicted occurrence rate lower limit curve, wherein the dependent variable corresponding to the target independent variable in the predicted occurrence rate upper limit curve is the sum of the dependent variable corresponding to the target independent variable in the corresponding predicted occurrence rate curve and a first numerical value; the dependent variable corresponding to the target independent variable in the predicted occurrence rate lower limit curve is the difference between the dependent variable corresponding to the target independent variable in the corresponding predicted occurrence rate curve and the second value (wherein the second value may be the same as the first value or different from the first value, and both the second value and the first data are positive numbers).

Wherein, under the condition that the target generalized linear model is a regression model, the preset fitting conditions comprise: the absolute value of the difference value of the factor variable value corresponding to the target independent variable in the actual value mean curve and the factor variable value corresponding to the target independent variable in the predicted value mean curve is smaller than or equal to a difference threshold (the difference threshold can be determined according to the actual situation and is not limited here); the target independent variable is any independent variable in each second single-variable fitting curve; the target independent variable is any independent variable in the single variable fitting curve corresponding to each characteristic variable.

It can be understood that the univariate fitting effect judgment principle is as follows: the prediction occurrence rate curve is feasible within the range of the prediction occurrence rate upper limit curve and the prediction occurrence rate lower limit curve when the model is classified, and the closer the regression model actual value mean curve and the prediction value mean curve are, the better the regression model actual value mean curve and the prediction value mean curve are. In the embodiment of the application, the system can automatically judge the univariate fitting effect, can display the univariate fitting curve (can also simultaneously display the univariate fitting effect judgment principle), then the user carries out manual judgment according to the univariate fitting curve, the system determines the univariate fitting effect according to the manual judgment result, and the univariate fitting effect can be determined according to the actual condition without limitation.

According to the embodiment of the application, through the setting of the preset fitting conditions, the univariate fitting curve according to each characteristic variable can be achieved, whether the target generalized linear model is available or not can be rapidly determined, and model training efficiency can be improved.

In some embodiments of the present application, the plurality of target function components further comprises: the univariate analysis component and the univariate fitting drawing component; before outputting the target generalized linear model in the case that the first model evaluation index is greater than or equal to the index threshold in the above step 112, the generalized linear model training method in the banking field provided in the embodiment of the present application may further include steps 301 to 303 described below, and in the case that the first model evaluation index is greater than or equal to the index threshold in the above step 112, outputting the target generalized linear model may be specifically achieved through step 303 described below.

301. And under the condition that the first model evaluation index is larger than or equal to an index threshold, or the first model evaluation index is smaller than the index threshold and the variation of the first model evaluation index relative to the last first model evaluation index is smaller than or equal to the variation threshold, analyzing a prediction result corresponding to the target verification set output by the model transformation evaluation component through the univariate analysis component to obtain an analysis result of each characteristic variable.

302. And drawing the analysis result of each characteristic variable through a univariate fitting drawing component to obtain a univariate fitting curve corresponding to each characteristic variable.

303. And outputting the target generalized linear model under the condition that the first model evaluation index is larger than or equal to an index threshold value and the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each characteristic variable.

Illustratively, as shown in fig. 2, the input of the univariate analysis component is connected with the output of the model transformation component, the input of the univariate fitting drawing component (namely, the python notebook component) is connected with the output of the univariate analysis component, the result of the univariate fitting drawing component is displayed through the interactive graphical interface, and then the target data processing component set is returned to perform data optimization on the characteristic variables according to the result of the univariate fitting drawing component.

In the embodiment of the application, under the condition that the first model evaluation index is greater than or equal to the index threshold, the target generalized linear model is determined to be available by combining the univariate fitting curve corresponding to each characteristic variable, so that the output target generalized linear model is feasible in the overall fitting effect of the model, and the fitting effect of each characteristic variable is feasible, the target generalized linear model can better reflect the relation between the model result and the characteristic dimension, and the prediction effect of the target generalized linear model is better.

In some embodiments of the present application, following the step 302, the following step 304 may be further included.

304. And step S1 is executed in a loop iteration mode until the target generalized linear model is output under the condition that the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each characteristic variable.

Step S1 includes the following steps S11 to S13.

S11, determining a feature variable to be optimized in at least one target feature variable under the condition that a single-variable fitting curve corresponding to each feature variable indicates that the target generalized linear model is unavailable.

Wherein each target characteristic variable is a characteristic variable of which at least one characteristic variable is an unavailable characteristic variable of the target generalized linear model indicated by a single variable fitting curve.

And S12, optimizing the feature variable to be optimized through the target data processing assembly set, and generating a new feature variable to update the target training set and the target verification set.

The target data processing component set is further used for performing optimization processing on the feature variables in the target training set and the target verification set in the model optimization process, and generating new feature variables, as in step S12.

It can be understood that the new feature variable is obtained by performing cross combination processing on the feature variable to be optimized and other feature variables, and also can be obtained by performing segmentation processing or polynomial processing on the feature variable to be optimized. The target training set and the target verification set are updated by using the new characteristic variables, namely the characteristic variables to be optimized in the target training set and the target verification set are replaced by the new characteristic variables, and other characteristic variables are unchanged.

Wherein the new feature variable may be one feature variable or a plurality of feature variables. When the new feature variable is a plurality of feature variables, the new feature variable may or may not include the corresponding feature variable to be optimized, which is not limited herein.

S13, taking the target generalized linear model as an initial generalized linear model, and returning to execute the training of the initial generalized linear model based on the target training set through the machine learning training component to obtain the target generalized linear model, so as to update the target generalized linear model until a univariate fitting curve corresponding to each characteristic variable is obtained through the univariate fitting drawing component.

It can be understood that the loop execution step S1 refers to optimizing a target feature variable (i.e. feature variable to be optimized) in each execution step S1, if features are newly added in the feature variable to be optimized and the target generalized linear model is optimized, all features of all feature variables are fitted, and then the optimization is not required to be continued, and the optimization is ended, so that the target generalized linear model is output; if the feature is newly added in the feature variable to be optimized and the target generalized linear model is optimized, part of the feature variables in all the feature variables are fitted (good influence), the fitted part of the feature variables are not required to be optimized again, and then the next target feature variable is optimized (the next feature variable to be optimized is determined) on the basis of optimizing the feature variable to be optimized until all the feature variables are fitted; if the feature variable to be optimized is newly added in the feature variable to be optimized and the target generalized linear model is optimized, the feature variable to be optimized is not well fitted, then the newly added feature in the feature variable to be optimized is deleted, and then the next target feature variable (namely the next feature variable to be optimized) is optimized; if the feature variable to be optimized is newly added in the feature variable to be optimized and the target generalized linear model is optimized, the feature variable to be optimized is well fitted, but bad influence is generated on other feature variables, if the influence is weak (manual judgment), the optimization is generally not needed, if the influence is serious (manual judgment), the affected feature variable (the next feature variable to be optimized) can be optimized, if the optimization effect is bad, the feature variable to be optimized including the newly added feature can be deleted, and the influence is generally avoided after deleting the feature variable to be optimized because the other feature variables contain certain information of the feature variable to be optimized.

In this embodiment of the present application, after executing the step S1 for one feature variable to be optimized, if it is determined that the target generalized linear model is available according to the single-variable fitting curve corresponding to each feature variable, model optimization is ended, and the target generalized linear model is output, if it is determined that the target generalized linear model is still not available according to the single-variable fitting curve corresponding to each feature variable, the step S1 is executed for the next feature variable to be optimized, so that the step S1 is iteratively executed for a plurality of feature variables to be optimized, until it is determined that the target generalized linear model is available according to the single-variable fitting curve corresponding to each feature variable, model optimization is ended, and the target generalized linear model is output.

In the embodiment of the present application, the following step S1 is executed through loop iteration until the target generalized linear model is output under the condition that it is determined that the target generalized linear model is available according to the univariate fitting curve corresponding to each characteristic variable. Observing the model effect by a univariate fitting analysis method, and optimizing the model by optimizing and fitting the univariate; and univariate analysis and the like are used for model analysis in the form of functional components, so that the efficiency of generalized linear model optimization is improved.

In the embodiment of the application, the univariate analysis component is used for performing univariate analysis of the characteristic variables of the model. The python notebook component was used to draw a univariate fit curve. The univariate analysis component is connected with the model transformation component which outputs the prediction result of the verification set, univariate analysis and calculation are carried out, then the input of the python notebook component is connected with the output of the univariate analysis component, a univariate fitting curve is drawn through the python notebook component, then the optimization direction of the characteristic variable is determined according to the univariate fitting curve, and the characteristic variable to be optimized is determined. And then optimizing the feature variable to be optimized through the target data processing assembly set, so that model optimization is realized.

In some embodiments of the present application, the feature variable to be optimized is a feature variable with the greatest importance among at least one target feature variable.

The importance degree of the feature variable can be determined according to a model normalization coefficient corresponding to the generalized linear model, specifically, the importance degree can be determined according to actual conditions, and the importance degree is not limited herein.

In the embodiment of the application, the feature variables with the importance degree ranked at the front are preferentially processed, so that the model effect of the target generalized linear model can be rapidly improved.

In some embodiments of the present application, the plurality of target function components further comprises: a PSI component and a transformation correction component; after the step 304, the method for training a generalized linear model in a banking domain provided in the embodiment of the present application may further include the following steps 305 to 309.

305. After the loop iteration executes the step S1, determining that the univariate fitting curve of at least one first characteristic variable indicates that the target generalized linear model is not available according to the univariate fitting curve corresponding to each characteristic variable obtained by the univariate fitting drawing component.

Wherein, each first characteristic variable is a new characteristic variable generated by optimizing the second characteristic variable by executing the step S1 at least once; each first characteristic variable corresponds to the same or different second characteristic variable, and the second characteristic variable is one of at least one target characteristic variable; in each of the feature variables, a single-variable fitted curve of the feature variables other than the at least one first feature variable indicates that the target generalized linear model is available.

It will be appreciated that each first feature variable is the new feature variable generated by optimizing the second feature variable by performing step S1 described above one or more times.

Illustratively, taking the example that one first characteristic variable is generated by optimizing a corresponding second characteristic variable by executing the step S1 at a time, in the process of executing the step S1 at a time, one second characteristic variable in at least one target characteristic variable is used as the characteristic variable to be optimized, and cross-combining processing is performed on the second characteristic variable and other characteristic variables to generate the one first characteristic variable.

Taking the example that one first characteristic variable is obtained by optimizing the corresponding second characteristic variable through twice executing the step S1, and taking the generated new characteristic variable as an example, in the process of executing the step S1 at a certain time, cross-combining one second characteristic variable in at least one target characteristic variable as the characteristic variable to be optimized with other characteristic variables to generate an intermediate characteristic variable; in another execution of the step S1, the intermediate feature variable is used as the feature variable to be optimized, and the cross combination processing is performed on the intermediate feature variable and other feature variables, so as to generate the first feature variable.

In some embodiments of the present application, by performing the step S1 at least once to optimize a second feature variable, one or more first feature variables may be generated, that is, the new feature variable corresponding to the second feature variable includes one or more first feature variables, so when the second feature variable corresponds to one first feature variable, the one first feature variable corresponds to the corresponding second feature variable one by one; when the second feature variable corresponds to a plurality of first feature variables, the plurality of first feature variables correspond to the same second feature variable, which may be specifically determined according to the actual situation, and is not limited herein.

In some embodiments of the present application, when a plurality of first feature variables correspond to the same second feature variable, the plurality of first feature variables may be different from the one second feature variable, and one first feature variable and one second feature variable may also be the same in the plurality of first feature variables, which may be specifically determined according to practical situations.

306. And returning to the target data processing assembly set, respectively updating each first characteristic variable in the target training set and the target verification set into a corresponding second characteristic variable so as to update the target training set and the target verification set.

The target data processing component set is further configured to, after optimizing the feature variables in the target training set and the target verification set in the model optimization process, return to the feature variables before the optimization, as in step 306 above.

In some embodiments of the present application, when the plurality of first feature variables correspond to the same second feature variable, the plurality of first feature variables corresponding to the one second feature variable are a set of first feature variables, and when the step S22 is performed, the set of first feature variables are replaced by the corresponding second feature variables as a whole.

307. And determining the data corresponding to each second characteristic variable in the target training set by the PSI component, and obtaining at least one target PSI value between the data corresponding to each second characteristic variable in the target verification set.

The target PSI value is used for indicating whether the distribution of the second characteristic variable in the target training set is consistent with the distribution of the second characteristic variable in the target verification set. If one target PSI value is larger than the distribution threshold value, determining that the distribution of a second characteristic variable corresponding to the one target PSI value in the target training set is inconsistent with the distribution in the target verification set; and if one target PSI value is smaller than or equal to the distribution threshold value, determining the distribution of the second characteristic variable corresponding to the one target PSI value in the target training set, and conforming to the distribution in the target verification set. Aiming at a second characteristic variable of which the distribution in the target training set is inconsistent with the distribution in the target verification set, beta coefficients of the target generalized linear model aiming at the second characteristic variable need to be adjusted so that the fitting effect of the adjusted target generalized linear model aiming at the verification set is better; aiming at a second characteristic variable with the same distribution in the target training set and the target verification set, the second characteristic variable in the target training set and the target verification set needs to be deleted, and further the target generalized linear model is continuously trained, so that the target generalized linear model with better effect is obtained.

308. And under the condition that the PSI value less than or equal to the distribution threshold value does not exist in at least one target PSI value, respectively adjusting the beta coefficient of the target generalized linear model aiming at each second characteristic variable through a transformation correcting component to obtain an updated target generalized linear model.

When it is determined that the distribution of each second feature variable in the target training set and the target verification set is inconsistent based on at least one target PSI value, beta coefficients of the linear model are adjusted by means of formula pushing or manual input.

309. And taking the updated target generalized linear model as a target generalized linear model, returning to execute the target-based verification set through the model transformation evaluation component, and evaluating the target generalized linear model to obtain a first model evaluation index until the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each characteristic variable obtained through the univariate fitting drawing component.

The PSI component is used for calculating PSI values of characteristic variables of the target training set and the target verification set. The PSI module is connected with two data processing modules which output a target training set and a target verification set in the target data processing module set, PSI values of characteristic variables of the target training set and the target verification set are calculated, then the transformation correcting module is connected with the fitting module, logic regression coefficients of variables with the PSI values larger than a PSI threshold (for example, 0.25) are modified, then the transformation correcting module is connected with the model transforming module which is connected with the output verification set module, and the python non-book module is re-executed to check a single-variable fitting curve, so that the model is finally more suitable for the verification set.

Illustratively, as shown in fig. 2, the output of the validation set and the output of the training set of the target data processing component set are respectively connected with the input of the PSI component, the output of the machine learning training component is connected with one input of the transformation correcting component, and the user can adjust (modify) the beta coefficient of the target generalized linear model for each second characteristic variable through the transformation correcting group according to the output of the PSI component shown, so as to obtain an updated target generalized linear model.

In this embodiment of the present application, after the step S1 is circularly performed on one feature variable to be optimized (second feature variable), if a single-variable fitting curve corresponding to a first feature variable is determined for the one feature variable to be optimized (second feature variable), the first feature variable may be rolled back to a second feature variable (in the at least one target feature variable) corresponding to before the step S1 is circularly performed (the first feature variable is the second feature variable and at least one new feature variable obtained by performing the step S1 each time), all the new feature variables in the first feature variable are deleted to obtain the second feature variable), and then PSI values of the second feature variable between the target training set and the target verification set are calculated, so as to confirm whether data distribution of the second feature variable in two data sets (the target training set and the target verification set) is consistent. If the data distribution in the two data sets is inconsistent, determining a target beta coefficient corresponding to the second characteristic variable, and then adjusting the beta coefficient of the target generalized linear model aiming at the second characteristic variable based on the target beta coefficient to obtain the updated target generalized linear model. That is, the beta coefficient can be derived through a known generalized linear model formula or manually according to the difference between each set of actual mean and the predicted mean in the univariate fitting curve corresponding to the second target characteristic variable. The manual mode is to manually adjust the beta coefficient of a certain characteristic variable.

In the embodiment of the application, the data distribution in the two data sets corresponding to the second characteristic variable is inconsistent, which indicates that the model trained by the training set may not be capable of effectively predicting the verification set, so that the model can be more effectively applied to the verification set by manually adjusting or automatically adjusting the beta coefficient, and further, the target generalized linear model can be better in model effect aiming at the verification set, and the optimization efficiency of the generalized linear model is improved.

In some embodiments of the present application, after 307, the method for training a generalized linear model in a banking domain provided in the embodiments of the present application may further include a step 310 described below.

310. Under the condition that at least one PSI value smaller than or equal to a distribution threshold exists in at least one target PSI value, deleting each second characteristic variable corresponding to at least one PSI value in a target training set and a target verification set through a target data processing assembly set, taking the target generalized linear model as an initial generalized linear model, returning to execute the training based on the target training set, training the initial generalized linear model to obtain the target generalized linear model, updating the target generalized linear model until the target generalized linear model is determined to be available according to a univariate fitting curve corresponding to each characteristic variable obtained through a univariate fitting drawing assembly.

The target data processing component set is further configured to delete the feature variables in the target training set and the target verification set in the model optimization process, as in step 310 above.

In the embodiment of the application, when the distribution of a second feature variable in the target training set is determined according to the target PSI value, and the distribution of the second feature variable in the target verification set is consistent with the distribution of the second feature variable, it is explained that the second feature variable does not perform a good fitting function in the target generalized linear model, so that the second feature variable in the target training set and the target verification set is deleted, the target generalized linear model is used as the initial generalized linear model, the initial generalized linear model is returned to be executed based on the target training set, the initial generalized linear model is trained, the target generalized linear model is obtained, and the target generalized linear model is updated until the target generalized linear model is determined to be available according to the single variable fitting curve corresponding to each feature variable obtained by the single variable fitting drawing component. Furthermore, the target generalized linear model has good effects on the whole multi-application object and the model of the sub-application object, and the method has the advantages of less quantity of constructed models, simple process, less time consumption and simple management of the model on line.

It should be noted that, after the step 307, if it is determined that a portion of the PSI values in the at least one target PSI value is less than or equal to the distribution threshold (i.e., the second feature variable corresponding to the portion of the PSI values is identical to the data distribution in the target training set and the target verification set), it is indicated that the second feature variable corresponding to the portion of the PSI values does not perform a good fitting function in the target generalized linear model, so that the second feature variable corresponding to the portion of the PSI values in the target training set and the target verification set is deleted, the target generalized linear model is used as the initial generalized linear model, and the initial generalized linear model is trained based on the target training set, so as to obtain the target generalized linear model, so as to update the target generalized linear model; and then, beta coefficients of second characteristic variables corresponding to other parts of PSI values (namely PSI values larger than a distribution threshold value in at least one target PSI value, namely second characteristic variables corresponding to the other parts of PSI values) in the at least one target PSI value are adjusted so as to update the target generalized linear model, and the updated target generalized linear model is output.

Illustratively, to facilitate user selection of desired components, similarly functional components may be categorized, e.g., various data processing components may be grouped into a set, such as data cleansing components, data integration, feature transformation components, and the like; the various model analysis components may be grouped into a set, such as coefficient variation analysis components, univariate analysis components, PSI components, etc.; the model training components are divided into a group of training components such as a linear regression model component, a logistic regression model component and the like, and fitting conversion components such as a fitting component, a model conversion component and the like; the various model evaluation components are grouped into a set of model evaluation components, such as classification evaluation components, regression evaluation components, and the like.

In some embodiments of the present application, the interactive graphical interface may further provide a generalized linear model adapted to multiple application objects for the same service scenario but different application objects. Illustratively, the above embodiment is a training process description from the overall training perspective of a generalized linear model adapted to a multi-application object, the target training set is a training set including the multi-application object, and the target verification set is a verification set including the multi-application object. The embodiment of the application further includes a training process for the generalized linear model adapting to the multi-application object from the training perspective of each application object, for example, the plurality of target functional components includes a data splitting component, a second model transformation evaluation component (the function is the same as the function of the model change evaluation component in step 107 described above), a second univariate analysis component (the function is the same as the function of the univariate analysis component in step 301 described above), and a second univariate fitting drawing component (the function is the same as the function of the univariate fitting drawing component in step 302 described above). The data splitting component is used for grouping the target verification set according to different application objects to obtain a plurality of sub-verification sets. The second model transformation evaluation component is configured to evaluate the target generalized linear model based on each sub-verification set to obtain a second model evaluation index corresponding to each sub-verification set, and then determine whether the target generalized linear model is available for the plurality of sub-verification sets according to the second model evaluation index corresponding to each sub-verification set, if not, return to modify the model to super-parameters to continue training the model, and if available, the next step can be performed, and the specific process can be described in the step 107 and is not repeated herein. The second univariate analysis component and the second univariate fitting drawing component are configured to obtain at least one second univariate fitting curve corresponding to each sub-verification set (each second univariate fitting curve is a fitting curve of a value of one characteristic variable of the corresponding sub-verification set), then determine whether the target generalized linear model is available for the plurality of sub-verification sets according to the at least one second univariate fitting curve corresponding to each sub-verification set, and if the target generalized linear model is not available, return to optimize the characteristic variable of the target generalized linear model for the corresponding sub-verification set indicated by the second univariate fitting curve, and then continue training the model, and specific processes can refer to relevant descriptions in step 301.

The target generalized linear model obtained by the embodiment of the application is a model which is available for the target verification set comprising a plurality of application objects, and is available for the sub-verification set corresponding to each application object, that is, the model effect is good for the whole multi-application object and the sub-application object, the number of constructed models is small, the working procedure is simple, the time is short, and the management of the model on line is simple.

As shown in fig. 3, taking a generalized linear model for training a multi-application object as an example, a specific implementation process may include steps 401 to 420 described below.

401. Data comprising a plurality of application objects is processed to obtain a target training set and a target validation set comprising at least one feature variable.

402. Based on the target training set, training an initial generalized linear model to obtain a target generalized linear model.

403. And evaluating the target generalized linear model based on the target verification set to obtain a first model evaluation index and at least one first univariate fitting curve corresponding to the target verification set.

404. Determining whether the target generalized linear model is available for the target verification set according to the first model evaluation index and each first univariate fitting curve.

In the event that it is determined that the target generalized linear model is not available for the target verification set based on the first model evaluation index and the each first univariate fit curve, performing the following step 405; in the event that it is determined that the target generalized linear model is available for the target verification set based on the first model evaluation index and the each first univariate fit curve, the following step 406 is performed.

405. The target generalized linear model is determined as the initial generalized linear model.

Returning to step 402 or step 401, further training is performed on the target generalized linear model until step 406 is performed where it is determined that the target generalized linear model is available for the target verification set based on the first model evaluation index and the each first univariate fit curve.

406. And grouping the target verification sets according to different application objects to obtain the plurality of sub-verification sets.

407. And evaluating the target generalized linear model based on each sub-verification set to obtain a second model evaluation index corresponding to each sub-verification set and at least one second univariate fitting curve corresponding to each sub-verification set.

408. And determining whether the target generalized linear model is available for the plurality of sub-verification sets according to the second model evaluation index corresponding to each sub-verification set and at least one second univariate fitting curve corresponding to each sub-verification set.

In the case that it is determined that the target generalized linear model is available for each of the plurality of sub-verification sets according to the second model evaluation index corresponding to each sub-verification set and the at least one second univariate fit curve corresponding to each sub-verification set, step 409 is performed, and in the case that it is determined that the target generalized linear model is not available for at least one of the plurality of sub-verification sets according to the second model evaluation index corresponding to each sub-verification set and the at least one second univariate fit curve corresponding to each sub-verification set, step 410 is performed.

409. And outputting the target generalized linear model.

410. At least one target feature variable corresponding to the at least one sub-verification set is determined.

Wherein each target feature variable is a feature variable corresponding to at least one sub-verification set, and the second single-variable fitting curve indicates feature variables which are not available for the target generalized linear model of the corresponding sub-verification set.

In some embodiments of the present application, it may be determined whether each second univariate fitting curve in the feature variables corresponding to the at least one sub-verification set meets a preset fitting condition, and the feature variables in the feature variables corresponding to the at least one sub-verification set, where the second univariate fitting curve does not meet the preset fitting condition, are determined as the at least one target feature variable.

In some embodiments of the present application, each second univariate fit curve in the feature variables corresponding to the at least one sub-verification set may be displayed, and then at least one target feature variable is determined in response to user input.

411. It is determined whether variable optimization is performed.

In the event that a variable optimization is determined, the following step 412 is performed; in the event that determination is made that variable optimization is not to be performed, the following step 414 is performed.

In some embodiments of the present application, in a case where it is determined that there is a target feature variable that has not been subjected to variable optimization in at least one target feature variable, performing variable optimization is determined; in the case where it is determined that each of the at least one target feature variable has been subjected to variable optimization, it is determined that variable optimization has not been performed.

In some embodiments of the present application, under the condition that it is determined that at least one target feature variable has a target feature variable with a number of times of performing variable optimization being smaller than a preset number of times, performing variable optimization; and under the condition that each target characteristic variable in the at least one target characteristic variable is determined to be subjected to variable optimization, and the number of times of performing variable optimization is equal to the preset number of times, determining not to perform variable optimization.

In some embodiments of the present application, in a case of receiving a user input for performing variable optimization, determining to perform variable optimization; in the event that a user input is received that does not perform variable optimization, it is determined that variable optimization is not performed.

412. A feature variable to be optimized of the at least one target feature variable is determined.

413. And carrying out cross combination processing on the feature variable to be optimized and different application object identifiers to generate a new feature variable so as to update the target training set and the target verification set.

And updating the feature variable to be optimized in the target training set into the new feature variable to obtain an updated target training set, and updating the feature variable to be optimized in the target verification set into the new feature variable to obtain the updated target verification set.

And taking the target generalized linear model as an initial generalized linear model, returning to execute the training based on the target training set, and training the initial generalized linear model to obtain the target generalized linear model so as to update the target generalized linear model until a second model evaluation index corresponding to each sub-verification set and at least one second univariate fitting curve corresponding to each sub-verification set are obtained.

After the step 413, performing step 405, and returning to perform the steps 402 to 408, and outputting the target generalized linear model if it is determined that the target generalized linear model is available for the plurality of sub-verification sets according to the second model evaluation index corresponding to each sub-verification set and the at least one second univariate fitting curve corresponding to each sub-verification set; determining whether to continue variable optimization in the step 410 if the target generalized linear model is still not available for at least one of the plurality of sub-verification sets according to the second model evaluation index corresponding to each sub-verification set and the at least one second univariate fitting curve corresponding to each sub-verification set, and if it is determined to continue variable optimization, sequentially performing the steps 412, 413, 405, 402 to 408 and 410 in a loop until the target generalized linear model is output in the case that it is determined that the target generalized linear model is available for each of the plurality of sub-verification sets according to the second model evaluation index corresponding to each sub-verification set and the at least one second univariate fitting curve corresponding to each sub-verification set; if it is determined in step 410 that the variable optimization is not to be continued, the following steps 414 to 420 are performed to output the target generalized linear model.

414. At least one target feature variable is determined as at least one first feature variable.

Wherein, each first characteristic variable is a new characteristic variable generated by optimizing a second characteristic variable by executing step S1 at least once, and each first characteristic variable corresponds to the same or different second characteristic variable; the second characteristic variable is one of the at least one target characteristic variable; the second univariate fit curves of the other feature variables except the at least one first feature variable among the feature variables corresponding to the at least one sub-verification set are all indicative of the availability of the generalized linear model for the corresponding sub-verification set target.

415. And respectively updating each first characteristic variable in the target training set and the target verification set to a corresponding second characteristic variable so as to update the target training set and the target verification set.

416. And determining the target PSI value between the data corresponding to each second characteristic variable in the target training set and the data corresponding to the target verification set to obtain at least one target PSI value.

Wherein each second characteristic variable corresponds to a target PSI value.

417. It is determined whether there is a PSI value less than or equal to the distribution threshold value in the at least one target PSI value.

Determining that the distribution of the corresponding second characteristic variable in the target training set is inconsistent with the distribution in the target verification set for a target PSI value under the condition that the target PSI value is larger than a distribution threshold, and adjusting beta coefficients of the target generalized linear model for the second characteristic variable inconsistent with the distribution so as to improve the effect of the target generalized linear model; and under the condition that the target PSI value is smaller than or equal to the distribution threshold value, determining that the distribution of the corresponding second characteristic variable in the target training set is consistent with the distribution in the target verification set, and aiming at the second characteristic variable with consistent distribution, deleting the second characteristic variable in the target training set and the target verification set, and then retraining and verifying the target generalized linear model.

Accordingly, in the event that it is determined that there is no PSI value less than or equal to the distribution threshold value in the at least one target PSI value, step 418 is performed, and in the event that it is determined that there is at least one PSI value less than or equal to the distribution threshold value in the at least one target PSI value, step 420 is performed.

418. And determining a target beta coefficient corresponding to each second characteristic variable.

419. Based on each target beta coefficient, respectively adjusting the beta coefficient of the target generalized linear model aiming at the corresponding second characteristic variable to obtain an updated target generalized linear model.

After executing step 419, the step 403 is executed again, the target generalized linear model is evaluated based on the target verification set, at least one first univariate fitting curve corresponding to the target verification set and the first model evaluation index are obtained, until the target generalized linear model is determined to be available for the corresponding sub-verification set according to the second model evaluation index and the at least one second univariate fitting curve corresponding to each sub-verification set, optimization is finished, and the target generalized linear model is output.

420. And deleting each second characteristic variable corresponding to at least one PSI value in the target training set and the target verification set.

After the above step 420, the above step 405 is performed back, the target generalized linear model is taken as the initial generalized linear model, and then the steps of the above steps 402 to 419 are continuously performed until the target generalized linear model is output.

For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, exemplary embodiments of the present invention will be described in further detail below with reference to the accompanying drawings. Taking a bank credit card default risk prediction model as an example, a bank predicts default risk aiming at a client with the credit card of the bank, and predicts the probability of default risk occurrence within 1 year of the client in the future.

Fig. 4 is a schematic diagram showing time period division of training set and verification set, wherein the observation period is 12 months, and the expression period is 12 months. The observation period refers to historical data before the observation point, and the expression period refers to future data after the observation point; the observation period is used to refine feature variables and the presentation period is used to refine tag variables (tag variables of samples).

Step 1: data is acquired and a training set and a validation set are defined. The data obtained in the example are transaction behavior data and customer base information data of the total of 3 years from 2015, 1 month, 1 day to 2017, 12 months, 31 days in 38 areas of 36 provinces and Beijing, shanghai and two straight jurisdictions of China; defining training set and verification set according to time: training set time window: the observation period time is 2015, 1 month, 1 day to 2015, 12 month, 31 days, and the expression period time is 2016, 1 month, 1 day to 2016, 12 month, 31 days; verification set time window: the observation period is 2016, 1, to 2016, 12, 31, and the expression period is 2017, 1, to 2017, 12, 31.

Step 2: performing data cleaning, feature transformation, derivatization and other data operations on the data; in the example, the data in the training set observation period range described in the step 1 is utilized to form the characteristic data of the customer dimension through data processing such as data cleaning, characteristic transformation and derivation; calculating a label variable of a client dimension by utilizing the data in the expression period range of the training set in the step 1, wherein the label variable comprises two values of 0 and 1, 0 represents that the client has no default in the future 1 year, and 1 represents that the client has default in the future 1 year; forming a training set according to the client identification associated characteristic data and the tag variable; and similarly, calculating to obtain a verification set.

Step 3: training a generalized linear model by using a training set, saving a model standardization coefficient, and evaluating a model effect by using a verification set; classification model evaluation indexes comprise but are not limited to AUC and accuracy, and regression model evaluation indexes comprise but are not limited to fitness r2 and mean square error; in the example, the logistic regression model is trained by using the training set in the step 2, and the model effect is estimated by using the AUC value of the verification set.

Step 4: judging whether the verification set model effect is feasible or not, and ending model optimization when the verification set model effect is feasible; if not, entering a step 5; in the example, the model effect is feasible when the AUC value of the verification set is greater than or equal to 0.8, and when the AUC value of the verification set is smaller than 0.8, the model is optimized by the method of optimizing algorithm parameters in step 5 or optimizing characteristic variables in step 6, and the final AUC value reaches 0.82.

Step 5: and optimizing the hyper-parameters of the generalized linear model.

Step 5.1: adjusting super parameters of the generalized linear model, retraining the model, and storing a model standardization coefficient; in the example, a logistic regression training model is adopted, the model hyper-parameters such as regularization strength are adjusted, and the standardized coefficients of the model hyper-parameters are saved in each training model.

Step 5.2: the current model normalization coefficient and its change from the previous model normalization coefficient are analyzed. Normalized coefficient variation is defined, Δβ=β/β _before Where β is the normalized coefficient of a feature variable of the current model, β _before As the normalized coefficient of the one feature variable of the previous model, when β or Δβ is out of the specified range, the one feature variable is abnormal, the one feature variable is an unstable feature variable, and the unstable feature variable is deleted. Illustratively, when β is greater than or equal to 2, or Δβ is greater than or equal to 1.5, the feature variable is abnormal, and the corresponding feature variable is deleted.

Step 5.3: judging whether the verification set model effect is feasible or not, and ending model optimization when the verification set model effect is feasible; and (3) when the model is not feasible, iteratively optimizing the model parameters, and when the model evaluation index variation is lower than a specified range, determining the parameters, and entering a step (6) to perform model optimization. Illustratively, when the AUC improvement is below 0.5%, then the then-current regularized intensity parameter is determined.

Step 6: and 5, performing univariate analysis on the basis of the processed in the step 5, and optimizing the characteristic variables.

Step 6.1: drawing a univariate fitting curve, selecting the first N characteristics with poor fitting effect according to the importance of model characteristics, and forming new characteristics by methods such as characteristic crossing, polynomial processing or segmentation processing; in general, the features with important top ranking are preferentially processed, and after the optimization of the features with the top ranking is completed, the fitting of other features can be influenced, so that the variable fitting condition needs to be iteratively adjusted. The feature importance of the generalized linear model can be referred to as a normalization coefficient.

The implementation method of the univariate fitting curve comprises the following steps:

as shown in fig. 5, a univariate fitted curve diagram of the classification model is shown. The classification model realizes the single variable fitting curve as follows:

predicting the verification set by using the target generalized linear model to obtain a feature variable of the verification set, an actual tag (for example, whether the verification set is violated, the violation is about 0 and the violation is not about 1) and a prediction probability value;

the enumeration type data is unchanged, and the numerical type data is grouped;

the actual occurrence rate (the actual occurrence rate is obtained by dividing the number of samples by the total number of samples) and the average value of the predicted occurrence rate of each group of the variables are calculated, and an actual occurrence rate curve, a predicted occurrence rate curve (a curve drawn according to the average value of the predicted occurrence rate), a predicted occurrence rate upper limit curve and a predicted occurrence rate lower limit curve are obtained, and specific calculation formulas are shown in the following table 1.

TABLE 1

As shown in fig. 6, a schematic diagram of a univariate fitted curve of the regression model is shown. The regression model realizes the single-variable fitting curve as follows:

predicting the verification set by using the model to obtain a feature variable, an actual value y and a predicted value y' of the verification set;

the enumeration type data is unchanged, and the numerical type data is grouped;

And calculating the actual value average index and the predicted value average index of each group of each variable to obtain an actual value average curve and a predicted value average curve, wherein the actual value average curve and the predicted value average curve are shown in the following table 2.

TABLE 2

In the example, the logistic regression model (generalized linear model of classification model type) is trained by using the training set described in the step 2, and the model effect is evaluated by using the AUC value of the verification set.

The example model normalization coefficient ordering is in order: the maximum continuous overdue times of the observation period, the house type, the accumulation of 3-month interest, the credit approval query times, the credit loan organization number, the minimum overdue amount of the observation period and the like, wherein the fitting effects of variables such as the house type, the accumulation of 3-month interest, the credit approval query times, the minimum overdue amount of the observation period and the like are not feasible, the variables are sequentially adjusted according to the importance sequence, and then the steps 6.2 and 6.3 are carried out for iterative optimization. According to the adjustment of the univariate fitting curve, the housing types enter the model in discrete types, as can be seen from the fact that before the optimization of the housing type model in fig. 7, the fitting effect of the housing types is not feasible, and the housing types and the maximum continuous overdue times of the observation periods are combined in a crossing way to form new features, and the implementation mode is as follows: the housing type has A, B, C attribute values, and 2 characteristics are generated; when the sample housing type is A, the value of the characteristic 1 is the maximum continuous overdue times of the observation period, otherwise, the value is 0; when the sample housing type is B, the value of the characteristic 2 is the maximum continuous overdue times of the observation period, otherwise, the value is 0; two new features enter the model, the original housing type does not enter the model, and the fitting effect after the optimization of the two cross variables is added is as that after the optimization of the housing type model in figure 8. Due to the influence of the two cross variables, the single variable fitting curve of the minimum observed refund overdue amount changes from the state before the minimum observed refund overdue amount model shown in fig. 9 is optimized to the state after the minimum observed refund overdue amount model shown in fig. 10 is optimized. The 3 month interest accumulation is continuously typed into the model, as can be seen before the 3 month interest accumulation model is optimized, the actual occurrence rate is basically consistent with the predicted occurrence rate before the 3 month interest accumulation is 5000 (corresponding to the abscissa 2), the actual occurrence rate is lower than the predicted occurrence rate after 5000, and the new segmentation feature is added, wherein when the 3 month interest accumulation of the sample is greater than 5000, the value of the feature 3 is 3 month interest accumulation actual value, otherwise is 0, and the fitting effect is optimized after the 3 month interest accumulation model is optimized after the variable is added, as shown in fig. 12. The credit investigation and approval query times enter the model in a continuous type, as can be known before the credit investigation and approval query times model of fig. 13 is optimized, the fitting effect of the credit investigation and approval query times is not feasible, and the credit investigation and approval query times and the credit loan organization number are combined in a crossing way to form a new feature, and the implementation mode is that the feature 4 value is obtained by dividing the credit investigation and approval query times by the credit loan organization number, and the fitting effect is obtained after the variable is added to be optimized, as is the credit investigation and approval query times model of fig. 14. Other variables are optimized similarly.

Step 6.2: and adding new features to form a new training set and a verification set, retraining the optimized generalized linear model by using the new training set, verifying the model effect by using the new verification set, and drawing a univariate fitting curve. In the example, after optimizing one variable at a time and adding a new variable, a new training set is used to train a logistic regression model and check the effect of the verification set.

Step 6.3: if the verification set model effect is feasible, model optimization can be ended; if the verification set model effect is not feasible and the univariate fitting optimization is not completed, returning to the step 6 to continue the model optimization according to the univariate fitting curve; if the verification set model effect is not feasible, and the univariate fitting optimization is completed, the step 7 is entered; if the model effect and the univariate fitting curve are not improved after the new feature is added, the new variable is deleted and then the subsequent operation is carried out, otherwise, the feature is reserved and then the subsequent operation is carried out. In the example, after the first 4 feature variables are adjusted in turn, AUC of each region is greater than 0.8.

Step 7: and selecting a characteristic variable with poor fitting effect, calculating PSI values of the characteristic variable between the training set and the verification set, and confirming whether data distribution of the characteristic variable in the two data sets is consistent. If the model optimization is consistent, the model optimization is finished, and if the model optimization is inconsistent, the step 8 is carried out; generally, the verification set model effect is basically feasible after step 6 is completed. In the example, AUC of each region at the end of step 6 was 0.8 or more, considering that the gender variable was not fitted all the time, as shown in the pre-gender optimization chart of fig. 15. Therefore, PSI of the sex variable between the training set and the verification set is calculated, the value of the PSI is larger than 0.25, the distribution difference of the feature in the training set verification set is larger, and the beta coefficient of the variable is adjusted.

Step 8: according to the difference value between each group of actual occurrence rate mean value and predicted occurrence rate mean value in the univariate fitting curve, the beta coefficient of the linear model can be adjusted in a formula pushing or manual mode, and after the beta coefficient adjustment is completed, the model is more suitable for data distribution of a verification set. The formula pushes to, namely deduces the target value of the beta coefficient through the formula of the generalized linear model; manually adjusting the beta coefficient of a certain variable, predicting a verification set, drawing a univariate fitting curve corresponding to an application object, checking the fitting effect of the univariate, performing iterative fine adjustment according to the univariate fitting curve until the univariate fitting curve is feasible, and determining the beta coefficient at the moment as a target value. In the example, a logistic regression model is adopted, and the corresponding beta coefficient formula is as follows:

wherein z=β ₀ +β ₁ x ₁ +…+β _k x _k +…+β _n x _n

DeducingBeta is the beta coefficient.

In the case of the other variable coefficients being unchanged, x _k The coefficient difference formula of (2) is as follows:

wherein p is _target For the purpose ofProbability of marking, p _predict To predict probability, beta _k ^target As target coefficient beta _k Coefficients trained for the model.

In an example, according to the univariate fit curve implementation step, the actual occurrence of the optional gender-attribute value is the target probability, and the coefficient adjustment is the rough adjustment. Here, the actual female occurrence rate is selected as the target probability, the value is 0.015, the female prediction occurrence rate is 0.01, the model beta coefficient is-0.068, and then:

The sex variable is a discrete value, so x _k =1. If it is a continuous variable, x _k The average of the set of data may be taken. Here:

the validation set is predicted again according to beta coefficient 0.3425, and then a fitted curve of Beijing city gender variable is drawn, so that the gender optimization is performed as shown in fig. 16, and the model is more suitable for the validation set after the gender optimization.

The application also provides a generalized linear model training device in the banking field, fig. 17 is a schematic structural diagram of the generalized linear model training device in the banking field provided by the application, as shown in fig. 17, the generalized linear model training device in the banking field includes: the display module 1701 is used for displaying an interactive graphical interface, wherein the interactive graphical interface comprises a functional component column, a canvas and a component configuration column; the function component column comprises various function components for constructing a generalized linear model training process, the canvas is used for constructing the generalized linear model training process, and the component configuration column is used for configuring the operation parameters of each function component in the constructed generalized linear model training process; the display module 1701 is further configured to display a target generalized linear model training procedure constructed by a plurality of target functional components in response to an input from a user to add the plurality of target functional components in the functional component column to the canvas and a setting input of a configuration parameter of each of the plurality of target functional components in the component configuration column; the plurality of target function components comprise a first data reading component, a target data processing component set, a machine learning training component, a model transformation evaluation component, a data writing component, a second data reading component and a coefficient change analysis component; the set of target data processing components includes a data column filtering component; a data reading module 1702 for reading, by a first data reading component, target raw data in response to an input to perform a target generalized linear model training procedure based on the target raw data; a data processing module 1703, configured to process, by using the set of target data processing components, the target raw data to obtain a target training set and a target verification set that include at least one feature variable; the model training module 1704 is configured to train, through the machine learning training component, an initial generalized linear model corresponding to the machine learning training component based on the target training set, to obtain a target generalized linear model; the data writing module 1705 is configured to acquire, from the machine learning training component through the data writing component, a standardized coefficient of each feature variable corresponding to the target generalized linear model, and store the standardized coefficient; the transformation evaluation module 1706 is configured to evaluate, by using the model transformation evaluation component, the target generalized linear model based on the target verification set, to obtain a first model evaluation index corresponding to the target verification set; the parameter adjustment module 1707 is configured to, when the first model evaluation index is smaller than the index threshold, use the target generalized linear model as an initial generalized linear model, and return to adjust the super parameter of the initial generalized linear model through the machine learning training component; the model training module 1704 is further configured to continuously train the initial generalized linear model through the machine learning training component, and update the target generalized linear model; the data writing module 1705 is further configured to obtain, from the machine learning training component through the data writing component, a standardized coefficient of each feature variable corresponding to the updated target generalized linear model, and store the standardized coefficient; the data reading module 1702 is further configured to read, by using the second data reading component, a front-back twice normalization coefficient corresponding to each feature variable stored by the data writing component; the data deleting module 1708 is configured to, when the coefficient change analysis component analyzes the two front and rear normalization coefficients corresponding to each feature variable, and determines that the primary normalization coefficient corresponding to the feature variable to be deleted is not in the corresponding coefficient range or the variation of the two front and rear normalization coefficients corresponding to the feature variable to be deleted is not in the corresponding variation range, delete the feature variable to be deleted in the target training set and the target verification set respectively through the data column filtering component, and obtain an updated target training set and a target verification set; the model training module 1704 is further configured to use the target generalized linear model as an initial generalized linear model, return to training the initial generalized linear model through the machine learning training component, and update the target generalized linear model; the model output module 1709 is configured to output the target generalized linear model when the first model evaluation index is greater than or equal to the index threshold.

In some embodiments of the present application, the apparatus further comprises: the single-variable analysis module is used for analyzing a prediction result corresponding to the target verification set output by the model transformation evaluation component through the single-variable analysis component before outputting the target generalized linear model under the condition that the first model evaluation index is larger than or equal to the index threshold, or under the condition that the first model evaluation index is smaller than the index threshold and the variation of the first model evaluation index relative to the last first model evaluation index is smaller than or equal to the variation threshold, so as to obtain an analysis result of each characteristic variable; the curve drawing module is used for drawing the analysis result of each characteristic variable through the univariate fitting drawing component to obtain a univariate fitting curve corresponding to each characteristic variable; the model output module 1709 is specifically configured to output the target generalized linear model when the first model evaluation index is greater than or equal to the index threshold, and the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each characteristic variable.

In some embodiments of the present application, the plurality of target function components further comprises: the univariate analysis component and the univariate fitting drawing component; the model output module 1709 is specifically configured to analyze, by using the univariate analysis component, a prediction result corresponding to the target verification set output by the model transformation evaluation component, where the first model evaluation index is greater than or equal to the index threshold, to obtain an analysis result of each feature variable; drawing the analysis result of each characteristic variable through a univariate fitting drawing component to obtain a univariate fitting curve corresponding to each characteristic variable; and under the condition that the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each characteristic variable, outputting the target generalized linear model.

In some embodiments of the present application, the model output module 1709 is further configured to draw, by the univariate fitting drawing component, an analysis result of each feature variable, and after obtaining a univariate fitting curve corresponding to each feature variable, iterating and executing the following step S1 until, in a case where it is determined that the target generalized linear model is available according to the univariate fitting curve corresponding to each feature variable, outputting the target generalized linear model; wherein, step S1 includes: under the condition that a single-variable fitting curve corresponding to each characteristic variable indicates that the target generalized linear model is unavailable, determining a characteristic variable to be optimized in at least one target characteristic variable, wherein each target characteristic variable is at least one characteristic variable, and the single-variable fitting curve indicates that the target generalized linear model is unavailable; optimizing the feature variable to be optimized through the target data processing assembly set to generate a new feature variable so as to update the target training set and the target verification set; and taking the target generalized linear model as an initial generalized linear model, and returning to execute the training of the initial generalized linear model based on the target training set through the machine learning training component to obtain the target generalized linear model, so as to update the target generalized linear model until a univariate fitting curve corresponding to each characteristic variable is obtained through the univariate fitting drawing component.

In some embodiments of the present application, the plurality of target function components further comprises: a PSI component and a transformation correction component; the apparatus further comprises: the determining module is used for determining that the univariate fitting curve of at least one first characteristic variable indicates that the target generalized linear model is not available according to the univariate fitting curve corresponding to each characteristic variable obtained by the univariate fitting drawing component after the step S1 is executed in a cyclic iteration; each characteristic variable to be deleted is a new characteristic variable generated by optimizing the second characteristic variable by executing the step S1 at least once; each first characteristic variable corresponds to the same or different second characteristic variable, and the second characteristic variable is one of at least one target characteristic variable; in each characteristic variable, single-variable fitting curves of other characteristic variables except at least one first characteristic variable indicate that a target generalized linear model is available; the data processing module 1703 is further configured to update each of the first feature variables in the target training set and the target verification set to corresponding second feature variables respectively by returning to the target data processing assembly set, so as to update the target training set and the target verification set; the determining module is further used for determining data corresponding to each second characteristic variable in the target training set through the PSI component, and obtaining at least one target PSI value through target PSI values between the data corresponding to the second characteristic variable in the target verification set; under the condition that the PSI value less than or equal to the distribution threshold value does not exist in at least one target PSI value, respectively adjusting beta coefficients of the target generalized linear model aiming at each second characteristic variable through a transformation and correction component to obtain an updated target generalized linear model; the transformation evaluation module 1706 is configured to take the updated target generalized linear model as a target generalized linear model, and return to execute the target verification set-based execution through the model transformation evaluation component to evaluate the target generalized linear model, so as to obtain a first model evaluation index, until the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each characteristic variable obtained through the univariate fitting drawing component.

In some embodiments of the present application, the data processing module 1703 is further configured to, after determining, by the PSI component, data corresponding to each second feature variable in the target training set and a target PSI value between the data corresponding to the target verification set, obtain at least one target PSI value, delete, by the target data processing component, each second feature variable corresponding to at least one PSI value in the target training set and the target verification set if it is determined that there is at least one PSI value less than or equal to the distribution threshold in the at least one target PSI value, and the model training module 1704 is further configured to use the target generalized linear model as an initial generalized linear model, return to perform the training based on the target training set, and obtain the target generalized linear model, so as to update the target generalized linear model until the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each feature variable obtained by the univariate fitting drawing component.

In some embodiments of the present application, determining that the target generalized linear model is available according to the univariate fitting curve corresponding to each characteristic variable includes: under the condition that the univariate fitting curve corresponding to each characteristic variable meets the preset fitting condition, determining that the target generalized linear model is available according to the univariate fitting curve corresponding to each characteristic variable; wherein, under the condition that the target generalized linear model is a classification model, the preset fitting conditions comprise: the value of the factor variable corresponding to the target independent variable in the actual occurrence rate curve is smaller than or equal to the value of the factor variable corresponding to the target independent variable in the predicted occurrence rate upper limit curve, and is larger than or equal to the value of the factor variable corresponding to the target independent variable in the predicted occurrence rate lower limit curve; the single-variable fitting curve corresponding to each characteristic variable comprises an actual occurrence rate curve, a predicted occurrence rate upper limit curve and a predicted occurrence rate lower limit curve, wherein the dependent variable corresponding to the target independent variable in the predicted occurrence rate upper limit curve is the sum of the dependent variable corresponding to the target independent variable in the corresponding predicted occurrence rate curve and a first numerical value; the dependent variable corresponding to the target independent variable in the predicted occurrence rate lower limit curve is the difference between the dependent variable corresponding to the target independent variable in the corresponding predicted occurrence rate curve and a second numerical value (wherein the second numerical value and the first numerical value can be the same or different, and the second numerical value and the first data are both positive numbers); wherein, under the condition that the target generalized linear model is a regression model, the preset fitting conditions comprise: the absolute value of the difference value of the factor variable value corresponding to the target independent variable in the actual value mean value curve and the factor variable value corresponding to the target independent variable in the predicted value mean value curve is smaller than or equal to a difference value threshold value; the target independent variable is any independent variable in the single variable fitting curve corresponding to each characteristic variable.

As shown in fig. 18, the embodiment of the present application further provides an electronic device 1800, where the electronic device 1800 may be the electronic device described above. The electronic device 1800 includes: the processor 1801, the memory 1802, and a computer program stored in the memory 1802 and capable of running on the processor 1801, where the computer program when executed by the processor 1801 implements each process executed by the generalized linear model training method in the banking field as described above, and the same technical effects can be achieved, and for avoiding repetition, a detailed description is omitted herein.

The embodiment of the invention also provides a computer readable storage medium, on which a computer program is stored, which when executed by a processor, implements each process executed by the generalized linear model training method in the banking field, and can achieve the same technical effect, and in order to avoid repetition, the description is omitted here.

The computer readable storage medium may be a Read-Only Memory (ROM), a random access Memory (Random Access Memory, RAM), a magnetic disk, an optical disk, or the like.

The present invention provides a computer program product comprising: the computer program product, when run on a computer, causes the computer to implement the generalized linear model training method of banking domain described above.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present application, and not for limiting the same; although the present application has been described in detail with reference to the foregoing embodiments, it should be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the corresponding technical solutions from the scope of the technical solutions of the embodiments of the present application.

The foregoing description, for purposes of explanation, has been presented in conjunction with specific embodiments. However, the illustrative discussions above are not intended to be exhaustive or to limit the embodiments to the precise forms disclosed above. Many modifications and variations are possible in light of the above teaching. The embodiments were chosen and described in order to best explain the principles and the practical application, to thereby enable others skilled in the art to best utilize the embodiments and various embodiments with various modifications as are suited to the particular use contemplated.

Claims

1. A method for training a generalized linear model in the field of banking, the method comprising:

Displaying an interactive graphical interface, wherein the interactive graphical interface comprises a function component column, a canvas and a component configuration column; the function component column comprises various function components for constructing a generalized linear model training process, the canvas is used for constructing the generalized linear model training process, and the component configuration column is used for configuring the operation parameters of each function component in the constructed generalized linear model training process;

responsive to a user input adding a plurality of target function components in the function component column to the canvas and a setting input of a configuration parameter of each of the plurality of target function components in the component configuration column, displaying a target generalized linear model training process constructed by the plurality of target function components; the plurality of target function components comprise a first data reading component, a target data processing component set, a machine learning training component, a model transformation evaluation component, a data writing component, a second data reading component and a coefficient change analysis component; the set of target data processing components includes a data column filtering component;

reading, by the first data reading component, target raw data in response to input to perform the target generalized linear model training process based on the target raw data;

Processing the target original data through the target data processing assembly set to obtain a target training set and a target verification set which comprise at least one characteristic variable;

training an initial generalized linear model corresponding to the machine learning training component based on the target training set through the machine learning training component to obtain a target generalized linear model;

obtaining and storing a standardized coefficient of each characteristic variable corresponding to the target generalized linear model from the machine learning training component through the data writing component;

evaluating the target generalized linear model based on the target verification set through the model transformation evaluation component to obtain a first model evaluation index corresponding to the target verification set;

under the condition that the first model evaluation index is smaller than an index threshold, the target generalized linear model is used as the initial generalized linear model, the machine learning training component is used for adjusting the super parameters of the initial generalized linear model, the machine learning training component is used for continuously training the initial generalized linear model, the target generalized linear model is updated, and the data writing component is used for obtaining and storing the standardized coefficients of each characteristic variable corresponding to the updated target generalized linear model from the machine learning training component;

Reading the standardized coefficients corresponding to each characteristic variable stored by the data writing-out component by the second data reading component for two times;

analyzing front and back twice standardized coefficients corresponding to each characteristic variable through the coefficient change analysis component, and determining that one time standardized coefficient corresponding to the characteristic variable to be deleted is not in a corresponding coefficient range or the variation of the front and back twice standardized coefficients corresponding to the characteristic variable to be deleted is not in a corresponding variation range, respectively deleting the characteristic variable to be deleted in the target training set and the target verification set through the data column filtering component to obtain an updated target training set and the target verification set;

and taking the target generalized linear model as the initial generalized linear model, returning to training the initial generalized linear model through the machine learning training component, and updating the target generalized linear model until the target generalized linear model is output under the condition that the first model evaluation index is greater than or equal to the index threshold value.

2. The method of claim 1, wherein the plurality of target function components further comprises a univariate analysis component and a univariate fit rendering component, the method further comprising, prior to outputting the target generalized linear model if the first model evaluation metric is greater than or equal to the metric threshold:

When the first model evaluation index is greater than or equal to an index threshold, or the first model evaluation index is smaller than the index threshold and the variation of the first model evaluation index relative to the previous first model evaluation index is smaller than or equal to a variation threshold, analyzing a prediction result corresponding to the target verification set output by the model transformation evaluation component through the single variable analysis component to obtain an analysis result of each characteristic variable;

drawing the analysis result of each characteristic variable through the univariate fitting drawing component to obtain a univariate fitting curve corresponding to each characteristic variable;

outputting the target generalized linear model under the condition that the first model evaluation index is greater than or equal to the index threshold, specifically including:

and outputting the target generalized linear model under the condition that the first model evaluation index is larger than or equal to the index threshold and the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each characteristic variable.

3. The method according to claim 2, wherein after the analysis result of each feature variable is plotted by the univariate fitting and plotting component to obtain the univariate fitting curve corresponding to each feature variable, the method further comprises:

Step S1 is executed in a loop iteration mode until the target generalized linear model is output under the condition that the target generalized linear model is determined to be available according to a univariate fitting curve corresponding to each characteristic variable;

wherein, the step S1 includes:

determining a feature variable to be optimized in at least one target feature variable under the condition that a single-variable fitting curve corresponding to each feature variable indicates that the target generalized linear model is unavailable, wherein each target feature variable is a feature variable in the at least one feature variable, and the single-variable fitting curve indicates that the target generalized linear model is unavailable;

optimizing the feature variable to be optimized through the target data processing assembly set to generate a new feature variable so as to update the target training set and the target verification set;

and taking the target generalized linear model as the initial generalized linear model, and returning to execute the training of the initial generalized linear model based on the target training set through the machine learning training component to obtain the target generalized linear model so as to update the target generalized linear model until the univariate fitting curve corresponding to each characteristic variable is obtained through the univariate fitting drawing component.

4. A method according to claim 3, wherein the feature variable to be optimized is the feature variable of greatest importance among the at least one target feature variable.

5. The method of claim 3, wherein the plurality of target function components further comprise: a PSI component and a transformation correction component; the method further comprises the steps of:

after the step S1 is executed in a loop iteration manner, determining that the univariate fitting curve of at least one first characteristic variable indicates that the target generalized linear model is not available according to the univariate fitting curve corresponding to each characteristic variable obtained by the univariate fitting drawing component; each first characteristic variable is the new characteristic variable generated by optimizing a second characteristic variable by executing the step S1 at least once; each of the first characteristic variables corresponds to the same or different second characteristic variable, which is one of the at least one target characteristic variable; in each characteristic variable, single-variable fitting curves of other characteristic variables except the at least one first characteristic variable indicate that the target generalized linear model is available;

Returning to the target data processing assembly set, and respectively updating each first characteristic variable in the target training set and the target verification set to the corresponding second characteristic variable so as to update the target training set and the target verification set;

determining, by the PSI component, data corresponding to each of the second feature variables in the target training set, and a target PSI value between the data corresponding to the target verification set, to obtain at least one target PSI value;

under the condition that the PSI value less than or equal to a distribution threshold value does not exist in the at least one target PSI value, respectively adjusting beta coefficients of the target generalized linear model aiming at each second characteristic variable through the transformation correcting component to obtain an updated target generalized linear model;

and taking the updated target generalized linear model as the target generalized linear model, returning to execute the target-verification-set-based execution through the model transformation evaluation component, and evaluating the target generalized linear model to obtain the first model evaluation index until the target generalized linear model is determined to be available according to the univariate fitting curve corresponding to each characteristic variable obtained through the univariate fitting drawing component.

6. A method as in claim 5, wherein said determining, by said PSI component, a target PSI value for each of said second feature variables between corresponding data in said target training set and corresponding data in said target validation set results in at least one target PSI value, said method further comprising:

and under the condition that at least one PSI value smaller than or equal to a distribution threshold exists in the at least one target PSI value, deleting each second characteristic variable corresponding to the at least one PSI value in the target training set and the target verification set through the target data processing assembly set, taking the target generalized linear model as the initial generalized linear model, returning to execute the training based on the target training set, training the initial generalized linear model to obtain a target generalized linear model, and updating the target generalized linear model until the target generalized linear model is determined to be available according to a univariate fitting curve corresponding to each characteristic variable obtained through the univariate fitting drawing assembly.

7. The method according to any one of claims 2 to 6, wherein said determining that the target generalized linear model is available from the univariate fitted curve corresponding to each of the characteristic variables comprises:

Under the condition that the univariate fitting curve corresponding to each characteristic variable meets a preset fitting condition, determining that the target generalized linear model is available according to the univariate fitting curve corresponding to each characteristic variable;

wherein, in the case that the target generalized linear model is a classification model, the preset fitting condition includes: the method comprises the steps that a factor variable value corresponding to a target independent variable in an actual occurrence rate curve is smaller than or equal to a factor variable value corresponding to the target independent variable in a predicted occurrence rate upper limit curve, and is larger than or equal to a factor variable value corresponding to the target independent variable in a predicted occurrence rate lower limit curve; the single-variable fitting curve corresponding to each characteristic variable comprises the actual occurrence rate curve, a predicted occurrence rate curve, an upper prediction occurrence rate limit curve and a lower prediction occurrence rate limit curve, wherein the dependent variable corresponding to the target independent variable in the upper prediction occurrence rate limit curve is the sum of the dependent variable corresponding to the target independent variable in the corresponding prediction occurrence rate curve and a first numerical value; the dependent variable corresponding to the target independent variable in the predicted occurrence rate lower limit curve is the difference between the dependent variable corresponding to the target independent variable in the corresponding predicted occurrence rate curve and a second numerical value;

Wherein, in the case that the target generalized linear model is a regression model, the preset fitting condition includes: the absolute value of the difference value of the factor variable value corresponding to the target independent variable in the actual value mean curve and the factor variable value corresponding to the target independent variable in the predicted value mean curve is smaller than or equal to a difference value threshold;

and the target independent variable is any independent variable in the single-variable fitting curve corresponding to each characteristic variable.

8. The utility model provides a generalized linear model trainer in bank field which characterized in that includes:

the display module is used for displaying an interactive graphical interface, wherein the interactive graphical interface comprises a functional component column, a canvas and a component configuration column; the function component column comprises various function components for constructing a generalized linear model training process, the canvas is used for constructing the generalized linear model training process, and the component configuration column is used for configuring the operation parameters of each function component in the constructed generalized linear model training process;

the display module is further used for responding to the input that a user adds a plurality of target function components in the function component column to the canvas and the setting input of the configuration parameters of each target function component in the component configuration column, and displaying a target generalized linear model training flow constructed by the plurality of target function components; the plurality of target function components comprise a first data reading component, a target data processing component set, a machine learning training component, a model transformation evaluation component, a data writing component, a second data reading component and a coefficient change analysis component; the set of target data processing components includes a data column filtering component;

A data reading module for reading target raw data through the first data reading component in response to input of the target generalized linear model training process based on the target raw data;

the data processing module is used for processing the target original data through the target data processing assembly set to obtain a target training set and a target verification set which comprise at least one characteristic variable;

the model training module is used for training an initial generalized linear model corresponding to the machine learning training component through the machine learning training component based on the target training set to obtain a target generalized linear model;

the data writing-out module is used for acquiring and storing the standardized coefficient of each characteristic variable corresponding to the target generalized linear model from the machine learning training component through the data writing-out component;

the transformation evaluation module is used for evaluating the target generalized linear model based on the target verification set through the model transformation evaluation component to obtain a first model evaluation index corresponding to the target verification set;

the parameter adjusting module is used for taking the target generalized linear model as the initial generalized linear model and returning to adjust the super parameters of the initial generalized linear model through the machine learning training component under the condition that the first model evaluation index is smaller than an index threshold;

The model training module is further used for continuously training the initial generalized linear model through the machine learning training component and updating the target generalized linear model;

the data writing-out module is further used for acquiring and storing the standardized coefficient of each characteristic variable corresponding to the updated target generalized linear model from the machine learning training component through the data writing-out component;

the data reading module is also used for reading the standardized coefficients corresponding to each characteristic variable stored by the data writing-out assembly through the second data reading assembly;

the data deleting module is used for respectively deleting the feature variables to be deleted in the target training set and the target verification set through the data column filtering component under the condition that the front and back twice standardized coefficients corresponding to each feature variable are analyzed through the coefficient change analysis component and the primary standardized coefficient corresponding to the feature variable to be deleted is not in the corresponding coefficient range or the change amount of the front and back twice standardized coefficients corresponding to the feature variable to be deleted is not in the corresponding change range, so as to obtain the updated target training set and the target verification set;

The model training module is also used for taking the target generalized linear model as the initial generalized linear model, returning to training the initial generalized linear model through the machine learning training component, and updating the target generalized linear model;

and the model output module is used for outputting the target generalized linear model under the condition that the first model evaluation index is greater than or equal to the index threshold value.

9. An electronic device, comprising: a processor for executing a computer program stored in a memory, which when executed by the processor implements the steps of the generalized linear model training method of banking according to any one of claims 1-7.

10. A computer readable storage medium having stored thereon a computer program, which when executed by a processor implements the steps of the generalized linear model training method of banking according to any one of claims 1-7.