AU2019101189A4

AU2019101189A4 - A financial mining method for credit prediction

Info

Publication number: AU2019101189A4
Application number: AU2019101189A
Authority: AU
Inventors: Ming Han; Shan Jiang; Ziyan LI; Junyi Ren; Chuyi Xiao; Xinxin Zhang
Original assignee: Han Ming Miss; Jiang Shan Miss; Li Ziyan Miss; Zhang Xinxin Miss
Current assignee: Han Ming Miss; Jiang Shan Miss; Li Ziyan Miss; Zhang Xinxin Miss
Priority date: 2019-10-02
Filing date: 2019-10-02
Publication date: 2020-01-23
Anticipated expiration: 2027-10-02

Abstract

How to evaluate and identify the potential default risk of the borrower or calculate the default probability of the borrower before issuing the loan the basis and significant link of the credit risk management of modem financial institutions.This paper mainly studies the statistical analysis of the historical loan data of Banks and other financial institutions by means of the idea of non-equilibrium data classification, and establish the loan default prediction model which is employing the Random Forest algorithm.The results showed that the Random Forest algorithm was better than the decision tree and logistic regression algorithm in predicting performance. In addition, by using the Random Forest algorithm to sort the importance of the features, it is possible to obtain a feature that has a greater impact on the final breach of contract. Therefore, to make a more effective judgment of lending risk in the financial field.Index Terms-Random Forest, loan default prediction, data mining Introduces unbalanced data classification and Random Forest algorithm Data preprocessing and data analysis Compares models of three different algorithms Conclusion: Random Forest algorithm has better performance Summarizes the paper Figure 1 Fi 2 lass results I X2~ocs voerotheptm A classification results 2clsica x~~ ~X -4 ""'a''°" """'in'rp M" classification results Figure 2

Description

Introduces unbalanced data classification and Random Forest algorithm

Data preprocessing and data analysis

Compares models of three different algorithms

Conclusion: Random Forest algorithm has better performance

Summarizes the paper

Figure 1

Fi 2 lass results I

X2~ocs A classification results 2clsica voerotheptm

M" x~~ ~X -4 ""'a''°" """'in'rp

classification results

Figure 2

TITLE

A financial mining method for credit prediction

FIELD OF THE INVENTION

This invention is in the field of Financial Big Data

BACKGROUND

With the vigorous development of world economy and China's reform

and opening up gradually in-depth, whether the development of the

enterprise or from the change of people consumption idea, loan has

become the enterprises and individuals an important way to solve the

problem of economy. With the introduction of a variety of bank loans

business and the expansion of the growing demand, non-performing loans,

that is, the probability of default also proliferated. To avoid default,

Banks and other financial institutions when they make loans to evaluate

the borrower's credit risk or score, to predict the probability of default and

whether lending judgment according to the results. How effective

evaluation before granting loans and identify potential borrowers default

risk, is the basis of the financial institutions to credit risk management

and the important link, with a scientific model and system to determine

the risk of loan defaults can minimise risk and profit maximization.

SUMMARY

This paper mainly studies how to use ideas of unbalanced data

classification of the history of Banks and other financial institutions loan

data analysis, and based on Random Forest classification model to predict

the likelihood of default. The first section mainly introduced in this paper

the unbalanced data classification and Random Forest algorithm; the

second section mainly for data preprocessing and data analysis. The third

section mainly constructs a model of Random Forest classification

forecast loan defaults, and get the results of this model and AUC values,

through the Random Forest algorithm compared with decision tree model

and logistic regression algorithm, getting the Random Forest algorithm

better conclusions. Finally, to evaluate the importance of each feature,

and draw which characteristics influence the results for the final default.

The fourth section summarizes the full text.

Table 1 Default classification based on Random Forests Default classification based on Random Forests

T-train set Ntree =the number of decision trees M=the number of expected variables in each sample Mtry =the number of variables participating in split in each tree nodes

Ssampsize=the sample size of Bootstrap The computation process: For(itree=O;14tre ! Ntree; tree + +) { 1. Generating a Bootstrap sample with size Ssampsize by using train set T to 2. Building an untrimmed tree tree by using Bootstrap. Choosing randomly Mtry variables and the best one to be a branch based on Value Gini in the process of generating tree. }

Output: Regression problems: the predicted result based on the average of all returned values. Classification Problems: the predicted result based on the classification outcome of the majority of decision trees.

Table 2 Data set variable case Variable name Variable description Type

SeriousDlqin2yrs Whether default Y/N The total amount of credit card and personal credit loan (excluding mortgages, installment payments like car loans, etc.) divided by the sum of RevolvingUtilizationOfUnsecuredLines credit lines Percentage

age Borrower age Integer

The number of times the borrower has been overdue for NumberOfTime3-59DaysPastDueNotWorse3O-59 days in the past two years Integer

Monthly debt repayments, alimony, living costs, etc. divided DebtRatio by total monthly income Percentage

MonthlyIncome monthly income Real Number of open loans (instalments such as car loans and mortgages) and credit lines NumberOfOpenCreditLinesAndLoans (such as credit cards) Integer

The number of times the borrower has been overdue for 90 days and over in the past two NumberOfTimes9ODaysLate years Integer

Mortgage and real estate loans NumberRealEstateLoansOrLines with mortgage-backed credits Integer

The number of times the borrower has overdue 60-89 NumberOfTime6-89DaysPastDueNotWorsedays in the past two years Integer

Number of people (spouse, children, etc.) who need to be raised in the family, excluding NumberOfDependents themselves Integer

Table 3 Table of frequency distribution of variable age Age Number of People Percentage of Number of people who Percentage of age interval defaulted defaulters within age interval Lower than 25 3028 2.02% 338 11.16% 26-35 18458 12.30% 2053 11.12% 36-45 29819 19.90% 2628 8.80% 46-55 36690 24.50% 2786 7.60% 56-65 33406 22.30% 1531 4.60% Higher than 65 28599 19.10% 690 2.40%

Table 4 variables NumberOfTime30-59 dayspastduenotworse frequency

distribution table

NumberRealEstateLoans Number of Ratio Number of Percentage of defaults OrLines people defaults in this range

Below 5 149207 99.47% 9884 6.6%

6-10 699 0.47% 121 17.3% 11-15 70 0.05% 16 22.8% 16-20 14 0.009% 3 21.4% Below 5 10 0.007% 2 20%

Table 5 frequency distribution table of variable

numberoftime30-59dayspastduenotworse NumberOfTime3O-59Days Number of Ratio Number of The percentage of default in PastDueNotWorse people defaulters this interval

126018 84% 5041 4% 1 16032 10.70% 2409 15% 2 4598 3.10% 1219 26.50% 3 1754 1.20% 618 35.20% 4 747 0.50% 318 42.60% 342 0.23% 154 45% 6 140 0.09% 74 52.90% 7or older 104 0.07% 50 48.07%

Table 6 Random Forests and the comparison of other algorithms Algorithm AUC value Random Forest 0.86 Decision Tree 0.8 Logistic Regression 0.8

Table 7 feature importance of each variable

Variable featureimportance RevolvingUtilizationOfUnsecuredLines 0.3411

NumberOfTime3O-59DaysPastDueNotWorse 0.1694 NumberOfTime90DaysLate 0.1594 NumberOfTime60-89DatysPastDueNotWorse 0.0727 age 0.0677 DebtRatio 0.0625 MonthlyIncome 0.0488 NumberOfOpenCreditLinesAndLoans 0.0442 NumberRealEstateLoansOrLines 0.0223 NumberOfDependents 0.0117

DESCRIPTION OF DRAWING

Figure 1 Analysis flow chart of credit forecast

Figure 2 Random Forests

Figure 3 Modeling flowcharts

DESCRIPTION OF PREFERRED EMBODIMENT

Random Forest Algorithm

Imbalanced data classification

Imbalanced data which means the number of some data (the majority) far

exceeds the other (the minority) is universally existing in network

intrusion detection,financial transaction fraud,text classifier and etc. And

most of the time we are more interested in the classification of the

minority.Imbalanced data classification can be solved by punishment

weight of positive and negative sample. In detail, the approach is to give

different weights for classification of different sample sizes in algorithm

implementation process where small sample size has high weight and large sample size has low weight in general, and then we can compute and make modeling.

Introduction of Random Forest

Random Forest building a forest by random techniques is a combined

algorithm based on random decision trees. The main method is to select

randomly some variables or features to generate the split and then repeat

several times and guarantee the independence between these trees. After

getting Random Forest, a new sample will be judged by each decision

tree when it enters in the forest and belongs to which classification gets

the highest score(process visualized in figure 2)

Random Forest algorithm principle and characteristics

Random Forests algorithm, include classification and regression problems,

its algorithm steps are as follows:

Random Forests have the following features: Process can be seen from

the above algorithm, the randomness of the Random Forest is mainly

manifested in two aspects: The randomness of the data space by Bagging

(Bootstrap Aggregating) implementation, the feature space of the

randomness of Random sample (Random Subspace). For classification

problems, each decision tree in a Random Forests is classified and

predicted for new samples. The decision results of these trees are then

somehow grouped together to give the final classification of the sample.

1, The data in the rows (data records) and columns (variables) two

random introduction, so that the Random Forest is not easy to fall into

overfitting.

2,.Random Forest has a good anti-noise ability.

3.When there are a large number of missing values in the data set,

Random Forests can effectively estimate and process the missing values.

4. the ability to adapt to the data set is strong: can process both discrete

data, but also to process continuous data, the data set does not need to be

normalized.

5.It can be able to the importance of the variable sorting, easy to explain

the variable. There are two methods for calculating the importance of

variables in Random Forests: one based on the average decline accuracy

of OOB (Out of Bag). That is, in the process of growing the decision tree,

the OB sample is tested and recorded the wrong sample, and then the

value order of a column variable in the Bootstrap sample is randomly

disrupted, the decision tree is re-predicted, and the number of misdivided

samples is recorded again. The number of two prediction errors divided

by the total number of OOB samples is the change in error rate of this

decision tree, and the average rate of average decline is obtained by

summarizing the error rate change sourof for all trees in the Random

Forest. The other is based on the GINI drop method at the time of division, the Random Forest in the growth decision tree is in accordance with the GINI non-purity decline in the node split, all the selected a variable in the forest as a split variable of the node summary to obtain gINI drop.

Random Forest in imbalanced data classification

The default of weight for each category is 1 in Random Forest which

predicts that all wrong cost is equivalent. In scikitlearn, Random Forest

supplies the parameter of weight(list or dict)and mutually specifies

weights for different sorts.If the parameter is 'balanced', each weight has

a negative relationship with input frequency since Random Forest

automatically adjusts weights by using the value y.

The calculation formula is

n _samplesl(n _classes* np.bincount(y)) (1)

'Balanced subsample' is similar to the 'balanced',which uses sample size

of retracted sampling instead of using total number of samples. Therefore,

we can solve the unbalanced data classification problem by this approach.

Data preprocessing and data analysis

Data Set

The data Set used by this paper: The loan default data set is 250000

samples that included 150000 training set and 100000 test set.

This training set contains 150,000 historical data of borrowers, among

which 10026 default samples account for 6.684% of the total sample,

6.684% of the loan default rate, and 139974 non-default samples account

for 93.316% of the total sample. It can be seen that this data set is a

typical highly unbalanced data. The data set includes the borrower's age,

income, family, etc., and loan conditions, with a total of 11 variables,

among which SeriousDlqin2yrs is the label's tag, and the other 10

variables are predictive characteristics. The following table lists variable

names and data types:

Data Analysis

The experimental environment used in this paper is

Anaconda3+Python3.Firstly, the data were preliminarily analyzed. This

experiment mainly analyzed the distribution of default rate on each

independent variable, and generated the frequency distribution table as

shown in Table 3 (decimals were rounded).

It can be seen from Table3 that the default rate of people younger than 25

years old and people aged 26-35 years old is more than 10%.Default rates

fall as people age.

Table 4 variables NumberOfTime30-59 dayspastduenotworse frequency

Table 4 shows that the number of real estate and mortgage loans of 99.47%

borrowers is less than 5, but the default rate of borrowers with more than

5 loans increases significantly, among which the default rate of borrowers

with more than 10 loans is above 20%.

It can be seen from Table 5 that the default rate of borrowers who have

not defaulted for 30-59 days is only about 4%, but with the increase of

the number of delinquencies, the default rate increases significantly. For

the other two variables, the frequency distribution table of 60-89 days

overdue and 90 or more overdue times of borrowers also shows the same

trend as Table 5. Therefore, it can be concluded that the more

delinquencies occur, the higher the default rate.

With 10 variables, this study using data set our statistical analysis of each

variable and get the frequency distribution table as shown above, in

addition to the variable Number Of Open Credit Lines And Loans (open

the number of loans and credit loans) and default rate has no obvious

correlation, other variables are related to whether the borrower default

eventually.

2.3 Data Pre-processing

A preliminary exploration of the data may reveal missing values in the

Monthly Income and Number Of Dependents variables, which are 29731

and 3924, respectively.

Outliers: the minimum value in the age variable is 0, which is an outlier.

NumberOfTime30-59DaysPastDueNotWorse

NumberOfTime60-89DaysPastDueNotWorse

NumberOfTimes9DaysLate three overdue days variables, there are a

small number of 96,98 values, may be abnormal values or some code.

Data preprocessing: when reading data using the pandas library in Python.

Set the navalues parameter in the function pd.readcsv() to our own

definition list, treat 0 of the age variable and 96,98 of the three overdue

variables as NaN values, then using sklearn. Preprocessing. Imputer

library will data set all NaN replaced with mean value of the

corresponding columns.

Buliding models and experiment result

Random Forest Model

In this experiment, we uses a package of Python-sklearn (more

specifically, sklean.ensemble.RandomForestClassifier)-to build a

Random Forest model.

Here are the parameters and their settings:

n-estimators: The number of Decision Tree, which is set to 100.

oobscore: Whether to use out-of-bag data, set to True.

minsamplessplit: The minimum number used to yield an internal

node, set to 2.

min-samplesleaf: The minimum number of a leaf node, set to 50.

njobs: The number of jobs for computer to run parallely, set to -1.

classweight: Used to control weights of each class, set to

"balancedsubsample".

bootstrap: Whether boostrap samples are used for generating trees, set

to True.

Model Assessment

We use AUC as a indicator to assess the model in this experiment. AUC

is defined as area under the curve of ROC(Receiver Operating

Characteristic), and apparently the value of this curve is not more than 1.

The x-axis of ROC is FPR(False Postive Rate), and the y-axis is

TPR(True Positive Rate). Because normally the ROC curve is above line

y--x, the value of AUC is between 0.5 and 1. We use AUC as a standard

for evaluating models because ROC curve cannot help us clearly

determining which classifier is better. On the other hand, as a numerical

value, AUC can tell us which classifier is superior more specifically.

Results of comparing three models: Random Forest, Decision Tree and

logistic regression, are as follows:

From the Table 6, the Random Forest Algorithm has the greatest AUC

value among these three algorithms. Hence, Random Forest's predicting

performance is better than the other two algorithms.

Feature Importances of Variables

We use featureimportances in the

sklearn.ensemble.RandomForestClassifier class for this experiment, and

the feature importances for each feature are as follows:

From the Table 7, we can notice that these three variables:

RevolvingUtilizationOfLinsecuredLines,

NumberOfTime3O-59DaysPastDueNotWorse and

NumberOfTime9ODaysLate have top three feature importances, which

have greater impact on determining who may break contracts and bring

economic loss to companies. Hence, while companies grant a loan, they

can consider these features of an applicant to lower the risk.

Conclusion

This paper mainly studied the loan defaults of common problems in the

financial sector, and using the Random Forest of unbalanced data

classification method to predict default model is established, the basic

idea of Random Forest is in the process of a single tree structure, some

random variables or characteristics involved in tree node, repeated many

times and ensure the independence between the trees, in view of the

unbalanced data, by the method of parameter adjustment makes Random

Forest weights can be adjusted according to the y value automatically,

thus effectively solve the problem of unbalanced data classification.

Experiments show that Random Forest algorithm than the decision tree

and the classification of the logistic regression model performance is

better, to loan defaults in the field of financial prediction problem has

important reference meaning. In addition, based on the importance of the

characteristics of the measurement, in this experiment can be lending a

person's age, debt ratio and number of real estate and mortgage of the

three characteristics of the final is greatly influenced by default, the

feature importance measure method is the other feature selection problem

in data mining to have the important reference significance.

EDITORIAL NOTE

2019101189

There is one page of the claims only.

Claims

CLAIM

1. A financial mining method for credit prediction, wherein the

experimental environment used in this experiment is

Anaconda3+Python3; First, the data is initially analyzed, this

experiment mainly analyzes the distribution of default rate on each

independent variable, and generates a frequency distribution table;

Data preprocessing: When reading data using the pandas library in

Python, set the navalues parameter in the function pd.read-csv()

to our own defined list, 0 in the age variable and three overdue

variables 96,98 is treated as a NaN value, then the

sklearn.preprocessing, imputer library is used to replace all NaNs

in the dataset with the average of the corresponding columns.

Introduces unbalanced data classification and Random Forest algorithm

Data preprocessing and data analysis 2019101189

Compares models of three different algorithms

Conclusion: Random Forest algorithm has better performance

Summarizes the paper

Figure 1

Figure 2

Compares models of three different algorithms 2019101189

Inputs data to Python

Uses sklearn to build models based on three different algorithms

Calculate AUC value (area under the ROC curve)

Compares AUC value of three models

Conclusion: Random Forest algorithm has better performance

Figure 3