CN114238702A

CN114238702A - Credit card fraud preselection method, device and storage medium based on graph database

Info

Publication number: CN114238702A
Application number: CN202111599387.3A
Authority: CN
Inventors: 赵禹豪; 罗宏辉
Original assignee: Dalian Branch China Construction Bank Co ltd
Current assignee: Dalian Branch China Construction Bank Co ltd
Priority date: 2021-12-24
Filing date: 2021-12-24
Publication date: 2022-03-25
Anticipated expiration: 2041-12-24
Also published as: CN114238702B

Abstract

The invention provides a credit card fraud preselection method and device based on a graph database and a storage medium. The method comprises the following steps: acquiring data to be processed; taking the credit card data, the merchant data and the payer data as node data, and taking the relationship data of the credit card data corresponding to the payer and the relationship data of the credit card data corresponding to the merchant as side data, thereby constructing a full-scale graph; taking the credit card data and the payor data as node data, and taking the relation data of the credit card data corresponding to the payor as edge data, thereby constructing a cluster map; grouping processing based on connectivity is carried out on the cluster graph, and all nodes connected with edges are classified into a sub-graph, so that a connectivity grouping result is obtained; connecting the repayment credit cards of the large group according to the repayment person as a central point, and further acquiring a subdivision result; and after evaluating each group in the full-scale graph based on the evaluation index, outputting a preselected result meeting a preset condition, wherein the evaluation index is acquired according to the consumption data of the credit card.

Description

Credit card fraud preselection method, device and storage medium based on graph database

Technical Field

The invention relates to the technical field of information security monitoring, in particular to a credit card fraud preselection method and device based on a graph database and a storage medium.

Background

The bank credit card system has large transaction data volume, and the fraud cannot be effectively detected only by manpower. It is usually necessary to screen clients with suspected fraud through an algorithm, and then to manually review the automatically screened risk clients. At present, mass credit card transaction data are in a transition stage from a data storage stage to a data mining stage, monitoring and mining means of credit card fraud behaviors of various families are different, a database is mainly constructed in NEO4j, a Louvain algorithm is used for dividing a customer group, and the customer group is analyzed according to indexes such as the size of the customer group, the average article rank score of the customer group, the number of overdue customers, the number of suspicious customers and the like. And finally, exporting the large passenger group suspected of fraud to manual investigation. However, Neo4j has poor write performance for the graph data structure, cannot keep up with real-time reading and writing, and has trouble in importing large data amount. Secondly Neo4j does not perform well for very large nodes. When there are a large number of edges on a node, the speed of operation on that node will be greatly reduced. In addition, when Neo4j performs graph database processing, graph database processing is very slow, non-distributed storage has high requirements on hard disk resources, and hardware finally reaches a bottleneck by only increasing the hard disk of a machine and using higher memory and SSD in order to improve performance and capacity.

In summary, Neo4j is suitable for graph data with small data storage amount, less modification, more queries, and no oversized nodes, whereas anti-fraud detection to be completed is graph database with large data, complex node relationships, and a large number of graph calculations.

Disclosure of Invention

According to the technical problems of large transaction data volume, large model calculation amount, long time consumption and high requirement on hardware, the credit card fraud preselection method, the credit card fraud preselection device and the credit card fraud preselection storage medium based on the graph database are provided. The invention is embedded into a new generation system of a bank, so that a credit card risk early warning post can conveniently peel cocoons layer by layer from the judgment of the relationship between a guest group, a customer, an account and a commercial tenant, and assist in the collaborative investigation of bills and external data, thereby realizing the identification of a cheating guest group, and preventing, controlling and resolving risks in time.

The technical means adopted by the invention are as follows:

a graph database based credit card fraud preselection method comprising:

acquiring data to be processed, wherein the data to be processed comprises credit card data, merchant data and repayment data, and constructing a graph database through Spark graph;

taking the credit card data, the merchant data and the payer data as node data, and taking the relationship data of the credit card data corresponding to the payer and the relationship data of the credit card data corresponding to the merchant as side data, thereby constructing a full-scale graph;

taking the credit card data and the payor data as node data, and taking the relation data of the credit card data corresponding to the payor as edge data, thereby constructing a cluster map;

grouping the cluster graph based on connectivity, classifying all nodes connected with edges into a sub-graph, thereby obtaining a connectivity grouping result, distributing a connectivity grouping id to each connectivity grouping, and adding a grouping corresponding to the connectivity grouping id into a full graph through an id merging table;

extracting large groups of which the number of nodes in the group is greater than a preset threshold value based on the connectivity grouping result, connecting the repayment credit cards of the large groups according to repayment persons as central points, further acquiring subdivision results, sequencing the subdivision results according to the number of the cards connected by the repayment persons, allocating a unique subdivision id to each subdivision group based on the sequencing results, and adding the groups corresponding to the continuous subdivision ids into a full-scale map through an id merging table;

and after evaluating each group in the full-scale graph based on the evaluation index, outputting a preselected result meeting a preset condition, wherein the evaluation index is acquired according to the consumption data of the credit card.

Further, the credit card data includes a credit card contract number, a customer number, an encryption code, an account status, a card type, a customer billing address, a work unit, an expected amount, and a credit limit;

the merchant data includes: a merchant number and a merchant name;

the repayment data includes a repayment account number.

Further, the relationship data of the credit card data corresponding to the payers comprises contract numbers of the credit cards, account numbers of the payers, total number of payers and total amount of payers;

the relationship data of the credit card data corresponding to the merchant comprises a credit card contract number, a merchant number, a total consumption stroke number and a total consumption amount.

Further, before constructing the full-scale map or the cluster map, the method comprises the step of recoding the credit card contract number; accordingly, the method can be used for solving the problems that,

before outputting the pre-selection result, the method comprises a step of decoding the credit card contract number.

Further, the evaluation index comprises the number of customers, the average consumption amount of the customers with the average consumption higher than a threshold value, the number of the customers with the average consumption higher than the threshold value, the proportion of the customers with the average consumption higher than the threshold value and the average number of common repayment accounts.

The invention also discloses a credit card fraud preselection device based on a graph database, comprising:

the system comprises an acquisition module, a processing module and a processing module, wherein the acquisition module is used for acquiring data to be processed, the data to be processed comprises credit card data, merchant data and repayment person data, and a graph database is constructed through Spark graph;

the system comprises a full-scale graph building module, a data processing module and a data processing module, wherein the full-scale graph building module is used for building a full-scale graph by taking credit card data, merchant data and payer data as node data and taking relationship data of a payer corresponding to the credit card data and relationship data of a merchant corresponding to the credit card data as side data;

the cluster map building module is used for taking the credit card data and the repayment person data as node data and taking the relation data of the credit card data corresponding to the repayment person as edge data so as to build a cluster map;

a connectivity grouping module used for grouping the cluster graph into a subgraph based on the connectivity of all the nodes connected with the edges, thereby obtaining the result of the connectivity grouping, distributing a connectivity grouping id to each connectivity grouping, and adding the grouping corresponding to the connectivity grouping id into the total graph through an id merging table;

the subdivision module is used for extracting large groups of which the number of nodes in the group is greater than a preset threshold value based on the connectivity grouping result, connecting the credit cards paid by the large groups according to the payment person as a central point, further acquiring subdivision results, sequencing the subdivision results according to the number of the cards connected by the payment person, allocating a unique subdivision id to each subdivision group based on the sequencing results, and adding the groups corresponding to the continuous subdivision ids into a full-scale map through an id merging table;

and the output module is used for outputting a preselection result meeting preset conditions after evaluating each group in the full-scale map based on the evaluation indexes, wherein the evaluation indexes are acquired according to the consumption data of the credit card.

The invention also discloses a storage medium which comprises a stored program, wherein when the program runs, the method of any one of the above is executed.

Compared with the prior art, the invention has the following advantages:

1. according to the invention, the Spark graph x is used for graph calculation, so that the calculation efficiency is improved, more data can be put into the graph database for calculation, and the result can be obtained more quickly. For example, if the same machine Neo4j can only put 3 months of data into the calculation, then Spark graph x may be put more than 6 months of data into the calculation. Therefore, the graph database can better reflect the actual situation and better screen anti-fraud.

2. The invention uses the cluster map (non-full map) to calculate, which can accelerate the calculation and table combining, calculation and output speed and improve the program efficiency. The intra-group subdivision operation can avoid deleting transaction edges in the graph, so that the graph in the subdivision group can better represent the relationship between the payers and all the credit cards which have been paid.

3. The index system disclosed by the invention can more efficiently identify the high-risk fraud passenger groups and reduce the workload of service personnel verification. When a cheating passenger group is automatically found out, the most representative credit card data is output instead of directly outputting the full data for the verification of the service personnel, so that the verification efficiency can be improved, and the workload of the verification of the service personnel is reduced.

4. The invention has respective unique processing method for different groups, can well adjust output, determines an output scheme according to respective characteristics of the groups, can enable the output result to be more accurate, and is more convenient for subsequent parameter adjustment.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings needed to be used in the description of the embodiments or the prior art will be briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to these drawings without creative efforts.

FIG. 1 is a flow chart of a method for graph database based preselection of credit card fraud in accordance with the present invention.

Detailed Description

In order to make the technical solutions of the present invention better understood, the technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the drawings in the embodiments of the present invention, and it is obvious that the described embodiments are only a part of the embodiments of the present invention, and not all of the embodiments. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.

It should be noted that the terms "first," "second," and the like in the description and claims of the present invention and in the drawings described above are used for distinguishing between similar elements and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used is interchangeable under appropriate circumstances such that the embodiments of the invention described herein are capable of operation in sequences other than those illustrated or described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed, but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

As shown in FIG. 1, the present invention provides a graph database based preselection method for credit card fraud, comprising:

s1, obtaining data to be processed, wherein the data to be processed comprises credit card data, merchant data and repayment person data, and constructing a graph database through Spark graph.

Specifically, a graph database is constructed by Spark graph, and the graph database comprises three nodes: credit card, merchant, payor. Two sides: credit card-payor, credit card-merchant. Wherein the credit card data includes: credit card contract number, customer number, lock code, account status, card type, customer billing address, work unit, expected amount, credit line. The merchant data includes: merchant number, merchant name. The repayment data includes: a payor account number. Credit card-payor-side data includes: the credit card is used for closing the number, the account number of the payer, the total number of the repayment and the total amount of the repayment. Credit card-merchant side data includes: credit card contract number, merchant number, total number of strokes consumed, total amount consumed.

And S2, taking the credit card data, the merchant data and the payer data as node data, and taking the relationship data of the credit card data corresponding to the payer and the relationship data of the credit card data corresponding to the merchant as side data, so as to construct a full-scale graph.

And S3, constructing a cluster map by taking the credit card data and the payor data as node data and taking the relation data of the credit card data corresponding to the payor as edge data.

Specifically, the map database constructed by Spark graph x needs to be connected by using a Long type main key, and based on the characteristics of credit card data, the credit card contract number contains letters and cannot be converted into a Long type, and at this time, the credit card contract number only needs to be recoded and decoded during output. In the invention, two graphs are created, namely a full graph (three nodes and 2 sides) and a cluster graph (credit card node, repayment person node and credit card-repayment person side). The full graph is mainly used for inquiring, and the cluster graph is mainly used for operation. The full graph will contain information for all nodes and edges to facilitate the query. And the cluster map only contains the full data of node main keys and edges necessary for operation so as to reduce the scale of the map and improve the speed of table merging, operation and output.

The Spark graph adopted in the invention is a distributed graph processing framework, provides a simple, easy-to-use and rich interface for graph calculation and graph mining based on a Spark platform, and greatly facilitates the requirement on distributed graph processing. Both point segmentation and GAS are mature when designing GraphX, optimized for them in design and coding, and finding the best balance between function and performance. The operation of GraphX based on Spark cache rdd saves a large amount of computation and io expenses, so GraphX is also particularly suitable for computing mass data. In short, Neo4j is more biased towards graph database systems and is more suitable for graph storage and querying. Spark graph x, on the other hand, is biased towards graph computing systems, which are adept at constructing a graph-structured data set and computing some useful things from it. The credit card anti-fraud project designs a large number of operations of node-edge relation, so the use of Spark graph X is a better choice.

S4, grouping the cluster graph based on connectivity, classifying all the nodes connected with edges into a sub-graph, thereby obtaining the result of connectivity grouping, distributing connectivity grouping id to each connectivity grouping, and adding the grouping corresponding to the connectivity grouping id into the full graph through an id combining table.

Specifically, the connectivity grouping in the present invention is connected component clustering, and after the nodes are connected, a huge relational graph is generated. At which point it needs to be divided into small groups. The method divides the large graph through the connectivity of the graph, and assigns all nodes connected by edges to a sub-graph. Since we exclude the merchant from the clustering process, the connectivity packet can already embody some fraud characteristics. From a business perspective, a sub-graph represents a group (or a) of credit cards and payers paid for them, and payment related data. The larger the subgraph is, the more credit cards and payers are involved, the more complicated the payment relationship among the payers is represented, and the greater the fraud suspicion degree is.

Further, this step requires first screening out some of the individual points (credit card only without a payer or credit card only with a payer) and deleting these points in both figures according to their id. Each connectivity packet then has its unique group number id, which is added from the cluster map to the full-scale map by id union table. Generally, the fraud risk of one-to-one repayment is small, and the one-to-one repayment can be optionally excluded in the first step, but considering that the one-to-one repayment may involve a small amount of self-owned merchant cash register problems, we choose to retain the one-to-one relationship and separately propose the same to carry out the one-to-one repayment detection special step in the subsequent detection.

With regard to the connectivity (connected component) algorithm, direct bring-in of merchant data can make the connectivity graph too large, as most credit cards are associated with a centralized consumption location such as a mall or a gas station. While the fraudulent relationship is not reflected in this association. Therefore, the method excludes the merchant point, and only connectivity grouping is carried out on the payers, the credit cards and the connection edges of the payers and the credit cards, and the group represents the credit cards and all payers.

With respect to the Louvain algorithm, nodes are divided into smaller groups (relative connected components) according to the degree of ingress and egress. However, this algorithm will delete a large number of edges to achieve the purpose of dividing the group. In practical tests, the weights of two edges in the anti-fraud graph are found to be very different, and if the two edges are subjected to Louvain clustering, the edges connected with the merchant and the card are almost completely deleted. However, the situation that the repayment relation edge is deleted also occurs when the Louvain algorithm is carried out only based on the common repayment person identification (only the repayment person, the credit card and the connection edge are substituted), the situation may influence the subsequent step of judging a cheating passenger group based on the common repayment person, and therefore the secondary clustering method is changed by the Louvain algorithm.

Furthermore, sorting the list only by the size of the guest group does not represent a good indication of the higher risk of guest group fraud. Since frequent large transactions occur in credit card applications, we are more focused on monitoring abnormally large transactions. The evaluation indexes are newly added: the average payment amount of each account of the customer group, the large payment proportion of the customer group, the average payment amount of each account of the customer group, the payment fluctuation degree of the customer group and the like are used for sequencing the risk customer group, so that the service verification is more targeted.

S5, extracting large groups with nodes in the groups larger than a preset threshold value based on the connectivity grouping result, connecting the credit cards paid by the large groups according to the payoff person as a central point, further obtaining the grouping results, sequencing the grouping results according to the number of the cards connected by the payoff person, allocating a unique subdivision id to each grouping based on the sequencing result, and adding the grouping corresponding to the continuous subdivision id into a full-scale map through an id combining table.

In particular, with connectivity grouping, large groups (card number within group greater than 30) may still occur. The group outputs a report too large when carrying out subsequent detection, which is very difficult for service checking work, but has very large fraud risk from the anti-fraud perspective, so that the group is further divided to facilitate data analysis. We connect such groups to their repayment credit cards at a central point by the payer, and sort by the number of cards connected by the payer. Such a group has two group ids, the first being connectivity group ids and the other being subdivision ids, sorted first by connectivity group size and then by subdivision id within the group.

And S6, after evaluating each group in the full-scale map based on the evaluation index, outputting a preselection result meeting preset conditions, wherein the evaluation index is acquired according to the credit card consumption data.

Specifically, through the above steps, three types of groups can be obtained, which are respectively a normal group, a large group and a 1-to-1 group.

First analyzing the graph structure of a rogue guest group has the following two features:

a) the customers in the fraudulent customer base are related to other customers in the customer base through common payers and consumption merchants, and the relationship between the customers and the customers is usually many-to-many, so that the map presents a compact network structure.

b) The relationship map of the fraud customer group generally comprises a plurality of identical payers and identical consumption merchants, and the customers and the payers, and the customers and the merchants are also in many-to-many relationship.

Thus, for each cohort, an evaluation was constructed according to the following criteria:

1. number of clients (count number of credit cards in group): the larger the number of customers, the more complex the customer base and the higher the fraud risk.

2. Average spending amount (total spending/total strokes consumed within a group): the higher the average dollar amount, the higher the risk of fraud.

3. Average consumption >1000 average amount of consumption of customers (average consumption >1000 yuan of customers in the customer group is screened and their average amount of consumption is calculated): the important index, average consumption per stroke >1000 yuan, proves that the credit card is used less daily, and the larger the average consumption amount is, the larger the fraud risk is.

4. Average consumption >1000 number of clients (screening clients in the client group average consumption >1000 yuan and counting number of clients): important metrics, such as metric 3, the group fraud risk is higher if the customer group has many users who average cost >1000 dollars per pen.

5. Average consumption >1000 customer ratio (average consumption >1000 number of customers/total number of group of customers): visually representing the question as index 4, the group is manually reviewed, typically if > -1/2.

6. Average number of common repayment accounts (number of credit cards/number of payers in group): there are also several cards per payor in the group on average, the larger the number, the higher the fraud risk.

7. Degree of payment fluctuation: the average consumption is greater than 1000 in the group, the absolute value of each consumption (consumption amount-average consumption amount)/average consumption amount) of each customer is summed and divided by the consumption times, and then the sum of the results is divided by the number of the average consumption of the customers greater than 1000 in the group to serve as the fluctuation degree of the group payment. (note: customers meeting the requirements in the group calculate their consumption fluctuation individually, and average their fluctuation to be the group fluctuation). It should be noted that this requires the calculation of excluding the credit card that has been consumed only once, because the fluctuation determination requirement is not satisfied by consuming the credit card only once, and the fluctuation degree is calculated as 0, the group fluctuation attribute is greatly reduced. The group attribute >1 can basically eliminate risks, and the business performance is basically that high-volume products are purchased once, and the rest is daily consumption. The attribute <0.5 of the group proves that the consumption amount in the group is high and stable, and the fraud risk is higher.

In this embodiment, the preferred threshold is 1000 yuan, and in fact, the parameter may be adjusted according to each row of data. Because the fraud account is frequently subjected to large cash register, the overall consumption amount is higher, and if the consumption amount of the whole group is higher, the group is higher in risk of being a fraud group.

In combination with the above index system, this implementation provides output methods for different groups:

the common group output method comprises the following steps: the connectivity groups are ranked by index 3 (average spending >1000 customer average spending amount) and other items are reviewed. And (4) providing the average consumption >1000 client id in the index 4, inquiring the consumption and repayment information in the full-scale graph by using the id, and outputting the result as an xlsx file for screening by service personnel. Because the data such as the actual card number is very long, the very long data may be saved into a scientific counting method format after the business personnel edits and saves the csv file, so that the data such as the card number is lost, and in order to prevent the situation from happening, the csv format of the file which is often output is changed into an excel format. If the fraud characteristics are screened, the group is inquired by using the connectivity group id and all the credit card ids involved in the group are proposed, consumption and payment information is inquired in a full-scale graph and the risk fraud group is output and provided to a business staff for examination in a report form.

The large group output method comprises the following steps: and (3) writing indexes (1-5) according to the subdivision id in the connectivity graph (because indexes 6 and 1 are the same in the group), putting out the payment person id with the client number (index 1) >10 (variable), and outputting the relevant information of the payment credit card by the joint total amount table, and screening the payment credit card by the service personnel. And then, performing the same processing mode of the common group on other subdivided groups.

A one-to-one group output method: the one-to-one group represents that one card corresponds to one payer, the structure is relatively simple, the average consumption amount of the card is directly calculated and ranked, and the one-to-one group with high average payment is screened. The remaining operation was identical to the general group.

The above-mentioned serial numbers of the embodiments of the present invention are merely for description and do not represent the merits of the embodiments.

In the above embodiments of the present invention, the descriptions of the respective embodiments have respective emphasis, and for parts that are not described in detail in a certain embodiment, reference may be made to related descriptions of other embodiments.

In the embodiments provided in the present application, it should be understood that the disclosed technology can be implemented in other ways. The above-described embodiments of the apparatus are merely illustrative, and for example, the division of the units may be a logical division, and in actual implementation, there may be another division, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, units or modules, and may be in an electrical or other form.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention may be embodied in the form of a software product, which is stored in a storage medium and includes instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: a U-disk, a Read-Only Memory (ROM), a Random Access Memory (RAM), a removable hard disk, a magnetic or optical disk, and other various media capable of storing program codes.

Finally, it should be noted that: the above embodiments are only used to illustrate the technical solution of the present invention, and not to limit the same; while the invention has been described in detail and with reference to the foregoing embodiments, it will be understood by those skilled in the art that: the technical solutions described in the foregoing embodiments may still be modified, or some or all of the technical features may be equivalently replaced; and the modifications or the substitutions do not make the essence of the corresponding technical solutions depart from the scope of the technical solutions of the embodiments of the present invention.

Claims

1. A method for graph database based preselection of credit card fraud, comprising:

2. A graph database based credit card fraud preselection method as claimed in claim 1 wherein said credit card data includes credit card contract number, customer number, lock code, account status, card type, customer billing address, work unit, expected amount and credit line;

the merchant data includes: a merchant number and a merchant name;

the repayment data includes a repayment account number.

3. The graph database-based credit card fraud preselection method of claim 1, wherein said credit card data corresponds to payers' relationship data including credit card contract number, payers account number, total number of payers, total amount of payers;

4. A graph database based credit card fraud preselection method as claimed in claim 2 or 3, comprising the step of recoding the credit card contract number before constructing the full map or cluster map; accordingly, the method can be used for solving the problems that,

5. A graph database based credit card fraud preselection method as claimed in claim 1 wherein said evaluation metrics comprise number of customers, average amount of consumption of customers with average consumption above a threshold, number of customers with average consumption above a threshold, proportion of customers with average consumption above a threshold, and average number of common repayment accounts.

6. A graph database based credit card fraud preselection apparatus comprising:

7. A storage medium comprising a stored program, wherein the program when executed performs the method of any one of claims 1 to 5.