CN117971921B

CN117971921B - Method and system for detecting abnormal operation of client based on apriori algorithm

Info

Publication number: CN117971921B
Application number: CN202410055271.0A
Authority: CN
Inventors: 王锟; 任纪刚; 蔺心臣
Original assignee: Arms Equipment Group Finance Co ltd
Current assignee: Arms Equipment Group Finance Co ltd
Priority date: 2024-01-15
Filing date: 2024-01-15
Publication date: 2024-06-21
Anticipated expiration: 2044-01-15
Also published as: CN117971921A

Abstract

The invention provides a method and a system for detecting abnormal operation of a client based on an apriori algorithm, which relate to the technical field of abnormal operation detection and comprise the following steps: obtaining operation data, generating a bitmap, constructing a bitmap index based on the bitmap, generating a candidate item set, determining a condition mode base of each item in the candidate item set, traversing a transaction database to determine the support degree of the condition mode base, and repeating iteration to obtain a frequent item set; for the items in the frequent item set, calculating an importance score corresponding to each item, removing the items with the importance scores smaller than a score threshold value, and for the reserved items, determining corresponding sub-association rules, and combining to obtain association rules; and traversing the association rule for each operation data, determining whether a front item corresponding to the association rule exists in the operation data, returning to be free of abnormality if the front item does not exist, detecting whether a rear item corresponding to the front item does not exist or an item which does not accord with the association rule exists in the operation data if the front item does not exist, marking the front item as abnormal if the rear item does not exist or returning to be abnormal if the rear item does not accord with the association rule exists in the operation data.

Description

Method and system for detecting abnormal operation of client based on apriori algorithm

Technical Field

The invention relates to the technical field of abnormal operation detection, in particular to a method and a system for detecting abnormal operation of a client based on an apriori algorithm.

Background

In the financial field, particularly in banks and other financial institutions, abnormal operation of customers may imply potential risk or fraud. Such abnormal operations may include large transfers, irregular withdrawal practices, off-site logging, and the like. Detecting these abnormal operations is critical to maintaining the security of the financial system.

In the prior art, CN112148715A discloses a database security detection method and system based on user behavior rules. The method comprises the following steps: acquiring a normal SQL statement submitted to a database by a user; splitting a normal SQL sentence into user operation behavior vectors, wherein the user operation behavior vectors comprise user names, operation behaviors and operation objects; taking a set formed by each user operation behavior vector as a data set, taking the maximum confidence as a correlation measurement standard, adopting an Apriori algorithm to screen user operation behavior vectors larger than a correlation threshold in the data set, marking the user operation behavior vectors as normal operation behavior vectors, and marking the set formed by each normal operation behavior vector as a user normal behavior library; acquiring an SQL sentence to be detected; splitting the SQL sentence to be detected into operation behavior vectors of the user to be detected; and matching the operation behavior vector of the user to be detected with each vector in the normal behavior library of the user so as to detect abnormal behaviors.

In summary, although the prior art can detect the operation of the client through the apriori algorithm, the evaluation of the opposite credibility is simpler, and the operation habit difference of different users is not considered, so a method is needed to make up for the defects in the prior art.

Disclosure of Invention

The embodiment of the invention provides a method and a system for detecting abnormal operation of a client based on an apriori algorithm, which are used for analyzing the relevance of the behavior of the client and detecting the abnormal operation based on the operation data of the client.

In a first aspect of an embodiment of the present invention, a method for detecting a client abnormal operation based on an apriori algorithm is provided, including: acquiring operation data, generating a bitmap based on items in a transaction database, constructing a bitmap index based on the bitmap through bit operation, generating a candidate item set based on the bitmap index in combination with a preset comprehensive compression algorithm, determining a condition pattern base of each item in the candidate item set by constructing a frequent tree, determining the support degree of the condition pattern base by traversing the transaction database based on the bitmap index and the condition pattern base, and repeating iteration to obtain a frequent item set, wherein each bit in the bitmap represents one item of the transaction;

For the items in the frequent item set, determining an importance score corresponding to each item by calculating information gain, comparing the importance score with a preset score threshold, removing the items with the importance score smaller than the score threshold, and for the reserved items, determining a corresponding sub-association rule by a rule generation algorithm, and combining the sub-association rules to obtain an association rule;

And traversing the association rule for each operation data, determining whether a front item corresponding to the association rule exists in the operation data, returning to be free of abnormality if the front item does not exist, detecting whether a rear item corresponding to the front item does not exist or an item which does not accord with the association rule exists in the operation data if the front item does not exist, marking the operation data as abnormal if the rear item does not exist or the item which does not accord with the association rule exists in the operation data, and returning to be abnormal.

In an alternative embodiment of the present invention,

The obtaining operation data, preprocessing the operation data, generating a bitmap based on items in a transaction database, constructing a bitmap index based on the bitmap through bit operation, and generating a candidate item set based on the bitmap index and combining a preset comprehensive compression algorithm comprises:

representing each transaction of a transaction database in the operation data by using a bitmap;

Distributing a randomly generated unique identifier for each item in the operation data based on the bitmap, constructing a mapping table based on the unique identifier, traversing the operation data, and converting the identifier corresponding to each item into a bitmap index through the mapping table; and compressing continuous repeated data in each item of the operation data based on the bitmap index to generate a first compressed output, obtaining comprehensive compressed output by distributing and encoding discontinuous data in the first compressed output according to the occurrence frequency, and summarizing the comprehensive compressed output to obtain a candidate item set.

In an alternative embodiment of the present invention,

Determining a condition pattern base of each item in the candidate item set by constructing a frequent tree, determining the support degree of the condition pattern base by traversing the transaction database based on the bitmap index and the condition pattern base, and repeating the iteration to obtain the frequent item set, wherein the method comprises the following steps:

Based on the candidate item set, determining the number of transactions in the operation data, initializing a frequent tree according to the number of transactions, respectively constructing a head table for items in the frequent tree, combining the head table, determining the frequency of each transaction in the transaction database, and arranging the transactions in descending order according to the frequency;

Adding the transaction in the transaction database into the frequent tree, inserting or expanding the items in the transaction based on the structure of the frequent tree, updating a corresponding item header table, counting the frequent items in the transaction, for each frequent item, finding a path ending with the current frequent item in the frequent tree, removing the current frequent item, obtaining a condition pattern base corresponding to each item, repeating the operation to obtain an initial frequent item set, checking whether a subset of the initial frequent item set is frequent or not, and deleting the initial frequent item set if the subset is not frequent;

For the reserved initial frequent item set, constructing a corresponding bitmap, initializing a bitmap counter, recording a transaction containing the initial frequent item set, traversing the transaction database, checking the contained frequent item set through bit operation, determining the support degree of the corresponding frequent item set based on the data of the bitmap counter, repeating iteration until the support degree of the frequent item set is greater than a preset support degree threshold, and outputting the corresponding frequent item set.

In an alternative embodiment of the present invention,

For the items in the frequent item set, determining an importance score corresponding to each item by calculating an information gain, comparing the importance score with a preset score threshold, and removing the items with the importance score smaller than the score threshold comprises:

Determining the total transaction number in the transaction database, calculating database information entropy corresponding to the transaction database based on the total transaction number, calculating corresponding item information entropy for each item in the frequent item set, and calculating the information gain based on the database information entropy and the item information entropy;

And carrying out normalization processing on the information gain to obtain an importance score corresponding to each item, carrying out descending order on the items in the frequent item set based on the importance score, comparing the importance score with a preset score threshold, and removing the item corresponding to the importance score if the importance score is smaller than the score threshold.

In an alternative embodiment of the present invention,

The information gain is calculated based on the database information entropy and the item information entropy, and is calculated as shown in the following formula:

Wherein G represents information gain, C represents the number of categories in the transaction database, i represents the ith category, P (class _i) represents the frequency of the category i in the dataset, P (K) represents the probability of occurrence of the item under the current category, and K represents one category in the transaction database.

In an alternative embodiment of the present invention,

For the reserved item, determining a corresponding sub-association rule through a rule generation algorithm, and combining the sub-association rule to obtain an association rule comprises:

acquiring a frequent item set subjected to threshold screening, initializing a corresponding sub-association rule for each reserved item in the frequent item set, and setting the maximum length of the sub-association rule;

For the items ordered according to the importance scores, selecting a first item as a front item of the sub-association rule, calculating the support degree of the front item, for the unselected items, calculating the first confidence degree of each item combined with the front item, and selecting the item with the largest first confidence degree value as the next item of the sub-association rule;

And repeatedly selecting until the length of the sub-association rule reaches the preset maximum length, outputting the sub-association rule, calculating a second confidence coefficient corresponding to each sub-association rule for each sub-association rule, arranging the sub-association rules in a descending order according to the second confidence coefficient, and selecting the sub-association rule positioned in the first 50% to be combined to obtain the association rule.

In an alternative embodiment of the present invention,

Traversing the association rule for each operation data, determining whether a front item corresponding to the association rule exists in the operation data, returning to be free of abnormality if the front item does not exist, detecting whether a rear item corresponding to the front item does not exist or an item which does not conform to the association rule exists in the operation data if the front item does not exist, marking the operation data as abnormal if the rear item does not exist in the operation data or the rear item does not conform to the association rule exists, and returning to be abnormal comprises the following steps:

Checking the operation data based on the association rule, traversing the association rule, matching the front item in the association rule in the operation data, determining whether the front item exists, and if so, further judging whether the operation data has a rear item corresponding to the association rule;

if the postitem corresponding to the previous item does not exist, marking the operation data as abnormal, returning to the existence of the abnormality, and if the postitem corresponding to the previous item exists, returning to the nonessential abnormality;

If the previous item exists in the operation data and the item which does not exist in the association rule exists, marking the operation data as abnormal and returning to the abnormal state.

In a second aspect of the embodiment of the present invention, a system for detecting abnormal operation of a client based on apirori algorithm is provided, including: the first unit is used for acquiring operation data, generating a bitmap based on items in a transaction database, constructing a bitmap index through bit operation based on the bitmap, generating a candidate item set based on the bitmap index in combination with a preset comprehensive compression algorithm, determining a condition pattern base of each item in the candidate item set by constructing a frequent tree, determining the support degree of the condition pattern base based on the bitmap index and the condition pattern base, and repeatedly iterating to obtain a frequent item set, wherein each bit in the bitmap represents one item of the transaction;

A second unit, configured to determine, for the items in the frequent item set, an importance score corresponding to each item by calculating an information gain, compare the importance score with a preset score threshold, remove items with importance scores smaller than the score threshold, and determine, for the retained items, a corresponding sub-association rule by a rule generation algorithm, and combine the sub-association rules to obtain an association rule;

And a third unit, configured to traverse the association rule for each operation data, determine whether a front item corresponding to the association rule exists in the operation data, if not, return to no exception, if so, detect whether a rear item corresponding to the front item does not exist or an item that does not conform to the association rule exists in the operation data, if so, mark the operation data as exception, and return to the exception.

In a third aspect of an embodiment of the present invention,

There is provided an electronic device including:

A processor;

a memory for storing processor-executable instructions;

Wherein the processor is configured to invoke the instructions stored in the memory to perform the method described previously.

In a fourth aspect of an embodiment of the present invention,

There is provided a computer readable storage medium having stored thereon computer program instructions which, when executed by a processor, implement the method as described above.

In the invention, the bitmap is generated through the items in the transaction database, and then the bitmap index is constructed through bit operation based on the bitmap, thereby being beneficial to efficiently representing and indexing the item set information in the transaction database, improving the efficiency of data storage and retrieval, and generating the candidate item set based on the bitmap index and combining with a preset comprehensive compression algorithm. The comprehensive compression algorithm is used for reducing redundancy of data, improving efficiency of candidate item sets, determining importance scores corresponding to items in each frequent item set through calculation of information gain, helping to identify which items have greater contribution to analysis of the whole data, improving quality of association rules, traversing the association rules, determining whether front items corresponding to the association rules exist or not for each operation data, detecting whether rear items exist or not accord with the association rules, helping to identify operation data which do not accord with the association rules, marking the operation data as abnormal, and providing an abnormal detection function.

Drawings

FIG. 1 is a flowchart of a method for detecting a customer abnormal operation based on apirori algorithm according to an embodiment of the present invention;

FIG. 2 is a schematic diagram of a system for detecting abnormal operation of a client based on apirori algorithm according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are only some embodiments of the present invention, not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The technical scheme of the invention is described in detail below by specific examples. The following embodiments may be combined with each other, and some embodiments may not be repeated for the same or similar concepts or processes.

FIG. 1 is a flowchart of a method for detecting a client abnormal operation based on apirori algorithm according to an embodiment of the present invention, as shown in FIG. 1, the method includes:

S1, acquiring operation data, generating a bitmap based on items in a transaction database, constructing a bitmap index through bit operation based on the bitmap, generating a candidate item set based on the bitmap index in combination with a preset comprehensive compression algorithm, determining a condition pattern base of each item in the candidate item set by constructing a frequent tree, determining the support degree of the condition pattern base by traversing the transaction database based on the bitmap index and the condition pattern base, and repeatedly iterating to obtain a frequent item set, wherein each bit in the bitmap represents one item of the transaction;

The transaction database refers to a database in which a plurality of transactions are recorded, each transaction generally represents a set of item sets or transactions, each transaction contains a set of items (items), the bitmap is a data structure and is used for representing a set of elements, each element corresponds to a bit in the bitmap, each bit of the bitmap represents whether an element exists in the set, the bitmap index is an index structure based on the bitmap and is used for accelerating database query, each value of a certain attribute has a corresponding bitmap, each bit represents whether the attribute value exists in the data set, the comprehensive compression algorithm is an algorithm for reducing the data storage space, the cost of storage and transmission is reduced by removing redundant information or using a more compact representation form, the frequent tree is a data structure used for finding frequent item sets in data mining, the frequent item sets can be found more effectively by representing the transaction database as a tree, the condition pattern is a data structure which is used for assisting in finding frequent item sets in the frequent item sets, and the frequent item sets are generated on the condition basis.

In an alternative embodiment of the present invention,

The mapping table is a data structure for establishing a mapping relationship of elements between different data sets, the identifier is a unique identifier generated for each item in the mapping table, by using the identifier, the original item can be represented in a more compact manner, the comprehensive compression output is a compression result generated in a compression algorithm, and the encoding generally refers to representing the original data in a shorter form to reduce the overhead of storage or transmission.

Traversing the transaction database, finding all the items which appear in the transaction database, constructing a full set of item sets, taking the full set as a column label of the bitmap, creating a bitmap with the length equal to the length of the full set for each transaction, setting each bit of the bitmap to 0, traversing the item set of each transaction, determining the position of the item in the full set of item sets, modifying the corresponding position to 1, repeating the operation, and representing the transaction by the bitmap for each transaction in the operation data;

Generating a random unique identifier for each item in a transaction database, establishing a mapping relation between each item and the corresponding unique identifier, generating a mapping table, traversing an item set in each item for each transaction, searching the corresponding unique identifier of each item in the item set through the mapping table, converting the unique identifier into a corresponding bitmap index by mapping the unique identifier into an integer, setting the position of the corresponding index in the bitmap to be 1 to indicate that the item exists in the current transaction, and repeating the operation until the identifier corresponding to each item in the operation data is converted into the bitmap index for each transaction in the database;

Illustratively, assume that there are transaction 1 and transaction 2 in the operational data, where transaction 1 is [ A, B, C ], transaction 2 is [ A, C, D ], each item in the two transactions is assigned a unique identifier, A: ID1, B: ID2, C: ID3, D: ID4, constructing a mapping table according to the unique identifiers, namely, A- & gt, ID1, B- & gt, ID2, C- & gt, ID3, D- & gt, ID4, traversing the operation data, converting the identifiers corresponding to each item into bitmap indexes through the mapping table, namely, transaction 1 can be represented as [ ID1, ID2, ID3], transaction 2 can be represented as [ ID1, ID3, ID4], updating the bitmap according to the bitmap indexes, transaction 1 is represented as 1100, and transaction 2 is represented as 1011; for each transaction, traversing the bitmap index of the transaction, compressing continuously repeated data to obtain a first compressed output, wherein the continuously repeated data is compressed, counting discontinuous data in the first compressed output to obtain the occurrence frequency of each discontinuous data, distributing codes for the discontinuous data according to the occurrence frequency, splicing the continuously repeated data and the codes of the discontinuous data in the first compressed output together to obtain a comprehensive compressed output, and for each transaction, taking the comprehensive compressed output as a candidate, and summarizing the candidates of all the transactions to obtain a candidate set;

illustratively, assuming there are transaction 1 and transaction 2 represented by a bitmap index, where transaction 1:11001100, transaction 2:0100110, by compressing the consecutively repeated data in the transaction, transaction 1 can be represented as: 1010, transaction 2 may be represented as: 01010, counting the occurrence frequency of discontinuous data, and distributing codes, wherein 0 codes are 0,1 codes are 1,10 codes are 01, the comprehensive compression output of the transaction 1 is 10, and the comprehensive compression output of the transaction 2 is 01010.

In this embodiment, by representing each transaction of the transaction database with a bitmap, the space requirement of data storage can be effectively reduced, the process of converting the unique identifier into the bitmap index further improves the data storage and processing efficiency, and based on the bitmap index, continuous repeated data in each item is compressed, so that the redundancy of the data is reduced, the consumption of storage space is reduced, the efficiency of data storage is improved, and in sum, the embodiment can obtain a more compact and efficient data representation form, provide a better basis for data mining tasks, help to accelerate the execution of tasks such as frequent item set mining, and reduce the burden of data storage and processing.

In an alternative embodiment of the present invention,

The candidate item set is a set which is generated in a frequent item set mining algorithm and possibly contains frequent items, the frequency represents the frequency or the number of times that an item set appears in a transaction database, the bitmap counter is a data structure for efficiently counting the number of transactions containing a specific frequent item set, each bitmap counter corresponds to one frequent item set, each bit represents one transaction, and the support threshold is a threshold which is set in advance in the process of mining the frequent item set and is used for determining the condition that the finally output frequent item set needs to meet.

Counting all the transactions in the candidate item set, determining the total number of the transactions, creating an empty frequent tree based on the obtained total number of the transactions, constructing an item header table for each item in the frequent tree, storing the occurrence position of the item in the frequent tree, traversing a transaction database, checking whether the item in each transaction is in the frequent tree, if so, updating the position of the corresponding item in the item header table, and arranging the transactions in a descending order according to the frequency of the item in the item header table;

Illustratively, assuming the number of transactions in the operational data is 4, there is a set of candidates [ A, B, C, D, AB, AC, AD, BC, BD, ABC, ABD, BCD, ABCD ] that is initialized with an empty frequent tree, the set of candidates is traversed, the frequent tree is constructed, e.g., for the set of candidates [ A, B, C ], the construction of the frequent tree increases nodes at corresponding locations, and for each of the entries in the frequent tree, a header table is constructed, e.g.: a: linked list nodes 1, b: linked list nodes 2, c: linked list nodes 3, d: the linked list node 4. Traversing the transaction database, for each transaction, checking if the item therein is in a frequent tree, e.g. transaction 1 contains items [ A, C ], thus updating the locations of A and C in the head list, ordering the transactions in descending order according to the frequency of the items in the head list, e.g. assuming the frequency ordering is: a > C > B > D, then the transaction ordering is: 1>3>2>4;

For each transaction, traversing the items in the transaction, starting from the root node of the frequent tree, checking whether the items exist in the child node of the current node or not, if the items exist, moving a pointer to the corresponding child node, if the pointers do not exist, creating a new child node, taking the items as labels, updating a linked list of the items in the item header table, for each transaction, repeatedly executing traversing operation and inserting or expanding operation until all transaction item sets are completely processed, traversing each item in the transaction, counting the appearance position of the items in the frequent tree, for each frequent item, finding a path ending with the item, removing the current frequent item, obtaining a condition mode base corresponding to each item, repeatedly executing the operation until a new frequent item cannot be generated, obtaining an initial frequent item set, for the initial frequent item set, checking whether a subset of each item set is frequent or not, and deleting the initial frequent item set if the subset exists.

Illustratively, a new transaction T is present in which the set of items is { A, B, D, F }, this transaction is traversed, for each item, processing in turn, for item A, checking if it is present in the child node of the root node, moving the pointer to the A node because A is present in the child node, for item B, checking if it is present in the child node of the A node, moving the pointer to the B node because B is present in the child node, checking if it is present in the child node of the B node, creating a new child node because D is not present in the child node, and updating the linked list of D in the head table, checking if it is present in the child node of the D node, creating a new child node because F is not present in the child node, and updating F as a label in the head table, adding each item to the linked list tree according to the updated position until all items are processed.

For example, assuming a transaction T, where the term set is { a, B, D, F }, counting the occurrence position of each term in the frequent tree, for each frequent term, finding a path ending with the term, removing the current frequent term to obtain a condition pattern base corresponding to each term, for example, for the term a, the condition pattern base is { B, D, F }, performing the same operation on other terms to obtain the condition pattern base corresponding to each term, merging the generated frequent term sets to obtain an initial frequent term set { a, B, D, F, AD, AF, BD, BF, DF, ADF }, checking whether a subset of each term set is frequent, for example, checking { a, B, D }, its subset { a, D } and { B, D } are frequent, and thus retaining { a, B, D }.

For each set of reserved initial frequent items, constructing a bitmap for each set of items to record whether the transaction contains the set of items, wherein each bit of the bitmap represents a transaction, initially set to 0, for each set of frequent items, initializing a bitmap counter for recording the number of transactions containing the set of items, for each transaction in the transaction database, traversing the initial set of frequent items, checking whether the transaction contains each set of frequent items by bit operations, if so, incrementing the corresponding bitmap counter by 1, and, illustratively, assuming an initial set of frequent items and a transaction database containing 5 transactions, wherein the initial set of frequent items: { A, B, D, F, AD, AF, BD, BF, DF, ADF }, transaction 1:101010000, transaction 2:1100100001, transaction 3:0100011000, transaction 4:1010010100, transaction 5:0110000101 for each frequent item set, initialize a counter, traverse each transaction, check if transaction contains a frequent item set, e.g., transaction 1 contains { A, D, F, DF }, repeatedly check other transactions, count the number of occurrences of each frequent item set and modify the count value of the bitmap counter, the final bitmap counter is shown as: { A:3, B:3, D:2, F:1, AD:1, AF:1, BD:1, BF:1, DF:1, ADF:0}, a frequent item set { A, B, D, F }, for each frequent item set, determining the number of transactions comprising the frequent item set, namely the support degree, according to the data of the bitmap counter, checking whether the support degree of each frequent item set is greater than a preset support degree threshold value, outputting the frequent item set if the support degree is greater than the threshold value, otherwise, rejecting the frequent item set with the support degree not meeting the condition.

In this embodiment, for each item in the frequent tree, a header table is constructed, which is used to quickly locate the position of the frequent item in the frequent tree, so that the frequent item set can be updated and queried more efficiently in the process of inserting and expanding the frequent tree, then an initial frequent item set is obtained by repeating the operation, whether a subset of each initial frequent item set is frequent or not is checked, if the subset is not frequent, the initial frequent item set is deleted, pruning operation is implemented, complexity of subsequent calculation is reduced, the bitmap counter is continuously updated through iteration, the support degree of the frequent item set is calculated, and the iteration is repeated until the support degree of the frequent item set is greater than a preset support degree threshold value, so that the finally output frequent item set meets a preset support degree condition.

S2, for the items in the frequent item set, determining an importance score corresponding to each item through calculating information gain, comparing the importance score with a preset score threshold value, removing the items with the importance score smaller than the score threshold value, and for the reserved items, determining corresponding sub-association rules through a rule generation algorithm, and combining the sub-association rules to obtain association rules;

The information gain is a measure for evaluating the contribution degree of an attribute to a classification task, the importance score is a comprehensive evaluation index for measuring the importance of an item set or rule in the whole data set, and the sub-association rule refers to a more specific rule in the association rule.

In an alternative embodiment of the present invention,

The database information entropy is used for measuring uncertainty in the whole transaction database, is a result of taking negative logarithm of probability of each category and adding, the lower the database information entropy is, the more transactions in the database tend to belong to the same category, the smaller the information uncertainty is, the item information entropy is used for measuring uncertainty of a certain item in the transaction database, the information gain is used for measuring the degree of uncertainty of the whole transaction database through the certain item, the result of subtracting the item information entropy from the database information entropy is obtained, the larger the information gain is, the more the information gain is contributed to explaining change of the whole transaction database, and the more information can be provided by the information gain.

Counting the total transaction number in a transaction database, calculating to obtain a database information entropy corresponding to the transaction database by taking negative logarithm of the occurrence probability of each category and adding the negative logarithm, counting the occurrence frequency of each item in the transaction database and calculating the probability, taking negative logarithm of the occurrence probability of each item and adding the negative logarithm, calculating to obtain an item information entropy corresponding to the item, and subtracting the item information entropy from the database information entropy to obtain the information gain; and mapping the numerical value into the range of [0,1] through normalization processing for the calculated information gain, taking the normalized information gain as the importance score of each item, arranging the items in the frequent item set in descending order according to the importance score, comparing the importance score with a preset score threshold, and removing the corresponding item if the importance score is smaller than the score threshold.

Illustratively, it is assumed that there is a frequent item set { A, B, C }, where the information gain corresponding to each item is processed to be A:0.8, b:0.5 and C:0.3, carrying out descending order, namely A > B > C, and assuming that the score threshold is 0.6, only the importance score of the A item is larger than the score threshold, reserving the A item, and removing the B item and the C item.

In this embodiment, the uncertainty of the whole transaction database and each frequent item in the transaction database can be quantified by calculating the database information entropy and the item information entropy, a basis is provided for the subsequent information gain calculation, the normalized information gain is mapped to a specific range, the normalized numerical value is used as an importance score, the importance score is calculated so that the importance scores among different items are comparable, the relative importance of the items can be better understood, the items are arranged in descending order of the importance score, the items with larger contribution to the overall association relationship can be more easily focused during analysis, and in conclusion, the embodiment can help an analyst to more effectively understand the relationship among data features, and select key features for further analysis or modeling.

In an alternative embodiment of the present invention,

In the function, after the existence or characteristics of the items are known, the information gain measures the reduction degree of the uncertainty of the categories, the information contribution of the items can be more comprehensively evaluated by comprehensively considering the probability of the categories and the occurrence probability of the items under the categories, the distribution of the categories in the whole transaction database is considered, so that the diversity of the categories is better captured, the distribution situation of the items under different categories is considered, the information about the relation between the items and the categories is further provided, the determination of whether the items have significance under the specific categories is facilitated, and the function integrates the category distribution and the distribution of the items under the categories to provide a comprehensive index for evaluating the contribution of the items to the categories.

In an alternative embodiment of the present invention,

Obtaining items subjected to score threshold screening, forming a frequent item set, initializing a sub-association rule for each reserved item, and setting the maximum length of the sub-association rule;

sorting the items in the frequent item set in descending order according to the importance score, selecting a first item as a front item of the sub-association rule, calculating the support degree of the front item, calculating a first confidence coefficient of each item combined with the front item for the unselected items, selecting an item with the largest first confidence coefficient value as a next item of the sub-association rule, and repeating the steps of selecting the front item and calculating the first confidence coefficient until the maximum length of the sub-association rule is reached or a new association rule cannot be generated;

Illustratively, assume that the frequent item set after threshold filtering is { A, B, C, D }, and their importance scores are arranged in descending order { A, B, C, D }. The preset maximum length of the sub-association rule is 2, initializing { A }, { B }, { C }, { D }, selecting A and calculating the support, calculating the first confidence values of { A, B }, { A, C } and { A, D }, selecting the item with the maximum first confidence value, assuming { A, B }, selecting { B } and calculating the support, repeatedly calculating the first confidence and selecting the value with the maximum first confidence, and obtaining the final sub-association rule.

And repeatedly selecting the next item and calculating the first confidence coefficient for each sub-association rule until the length of the sub-association rule reaches the preset maximum length, outputting the finally obtained sub-association rule, calculating the second confidence coefficient corresponding to each sub-association rule, arranging the sub-association rules in a descending order according to the second confidence coefficient, and selecting the sub-association rule with the top 50% as the final association rule.

Illustratively, assume that the maximum length of the sub-association rule is 2, and that the following two sub-association rules have been generated: first sub-association rule: { A } → { B } and a second sub-association rule: for each sub-association rule, calculating a corresponding second confidence coefficient, assuming that the confidence coefficient of the first sub-association rule is 0.8 and the confidence coefficient of the second sub-association rule is 0.6, sorting the two sub-association rules, and selecting the first sub-association rule as a final association rule.

The first confidence coefficient refers to the probability that the item set Y is simultaneously contained in the transaction containing the item set X, the first confidence coefficient reflects how likely the Y is to be when the X is to appear, the quasi-determination of the rule is measured, the larger the numerical value is, the more credible is represented, and the second confidence coefficient is used for measuring the degree of improvement of the rule relative to the basic probability by considering the basic occurrence probability of the item set Y on the basis of the first confidence coefficient.

The calculation formulas of the first confidence coefficient and the second confidence coefficient are as follows:

wherein C (X.fwdarw.Y) represents a first confidence, Z (X.u.Y) represents a support corresponding to a transaction containing both X and Y items, Z (X) represents a support corresponding to a transaction containing X items, L (X.fwdarw.Y) represents a second confidence, and Z (Y) represents a support corresponding to a transaction containing Y items.

The preceding term is the part of the association rule to the left of the arrow, is a set of terms, and includes terms of the association rule that are considered conditions, in which the preceding term represents a condition, i.e. when these sets of terms occur, the following term is also possible to occur, the following term is the part of the association rule to the right of the arrow, and includes terms of the association rule that are considered results or conclusions, the following term represents a result.

In this embodiment, the term with the highest importance score is selected as the front term of the rule, the term with more significant influence on the data can be considered first, the rule mining efficiency is improved, the first confidence coefficient is calculated to measure the accuracy of the rule, that is, the probability that the rear term appears under the condition that the front term appears, the rule which does not conform to expectations is eliminated, a more complex rule structure is gradually constructed through multiple selection and calculation of the confidence coefficient, the rule is made to be closer to the actual condition of the data, the second confidence coefficient considers the lifting of the rule relative to the base line, the descending order arrangement can lead the rule with high quality to be arranged in front, the rule with the higher importance can be screened out by selecting the rule with the first 50%, the rule mining effect is improved, and in conclusion, the embodiment is helpful for mining the association rule with the higher confidence coefficient and the lifting degree from the data, the relationship between the data can be understood, and the decision making and the predictive analysis are supported.

S3, traversing the association rule for each operation data, determining whether a front item corresponding to the association rule exists in the operation data, returning to be free of abnormality if the front item does not exist, detecting whether a rear item corresponding to the front item does not exist or an item which does not accord with the association rule exists in the operation data if the front item does not exist, marking the operation data as abnormal if the rear item does not exist in the operation data or the operation data does not accord with the association rule exists, and returning to be abnormal.

In an alternative embodiment of the present invention,

Traversing the association rule for each operation data, determining whether a front item corresponding to the association rule exists in the operation data, returning to be free of abnormality if the front item does not exist, detecting whether a rear item corresponding to the front item does not exist or an item which does not meet the association rule exists in the operation data if the front item does not exist, marking the operation data as abnormal if the rear item does not exist or the rear item does not meet the association rule exists in the operation data, and returning to be abnormal comprises the following steps:

For each association rule, traversing the operation data, checking whether the operation data has the front item of the association rule, and if the operation data finds the item matched with the front item of the association rule, continuing the next step; otherwise, returning to "no abnormality", checking whether or not there is a corresponding postitem in the association rule for each operation data for which a matching antecedent is found, if a antecedent is found in the operation data but a corresponding postitem is not found, marking the operation data as abnormal, and returning to "abnormality exists", if there is an item not included in the association rule in the operation data, marking the operation data as abnormal, and returning to "abnormality exists", and if there is both a antecedent and a postitem in the operation data, returning to "no abnormality".

Illustratively, assume that there is one association rule: { A, B } → { C }, meaning that if the item set { A, B } is found in the operation data, then the item set { C } should exist at the same time, for the operation data, operation 1: { A, B, C }, operation 2: { A, D, E }, operation 3: { B, F, C }, detection operation 1: there is { a, B }, continuing with the next step, in operation 1 there is { C }, the rule is satisfied, return "no anomaly", detect operation 2: no { a, B }, then return "no exception".

In this embodiment, if a preceding item is found in the operation data but a corresponding following item is not found, or an item not included in the association rule is found, the operation data is marked as abnormal, which is helpful for identifying those data which do not conform to the association rule, possibly indicating that there is an abnormal condition in the data.

FIG. 2 is a schematic structural diagram of a system for detecting abnormal operation of a client based on apirori algorithm according to an embodiment of the present invention, as shown in FIG. 2, the system includes:

The first unit is used for acquiring operation data, generating a bitmap based on items in a transaction database, constructing a bitmap index through bit operation based on the bitmap, generating a candidate item set based on the bitmap index in combination with a preset comprehensive compression algorithm, determining a condition pattern base of each item in the candidate item set by constructing a frequent tree, determining the support degree of the condition pattern base based on the bitmap index and the condition pattern base, and repeatedly iterating to obtain a frequent item set, wherein each bit in the bitmap represents one item of the transaction;

The present invention may be a method, apparatus, system, and/or computer program product. The computer program product may include a computer readable storage medium having computer readable program instructions embodied thereon for performing various aspects of the present invention.

Finally, it should be noted that: the above embodiments are only for illustrating the technical solution of the present invention, and not for limiting the same; although the invention has been described in detail with reference to the foregoing embodiments, it will be understood by those of ordinary skill in the art that: the technical scheme described in the foregoing embodiments can be modified or some or all of the technical features thereof can be replaced by equivalents; such modifications and substitutions do not depart from the spirit of the invention.

Claims

1. A method for detecting a customer abnormal operation based on an apriori algorithm, comprising:

acquiring operation data, generating a bitmap based on items in a transaction database, constructing a bitmap index based on the bitmap through bit operation, generating a candidate item set based on the bitmap index in combination with a preset comprehensive compression algorithm, determining a condition pattern base of each item in the candidate item set by constructing a frequent tree, determining the support degree of the condition pattern base by traversing the transaction database based on the bitmap index and the condition pattern base, and repeating iteration to obtain a frequent item set, wherein each bit in the bitmap represents one item of the transaction;

2. The method of claim 1, wherein the obtaining the operation data, by preprocessing the operation data, generating a bitmap based on items in a transaction database, constructing a bitmap index based on the bitmap by a bit operation, generating a candidate set based on the bitmap index in combination with a preset comprehensive compression algorithm comprises:

3. The method of claim 1, wherein the determining a conditional pattern base for each item in the candidate set of items by constructing a frequent tree, determining a support for the conditional pattern base by traversing the transaction database based on the bitmap index and the conditional pattern base, iterating, and obtaining a frequent set of items comprises:

4. The method of claim 1, wherein for the items in the frequent item set, determining an importance score for each item by calculating an information gain, comparing the importance score to a preset score threshold, and removing items having an importance score less than the score threshold comprises:

5. The method of claim 4, wherein the calculating the information gain based on the database information entropy and the item information entropy is as follows:

6. The method of claim 1, wherein the determining, for the retained term, a corresponding sub-association rule by a rule generation algorithm, combining the sub-association rules to obtain an association rule comprises:

7. The method of claim 1, wherein for each operation data, traversing an association rule, determining whether a front item corresponding to the association rule exists in the operation data, if not, returning no exception, if so, detecting whether a rear item corresponding to the front item does not exist or an item which does not conform to the association rule exists in the operation data, if so, marking the operation data as exception, and returning the exception includes:

8. A system for detecting a customer abnormal operation based on apirori algorithm for implementing the method for detecting a customer abnormal operation based on apirori algorithm according to any one of claims 1 to 7, comprising:

9. An electronic device, comprising:

A processor;

a memory for storing processor-executable instructions;

wherein the processor is configured to invoke the instructions stored in the memory to perform the method of any of claims 1 to 7.

10. A computer readable storage medium having stored thereon computer program instructions, which when executed by a processor, implement the method of any of claims 1 to 7.