Disclosure of Invention
In view of the above, the present disclosure provides a method and an apparatus for mining association rules between applications, which are used for solving the technical problems that the existing application association rules mining is inaccurate and the core application of the application community cannot be identified.
Based on the embodiment of the disclosure, the disclosure provides a mining method for association rules between applications, the method comprising:
In a plurality of time periods of analysis duration, grouping statistics is carried out according to application use information corresponding to the user identification code, and an application set used by each user in each time period is obtained;
calculating the use rate of each application in different time periods, and taking the average value of the use rate of each application in different time periods as the minimum support degree of each application;
For each time period, calculating the support degree of each application in the time period, and screening out the applications with the support degree greater than the minimum support degree per se in the time period to form a term set L1;
combining the items in the L1 in pairs to form a candidate item set C2 for expressing the application relation of the two applications, and calculating the credibility of each item in the C2 to obtain a credibility set L2 for expressing the application relation of the two applications;
And (3) selecting items from the L2, wherein the credibility of the previous application to the next application is larger than the product of the minimum support of the next application and a preset constant, and obtaining a set R2 for expressing the association rule between the applications.
Further, according to an embodiment of the present disclosure, the method further includes:
Respectively taking the front application and the rear application of each item in the association rule set R2 as points in the graph, taking the credibility of the front application and the rear application as the weight of the edge between the two points, and constructing an application association graph;
And carrying out community division on the application association graph by adopting a community division algorithm to obtain an application community.
Further, according to an embodiment of the present disclosure, the method further includes:
and calculating the weight sum of the edges of the nodes corresponding to each application in each application community, and determining the core application of the application community by the application with the maximum weight sum of the edges in each application community.
Further, before the calculating of the minimum support of each application, the method further includes filtering out application data with a usage rate greater than a preset threshold.
Further, based on the embodiment of the present disclosure, the community division algorithm is a Louvain algorithm, and the preset constant is a constant greater than 1.2.
Based on the embodiment of the disclosure, the disclosure further provides an excavating device for applying the association rule between applications, where the device includes:
the grouping statistics module is used for grouping statistics with application use information corresponding to the user identification codes in a plurality of time periods of analysis duration to obtain application sets used by each user in each time period;
the minimum support calculation module is used for calculating the use rate of each application in different time periods, and taking the average value of the use rate of each application in different time periods as the minimum support of each application;
the application screening module is used for calculating the support degree of each application in each time period according to each time period, and screening out the application with the support degree greater than the minimum support degree in the time period to form a term set L1;
the credibility calculation module is used for combining the items in the L1 in pairs to form a candidate item set C2 for expressing two application sequence application relations, and calculating the credibility of each item in the C2 to obtain a credibility set L2 for expressing two application sequence application relations;
And the association rule determining module is used for screening items from the L2, the credibility of which from the previous application to the next application is greater than the product of the minimum support of the next application and a preset constant, and obtaining a set R2 for expressing association rules between the applications.
Further, according to an embodiment of the present disclosure, the apparatus further includes:
The community dividing module is used for respectively taking the front application and the rear application of each item in the association rule set R2 as points in the graph, taking the credibility of the front application and the rear application as the weight of the edge between the two points, and constructing an application association graph;
The core application identification module is used for calculating the weight sum of the edges of the nodes corresponding to each application in each application community, and determining the core application of the application community by the application with the largest weight sum of the edges in each application community.
Further, according to an embodiment of the present disclosure, the apparatus further includes:
The minimum support calculation module is further configured to filter out application data with a usage rate greater than a preset threshold before the minimum support of each application is calculated.
The present disclosure also provides a storage medium, in which a computer program is stored, and when the computer program in the storage medium is read and executed by a processor, the step functions of the mining method for association rules between any application provided in the present disclosure are completed.
The method comprises the steps of calculating the evaluation utilization rate of an application based on time segmentation as the minimum support, screening out all applications larger than the minimum support to form a frequent item set L1, carrying out application pairwise combination based on the L1, calculating the credibility between the applications to obtain a credibility set L2, screening out items from the L2, from the former application to the latter application, of which the credibility is larger than the product of the minimum support of the latter application and a preset constant, and forming a set R2 for expressing association rules between the applications. And further constructing a graph based on the R2 and carrying out community division by using a community division algorithm. The method and the system can actively push the application to the users with more related applications according to the mined association rules and community division, can carry out key operation and maintenance guarantee on the core application in the identified application community, improve the intelligence and flexibility of association rule mining, and improve the efficiency of an application maintenance system.
Detailed Description
The terminology used in the embodiments of the disclosure is for the purpose of describing particular embodiments only and is not intended to be limiting of the embodiments of the disclosure. As used in the presently disclosed embodiments and in the claims, the singular forms "a," "an," and "the" are intended to include the plural forms as well, unless the context clearly indicates otherwise. The term "and/or" as used in this disclosure refers to any or all possible combinations including one or more of the associated listed items.
It should be understood that although the terms first, second, third, etc. may be used in embodiments of the present disclosure to describe various information, these information should not be limited to these terms. These terms are only used to distinguish one type of information from another. For example, the first information may also be referred to as second information, and similarly, the second information may also be referred to as first information, without departing from the scope of embodiments of the present disclosure. Depending on the context, furthermore, the word "if" used may be interpreted as "at..once" or "when..once" or "in response to a determination".
It is one of the primary objects of embodiments of the present disclosure to provide a mining method for inter-application association rules, which is an improvement of the mining method for inter-application association rules based on Apriori algorithm. The Apriori algorithm is a classical data mining algorithm that mines frequent item sets and association rules. The algorithm uses a priori the nature of the frequent item set, i.e. all non-empty subsets of the frequent item set must also be frequent. The Apriori algorithm uses an iterative approach called a layer-by-layer search, in which a set of k terms is used to explore a set of (k+1) terms.
In the present disclosure, some basic terms used are explained as follows in the meaning represented in the present disclosure:
Item and item set let appset = { app 1,app2,...,appm } is the set of all applications, where app i is called the item and subscript i represents the ith application. The set of items is referred to as the item set appset, and the item set containing k items is referred to as the application k item set, denoted as k-appset.
Support-the proportion of records in the data set that contain the set of items to the total records is used to measure how often a collection appears in the original data. For example, the meaning of the support degree of the i-th application app i is that the proportion of the number of users using the application to the total number of users is expressed as follows:
Confidence measures the probability that a rule will occur, e.g., the confidence that the user uses app n immediately after app i is equal to the support of app i and app n, which are applied in succession, divided by the support of app i alone, as follows:
The embodiment of the disclosure firstly completes the preprocessing of the network traffic by using a traffic collector, wherein the preprocessing comprises the collection of the network traffic, the data matching and filtering, the network application data merging based on the user and the like. Based on preprocessing, the embodiment of the disclosure provides a minimum support degree determining method for calculating association rules between applications in a preset analysis duration range based on time period grouping, and designs an association analysis framework for a user to use the applications, wherein the association analysis framework comprises data processing, data analysis, data mining and application community division.
Fig. 1 is a step flowchart of a mining method for applying an association rule between applications, which is provided in an embodiment of the present disclosure, and the method includes:
And 101, in a plurality of time periods of analysis duration, grouping and counting application use information corresponding to the user identification codes to obtain an application set used by each user in each time period.
One of the purposes of the present disclosure is to analyze the relevance between two applications used by a user simultaneously within a certain time frame, and in order to more accurately determine the relevance rule between the applications, the present disclosure proposes a scheme of segmenting analysis duration and then segment statistics. The analysis duration is used to determine the time frame for performing the analysis of applying the association rule, and may be, for example, 1 day, 1 week, or the like. The time period may be divided based on user living habits or application usage habits in different geographical environments, or divided in fixed time units, etc. The division of the time period needs to consider the analysis precision and efficiency of the association rule, and if the time period is too short, the association between certain applications may be small, and if the time period is too long, the application set may be too large, so that the analysis efficiency is low.
Step 102, calculating the use rate of each application in different time periods, and taking the average value of the use rate of each application in different time periods as the minimum support degree of each application.
The method comprises the steps of calculating the use rate of each application in different time periods based on the obtained application set data used by each user in each time period in the analysis time period range, and taking the average value of the use rates of each application in different time periods as the minimum support degree of each application. The minimum support is a dynamic average value, is more flexible and intelligent relative to a fixed threshold, and can avoid that the application with low use ratio is not easy to find.
Step 103, calculating the support degree of each application in each time period, and screening out the application with the support degree greater than the minimum support degree in the time period to form a term set L1.
Step 104, combining the items in the L1 two by two to form a candidate item set C2 for expressing the two application precedence application relations, and calculating the credibility of each item in the C2 to obtain a credibility set L2 for expressing the two application precedence application relations.
Step 105, selecting items from the L2, wherein the credibility from the previous application to the next application is larger than the product of the minimum support of the next application and a preset constant, and obtaining a set R2 for expressing the association rule between the applications.
In the embodiment of the disclosure, when the association rule between two applications is mined, in order to ensure the reliability of the association rule, when judging whether the two applications are strong association rules, the relation between the reliability and the minimum support of the latter application is further judged on the basis of calculating the reliability of one application before and after the two applications, and only when the reliability of the two applications used successively is greater than the product of the minimum support of the latter application and a preset constant, the reliability of the association rule is further ensured.
In an embodiment of the present disclosure, to identify and partition application communities, mining community relationships between applications, and identifying core applications in an application community, the method further includes:
And 106, respectively taking the front application and the rear application of each item in the association rule set R2 as points in the graph, and taking the credibility of the front application and the rear application as the weight of the edge between the two points to construct an application association graph.
And 107, carrying out community division on the application association graph by adopting a community division algorithm to obtain an application community.
According to the method, the application association diagram is processed through a community division algorithm to obtain one or more application communities, the application communities represent community relations among a group of applications, and application shops can recommend applications in the same community to users based on the application communities, so that the intellectualization and user experience of similar application shops are improved.
And 108, calculating the weight sum of the edges of the nodes corresponding to each application in each application community, and determining the core application of the application community by using the application with the maximum weight sum of the edges in each application community.
By determining the core application in the application community, the core application in the application community can be found, key maintenance is carried out on the core application, the availability and stability of the operation and maintenance system can be improved, and the maintenance efficiency and the user experience are improved.
Based on the above embodiments of the present disclosure, the present disclosure improves the existing application association rule mining method, and by calculating the application usage rates in different time periods, the present disclosure automatically sets a support threshold for each application, thereby improving the intelligence and adaptability of the application association rule mining method, and solving the problem that the conventional Apriori algorithm cannot take all conditions into account. In addition, when judging whether the rule is a strong association rule, the method is compared with the minimum support degree of the latter application, so that the credibility is further ensured. The embodiment of the disclosure further takes association rules among mined applications as points and edges in a graph algorithm, obtains an application community graph by using a community partitioning algorithm, and applies the closely-related applications to the same community. The application association rule can predict the behavior of a user accessing the application, a certain application possibly needed is recommended to the user, and the application community graph can actively push the application in the same community to the user using the community application.
The following describes the steps of the mining method for association rules between applications provided in the present disclosure in detail in connection with specific embodiments.
Step 201, acquiring user network flow data by using a flow acquisition technology, and acquiring application use information from the user network flow data, wherein the application use information at least comprises application use information such as a user identification code, use time and the like.
The step first obtains network data by using a flow collector and stores the data in a database. The data is then processed using python, including filtering, grouping, integration, etc.
The network flow data is acquired by utilizing the deep packet detection technology of the flow collector, and the deep packet detection technology not only can acquire data such as source ip, destination ip, source port, destination port, protocol and the like in the flow packet, but also can perform content detection and deep decoding on application layer data. But the data redundancy obtained by deep packet inspection is too much, only the needed application usage information needs to be extracted from it. Such as user identification code/user identification, application name, time of use, traffic, number of packets, etc. These data are stored in a data repository to form a data table for applying the correlation analysis.
Correlation analysis data sheet
| Fields |
Meaning of field |
| user_id |
User identification code |
| apply_name |
Application name |
| log_time |
Service time |
| flow |
Flow rate |
| pack |
Number of packets |
| app_class |
Application class |
And 202, grouping and counting the acquired application use information by using the user identification code in a plurality of time periods of the analysis duration to obtain an application set used by each user in each time period.
An embodiment of the present disclosure sets the analysis duration to one day, dividing 24 hours a day into 7 periods. The time period from zero point to five points is late night time period, six points to eight points are early morning time period, nine points to eleven points are afternoon time period, twelve points and thirteen points are midday time period, fourteen points to seventeen points are afternoon time period, eighteen points and nineteen points are evening time period, and twenty points to twenty three points are night time period.
Time segment division
For some popular applications, such as WeChat, the independence of the applications is relatively high, the applications are often used independently, the flow data of the applications need to be filtered out from the flow data collected by the collector, otherwise, the mined frequent item sets are almost all the popular applications, so that the discovery of other application relations with low use amount is not facilitated, and the mined rule meanings are not great. Therefore, in an embodiment of the present disclosure, by calculating the usage rate of each application, application data with the usage rate greater than a preset threshold is filtered out, for example, application data with a quartile range of 1.5 times is filtered out, where the quartile range is the difference between the third quartile and the first quartile.
And carrying out grouping statistics on the processed data according to different user identification codes in different time periods, merging applications used by users, so as to obtain application sets used by each user in each time period, and carrying out association rule mining by taking the sets as input.
Step 203, calculating the use rate of each application in different time periods, and taking the average value of the use rate of each application in different time periods as the minimum support degree of each application.
The Apriori algorithm uses an iterative approach called a layer-by-layer search, in which a set of k terms is used to explore a set of (k+1) terms. The embodiment of the disclosure only searches 2 frequent item sets, firstly, the collected data sets are scanned, the count of each item is accumulated, the item meeting the minimum support degree is collected, and the set of 1 frequent item sets is found and recorded as L1. Then, find the set L2 of frequent 2 item sets using L1, and finally get the applied association rule R2.
The step is to calculate the usage rate of each application in different time periods based on the obtained application set data used by each user in each time period in the total analysis time period range, and the average value of the usage rate of each application in different time periods is taken as the minimum support degree of each application
Where user_num is the total number of users in the t period, user_app m _n is the number of users in the t period using app m.
Step 204, for each time period, obtaining the support degree of each application in the time period, and screening out the applications with the support degree greater than the minimum support degree in the time period, so as to form a frequent 1 item set L1.
In this step, a candidate set C1 is constructed for each time period, C1 is a 1-dimensional candidate set, the statistics and operation are performed on the number of times of application for each time period to obtain C1, and each element of C1 contains information of an application identifier and an application support degree, for example:
Then, C1 of each time period is scanned, whether each application in the time period is larger than the minimum support degree is judged, and the application which is larger than the minimum support degree in each time period is formed into a frequent 1 item set L1.
Step 205, combining the elements in the L1 in pairs to form a candidate item set C2 for expressing the two application precedence application relations, and calculating the credibility of each item in the C2 to obtain a credibility set L2 for expressing the two application precedence application relations.
In the step, elements in the L1 are combined two by two to form a candidate set C2, and each item of the candidate set C2 comprises application identifiers of two applications and support degree information of the combination of the two applications:
The set (app a,appb) calculates the order relationship between applications when computing the two sets L2, irrespective of the order of applications, for each item in C2, the following two trustworthiness is calculated:
And
The confidence level is calculated for each item in C2 to obtain a set L2.
Each item in L2 contains association rules and trust information between two applications.
Step 206, selecting items from L2, wherein the credibility from the previous application to the next application is larger than the product of the minimum support of the next application and a preset constant, and forming a set R2 for expressing the association regulation between the two applications.
The scan L2 judges whether the credibility of each item is larger than the product of the minimum support of the next application and a preset constant in the items, and the items larger than the product of the minimum support of the next application and the preset constant form a set R2 for expressing the association regulation between the two applications.
Where K is a preset constant, in an embodiment of the disclosure, the value of K is a constant value greater than 1.2, for example, 1.4.
Through the steps, the association rules of the application access by the user in different time periods can be obtained, the behavior of the application access by the user can be predicted according to the association rules, and meanwhile, the application with the association relationship is recommended to the user.
Step 207, respectively taking the front and back applications of each item in the association rule set R2 as points in the graph, and taking the credibility of the front and back applications as the weight of the edge between the two points to construct an application association graph.
This step takes each app a,appb in the association rule set R2 as a point,As the weight of the edges of the point app a to the point app b, an application association graph G is constructed for the purpose of application community division.
And step 208, taking the application association graph G as input of a community division algorithm to obtain different application communities divided based on the application association relationship, wherein the application in the communities is closely related, and the application in the communities is sparsely related.
The modularity is used to measure the quality of a community network division and can be simply understood as the sum of all edge weights inside the community minus the sum of edge weights connected with the community. The definition is as follows:
wherein A ij represents the weight of the edge between node i and node j, k i=∑jAij represents the sum (degree) of the weights of all the edges connected to node i, c i represents the community to which node i belongs; Representing the sum of the weights of all edges.
The Louvain algorithm is a modularity-based community discovery algorithm which is good in efficiency and effect, and can discover hierarchical community structures, and the optimization goal of the Louvain algorithm is to maximize the modularity of the whole community network. An embodiment of the disclosure utilizes a Louvain algorithm to achieve community partitioning between applications.
The community division flow between applications is realized by using Louvain algorithm as follows:
1) Each node in the graph is regarded as an independent community, and the number of communities is the same as the number of nodes;
2) For each node i, sequentially attempting to distribute the node i to communities where each neighbor node is located, calculating module degree change delta Q before and after distribution, recording the neighbor node with the largest delta Q, if max delta Q is more than 0, distributing the node i to communities where the neighbor node with the largest delta Q is located, otherwise, keeping unchanged;
3) Repeating the step 2) until communities to which all nodes belong are not changed;
4) Compressing the graph, compressing all nodes in the same community into a new node, converting the weight of edges between the nodes in the community into the weight of a new node ring, converting the weight of edges between the communities into the weight of edges between the new nodes, wherein the new node ring represents that after the graph is compressed, the points in the same community are taken as the new nodes, and the weight at the moment is the sum of the weights of the points in the same community.
5) Repeating 1) -4) until the modularity of the entire graph is no longer changed.
6) The node i with the largest k i (degrees) in each community is denoted as the community core, wherein k i=∑jAij represents the sum (degrees) of the weights of all the edges connected to the node i, and A ij represents the weight of the edge between the node i and the node j. The application community partitioning results are then returned as follows:
{ community 1 [ (community core, application a), (community member, application b), ];
community 2 [ (community core, application m), (community member, application n) ];
......}
the application community division result is obtained, key operation and maintenance guarantee can be carried out on core applications in the application communities, and the applications in the same community can be recommended to users using the community applications.
Fig. 2 is a schematic structural diagram of an excavating device applying association rules according to an embodiment of the present disclosure, where each functional module in the excavating device may be implemented in a software module manner or in a hardware unit manner. The functions of the modules of the device have corresponding relations with the steps in the mining method of the association rule between the applications provided by the implementation of the present disclosure. Each module in the apparatus may be executed on one hardware device, or one or more steps or module functions in the method provided in the disclosure may be respectively completed by different hardware devices. The device 200 comprises a grouping statistics module 201, a minimum support calculation module 202, an application screening module 203, a credibility calculation module 204 and an association rule determination module 205.
The grouping statistics module 201 is configured to perform grouping statistics with application usage information corresponding to a user identification code in a plurality of time periods of analysis duration, so as to obtain an application set used by each user in each time period;
A minimum support calculating module 202, configured to calculate a usage rate of each application in different time periods, and take an average value of the usage rates of each application in different time periods as a minimum support of each application;
The application screening module 203 is configured to calculate, for each time period, a support degree of each application in the time period, and screen out an application that is greater than a minimum support degree of the application in the time period, so as to form an item set L1;
the credibility calculation module 204 is configured to combine the items in L1 two by two to form a candidate item set C2 expressing two application precedence-order relationships, calculate the credibility of each item in C2, and obtain a credibility set L2 expressing two application precedence-order relationships;
The association rule determining module 205 is configured to screen out items from the L2, where the reliability from the previous application to the next application is greater than the product of the minimum support of the next application and a preset constant, to obtain a set R2 of association rules between expression applications.
To implement application community partitioning, in an embodiment of the present disclosure, the apparatus further includes:
the community division module 206 is configured to respectively use the front and rear applications of each item in the set R2 of association rules as points in the graph, and use the credibility of the front and rear applications as the weight of the edge between the two points to construct an application association graph;
in order to implement core application identification, in an embodiment of the present disclosure, the apparatus further includes a core application identification module 207, configured to calculate a sum of weights of edges of nodes corresponding to each application in each application community, and determine a core application of each application community according to the sum of weights of edges in each application community and a maximum application.
In order to filter out applications that are frequently used independently, in an embodiment of the present disclosure, the minimum support calculation module 202 is further configured to filter out application data with a usage rate greater than a preset threshold before calculating the minimum support of each application.
In another embodiment of the present disclosure, a storage medium is further provided, where the storage medium is located in a device having a processor and a bus structure, where the storage medium may be a volatile storage medium or a non-volatile storage medium, where a computer program is stored in the storage medium, where the computer program in the storage medium, when being read and executed by the processor, may be used to complete the step functions of the mining method for association rules between applications provided in the embodiments of the present disclosure.
The foregoing is merely exemplary of the present disclosure and is not intended to limit the present disclosure. Various modifications and variations of this disclosure will be apparent to those skilled in the art. Any modifications, equivalent substitutions, improvements, or the like, which are within the spirit and principles of the present disclosure, are intended to be included within the scope of the claims of the present disclosure.