CN108446375A

CN108446375A - A kind of multi-scale coupling rule and method based on Spark platforms

Info

Publication number: CN108446375A
Application number: CN201810218838.6A
Authority: CN
Inventors: 王灵矫; 赵博文
Original assignee: Xiangtan University
Current assignee: Xiangtan University
Priority date: 2018-03-16
Filing date: 2018-03-16
Publication date: 2018-08-24

Abstract

The present invention provides a kind of multi-scale association rule method based on Spark platform, and this method comprises: build HDFS file system and Spark platform with virtual machine on physical server, and upload data set in HDFS file system; Spark platform from Read data in the HDFS file system and convert it into an elastic distributed data set RDD, and store it in memory; select multiple benchmark scales to divide the data set and perform parallel operations on the Spark platform to obtain frequent itemsets of the benchmark scale data set; Then, the association rules of the target data set are mined through the scale conversion mechanism, and the accuracy of the algorithm is obtained. The present invention combines the traditional association rule algorithm with the scale conversion mechanism, and only needs to mine the reference scale data set once to obtain the relevant knowledge under the target scale data set, which greatly improves the accuracy and efficiency of operations on the scale data set, and at the same time Implementation on the Spark platform further improves the data processing speed.

Description

A method of multi-scale association rules based on Spark platform

技术领域technical field

本发明涉及数据挖掘及数据处理技术领域，尤其涉及一种基于Spark平台的多尺度关联规则方法。The invention relates to the technical fields of data mining and data processing, in particular to a method for multi-scale association rules based on a Spark platform.

背景技术Background technique

大数据时代已经到来，数据已经渗透到世界各个领域的个体与组织中，记录其生命周期的全过程，是不可或缺的生产要素。面对信息爆炸带来的海量数据，无论是科研领域、商业领域还是政府机构都视数据挖掘技术为必不可少的分析工具，数据挖掘研究也获得了前所未有的关注和重视。数据挖掘旨在从大量的、形式不一、内容驳杂的数据中，发现研究对象本质上性质的相似性和行为的一致性，从而凝练出一定的规则和知识，用于决策。The era of big data has arrived. Data has penetrated into individuals and organizations in various fields in the world. Recording the entire process of its life cycle is an indispensable factor of production. Faced with the massive data brought by the information explosion, data mining technology is regarded as an indispensable analysis tool in both the scientific research field, commercial field and government agencies, and data mining research has also received unprecedented attention and attention. Data mining aims to discover the similarity of the nature of the research object and the consistency of behavior from a large amount of data with different forms and heterogeneous content, so as to condense certain rules and knowledge for decision-making.

关联规则挖掘是数据挖掘领域中应用广泛且实用意义较高的研究方向，旨在发掘数据项之间频现且有趣的的关联性和相关性。由于关联规则挖掘具有广泛的商业应用，故对其的研究一直热度不减。目前对于关联规则挖掘的研究越来越具体，研究者往往针对不断涌现的新问题和特定的应用领域，对关联规则挖掘展开研究，力图在较实际的层面上解决问题。如今，简单的“啤酒，尿布”型规则挖掘早已不能满足决策者的信息需求，多层次、多角度的关联模式分析才是解决实际问题的关键。Association rule mining is a widely used and highly practical research direction in the field of data mining, which aims to discover frequent and interesting associations and correlations between data items. Since association rule mining has a wide range of commercial applications, the research on it has been unabated. At present, the research on association rule mining is becoming more and more specific. Researchers often focus on emerging new problems and specific application fields, and conduct research on association rule mining, trying to solve problems on a more practical level. Today, simple "beer, diaper" rule mining can no longer meet the information needs of decision makers, and multi-level and multi-angle correlation pattern analysis is the key to solving practical problems.

多尺度科学是新崛起的一门学科，由于其描述了研究对象本质的结构性与层次性，在数学、物理学、生物学、化学、地学等领域掀起了一场跨学科研究热潮。目前，结合多尺度科学进行跨学科研究是大势所趋，数据挖掘领域也顺应这一趋势，在理论和方法上将多尺度科学与数据挖掘技术相结合，分析挖掘结果的多层次、多尺度内涵，可将普通的挖掘结果提升为多尺度知识，这将有利于在实践中形成多尺度决策；另外，多尺度科学中分层次、分尺度处理和分析研究对象的思想与并行运算的思想不谋而合，研究多尺度数据挖掘方法有利于在大数据环境下高效的处理实际问题。Multi-scale science is a newly emerging discipline. Because it describes the structural and hierarchical nature of the research object, it has set off an upsurge of interdisciplinary research in the fields of mathematics, physics, biology, chemistry, and geosciences. At present, interdisciplinary research combined with multi-scale science is the general trend, and the field of data mining also conforms to this trend. In terms of theory and methods, multi-scale science and data mining technology are combined to analyze the multi-level and multi-scale connotations of mining results. Elevating ordinary mining results to multi-scale knowledge will facilitate the formation of multi-scale decision-making in practice; in addition, the idea of hierarchical and sub-scale processing and analysis of research objects in multi-scale science coincides with the idea of parallel computing , the study of multi-scale data mining methods is beneficial to efficiently deal with practical problems in the big data environment.

在2009年，Spark起初作为一个研究项目诞生于伯克利大学AMPLab，其采用的开发语言是一门将面向对象与函数式相结合的语言scala，核心代码部分最初是由63个scala文件构成。在2013年6月将该项目开源，并成为Apache基金项目，并于2014年2月，成为了Apache软件基金会的顶级开源项目。截止目前，已有200多家企业的开发人员对Spark做出了贡献，800多开发人员参与其中，是当前的大数据技术领域中最活跃的开源项目之一。在实际的生产环境中，已经得到国内外很多著名企业的深度应用，并且Spark集群中节点的个数已经突破1000。在这短短的2年多的时间里，Spark在众多企业及开发人员的大力支持下，已经发布了近15个版本之多。In 2009, Spark was originally born as a research project in the AMPLab of Berkeley University. The development language it adopted was scala, a language that combines object-oriented and functional programming. The core code part was originally composed of 63 scala files. In June 2013, the project was open sourced and became an Apache Foundation project, and in February 2014, it became a top open source project of the Apache Software Foundation. So far, developers from more than 200 companies have contributed to Spark, and more than 800 developers have participated in it. It is one of the most active open source projects in the current big data technology field. In the actual production environment, it has been deeply applied by many famous enterprises at home and abroad, and the number of nodes in the Spark cluster has exceeded 1,000. In this short period of more than 2 years, Spark has released nearly 15 versions with the strong support of many enterprises and developers.

Apache Spark作为一种当今最流行的分布式计算框架之一，它是基于内存计算和并行计算的，非常适合大数据挖掘和机器学习。在速度方面，它是基于内存计算的，而Hadoop将中间计算结果写到HDFS文件系统，每次读写操作都要读写HDFS文件系统，所以Spark的运行效率比Hadoop要快上100倍，访问磁盘的速度也要比Hadoop快上10倍。因此Spark更适合运行比较复杂的算法，例如迭代计算、图计算等。不仅如此，Spark支持对数据集的多种操作，如map，filter，flatmap等，相比之下，MapReduce只支持map和reduce两种操作。As one of the most popular distributed computing frameworks today, Apache Spark is based on memory computing and parallel computing, and is very suitable for big data mining and machine learning. In terms of speed, it is based on memory calculation, and Hadoop writes the intermediate calculation results to the HDFS file system, and each read and write operation must read and write to the HDFS file system, so the running efficiency of Spark is 100 times faster than Hadoop. The disk is also 10 times faster than Hadoop. Therefore, Spark is more suitable for running more complex algorithms, such as iterative calculations and graph calculations. Not only that, Spark supports multiple operations on data sets, such as map, filter, flatmap, etc. In contrast, MapReduce only supports two operations of map and reduce.

总之，将数据挖掘算法结合多尺度学科在Spark平台上实现，既能够提升效率，又能够有效地利用资源。In short, combining data mining algorithms with multi-scale disciplines on the Spark platform can not only improve efficiency, but also effectively utilize resources.

发明内容Contents of the invention

本发明的目的在于提供一种基于Spark平台的多尺度关联规则方法，将多尺度领域知识与关联规则算法相结合并在Spark平台上实现，在需要处理海量数据的时代背景下，与传统关联规则算法相比，本发明的执行效率和精确度都得到了很高的提升。The purpose of the present invention is to provide a multi-scale association rule method based on the Spark platform, which combines multi-scale domain knowledge and association rule algorithms and realizes it on the Spark platform. Compared with the algorithm, the execution efficiency and accuracy of the present invention have been greatly improved.

本发明是通过以下步骤实现的。The present invention is realized through the following steps.

一种基于Spark平台的多尺度关联规则方法，包括：A method for multi-scale association rules based on the Spark platform, including:

步骤1：在物理服务器上构建具有虚拟机的HDFS文件系统和Spark平台，并将数据集上传到HDFS文件系统中；Step 1: Construct the HDFS file system and Spark platform with virtual machines on the physical server, and upload the dataset to the HDFS file system;

步骤2：通过客户端向Spark平台提交作业，Spark平台从HDFS文件系统中读取数据并转换为弹性分布式数据集RDD，并将其存储在内存中；Step 2: Submit the job to the Spark platform through the client, and the Spark platform reads the data from the HDFS file system and converts it into an elastic distributed dataset RDD, and stores it in memory;

步骤3：选择多个基准尺度BS，确定每个基准尺度的数据集ds_BS,由于每个RDD对应多个分区partition，所以在每个分区运行一个基准尺度数据集，各个分区并行运算；Step 3: Select multiple benchmark scales BS, and determine the data set ds _BS of each benchmark scale. Since each RDD corresponds to multiple partitions, a benchmark scale dataset is run in each partition, and each partition is operated in parallel;

步骤4，各个partition中以最小支持度min_sup挖掘各个基准尺度数据集，得到各个基准尺度数据集频繁项集的集合FI_i，并求取上述若干频繁项集集合的并集FI_C作为目标尺度数据集ds_SO频繁项集的候选项集集合；Step 4: Mining each benchmark scale data set with the minimum support min_sup in each partition to obtain the set FI _i of frequent itemsets of each benchmark scale data set, and obtain the union of the above frequent itemset sets FI _C as the target scale data Set ds _SO frequent itemsets set of candidate itemsets;

步骤5，通过克里格法确定基准尺度数据集对目标尺度数据集的权重矩阵λ，用于估算目标尺度数据集中频繁项集的支持度；Step 5. Determine the weight matrix λ of the benchmark-scale dataset to the target-scale dataset by kriging, which is used to estimate the support of frequent itemsets in the target-scale dataset;

步骤6，筛选目标尺度数据集最终的频繁项集，生成关联规则，并计算各个分区的算法精确度，进而求取精确度的均值作为整体实验结果。Step 6: Screen the final frequent itemsets of the target scale data set, generate association rules, and calculate the algorithm accuracy of each partition, and then calculate the mean value of the accuracy as the overall experimental result.

其中，所述步骤3中，根据用户选择的基准尺度数目来设定RDD中partition数目并启动一定数目的并发线程数读取数据。Wherein, in the step 3, the number of partitions in the RDD is set according to the number of reference scales selected by the user, and a certain number of concurrent threads are started to read data.

其中，所述步骤4中，对于多尺度关联规则算法，只对基准尺度数据集进行相关挖掘，求取基准尺度数据集频繁项集的集合，进而推导目标尺度数据集的频繁项集，进行频繁项集的多尺度转换。Wherein, in the step 4, for the multi-scale association rule algorithm, only correlation mining is performed on the benchmark scale data set, and the set of frequent itemsets of the benchmark scale data set is obtained, and then the frequent itemsets of the target scale data set are derived, and frequent Multiscale Transformation of Itemsets.

其中，所述步骤5中，首先通过克里格法定义线性估计量，然后计算线性估计量中的克里格系数λ，将目标尺度数据集中的支持度估计值和基准尺度数据集中的支持度分别对应于所定义线性方程的待估计值和样点数据。Wherein, in the step 5, the linear estimator is first defined by the kriging method, and then the kriging coefficient λ in the linear estimator is calculated, and the support degree estimation value in the target scale data set and the support degree in the reference scale data set Corresponding to the estimated value and sample point data of the defined linear equation respectively.

其中，所述步骤6中，将所有候选项集的估计支持度同最小支持度min_sup进行比较，选择估计支持度不小于min_sup的频繁项集组成目标尺度数据集的最终频繁项集集合FI，并依据最小置信度min_conf产生关联规则。Wherein, in the step 6, the estimated support of all candidate itemsets is compared with the minimum support min_sup, and the frequent itemsets whose estimated support is not less than min_sup are selected to form the final frequent itemset set FI of the target scale data set, and Generate association rules according to the minimum confidence min_conf.

本发明与现有技术相比，具有以下优点：Compared with the prior art, the present invention has the following advantages:

本发明将多尺度领域的尺度转换机制与数据挖掘中的关联规则算法结合，以多尺度数据挖掘算法框架和具体的多尺度关联规则挖掘实现了知识的多尺度转换，从算法角度分析，算法在具备多尺度特性的数据集上实施，精确度和运行速度将有很大的提升，这具有相当大的实际意义。The present invention combines the scale conversion mechanism in the multi-scale field with the association rule algorithm in data mining, and realizes the multi-scale conversion of knowledge with the multi-scale data mining algorithm framework and specific multi-scale association rule mining. From the perspective of the algorithm, the algorithm is Implemented on a data set with multi-scale characteristics, the accuracy and running speed will be greatly improved, which has considerable practical significance.

本分明还将多尺度关联规则算法在Spark平台上运行，在当下海量数据时刻都会产生的背景下，基于Spark平台的并行化模式将进一步提高数据处理效率。Benming also runs the multi-scale association rule algorithm on the Spark platform. Under the background of massive data generated all the time, the parallelization mode based on the Spark platform will further improve the data processing efficiency.

附图说明Description of drawings

图1是本发明的方法流程图；Fig. 1 is method flowchart of the present invention;

图2是在Spark平台上的实现流程图。Figure 2 is the implementation flow chart on the Spark platform.

具体实施方式Detailed ways

为了使本发明的目的、技术方案及优点更加清楚明白，以下结合附图及实施例，对本发明进行进一步详细说明。应当理解，此处所描述的具体实施例仅仅用于解释本发明，并不用于限定本发明。In order to make the object, technical solution and advantages of the present invention clearer, the present invention will be further described in detail below in conjunction with the accompanying drawings and embodiments. It should be understood that the specific embodiments described here are only used to explain the present invention, not to limit the present invention.

结合图1和图2，一种基于Spark平台的多尺度关联规则方法，包括以下步骤：Combining Figure 1 and Figure 2, a multi-scale association rule method based on the Spark platform includes the following steps:

步骤3：选择多个基准尺度BS，确定每个基准尺度的数据集ds_BS,调用SparkContext方法的parallelize，将数据集并行化，转化为分布式的RDD，每个RDD由很多分区partition组成，所以在每个分区运行一个基准尺度数据集，各个分区并行运算。Step 3: Select multiple benchmark scales BS, determine the data set ds _BS of each benchmark scale, call the parallelize method of SparkContext, parallelize the data set, and convert it into a distributed RDD. Each RDD is composed of many partitions, so Run a benchmark dataset in each partition, and each partition operates in parallel.

该过程根据用户选择的基准尺度数目来设定RDD中partition数目并启动一定数目的并发线程数读取数据。This process sets the number of partitions in the RDD according to the number of benchmark scales selected by the user and starts a certain number of concurrent threads to read data.

步骤4：各个partition中以最小支持度min_sup挖掘各个基准尺度数据集，得到各个基准尺度数据集频繁项集的集合FI_i，并求取上述若干频繁项集集合的并集FI_C作为目标尺度数据集ds_SO频繁项集的候选项集集合，此候选项集集合能最大程度地反映目标尺度数据集中隐含频繁项集的情况。Step 4: Mining each benchmark scale data set with the minimum support min_sup in each partition to obtain the set FI _i of frequent itemsets of each benchmark scale data set, and obtain the union FI _C of the above frequent itemsets as the target scale data Set ds _SO frequent item set candidate item set, this candidate item set set can reflect the hidden frequent item set in the target scale data set to the greatest extent.

简单来说，该过程只对基准尺度数据集进行相关挖掘，求取基准尺度数据集频繁项集的集合，进而推导目标尺度数据集的频繁项集，进行频繁项集的多尺度转换。To put it simply, this process only performs correlation mining on the benchmark scale dataset, obtains the set of frequent itemsets of the benchmark scale dataset, and then derives the frequent itemsets of the target scale dataset, and performs multi-scale conversion of frequent itemsets.

步骤5：通过克里格法确定基准尺度数据集对目标尺度数据集的权重矩阵λ，用于估算目标尺度数据集中频繁项集的支持度；Step 5: Determine the weight matrix λ of the benchmark-scale dataset to the target-scale dataset by kriging, which is used to estimate the support of frequent itemsets in the target-scale dataset;

步骤S500，克里格法定义一个线性估计量：In step S500, the Kriging method defines a linear estimator:

其中，为待估计值，Z(x_i)为样点数据，λ为克里格系数，表示为：in, is the value to be estimated, Z( _xi ) is the sample data, λ is the Kriging coefficient, expressed as:

λ＝K^-1Dλ=K ^- 1D

步骤S501，确定基准尺度数据集之间的相似度矩阵K，我们通过Jaccard相似性系数来计算矩阵元素M_ij：Step S501, determine the similarity matrix K between benchmark scale datasets, we calculate the matrix elements M _ij through the Jaccard similarity coefficient:

其中，FI_i、FI_j表示不同基准尺度数据集频繁项集的集合。Among them, FI _i and FI _j represent the collection of frequent itemsets of data sets of different benchmark scales.

步骤S502，确定矩阵D，尺度上推(基准尺度BS∈目标尺度SO)时，D的元素为基准尺度数据集在上层尺度数据集中数量上的占比；尺度下推(目标尺度SO∈基准尺度BS)时，D的元素为两者的Jaccard相似性系数；Step S502, determine the matrix D, when the scale is pushed up (base scale BS ∈ target scale SO), the element of D is the ratio of the base scale dataset to the upper scale dataset; the scale is pushed down (target scale SO ∈ reference scale BS), the element of D is the Jaccard similarity coefficient of the two;

步骤S503，应用克里格法定义的线性估计量求目标尺度数据集的支持度估计值：Step S503, applying the linear estimator defined by the kriging method to find the estimated support value of the target scale data set:

按照上述方式处理步骤4中目标尺度数据集的所有候选项集。Process all candidate item sets of the target scale dataset in step 4 in the manner described above.

步骤6：筛选目标尺度数据集最终的频繁项集，生成关联规则，并计算各个分区的算法精确度和执行时间，进而求取精确度和运行时间的均值作为整体实验结果。Step 6: Screen the final frequent itemsets of the target scale dataset, generate association rules, and calculate the algorithm accuracy and execution time of each partition, and then calculate the average of the accuracy and running time as the overall experimental result.

将所有候选项集的估计支持度同最小支持度min_sup进行比较，选择估计支持度不小于min_sup的频繁项集组成目标尺度数据集的最终频繁项集集合FI，并依据所设置的最小置信度min_conf产生关联规则；通过spark平台下的Action算子得出各个分区数据集下的算法精确度，进而求其均值作为最终结果。Compare the estimated support of all candidate item sets with the minimum support min_sup, select frequent itemsets whose estimated support is not less than min_sup to form the final frequent itemset set FI of the target scale data set, and set the minimum confidence min_conf Generate association rules; use the Action operator under the spark platform to obtain the accuracy of the algorithm under each partitioned data set, and then calculate its mean value as the final result.

步骤S600，对于生成的每条关联规则计算置信度，计算公式为：Step S600, calculating confidence for each generated association rule, the calculation formula is:

给定参数最小置信度阈值min_conf，将以上每条关联规则计算出的置信度与min_conf比较，仅将置信度大于min_conf的关联规则留下；Given the parameter minimum confidence threshold min_conf, compare the confidence calculated by each of the above association rules with min_conf, and only leave the association rules with confidence greater than min_conf;

步骤S601，求取算法最终精确度，计算公式为：Step S601, calculating the final accuracy of the algorithm, the calculation formula is:

其中，FI_m为算法挖掘到的频繁项集，FI_o为目标尺度数据集包含的真实频繁项集，fp和fn为错误判断的频繁项集，n为RDD中分区partition的个数(即所设置基准尺度的个数)。Among them, FI _m is the frequent itemset mined by the algorithm, FI _o is the real frequent itemset contained in the target scale data set, fp and fn are the frequent itemsets misjudged, and n is the number of partitions in the RDD (that is, the Set the number of benchmark scales).

Claims

1. A Spark platform-based multi-scale association rule method is characterized by comprising the following steps:

step 1: constructing an HDFS file system with a virtual machine and a Spark platform on a physical server, and uploading a data set to the HDFS file system;

step 2: submitting a job to a Spark platform through a client, reading data from an HDFS file system by the Spark platform, converting the data into an elastic distributed data set RDD, and storing the RDD in a memory;

and step 3: selecting a plurality of reference scales BSData set ds for each reference scale_BSEach RDD corresponds to a plurality of partition partitions, so that a reference scale data set is operated in each partition, and each partition is operated in parallel;

and 4, step 4: mining each reference scale data set with the minimum support degree min _ sup in each partition to obtain a set FI of frequent item sets of each reference scale data set_iAnd finding out a union FI of the frequent item sets_CAs a target scale dataset ds_SOA candidate item set of the frequent item set;

and 5: determining a weight matrix lambda of the reference scale data set to the target scale data set by a kriging method, and estimating the support degree of frequent item sets in the target scale data set;

step 6: and screening the final frequent item set of the target scale data set of each partition, generating an association rule, calculating the algorithm accuracy of each partition, and further obtaining the average value of the accuracy as a final result.

2. The Spark platform based multi-scale association rule method as claimed in claim 1, wherein in step 3, the partition number in the RDD is set according to the reference scale number selected by the user and a certain number of concurrent threads are started to read data.

3. The Spark platform based multi-scale association rule method as claimed in claim 1, wherein in step 4, for the multi-scale association rule algorithm, only the reference scale dataset is subjected to correlation mining, a set of frequent item sets of the reference scale dataset is obtained, then the frequent item sets of the target scale dataset are deduced, and multi-scale conversion of the frequent item sets is performed.

4. The Spark platform based multi-scale association rule method according to claim 1, wherein the specific process of step 5 is as follows:

step S500, kriging defines a linear estimator:

wherein,to be evaluated, Z (x)_i) For the sample data, λ is the kriging coefficient, expressed as:

λ＝K^-1D

step S501, a similarity matrix K between the reference scale data sets is determined, and matrix elements M are calculated through Jaccard similarity coefficients_ij：

Step S502, determining a matrix D, wherein when the scale is pushed up, the element of D is the ratio of the number of the reference scale data set to the number of the upper scale data set; when the scale is pushed down, the element of D is the Jaccard similarity coefficient of the two;

step S503, solving the support degree estimated value of the target scale data set by applying the linear estimator defined by the kriging method:

all candidate sets of the target scale data set in step 4 are processed in the manner described above.

5. The Spark platform based multi-scale association rule method as claimed in claim 1, wherein in step 6, the algorithm accuracy under each partition data set is obtained through an Action operator under the Spark platform, and then the average value is obtained as the final result,

step S600, calculating a confidence for each generated association rule, where the calculation formula is:

giving a parameter minimum confidence threshold value min _ conf, comparing the confidence calculated by each association rule with min _ conf, and only leaving the association rules with the confidence greater than min _ conf;

step S601, the final accuracy of the algorithm is obtained, and the calculation formula is as follows:

wherein, FI_mFor frequent item sets mined by the algorithm, FI_oFor the real frequent item set contained in the target scale data set, fp and fn are wrong judged frequent item sets, and n is the number of partition in the RDD (i.e. the number of set reference scales).