CN118643444A

CN118643444A - Big data anomaly detection method, device, equipment, storage medium and product

Info

Publication number: CN118643444A
Application number: CN202410850922.5A
Authority: CN
Inventors: 秦月; 张燮阳; 王丽; 尼宏伟; 宋培
Original assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Current assignee: China Mobile Communications Group Co Ltd; China Mobile Information Technology Co Ltd
Priority date: 2024-06-27
Filing date: 2024-06-27
Publication date: 2024-09-13

Abstract

The invention discloses a big data anomaly detection, a device, equipment, a storage medium and a computer program product, wherein the method is characterized in that an original isolated forest algorithm is improved, and a plurality of isolated trees are created according to business data to be processed; then, calculating the fitness value of each isolated tree in the plurality of isolated trees; and then, selecting a target isolated tree form among a plurality of isolated trees to form an isolated forest based on a tabu search algorithm by taking the fitness value of the isolated tree as a screening index, and finally, carrying out anomaly detection on service data to be processed according to the isolated forest to obtain an anomaly detection result. The adaptation value of the isolated tree is used as a screening index, the isolated tree with lower adaptation value and redundancy in the plurality of isolated trees is removed based on a tabu search algorithm, and a better target isolated tree is screened, so that the occupied space of the isolated forest is reduced, the calculation cost of anomaly detection is reduced, and the anomaly detection efficiency is improved.

Description

Big data anomaly detection method, device, equipment, storage medium and product

技术领域Technical Field

本发明涉及大数据处理技术领域，尤其涉及一种大数据异常检测方法、装置、设备、存储介质及计算机程序产品。The present invention relates to the field of big data processing technology, and in particular to a big data anomaly detection method, device, equipment, storage medium and computer program product.

背景技术Background Art

传统企业业务生产过程中，数据产生多经由工作人员手工填报，对于源数据库中数据正确性不能保证，常存在一些字符乱码，数据单位不匹配，数据缺失等异常数据。数据转换过程中如果不对异常数据进行剔除，不但会增大系统负担，对于复杂的数据转换操作，异常数据甚至会造成数据转换任务的崩溃。目前现有技术方案主要两种：基于距离的方法比如K-Means算法；基于密度的方法比如局部离群因子LOF算法。基于距离的方法K-Means算法虽然具有简洁高效等特点，但也存在对噪声和异常点比较敏感的问题。基于密度的方法局部离群因子LOF算法通过计算数据集中每个样本的离群因子来确定异常数据，虽然对中高维数据集效果良好，但由于需要计算各样本之间的距离，计算开销过高，对于大数据集效果很差。In the production process of traditional enterprise business, data is mostly generated by manual reporting by staff. The correctness of the data in the source database cannot be guaranteed. There are often some abnormal data such as garbled characters, mismatched data units, and missing data. If the abnormal data is not eliminated during the data conversion process, it will not only increase the burden on the system, but also cause the collapse of the data conversion task for complex data conversion operations. There are currently two main existing technical solutions: distance-based methods such as the K-Means algorithm; density-based methods such as the local outlier factor LOF algorithm. Although the distance-based method K-Means algorithm has the characteristics of simplicity and efficiency, it also has the problem of being sensitive to noise and outliers. The density-based method local outlier factor LOF algorithm determines the abnormal data by calculating the outlier factor of each sample in the data set. Although it works well for medium and high-dimensional data sets, it needs to calculate the distance between each sample, and the calculation overhead is too high, so it is very ineffective for large data sets.

针对于此，目前提出了基于孤立森林算法，原始孤立森林算法在异常检测方面效果良好，但检测精度与孤立树数目相关，对于大数据量的异常检测，往往需要构建较多颗孤立树耗费内存空间较大、计算开销较大、异常检测效率较低。To address this problem, an isolation forest algorithm has been proposed. The original isolation forest algorithm works well in anomaly detection, but the detection accuracy is related to the number of isolated trees. For anomaly detection of large amounts of data, it is often necessary to build more isolated trees, which consumes a lot of memory space, has high computational overhead, and has low anomaly detection efficiency.

发明内容Summary of the invention

本发明的主要目的在于提供了一种大数据异常检测方法、装置、设备、存储介质及计算机程序产品，旨在解决现有技术中构建较多颗孤立树耗费内存空间较大、计算开销较大、异常检测效率低的技术问题。The main purpose of the present invention is to provide a big data anomaly detection method, device, equipment, storage medium and computer program product, aiming to solve the technical problems in the prior art that constructing a large number of isolated trees consumes a large amount of memory space, has a large computational overhead and has low anomaly detection efficiency.

为实现上述目的，本发明提供了一种大数据异常检测方法，所述方法包括：To achieve the above object, the present invention provides a big data anomaly detection method, the method comprising:

根据待处理业务数据创建多颗孤立树；Create multiple isolated trees based on the business data to be processed;

计算所述多颗孤立树中每颗所述孤立树的适应度值；Calculating the fitness value of each of the plurality of isolated trees;

以所述孤立树的适应度值作为筛选指标，基于禁忌搜索算法从所述多颗孤立树中选出目标孤立树形成孤立森林；Using the fitness value of the isolated tree as a screening index, a target isolated tree is selected from the plurality of isolated trees based on a taboo search algorithm to form an isolated forest;

根据所述孤立森林对所述待处理业务数据进行异常检测，获取异常检测结果。Anomaly detection is performed on the business data to be processed according to the isolation forest to obtain anomaly detection results.

可选地，所述计算所述多颗孤立树中每颗所述孤立树的适应度值，包括：Optionally, calculating the fitness value of each of the plurality of isolated trees includes:

计算所述多颗孤立树中每颗所述孤立树正确分类样本的精度值，并计算所述多颗孤立树中任意两颗孤立树正确检测样本的差异度；Calculate the accuracy value of samples correctly classified by each of the multiple isolated trees, and calculate the difference between samples correctly detected by any two isolated trees among the multiple isolated trees;

根据所述精度值和所述差异度计算每颗所述孤立树的适应度值。The fitness value of each of the isolated trees is calculated according to the precision value and the difference.

可选地，所述根据所述精度值和所述差异度计算每颗所述孤立树的适应度值，包括：Optionally, the calculating the fitness value of each of the isolated trees according to the precision value and the difference includes:

计算所述精度值与第一权重的乘积以及所述差异度与第二权重的乘积之和；Calculating a sum of a product of the precision value and a first weight and a product of the difference and a second weight;

将所述乘积之和的倒数作为每颗所述孤立树的适应度值。The reciprocal of the sum of the products is used as the fitness value of each of the isolated trees.

可选地，所述计算所述多颗孤立树中每颗所述孤立树正确分类样本的精度值，包括：Optionally, calculating the accuracy value of correctly classified samples of each of the plurality of isolated trees includes:

根据统计法计算所述多颗孤立树中每颗所述孤立树正确分类样本的精度值。The accuracy value of the correctly classified samples of each of the multiple isolated trees is calculated according to a statistical method.

可选地，所述计算所述多颗孤立树中任意两颗孤立树正确检测样本的差异度，包括：Optionally, the calculating the difference between samples correctly detected by any two isolated trees among the plurality of isolated trees includes:

根据交叉验证法计算所述多颗孤立树中任意两颗孤立树正确检测样本的差异度。The difference between samples correctly detected by any two isolated trees among the plurality of isolated trees is calculated according to the cross-validation method.

可选地，所述以所述孤立树的适应度值作为筛选指标，基于禁忌搜索算法从所述多颗孤立树中选出目标孤立树形成孤立森林，包括：Optionally, the step of using the fitness value of the isolated tree as a screening index and selecting a target isolated tree from the plurality of isolated trees based on a taboo search algorithm to form an isolated forest includes:

从所述多颗孤立树的适应度值中选取任意孤立树作为初始解；Selecting any isolated tree from the fitness values of the plurality of isolated trees as an initial solution;

在所述初始解的邻域中选出满足预设禁忌要求的候选集，所述预设禁忌要求为候选集中的孤立树的适应度值大于初始解所对应的孤立树的适应度值；Selecting a candidate set that meets a preset taboo requirement in the neighborhood of the initial solution, wherein the preset taboo requirement is that the fitness value of the isolated tree in the candidate set is greater than the fitness value of the isolated tree corresponding to the initial solution;

根据所述候选集中选择适应度值最大的孤立树更新禁忌表；Update the taboo table according to the isolated tree with the largest fitness value selected from the candidate set;

利用所述适应度值最大的孤立树作为解重新查找候选集并基于候选集重新更新禁忌表直至满足预设终止条件；Using the isolated tree with the largest fitness value as a solution, re-searching the candidate set and re-updating the taboo table based on the candidate set until a preset termination condition is met;

在达到所述预设终止条件情况下，将所述禁忌表中的孤立树作为所述目标孤立树形成所述孤立森林。When the preset termination condition is reached, the isolated tree in the taboo table is used as the target isolated tree to form the isolated forest.

可选地，所述根据所述孤立森林对所述待处理业务数据进行异常检测，获取异常检测结果，包括：Optionally, performing anomaly detection on the to-be-processed business data according to the isolation forest to obtain anomaly detection results includes:

计算所述待处理业务数据的每个业务数据在所述孤立森林的所述目标孤立树中路径长度的期望值，并计算所述待处理业务数据的每个业务数据在每颗所述目标孤立树中的路径长度的平均值；Calculate the expected value of the path length of each business data of the business data to be processed in the target isolated tree of the isolated forest, and calculate the average value of the path length of each business data of the business data to be processed in each of the target isolated trees;

根据所述期望值和所述平均值计算每个所述业务数据在所述目标孤立树的异常分数；Calculate the abnormal score of each of the business data in the target isolation tree according to the expected value and the average value;

根据所述待处理业务数据中每个所述业务数据在所述目标孤立树的异常分数确定异常检测结果。An anomaly detection result is determined according to an anomaly score of each of the business data in the to-be-processed business data in the target isolated tree.

此外，为实现上述目的，本发明还提出一种大数据异常检测装置，所述大数据异常检测装置包括：In addition, to achieve the above-mentioned purpose, the present invention also proposes a big data anomaly detection device, the big data anomaly detection device comprising:

孤立树构建模块，用于根据待处理业务数据创建多颗孤立树；An isolated tree construction module is used to create multiple isolated trees based on the business data to be processed;

处理模块，用于计算所述多颗孤立树中每颗所述孤立树的适应度值；A processing module, used for calculating the fitness value of each of the plurality of isolated trees;

禁忌搜索模块，用于以所述孤立树的适应度值作为筛选指标，基于禁忌搜索算法从所述多颗孤立树中选出目标孤立树形成孤立森林；A taboo search module, configured to use the fitness value of the isolated tree as a screening index and select a target isolated tree from the plurality of isolated trees based on a taboo search algorithm to form an isolated forest;

异常检测模块，还用于根据所述孤立森林对所述待处理业务数据进行异常检测，获取异常检测结果。The anomaly detection module is also used to perform anomaly detection on the to-be-processed business data according to the isolation forest to obtain anomaly detection results.

此外，为实现上述目的，本发明还提出一种电子设备，所述电子设备包括：存储器、处理器及存储在所述存储器上并可在所述处理器上运行大数据异常检测程序，所述大数据异常检测程序配置为实现如上文所述的大数据异常检测方法的步骤。In addition, to achieve the above-mentioned purpose, the present invention also proposes an electronic device, which includes: a memory, a processor, and a big data anomaly detection program stored in the memory and run on the processor, and the big data anomaly detection program is configured to implement the steps of the big data anomaly detection method described above.

此外，为实现上述目的，本发明还提出一种存储介质，所述存储介质上存储有大数据异常检测程序，所述大数据异常检测程序被处理器执行时实现如上文所述的大数据异常检测方法的步骤。In addition, to achieve the above-mentioned purpose, the present invention also proposes a storage medium, on which a big data anomaly detection program is stored, and when the big data anomaly detection program is executed by a processor, the steps of the big data anomaly detection method described above are implemented.

此外，为实现上述目的，本发明还提供一种计算机程序产品，所述计算机程序产品包括大数据异常检测程序，所述大数据异常检测程序被处理器执行时实现如上文所述的大数据异常检测方法的步骤。In addition, to achieve the above-mentioned purpose, the present invention also provides a computer program product, which includes a big data anomaly detection program, and when the big data anomaly detection program is executed by a processor, it implements the steps of the big data anomaly detection method described above.

本发明中提供了一种大数据异常检测方法，对原始的孤立森林算法进行改进，首先，根据待处理业务数据创建多颗孤立树；之后，计算多颗孤立树中每颗孤立树的适应度值；然后，以孤立树的适应度值作为筛选指标，基于禁忌搜索算法从多颗孤立树中选出目标孤立树形成孤立森林，最后，根据孤立森林对待处理业务数据进行异常检测，获取异常检测结果。本实施例中并不直接利用待处理业务数据创建的多颗孤立树进行异常检测，而是以孤立树的适应度值作为筛选指标，基于禁忌搜索算法去掉多颗孤立树中适应度值较低、较冗余的孤立树，筛选出较为优秀的目标孤立树，不仅减小了孤立森林所占用的空间，还降低了异常检测的计算开销，提高了异常检测的效率。The present invention provides a big data anomaly detection method, which improves the original isolation forest algorithm. First, multiple isolated trees are created according to the business data to be processed; then, the fitness value of each isolated tree in the multiple isolated trees is calculated; then, the fitness value of the isolated tree is used as a screening index, and a target isolated tree is selected from the multiple isolated trees based on a taboo search algorithm to form an isolated forest. Finally, anomaly detection is performed on the business data to be processed based on the isolated forest to obtain anomaly detection results. In this embodiment, the multiple isolated trees created by the business data to be processed are not directly used for anomaly detection, but the fitness value of the isolated tree is used as a screening index, and the isolated trees with lower fitness values and more redundant are removed from the multiple isolated trees based on the taboo search algorithm, and the more excellent target isolated trees are screened out, which not only reduces the space occupied by the isolated forest, but also reduces the calculation overhead of anomaly detection, and improves the efficiency of anomaly detection.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例方案涉及的硬件运行环境的电子设备的结构示意图；FIG1 is a schematic diagram of the structure of an electronic device in a hardware operating environment according to an embodiment of the present invention;

图2为本发明实施例方案涉及的异常分数与孤立树中样本的路径长度的期望之间的关系示意图；FIG2 is a schematic diagram of the relationship between anomaly scores and expected path lengths of samples in an isolated tree according to an embodiment of the present invention;

图3为本发明实施例方案涉及的样本ABCD构成的一颗孤立树的结构示意图；FIG3 is a schematic diagram of the structure of an isolated tree composed of samples ABCD involved in an embodiment of the present invention;

图4为本发明实施例方案涉及的禁忌搜索算法的流程示意图；FIG4 is a schematic diagram of a process flow of a taboo search algorithm according to an embodiment of the present invention;

图5为本发明实施例方案涉及的大数据异常检测第一实施例的流程示意图；FIG5 is a schematic diagram of a process flow of a first embodiment of big data anomaly detection according to an embodiment of the present invention;

图6为本发明实施例方案涉及的大数据异常检测第一实施例的流程示意图；FIG6 is a flow chart of a first embodiment of big data anomaly detection according to an embodiment of the present invention;

图7为本发明实施例方案涉及的大数据异常检测第一实施例的流程示意图；FIG7 is a flow chart of a first embodiment of big data anomaly detection according to an embodiment of the present invention;

图8为本发明实施例方案涉及的大数据异常检测第三实施例的流程示意图；FIG8 is a flow chart of a third embodiment of big data anomaly detection according to an embodiment of the present invention;

图9为本发明实施例方案涉及的改进后的孤立森林异常检测算法与原始孤立森林异常检测算法的测试时间示意图；FIG9 is a schematic diagram of the test time of the improved isolation forest anomaly detection algorithm and the original isolation forest anomaly detection algorithm involved in the embodiment of the present invention;

图10为本发明实施例方案涉及的大数据异常检测装置第一实施例的结构框图。FIG. 10 is a structural block diagram of a first embodiment of a big data anomaly detection device according to an embodiment of the present invention.

本发明目的的实现、功能特点及优点将结合实施例，参照附图做进一步说明。The realization of the purpose, functional features and advantages of the present invention will be further explained in conjunction with embodiments and with reference to the accompanying drawings.

具体实施方式DETAILED DESCRIPTION

应当理解，此处所描述的具体实施例仅用以解释本发明，并不用于限定本发明。It should be understood that the specific embodiments described herein are only used to explain the present invention, and are not used to limit the present invention.

传统企业业务生产过程中，数据产生多经由工作人员手工填报，对于源数据库中数据正确性不能保证，常存在一些字符乱码，数据单位不匹配，数据缺失等异常数据。数据转换过程中如果不对异常数据进行剔除，不但会增大系统负担，对于复杂的数据转换操作，异常数据甚至会造成数据转换任务的崩溃。目前现有技术方案主要两种：In the production process of traditional enterprise business, data is mostly generated by manual reporting by staff, and the correctness of the data in the source database cannot be guaranteed. There are often some abnormal data such as garbled characters, mismatched data units, missing data, etc. If the abnormal data is not eliminated during the data conversion process, it will not only increase the system burden, but also cause the data conversion task to crash for complex data conversion operations. There are currently two main existing technical solutions:

一种为基于距离的方法，基于假设为正常样本总是聚集的，而异常样本总是孤立的，即通过计算每个样本与周围样本的距离，来判断一个点是不是存在异常。One is a distance-based method, which is based on the assumption that normal samples are always clustered, while abnormal samples are always isolated. That is, by calculating the distance between each sample and the surrounding samples, it is determined whether a point is abnormal.

比较代表的算法如K-Means算法，K-Means通过计算样本的欧式距离来将数据集划分为多个聚类，而远离聚类的样本即被标识为异常数据。A more representative algorithm is the K-Means algorithm. K-Means divides the data set into multiple clusters by calculating the Euclidean distance of samples, and samples far away from the cluster are identified as abnormal data.

另一种为基于密度的方法，与基于距离的方法类似，通过计算样本周围密度与其邻近点的周围密度来计算异常分数，达到检测异常数据的目的。DBSCAN算法(Density-Based Spatial Clustering of Applications with Noise)是经典的基于密度异常检测算法。其以数据密度为指标将数据集分组，低密度区域的样本即为异常点。The other is a density-based method, which is similar to the distance-based method. It calculates the anomaly score by calculating the density around the sample and the density around its neighboring points to achieve the purpose of detecting abnormal data. The DBSCAN algorithm (Density-Based Spatial Clustering of Applications with Noise) is a classic density-based anomaly detection algorithm. It groups data sets based on data density, and samples in low-density areas are outliers.

基于距离的方法K-Means算法具有简洁高效等特点，获得了广泛应用，但也存在对噪声和异常点比较敏感的问题。基于密度的方法局部离群因子LOF(Local OutlierFactor)算法通过计算数据集中每个样本的离群因子来确定异常数据。LOF算法对中高维数据集效果良好，但由于需要计算各样本之间的距离，计算开销过高，对于大数据集效果很差。The distance-based method K-Means algorithm has the characteristics of simplicity and efficiency, and has been widely used, but it is also sensitive to noise and outliers. The density-based method Local Outlier Factor (LOF) algorithm determines abnormal data by calculating the outlier factor of each sample in the data set. The LOF algorithm works well for medium and high-dimensional data sets, but because it needs to calculate the distance between each sample, the computational overhead is too high, and it has poor effect on large data sets.

针对于此，提出了基于孤立森林算法，原始孤立森林算法在异常检测方面效果良好，但检测精度与孤立树数目相关，对于大数据量的异常检测，往往需要构建较多孤立树耗费内存空间较大、且计算开销较大、异常检测效率低。To solve this problem, an isolation forest algorithm was proposed. The original isolation forest algorithm works well in anomaly detection, but the detection accuracy is related to the number of isolated trees. For anomaly detection of large amounts of data, it is often necessary to build more isolated trees, which consumes a lot of memory space, has high computational overhead, and has low anomaly detection efficiency.

参照图1，图1为本发明实施例方案涉及的硬件运行环境的电子设备结构示意图。Refer to FIG. 1 , which is a schematic diagram of the structure of an electronic device in a hardware operating environment according to an embodiment of the present invention.

如图1所示，该电子设备可以包括：处理器1001，例如中央处理器(CentralProcessing Unit，CPU)，通信总线1002、用户接口1003，网络接口1004，存储器1005。其中，通信总线1002用于实现这些组件之间的连接通信。用户接口1003可以包括显示屏(Display)、输入单元比如键盘(Keyboard)，可选用户接口1003还可以包括标准的有线接口、无线接口。网络接口1004可选的可以包括标准的有线接口、无线接口(如无线保真(Wireless-Fidelity，WI-FI)接口)。存储器1005可以是高速的随机存取存储器(RandomAccess Memory，RAM)，也可以是稳定的非易失性存储器(Non-Volatile Memory，NVM)，例如磁盘存储器。存储器1005可选的还可以是独立于前述处理器1001的存储装置。As shown in FIG1 , the electronic device may include: a processor 1001, such as a central processing unit (CPU), a communication bus 1002, a user interface 1003, a network interface 1004, and a memory 1005. Among them, the communication bus 1002 is used to realize the connection and communication between these components. The user interface 1003 may include a display screen (Display), an input unit such as a keyboard (Keyboard), and the optional user interface 1003 may also include a standard wired interface and a wireless interface. The network interface 1004 may optionally include a standard wired interface and a wireless interface (such as a wireless fidelity (Wireless-Fidelity, WI-FI) interface). The memory 1005 may be a high-speed random access memory (Random Access Memory, RAM), or a stable non-volatile memory (Non-Volatile Memory, NVM), such as a disk storage. The memory 1005 may also be a storage device independent of the aforementioned processor 1001.

本领域技术人员可以理解，图1中示出的结构并不构成对电子设备的限定，可以包括比图示更多或更少的部件，或者组合某些部件，或者不同的部件布置。Those skilled in the art will appreciate that the structure shown in FIG. 1 does not limit the electronic device and may include more or fewer components than shown in the figure, or combine certain components, or arrange the components differently.

如图1所示，作为一种存储介质的存储器1005中可以包括操作系统、网络通信模块、用户接口模块以及大数据异常检测程序。As shown in FIG. 1 , the memory 1005 as a storage medium may include an operating system, a network communication module, a user interface module, and a big data anomaly detection program.

在图1所示的电子设备中，网络接口1004主要用于与网络服务器进行数据通信；用户接口1003主要用于与用户进行数据交互；本发明电子设备中的处理器1001、存储器1005可以设置在电子设备中，所述电子设备通过处理器1001调用存储器1005中存储的大数据异常检测程序，并执行本发明实施例提供的大数据异常检测方法。In the electronic device shown in FIG1 , the network interface 1004 is mainly used for data communication with a network server; the user interface 1003 is mainly used for data interaction with a user; the processor 1001 and the memory 1005 in the electronic device of the present invention can be set in the electronic device, and the electronic device calls the big data anomaly detection program stored in the memory 1005 through the processor 1001, and executes the big data anomaly detection method provided in an embodiment of the present invention.

本发明先对基于原始孤立森林进行异常检查的方法进行说明。The present invention first describes a method for abnormality detection based on the original isolation forest.

在大数据集中，异常数据普遍具有两个特点：1)异常数据特征值与正常数据总存在较大偏离；2)数据集中出现异常数据的概率很低。In large data sets, abnormal data generally have two characteristics: 1) there is always a large deviation between the characteristic values of abnormal data and normal data; 2) the probability of abnormal data appearing in the data set is very low.

利用这两个特点，孤立森林算法定义一种名为孤立树(Isolation Tree)的二叉搜索树，在数据集中进行递归划分，直到所有样本都是孤立的。在这种递归划分策略下，异常数据更接近于树的根节点，即仅需更少的递归次数即可将异常数据孤立。孤立森林算法引入了孤立树、路径长度、异常分数的定义。Using these two characteristics, the Isolation Forest Algorithm defines a binary search tree called an Isolation Tree, which recursively partitions the data set until all samples are isolated. Under this recursive partitioning strategy, abnormal data is closer to the root node of the tree, which means that fewer recursions are required to isolate abnormal data. The Isolation Forest Algorithm introduces the definitions of isolation trees, path lengths, and anomaly scores.

首先，对定义做简要介绍。First, a brief introduction to the definition.

定义1(孤立树(Isolation Tree))：设N为孤立树T中的一个节点，则N仅能存在无子节点或有两个子节点两种状态。若N无子节点，则称N为孤立树T的外部节点；若N具有两个子节点(Nl,Nr)，则称N为孤立树T的内部节点。对于给定样本数为n的数据集X＝{x1,...,xn}，从数据集中随机选择特征q和分割值p。对于节点N(N为数据集X中某一样本xi)而言，若N的特征值Nq<q，则将N放至左子树，反之则放至右子树。以此种方法递归的分割数据集X，直到满足以下任意条件时：1)孤立树达到限制高度；2)节点上仅剩下相同样本；3)节点上仅剩一个样本。Definition 1 (Isolation Tree): Let N be a node in an isolation tree T. N can only exist in two states: no child node or two child nodes. If N has no child nodes, then N is called an external node of the isolation tree T; if N has two child nodes (Nl, Nr), then N is called an internal node of the isolation tree T. For a given data set X = {x1,...,xn} with n samples, randomly select feature q and split value p from the data set. For node N (N is a sample xi in the data set X), if N's feature value Nq<q, then N is placed in the left subtree, otherwise it is placed in the right subtree. Recursively split the data set X in this way until any of the following conditions are met: 1) the isolation tree reaches the limit height; 2) only the same samples remain on the node; 3) only one sample remains on the node.

定义2(路径长度(Path Length))：路径长度为从孤立树T的根节点到外部节点所经历边的数量，记为h(x)。对于给定样本数为n的数据集X＝{x1,...,xn}，所构造的孤立树T的平均路径长度为等价于二叉搜索数中失败查询的路径长度，公式(1)如下：Definition 2 (Path Length): The path length is the number of edges from the root node of the isolated tree T to the external node, denoted as h(x). For a given data set X = {x1,...,xn} with n samples, the average path length of the constructed isolated tree T is equivalent to the path length of failed queries in the binary search, and formula (1) is as follows:

C(n)＝2H(n-1)-2(n-1)/n (1)C(n)＝2H(n-1)-2(n-1)/n (1)

其中H(i)为调和函数，H(i)＝Ln(i)+γ,γ为欧拉常数，Ln(i)为Log2i；n为数据集样本数；C(n)为给定n时路径长度h(x)的平均值，用以标准化样本x的路径长度h(x)。Where H(i) is the harmonic function, H(i) = Ln(i) + γ, γ is the Euler constant, Ln(i) is Log2i; n is the number of samples in the data set; C(n) is the average value of the path length h(x) for a given n, which is used to standardize the path length h(x) of sample x.

定义3(异常分数(Anomaly Score))：异常分数用于判断样本x是否异常，通过遍历孤立森林中的每棵孤立树，计算x在每棵树中的路径长度，然后根据其路径长度得到x的异常分数S(x,n)如公式(2)所示：Definition 3 (Anomaly Score): The anomaly score is used to determine whether a sample x is abnormal. By traversing each isolated tree in the isolation forest, calculating the path length of x in each tree, and then obtaining the anomaly score S(x,n) of x based on its path length, as shown in formula (2):

S( x , n ) ＝ 2^-E(h(x))/C(n) (2)S( x , n ) = 2^-E(h(x))/C(n) (2)

其中，E(h(x))为孤立树T中样本x的路径长度的期望。S(x,n)与E(h(x))关系如图2所示，由图2可知：当E(h(x))→n-1时，s趋近于0，该样本被视为正常数据；当E(h(x))→0时，s趋近于1，即该样本异常分数趋近于1，被视为异常数据；当E(h(x))→C(n)时，s趋近于0.5，即该样本的平均路径长度与孤立树的平均路径长度，则无法区分该样本是否为异常数据。Among them, E(h(x)) is the expected path length of sample x in the isolated tree T. The relationship between S(x,n) and E(h(x)) is shown in Figure 2. It can be seen from Figure 2 that: when E(h(x))→n-1, s approaches 0, and the sample is regarded as normal data; when E(h(x))→0, s approaches 1, that is, the abnormal score of the sample approaches 1, and it is regarded as abnormal data; when E(h(x))→C(n), s approaches 0.5, that is, the average path length of the sample is the same as the average path length of the isolated tree, and it is impossible to distinguish whether the sample is abnormal data.

需要说明的是，孤立树的构建过程本质上是随机的。在每个节点，算法随机选择一个特征和一个分割值来划分数据。由于这种随机性，数据点在树中的分布会变得相对均匀。另外，由于孤立树是二叉树，每个节点最多有两个子节点，这也导致了路径长度的期望与样本数量之间存在某种关系。孤立树的最大深度通常被定义为log2(n)，其中n是样本数量。这意味着随着样本数量的增加，树的高度(或深度)也会增加。而样本点的路径长度与树的高度直接相关，因此路径长度的期望也会随着样本数量的增加而增加。It should be noted that the construction process of the isolation tree is essentially random. At each node, the algorithm randomly selects a feature and a split value to divide the data. Due to this randomness, the distribution of data points in the tree becomes relatively uniform. In addition, since the isolation tree is a binary tree, each node has at most two children, which also leads to a certain relationship between the expectation of the path length and the number of samples. The maximum depth of the isolation tree is usually defined as log2(n), where n is the number of samples. This means that as the number of samples increases, the height (or depth) of the tree will also increase. The path length of the sample point is directly related to the height of the tree, so the expectation of the path length will also increase with the increase in the number of samples.

孤立森林算法的目的是通过计算样本点的路径长度来识别异常值。正常样本由于遵循数据的整体分布，它们在孤立树中的路径长度通常较长，因为它们需要经过更多的分割才能被隔离到叶子节点。而异常值由于它们与整体分布的差异，往往会在较浅的层次就被隔离，因此它们的路径长度较短。The purpose of the isolation forest algorithm is to identify outliers by calculating the path length of sample points. Normal samples usually have longer path lengths in the isolation tree because they follow the overall distribution of the data, because they need to go through more segmentation to be isolated to the leaf nodes. Outliers, due to their differences from the overall distribution, are often isolated at a shallower level, so their path lengths are shorter.

在概率论和统计学中，期望值是一个随机变量所有可能结果的概率加权平均值。对于孤立树中的样本点，其路径长度的期望值是所有可能路径长度的加权平均，而正常样本由于它们在树中的分布特性，其路径长度的期望值会接近于n-1。In probability theory and statistics, the expected value is the probability-weighted average of all possible outcomes of a random variable. For a sample point in an isolated tree, the expected value of its path length is the weighted average of all possible path lengths, while the expected value of the path length of a normal sample is close to n-1 due to its distribution characteristics in the tree.

需要注意的是，这里的“接近于n-1”是一个近似的描述，实际期望值可能会受到多种因素的影响，如数据的分布特性、树的高度、特征选择等。因此，在实际应用中，通常会通过大量的实验和验证来评估孤立森林算法的性能和准确性。It should be noted that "close to n-1" here is an approximate description, and the actual expected value may be affected by many factors, such as the distribution characteristics of the data, the height of the tree, feature selection, etc. Therefore, in practical applications, the performance and accuracy of the isolation forest algorithm are usually evaluated through a large number of experiments and verifications.

其次，对使用孤立森林算法进行异常检测的过程进行说明。Secondly, the process of anomaly detection using the isolation forest algorithm is explained.

使用孤立森林算法进行异常检测主要包括训练与测试两个阶段，下面将简单介绍使用孤立森林算法进行异常检测的流程。The use of the isolation forest algorithm for anomaly detection mainly includes two stages: training and testing. The following briefly introduces the process of using the isolation forest algorithm for anomaly detection.

训练阶段：基于数据集的子样本建立孤立树。Training phase: build isolated trees based on subsamples of the dataset.

(1)对于数据集X＝{x1,...,xn}，从X中选取ψ个样本构成子集M。(1) For a data set X = {x1, ..., xn}, select ψ samples from X to form a subset M.

(2)从d个特征点中随机选取一个特征q并随机一个分割点p，以此分割点将样本分别划入左右子树中。(2) Randomly select a feature q and a split point p from the d feature points, and use this split point to divide the samples into the left and right subtrees respectively.

(3)迭代步骤(2)直至满足定义一中孤立树的终止条件。(3) Iterate step (2) until the termination condition of the isolated tree in Definition 1 is met.

需要说明的是，第(1)步从X中选取ψ个样本构成子集M可重复多次执行，以生成目标个数的孤立树，例如，若所需生成的孤立树的数目为F，则需从X中选取F次样本，构成F个子集，F个子集中每个子集经过上述(2)和(3)步骤形成一颗孤立树，如此，得到了F颗孤立树。It should be noted that step (1) of selecting ψ samples from X to form a subset M can be repeated multiple times to generate a target number of isolated trees. For example, if the number of isolated trees to be generated is F, then it is necessary to select F samples from X to form F subsets. Each of the F subsets forms an isolated tree through the above steps (2) and (3). In this way, F isolated trees are obtained.

测试阶段：用孤立树为每一个样本计算异常分数。Testing phase: Anomaly scores are calculated for each sample using an isolation tree.

对于数据集中每一个样本点xi，令其遍历每一棵孤立树，根据异常分数公式计算样本点xi的异常分数，最终根据样本点的异常分数判定该样本点是否为异常数据。如图3所示，为样本集(A，B，C，D)遍历一棵孤立树的示意图，可以看出其中样本点D最可能是异常，因为其最早达到孤立状态。For each sample point xi in the data set, let it traverse each isolated tree, calculate the anomaly score of the sample point xi according to the anomaly score formula, and finally determine whether the sample point is an abnormal data according to the anomaly score of the sample point. As shown in Figure 3, it is a schematic diagram of the sample set (A, B, C, D) traversing an isolated tree. It can be seen that the sample point D is most likely to be an anomaly because it reaches the isolated state the earliest.

具体地说，对于每个样本点，算法会计算其在每棵孤立树中的路径长度。路径长度是指从根节点到包含该样本点的叶子节点所经过的边的数量。这个长度可以被看作是被切分的次数，或者从根节点到该样本点的深度。Specifically, for each sample point, the algorithm calculates the path length in each isolated tree. The path length refers to the number of edges from the root node to the leaf node containing the sample point. This length can be regarded as the number of splits, or the depth from the root node to the sample point.

基于样本点在每棵树中的路径长度，算法会计算其异常得分。异常得分通常是根据路径长度的平均值以及某种规范化因子来计算的。具体来说，对于每个样本x，它的异常得分s(x)可能基于根节点到x的平均路径长度c(n)以及给定样本数量的平均路径长度的期望值E(h(x))来计算，参见上述公式(2)。异常得分越低，表示样本点越可能是异常值。Based on the path length of the sample point in each tree, the algorithm calculates its anomaly score. The anomaly score is usually calculated based on the average path length and some normalization factor. Specifically, for each sample x, its anomaly score s(x) may be calculated based on the average path length c(n) from the root node to x and the expected value E(h(x)) of the average path length for a given number of samples, see formula (2) above. The lower the anomaly score, the more likely the sample point is an outlier.

根据异常得分和预定的阈值筛选出异常点，阈值通常基于训练数据的统计特性来确定。如果一个样本点的异常得分超过了这个阈值，那么它就会被标记为异常点。The anomalies are screened out based on the anomaly score and a predetermined threshold, which is usually determined based on the statistical characteristics of the training data. If the anomaly score of a sample point exceeds this threshold, it will be marked as an anomaly.

由于存在多棵孤立树，每颗孤立树都可能产生自己的异常检测结果。最终的异常检测决策可以通过集成所有孤立树的结果来得出。例如，可以通过对每棵树的异常得分进行平均或投票，以决定一个样本点是否为异常值。Since there are multiple isolated trees, each isolated tree may produce its own anomaly detection results. The final anomaly detection decision can be made by integrating the results of all isolated trees. For example, the anomaly scores of each tree can be averaged or voted to decide whether a sample point is an outlier.

最终，算法会输出每个样本点的异常检测结果，包括哪些样本点被认为是异常值。Finally, the algorithm outputs the anomaly detection results for each sample point, including which sample points are considered outliers.

需要注意的是，孤立森林算法中的参数(如孤立树的数量、树的最大深度、样本采样比例等)可以根据具体的应用场景和数据进行调整，以优化异常检测的性能。此外，孤立森林算法具有线性的时间复杂度，因此在处理大数据集时具有较高的效率。It should be noted that the parameters in the Isolation Forest algorithm (such as the number of isolated trees, the maximum depth of the tree, the sample sampling ratio, etc.) can be adjusted according to the specific application scenario and data to optimize the performance of anomaly detection. In addition, the Isolation Forest algorithm has a linear time complexity, so it is more efficient when processing large data sets.

本申请中对于原始孤立森林异常检查算法进行改进，引入禁忌搜索算法去除精度较低、冗余的孤立树，达到增强异常检测效率的目的。In this application, the original isolation forest anomaly detection algorithm is improved, and a taboo search algorithm is introduced to remove low-precision and redundant isolated trees, so as to enhance the efficiency of anomaly detection.

其中，禁忌搜索算法是一种经典的启发式算法，具有强大的全局搜索性能。禁忌搜索算法流程如图4所示，禁忌搜索算法通过设置禁忌表来记录一些已经历的操作，重复执行“移动邻域→选择候选解→计算适应度→判断是否符合藐视规则→更新当前解”这一流程，不断对当前解进行迭代，最终可以选出最优解。Among them, the taboo search algorithm is a classic heuristic algorithm with powerful global search performance. The process of the taboo search algorithm is shown in Figure 4. The taboo search algorithm records some operations that have been experienced by setting a taboo table, and repeatedly executes the process of "moving the neighborhood → selecting candidate solutions → calculating fitness → judging whether it meets the contempt rule → updating the current solution", constantly iterating the current solution, and finally selecting the optimal solution.

本案将使用禁忌搜索算法对原始森林进行筛选，获取较优秀的目标孤立树组成孤立森林，降低算法计算开销，并提高孤立森林算法异常检测精准度。In this case, the taboo search algorithm will be used to screen the original forest, obtain the better target isolated trees to form an isolation forest, reduce the algorithm calculation overhead, and improve the accuracy of anomaly detection of the isolation forest algorithm.

参见图5，本发明实施例提供了一种大数据异常检测方法，图5为本发明大数据异常检测方法第一实施例的流程示意图。Referring to FIG. 5 , an embodiment of the present invention provides a method for detecting anomalies in big data. FIG. 5 is a flow chart of a first embodiment of the method for detecting anomalies in big data according to the present invention.

本实施例中，所述大数据异常检测包括以下步骤：In this embodiment, the big data anomaly detection includes the following steps:

步骤S101：根据待处理业务数据创建多颗孤立树。Step S101: Create multiple isolated trees according to the business data to be processed.

需要说明的是，本实施例的执行主体可以是一种具有数据处理、网络通信以及程序运行功能的计算服务设备，例如台式机、平板电脑、个人电脑、手机等，或者是一种能够实现上述功能的电子设备、云端服务器等。以下以电子设备为例，对本实施例及下述各实施例进行说明。It should be noted that the execution subject of this embodiment can be a computing service device with data processing, network communication and program running functions, such as a desktop, tablet computer, personal computer, mobile phone, etc., or an electronic device capable of realizing the above functions, a cloud server, etc. The following takes an electronic device as an example to illustrate this embodiment and the following embodiments.

本实施例中可利用原始的孤立森林算法将待处理业务数据创建为多颗孤立，基于原始的孤立森林算法构建孤立树的过程如下所示：In this embodiment, the original isolation forest algorithm can be used to create multiple isolated trees for the service data to be processed. The process of constructing an isolated tree based on the original isolation forest algorithm is as follows:

(1)对于待处理业务数据集X＝{x1,...,xn}，从X中选取ψ个样本构成子集M。(1) For the business data set to be processed X = {x1, ..., xn}, select ψ samples from X to form a subset M.

(3)迭代步骤(2)直至满足孤立树的终止条件。终止条件包括：1、孤立树达到限制高度；2、节点上仅剩下相同样本；3、节点上仅剩一个样本。(3) Iterate step (2) until the termination condition of the isolation tree is met. The termination conditions include: 1. The isolation tree reaches the limit height; 2. Only the same samples remain on the node; 3. Only one sample remains on the node.

假设待处理业务数据有N个样本，样本特征为年龄或收入等，若年龄或者收入为特征，则具体的值为特征值。随机选择认为年龄>70(分割值q)为异常，年龄<＝70(分割值q)为正常。此时相当于在样本空间，画出了一个超平面。接着，再选取一个样本特征为收入，认为收入>1w为异常，收入<＝1W为正常，此时则又在样本空间画出了一个超平面。可以看出，分布稀疏位置的点可以通过较少次数的超平面划分被孤立出来，而分布密集位置的点需要更多次数的超平面划分才能被孤立。Assume that there are N samples of business data to be processed, and the sample features are age or income, etc. If age or income is a feature, the specific value is the feature value. Randomly select and consider age>70 (split value q) as abnormal, and age<=70 (split value q) as normal. This is equivalent to drawing a hyperplane in the sample space. Next, select another sample feature as income, and consider income>1w as abnormal, and income<=1W as normal. At this time, another hyperplane is drawn in the sample space. It can be seen that points in sparsely distributed positions can be isolated through fewer hyperplane divisions, while points in densely distributed positions require more hyperplane divisions to be isolated.

需要说明的是，第(1)步中X中选取ψ个样本构成子集M可重复多次执行，以生成目标个数的孤立树，例如，若所需生成的孤立树的数目为F，则需从X中选取F次样本，构成F个子集，F个子集每个子集利用上述(2)和(3)形成一颗孤立树，如此，得到了F颗孤立树。It should be noted that the step (1) of selecting ψ samples from X to form a subset M can be repeated multiple times to generate the target number of isolated trees. For example, if the number of isolated trees to be generated is F, then it is necessary to select F samples from X to form F subsets. Each of the F subsets forms an isolated tree using the above (2) and (3). In this way, F isolated trees are obtained.

本实施例中将待处理业务数据划分为k个部分(例如，k折交叉验证)，对于每一折，使用k-1个部分作为训练数据来构建多棵孤立树，然后使用剩余的一个部分作为测试数据进行异常检测。In this embodiment, the business data to be processed is divided into k parts (for example, k-fold cross validation), and for each fold, k-1 parts are used as training data to build multiple isolated trees, and then the remaining part is used as test data for anomaly detection.

步骤S102：计算所述多颗孤立树中每颗所述孤立树的适应度值。Step S102: Calculate the fitness value of each of the plurality of isolated trees.

本实施例中适应度值越高，表示孤立树具有较大差异值且有较好精确度。通过计算出每颗孤立树的适应度值，便可根据适应度值的大小选出适应度值较高的孤立树。In this embodiment, the higher the fitness value, the greater the difference value and the better the accuracy of the isolated tree. By calculating the fitness value of each isolated tree, the isolated tree with a higher fitness value can be selected according to the size of the fitness value.

步骤S103：以所述孤立树的适应度值作为筛选指标，基于禁忌搜索算法从所述多颗孤立树中选出目标孤立树形成孤立森林。Step S103: using the fitness value of the isolated tree as a screening index, selecting a target isolated tree from the plurality of isolated trees based on a taboo search algorithm to form an isolated forest.

本实施例中并不直接使用适应度值筛选孤立树，而是将孤立树的适应度值作为筛选指标，基于禁忌搜索算法筛选目标孤立树。这是直接使用适应度值筛选目标孤立树可能会导致算法陷入局部最优的问题。而将孤立树的适应度值作为筛选指标进行禁忌搜索算法筛选孤立树是一种元启发式方法。它通过设置禁忌表和藐视准则，在搜索过程中引入了一定的随机性和"跳出"机制，使得算法能够跳出局部最优，更全面地探索解空间，从而增加找到全局最优解或近似全局最优解的可能性。In this embodiment, the fitness value is not directly used to screen the isolated tree, but the fitness value of the isolated tree is used as a screening index to screen the target isolated tree based on the taboo search algorithm. This is because directly using the fitness value to screen the target isolated tree may cause the algorithm to fall into the local optimum. Using the fitness value of the isolated tree as a screening index to screen the isolated tree by the taboo search algorithm is a meta-heuristic method. It introduces a certain randomness and "jump out" mechanism in the search process by setting a taboo table and a contempt criterion, so that the algorithm can jump out of the local optimum and explore the solution space more comprehensively, thereby increasing the possibility of finding the global optimal solution or a solution close to the global optimal solution.

步骤S104：根据所述孤立森林对所述待处理业务数据进行异常检测，获取异常检测结果。Step S104: performing anomaly detection on the to-be-processed business data according to the isolation forest to obtain anomaly detection results.

本实施例中不直接利用待处理业务数据创建的多颗孤立树进行异常检测，也不直接使用适应度值筛选目标孤立树，而是以孤立树的适应度值作为筛选指标，基于禁忌搜索算法去掉多颗孤立树中适应度值较低、较冗余的孤立树，筛选出较为优秀的目标孤立树，不仅减小了孤立森林所占用的空间，还降低了异常检测的计算开销，提高了异常检测的效率。In this embodiment, multiple isolated trees created by the business data to be processed are not directly used for anomaly detection, nor is the fitness value used directly to screen the target isolated tree. Instead, the fitness value of the isolated tree is used as a screening indicator, and the taboo search algorithm is used to remove the isolated trees with lower fitness values and more redundant ones among the multiple isolated trees, and the better target isolated trees are screened out. This not only reduces the space occupied by the isolated forest, but also reduces the computational overhead of anomaly detection, thereby improving the efficiency of anomaly detection.

在一些实施例中，上述步骤S102包括步骤S1021和步骤S1022。In some embodiments, the above step S102 includes step S1021 and step S1022.

步骤S1021：计算所述多颗孤立树中每颗所述孤立树正确分类样本的精度值，并计算所述多颗孤立树中任意两颗孤立树正确检测样本的差异度。Step S1021: Calculate the accuracy value of each of the multiple isolated trees for correctly classifying samples, and calculate the difference between any two of the multiple isolated trees for correctly detecting samples.

本实施例中可采用统计法计算所述多颗孤立树中每颗所述孤立树正确分类样本的精度值。In this embodiment, a statistical method may be used to calculate the accuracy value of the correctly classified samples of each of the plurality of isolated trees.

具体地，对于每棵树，计算其正确分类的样本数量(即真正例TP和真反例TN)，然后通过下述公式(3)计算精度(Accuracy)值：Specifically, for each tree, the number of correctly classified samples (i.e., true positive examples TP and true negative examples TN) is calculated, and then the accuracy value is calculated using the following formula (3):

Accuracy＝(TP+TN)/(FP+FN+TP+TN) (3)Accuracy＝(TP+TN)/(FP+FN+TP+TN) (3)

其中，TP为真正例(正确地标记为正常的异常样本)，TN真反例(正确地标记为异常的异常样本)，FP是假正例(错误地标记为异常的正常样本)，FN是假反例(错误地标记为正常的异常样本)。Among them, TP is a true positive example (anomaly samples correctly labeled as normal), TN is a true negative example (anomaly samples correctly labeled as abnormal), FP is a false positive example (normal samples incorrectly labeled as abnormal), and FN is a false negative example (anomaly samples incorrectly labeled as normal).

本实施例中可采用交叉验证法计算所述多颗孤立树中任意两颗孤立树正确检测样本的差异度。In this embodiment, a cross-validation method may be used to calculate the difference between samples correctly detected by any two isolated trees among the plurality of isolated trees.

具体地说，可使用公式(4)计算孤立树Ti与Tj之间的差异度：Specifically, the difference between isolated trees Ti and Tj can be calculated using formula (4):

Qi,j ＝ (Ki+Kj-2Kij)/(Ki+Kj ) (4)Qi,j = (Ki+Kj-2Kij)/(Ki+Kj ) (4)

其中Qi,j表示孤立树Ti与Tj的差异度，Ki与Kj分别表示孤立树Ti与Tj正确检测样本数，kij表示孤立树Ti与Tj正确检测样本的重合数。Among them, Qi,j represents the difference between isolated trees Ti and Tj, Ki and Kj represent the number of correctly detected samples of isolated trees Ti and Tj respectively, and kij represents the number of overlapped correctly detected samples of isolated trees Ti and Tj.

需要说明的是，这里正确检测样本的含义可以理解为：例如：样本为时间9点半，则将样本输入孤立树后得到的结果为10点以前即检测结果正确，若为12点以后，则检测结果错误。正确检测出的重复样本数量多说明这两棵孤立树Ti与Tj在检测某些特定异常样本时具有一致性。It should be noted that the meaning of correctly detecting samples here can be understood as follows: for example, if the sample time is 9:30, then the result obtained after entering the sample into the isolation tree is before 10 o'clock, which means the detection result is correct, and if it is after 12 o'clock, the detection result is wrong. The large number of correctly detected duplicate samples indicates that the two isolation trees Ti and Tj are consistent in detecting certain specific abnormal samples.

步骤S1022：根据所述精度值和所述差异度计算每颗所述孤立树的适应度值。Step S1022: Calculate the fitness value of each of the isolated trees according to the precision value and the difference.

可选地，根据精度值和差异度计算每颗孤立树的适应度值，包括：计算精度值与第一权重的乘积以及差异度与第二权重的乘积之和；将乘积之和的倒数作为每颗孤立树的适应度值。Optionally, the fitness value of each isolated tree is calculated according to the precision value and the difference, including: calculating the sum of the product of the precision value and the first weight and the product of the difference and the second weight; and taking the inverse of the sum of the products as the fitness value of each isolated tree.

根据孤立树的差异度与精度值计算每棵孤立树的适应度值的计算公式如公式(5)所示：The calculation formula for calculating the fitness value of each isolated tree based on the difference and precision value of the isolated tree is shown in formula (5):

F(Ti) ＝ 1/(uXi+rQij) (5)F(Ti) = 1/(uXi+rQij) (5)

其中，F(Ti)表示孤立树Ti的适应度值，Xi表示孤立树Ti的精度值，Qi,j表示孤立树Ti与Tj的差异度，u表示精度值的第一权重，r表示差异度的第二权重。Among them, F(Ti) represents the fitness value of the isolated tree Ti, Xi represents the precision value of the isolated tree Ti, Qi,j represents the difference between the isolated tree Ti and Tj, u represents the first weight of the precision value, and r represents the second weight of the difference.

上述实施例中给出了计算适应度值的具体实现方式。The above embodiment provides a specific implementation method for calculating the fitness value.

在一些实施例中，如图6所示，上述步骤S103包括步骤S1031至步骤S1035。In some embodiments, as shown in FIG. 6 , the above step S103 includes steps S1031 to S1035 .

步骤S1031：从所述多颗孤立树的适应度值中选取任意孤立树作为初始解。Step S1031: Select any isolated tree from the fitness values of the plurality of isolated trees as an initial solution.

步骤S1032：在所述初始解的邻域中选出满足预设禁忌要求的候选集，所述预设禁忌要求为候选集中的孤立树的适应度值大于初始解所对应的孤立树的适应度值。Step S1032: Select a candidate set that meets a preset taboo requirement in the neighborhood of the initial solution, where the preset taboo requirement is that the fitness value of the isolated tree in the candidate set is greater than the fitness value of the isolated tree corresponding to the initial solution.

本实施例中初始解的邻域可以理解为与孤立树的相邻的孤立树。In this embodiment, the neighborhood of the initial solution can be understood as isolated trees adjacent to the isolated tree.

步骤S1033：根据所述候选集中选择适应度值最大的孤立树更新禁忌表。Step S1033: Update the taboo table according to the isolated tree with the largest fitness value selected from the candidate set.

本实施例中禁忌表初始为空，根据步骤S1033的执行逐渐更新禁忌表，禁忌表中的孤立树不再取出进行筛选。In this embodiment, the taboo table is initially empty, and is gradually updated according to the execution of step S1033, and isolated trees in the taboo table are no longer taken out for screening.

步骤S1034：利用所述适应度值最大的孤立树作为解重新查找候选集并基于候选集重新更新禁忌表直至满足预设终止条件。Step S1034: using the isolated tree with the largest fitness value as the solution to re-search the candidate set and re-update the taboo table based on the candidate set until a preset termination condition is met.

本实施例中预设终止条件可以为：In this embodiment, the preset termination condition may be:

(1)达到最大迭代次数：算法会预设一个最大的迭代次数，当迭代次数达到这个预设值时，算法将停止迭代。这是最常见的一种终止条件，确保算法不会无限期地运行下去。(1) Reaching the maximum number of iterations: The algorithm will preset a maximum number of iterations. When the number of iterations reaches this preset value, the algorithm will stop iterating. This is the most common termination condition, which ensures that the algorithm will not run indefinitely.

(2)解的质量改进停滞：如果在一定数量的连续迭代中，解的质量没有得到明显的改进(即适应度值没有显著提升)，则算法可能会终止。这表示算法可能已经陷入了局部最优解，无法找到更好的解。(2) Stagnation of solution quality improvement: If the quality of the solution does not improve significantly (i.e., the fitness value does not increase significantly) in a certain number of consecutive iterations, the algorithm may terminate. This means that the algorithm may have fallen into a local optimal solution and cannot find a better solution.

(3)达到预设的解的质量：如果算法在迭代过程中找到了满足预设条件的解(例如，达到或超过某个特定的适应度值)，算法可能会提前终止。这通常适用于那些有明确优化目标的问题。(3) Achieving the preset solution quality: If the algorithm finds a solution that meets the preset conditions (for example, reaching or exceeding a certain fitness value) during the iteration process, the algorithm may terminate early. This is usually applicable to problems with clear optimization goals.

(4)计算资源限制：有时，算法可能因为计算资源的限制(如内存、时间等)而需要提前终止。这种情况下，算法可能无法完成所有预设的迭代次数，但会根据当前的计算结果输出一个相对较好的解。(4) Computational resource limitations: Sometimes, the algorithm may need to terminate early due to computational resource limitations (such as memory, time, etc.). In this case, the algorithm may not be able to complete all the preset iterations, but will output a relatively good solution based on the current calculation results.

需要注意的是，这些终止条件并不是相互独立的，可以根据具体问题的需要选择使用其中的一个或多个。同时，终止条件的设置也需要考虑算法的性能和效率，避免过早或过晚地终止迭代。It should be noted that these termination conditions are not independent of each other, and one or more of them can be selected according to the needs of the specific problem. At the same time, the setting of the termination condition also needs to consider the performance and efficiency of the algorithm to avoid terminating the iteration too early or too late.

步骤S1035：在达到所述预设终止条件情况下，将所述禁忌表中的孤立树作为所述目标孤立树形成所述孤立森林。Step S1035: When the preset termination condition is reached, the isolated tree in the taboo table is used as the target isolated tree to form the isolated forest.

在一些实施例中，如图7所示，上述步骤S104包括步骤S1041至步骤S1043。In some embodiments, as shown in FIG. 7 , the above step S104 includes steps S1041 to S1043 .

步骤S1041：计算所述待处理业务数据的每个业务数据在所述孤立森林的所述目标孤立树中路径长度的期望值，并计算所述待处理业务数据的每个业务数据在每颗所述目标孤立树中的路径长度的平均值。Step S1041: Calculate the expected value of the path length of each business data of the business data to be processed in the target isolated tree of the isolated forest, and calculate the average value of the path length of each business data of the business data to be processed in each of the target isolated trees.

步骤S1042：根据所述期望值和所述平均值计算每个所述业务数据在所述目标孤立树的异常分数。Step S1042: Calculate the abnormal score of each business data in the target isolated tree according to the expected value and the average value.

步骤S1043：根据所述待处理业务数据中每个所述业务数据在所述目标孤立树的异常分数确定异常检测结果。Step S1043: determining an anomaly detection result according to an anomaly score of each of the business data in the to-be-processed business data in the target isolated tree.

本实施例中计算待处理业务数据的每个业务数据在目标孤立树中路径长度的期望值，并计算待处理业务数据的每个业务数据在所述目标孤立树中的路径长度的平均值；根据期望值和平均值计算每个所述业务数据的异常分数；根据所述待处理业务数据中每个所述业务数据的异常分数确定异常检测结果。具体可参见公式(2)，以及上述基于原始孤立森林获取异常检测结果的方法，本实施例中基于新的孤立森林获取异常检测结果的方法与基于原始孤立森林获取异常检测结果的方法大致相同，本实施例中不再赘述。In this embodiment, the expected value of the path length of each business data of the business data to be processed in the target isolated tree is calculated, and the average value of the path length of each business data of the business data to be processed in the target isolated tree is calculated; the anomaly score of each business data is calculated according to the expected value and the average value; the anomaly detection result is determined according to the anomaly score of each business data in the business data to be processed. For details, please refer to formula (2) and the above-mentioned method for obtaining anomaly detection results based on the original isolated forest. The method for obtaining anomaly detection results based on the new isolated forest in this embodiment is roughly the same as the method for obtaining anomaly detection results based on the original isolated forest, and will not be repeated in this embodiment.

下面以待处理业务数据为采集业务系统历史日志数据为例，对本实施例中基于禁忌搜索算法的孤立森林异常检测算法步骤进行具体说明，如图8所示。The following takes the historical log data of the collection business system as an example of the business data to be processed to specifically illustrate the steps of the isolation forest anomaly detection algorithm based on the taboo search algorithm in this embodiment, as shown in FIG8 .

步骤S201：采集业务系统历史日志数据。Step S201: Collect business system historical log data.

步骤S202：根据业务系统采集来的日志数据进行初始化并构成原始孤立森林。Step S202: Initialize and construct an original isolation forest based on the log data collected by the business system.

步骤S203：从日志数据集中抽取具有一定规模的子集，创建孤立树直至满足条件。Step S203: extract a subset of a certain size from the log data set, and create an isolated tree until the conditions are met.

步骤S204：根据交叉验证法计算每颗孤立树的精度值并根据统计法计算每颗孤立树的差异度。Step S204: Calculate the precision value of each isolated tree according to the cross-validation method and calculate the difference of each isolated tree according to the statistical method.

步骤S205：从所述多颗孤立树的适应度值中选取任意孤立树作为初始解。Step S205: Select any isolated tree from the fitness values of the plurality of isolated trees as an initial solution.

步骤S206：在所述初始解的邻域中选出满足预设禁忌要求的候选集，所述预设禁忌要求为候选集中的孤立树的适应度值大于初始解所对应的孤立树的适应度值。Step S206: Select a candidate set that meets a preset taboo requirement in the neighborhood of the initial solution, where the preset taboo requirement is that the fitness value of the isolated tree in the candidate set is greater than the fitness value of the isolated tree corresponding to the initial solution.

步骤S207：根据所述候选集中选择适应度值最大的孤立树更新禁忌表。Step S207: Update the taboo table according to the isolated tree with the largest fitness value selected from the candidate set.

步骤S208：利用所述适应度值最大的孤立树作为解重新查找候选集并基于候选集重新更新禁忌表直至满足预设终止条件。Step S208: using the isolated tree with the largest fitness value as the solution to re-search the candidate set and re-update the taboo table based on the candidate set until a preset termination condition is met.

步骤S209：在达到所述预设终止条件情况下，将所述禁忌表中的孤立树作为所述目标孤立树形成所述孤立森林。Step S209: When the preset termination condition is reached, the isolated tree in the taboo table is used as the target isolated tree to form the isolated forest.

步骤S210：根据孤立森林对数据集进行检测，遍历计算异常分数，从而得到异常日志数据集。Step S210: Detect the data set according to the isolation forest, traverse and calculate the anomaly score, so as to obtain the abnormal log data set.

其中，在孤立森立算法中，每棵孤立树的检测能力会根据训练集数据与参数变化，因此本案采用交叉验证法来计算每棵孤立树的精度值。Among them, in the isolated forest algorithm, the detection ability of each isolated tree will change according to the training set data and parameters. Therefore, this case uses the cross-validation method to calculate the accuracy value of each isolated tree.

通过对改进的孤立森林算法与原始孤立森林算法的异常检测精准度与执行时间进行分析对比，可证明经过改进的孤立森林算法比原始孤立森林算法更加有效。By analyzing and comparing the anomaly detection accuracy and execution time of the improved isolation forest algorithm and the original isolation forest algorithm, it can be proved that the improved isolation forest algorithm is more effective than the original isolation forest algorithm.

本提案引入异常检测常用指标ROC曲线下与坐标轴围成的面积(Area UnderCurve，AUC)衡量异常数据划分精确度，AUC的值为0.5-1之间，AUC越趋近于1，说明算法精准度越高。测试的数据集的采样数为256个样本，构建100棵孤立树，结果如下表1所示。This proposal introduces the area under the ROC curve (AUC), a common indicator of anomaly detection, to measure the accuracy of abnormal data classification. The value of AUC is between 0.5 and 1. The closer the AUC is to 1, the higher the accuracy of the algorithm. The number of samples in the test data set is 256, and 100 isolated trees are constructed. The results are shown in Table 1 below.

表1：Table 1:

可以看出，改进孤立森林算法其异常检测精准度比原始孤立森林更好。执行效率方面，使用改进孤立森林算法与原始森林算法分别对上表中6个数据集执行10次，结果如下图9所示，可以看出改进孤立森林算法在执行时间方面显著优于原始孤立森林算法。It can be seen that the improved isolation forest algorithm has better anomaly detection accuracy than the original isolation forest. In terms of execution efficiency, the improved isolation forest algorithm and the original forest algorithm were used to execute 10 times on the 6 data sets in the above table. The results are shown in Figure 9 below. It can be seen that the improved isolation forest algorithm is significantly better than the original isolation forest algorithm in terms of execution time.

本案基于改进孤立森林异常检测算法不仅有效降低了计算成本的开销、具备线性复杂度、抗噪声效果好而且适用范围广，检测精度高等优点。This case is based on the improved isolation forest anomaly detection algorithm, which not only effectively reduces the computational cost overhead, has linear complexity, good anti-noise effect, but also has a wide range of applications and high detection accuracy.

本提案可在集中化建设管理系统统一数据管理平台进行应用，具体场景是在各业务系统同步数据至计划建设系统过程中，该数据平台需要对同步过来的业务数据进行统一清洗、校验、稽核等处理，在处理过程中通常会存在业务字段数据不准确、乱码、甚至数据缺失等现象进而导致数据管理平台处理异常甚至崩溃，通过引入本提案算法极大提升了数据管理系统异常数据检测的准确性及稳定性，因此该提案在大数据异常检测领域具备较高的应用价值。This proposal can be applied in the unified data management platform of the centralized construction management system. The specific scenario is that in the process of synchronizing data from various business systems to the planned construction system, the data platform needs to uniformly clean, verify, and audit the synchronized business data. In the process of processing, there are usually inaccurate business field data, garbled characters, or even missing data, which may lead to abnormal processing or even crash of the data management platform. By introducing the algorithm of this proposal, the accuracy and stability of abnormal data detection in the data management system are greatly improved. Therefore, this proposal has a high application value in the field of big data anomaly detection.

本提案主要解决该系统在实施过程中面临的脏数据的问题。通过异常业务数据的检测汇聚和清洗，提升基础质量；进行各类数据标准化，统一内外口径提供全链路，跨服务、跨系统的数据交换、分析能力；建立和优化数据治理组织和流程，逐步实现数据安全可控，并赋能数据应用。通过本提案的数据异常检测方法极大提升数据质量。This proposal mainly solves the problem of dirty data faced by the system during implementation. Improve basic quality through detection, aggregation and cleaning of abnormal business data; standardize various types of data, unify internal and external calibers to provide full-link, cross-service, cross-system data exchange and analysis capabilities; establish and optimize data governance organizations and processes, gradually achieve data security and controllability, and enable data applications. The data anomaly detection method of this proposal greatly improves data quality.

此外，本发明实施例还提出一种存储介质，所述存储介质上存储有大数据异常检测程序，所述大数据异常检测程序被处理器执行时实现如上文所述的大数据异常检测方法的步骤。In addition, an embodiment of the present invention further proposes a storage medium, on which a big data anomaly detection program is stored. When the big data anomaly detection program is executed by a processor, the steps of the big data anomaly detection method described above are implemented.

此外，本发明实施例还提出一种计算机程序产品，包括大数据异常检测程序，所述大数据异常检测程序被处理器执行时实现如上所述的大数据异常检测方法的步骤。In addition, an embodiment of the present invention also proposes a computer program product, including a big data anomaly detection program, which implements the steps of the big data anomaly detection method as described above when executed by a processor.

本发明计算机程序产品具体实施方式与上述大数据异常检测方法各实施例基本相同，在此不再赘述。The specific implementation methods of the computer program product of the present invention are basically the same as the embodiments of the above-mentioned big data anomaly detection method, and will not be repeated here.

参照图10，图10为本发明大数据异常检测装置第一实施例的结构框图。Refer to Figure 10, which is a structural block diagram of the first embodiment of the big data anomaly detection device of the present invention.

如图10所示，本发明实施例提出的大数据异常检测装置包括：孤立树构建模块301，用于根据待处理业务数据创建多颗孤立树；处理模块302，用于计算所述多颗孤立树中每颗所述孤立树的适应度值；禁忌搜索模块303，用于以所述孤立树的适应度值作为筛选指标，基于禁忌搜索算法从所述多颗孤立树中选出目标孤立树形成孤立森林；异常检测模块304，还用于根据所述孤立森林对所述待处理业务数据进行异常检测，获取异常检测结果。As shown in FIG10 , the big data anomaly detection device proposed in the embodiment of the present invention includes: an isolated tree construction module 301, which is used to create multiple isolated trees according to the business data to be processed; a processing module 302, which is used to calculate the fitness value of each of the multiple isolated trees; a taboo search module 303, which is used to use the fitness value of the isolated tree as a screening index, and select the target isolated tree from the multiple isolated trees based on the taboo search algorithm to form an isolated forest; an anomaly detection module 304, which is also used to perform anomaly detection on the business data to be processed according to the isolated forest to obtain anomaly detection results.

本发明大数据异常检测装置的实施例或具体实现方式可参照上述各方法实施例，此处不再赘述。The embodiments or specific implementation methods of the big data anomaly detection device of the present invention can refer to the above-mentioned method embodiments, which will not be repeated here.

需要说明的是，在本文中，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、物品或者系统不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、物品或者系统所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括该要素的过程、方法、物品或者系统中还存在另外的相同要素。It should be noted that, in this article, the terms "include", "comprises" or any other variations thereof are intended to cover non-exclusive inclusion, so that a process, method, article or system including a series of elements includes not only those elements, but also other elements not explicitly listed, or also includes elements inherent to such process, method, article or system. In the absence of further restrictions, an element defined by the sentence "comprises a ..." does not exclude the existence of other identical elements in the process, method, article or system including the element.

上述本发明实施例序号仅仅为了描述，不代表实施例的优劣。The serial numbers of the above embodiments of the present invention are only for description and do not represent the advantages or disadvantages of the embodiments.

通过以上的实施方式的描述，本领域的技术人员可以清楚地了解到上述实施例方法可借助软件加必需的通用硬件平台的方式来实现，当然也可以通过硬件，但很多情况下前者是更佳的实施方式。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质(如只读存储器/随机存取存储器、磁碟、光盘)中，包括若干指令用以使得一台终端设备(可以是手机，计算机，服务器，空调器，或者网络设备等)执行本发明各个实施例所述的方法。Through the description of the above implementation methods, those skilled in the art can clearly understand that the above-mentioned embodiment methods can be implemented by means of software plus a necessary general hardware platform, and of course by hardware, but in many cases the former is a better implementation method. Based on such an understanding, the technical solution of the present invention, or the part that contributes to the prior art, can be embodied in the form of a software product, which is stored in a storage medium (such as a read-only memory/random access memory, a magnetic disk, or an optical disk), and includes a number of instructions for enabling a terminal device (which can be a mobile phone, a computer, a server, an air conditioner, or a network device, etc.) to execute the methods described in each embodiment of the present invention.

以上仅为本发明的优选实施例，并非因此限制本发明的专利范围，凡是利用本发明说明书及附图内容所作的等效结构或等效流程变换，或直接或间接运用在其他相关的技术领域，均同理包括在本发明的专利保护范围内。The above are only preferred embodiments of the present invention, and are not intended to limit the patent scope of the present invention. Any equivalent structure or equivalent process transformation made using the contents of the present invention specification and drawings, or directly or indirectly applied in other related technical fields, are also included in the patent protection scope of the present invention.

Claims

1. A big data anomaly detection method, the method comprising:

Creating a plurality of isolated trees according to the service data to be processed;

Calculating the fitness value of each isolated tree in the plurality of isolated trees;

Selecting a target isolated tree form from the plurality of isolated trees to form an isolated forest based on a tabu search algorithm by taking the fitness value of the isolated tree as a screening index;

and carrying out anomaly detection on the business data to be processed according to the isolated forest to obtain an anomaly detection result.

2. The big data anomaly detection method of claim 1, wherein the calculating the fitness value of each of the plurality of orphan trees comprises:

calculating the precision value of a correct classification sample of each isolated tree in the plurality of isolated trees, and calculating the difference degree of a correct detection sample of any two isolated trees in the plurality of isolated trees;

and calculating the fitness value of each isolated tree according to the precision value and the difference.

3. The big data anomaly detection method according to claim 2, wherein the calculating the fitness value of each of the isolated trees from the precision value and the degree of difference comprises:

calculating the sum of the product of the precision value and the first weight and the product of the difference degree and the second weight;

and taking the inverse of the sum of the products as the fitness value of each isolated tree.

4. The big data anomaly detection method of claim 2, wherein said calculating an accuracy value of a correctly classified sample for each of said plurality of orphans comprises:

and calculating the precision value of the correct classification sample of each isolated tree in the plurality of isolated trees according to a statistical method.

5. The big data anomaly detection method according to claim 2, wherein the calculating the degree of difference of the correctly detected samples of any two of the plurality of isolated trees includes:

And calculating the difference degree of the correct detection samples of any two isolated trees in the plurality of isolated trees according to a cross validation method.

6. The big data anomaly detection method according to claim 1, wherein selecting a target isolated tree form among the plurality of isolated trees based on a tabu search algorithm to form an isolated forest by using the fitness value of the isolated tree as a screening index comprises:

selecting any isolated tree from the fitness values of the plurality of isolated trees as an initial solution;

Selecting a candidate set meeting preset tabu requirements from the neighborhood of the initial solution, wherein the preset tabu requirements are that the adaptability value of the isolated tree in the candidate set is larger than that of the isolated tree corresponding to the initial solution;

Selecting an isolated tree with the maximum fitness value according to the candidate set to update a tabu table;

Utilizing the isolated tree with the maximum fitness value as a solution to search for a candidate set again and updating a tabu table again based on the candidate set until a preset termination condition is met;

and under the condition that the preset termination condition is reached, forming the isolated forest by taking the isolated tree in the tabu table as the target isolated tree.

7. The big data anomaly detection method according to claim 1, wherein the anomaly detection of the service data to be processed according to the isolated forest, obtaining an anomaly detection result, includes:

Calculating expected values of path lengths of each service data of the service data to be processed in the target isolated tree of the isolated forest, and calculating average values of path lengths of each service data of the service data to be processed in each target isolated tree;

calculating the abnormal score of each business data in the target isolated tree according to the expected value and the average value;

And determining an abnormality detection result according to the abnormality score of each business data in the to-be-processed business data in the target isolated tree.

8. A big data abnormality detection device, characterized by comprising:

the isolated tree construction module is used for creating a plurality of isolated trees according to the business data to be processed;

the processing module is used for calculating the fitness value of each isolated tree in the plurality of isolated trees;

The tabu search module is used for selecting a target isolated tree form among the plurality of isolated trees to form an isolated forest based on a tabu search algorithm by taking the fitness value of the isolated tree as a screening index;

And the abnormality detection module is also used for carrying out abnormality detection on the business data to be processed according to the isolated forest to obtain an abnormality detection result.

9. An electronic device, the device comprising: a memory, a processor, and a big data abnormality detection program stored on the memory and executable on the processor, the big data abnormality detection program configured to implement the steps of the big data abnormality detection method according to any one of claims 1 to 7.

10. A storage medium having stored thereon a big data abnormality detection program which, when executed by a processor, implements the steps of the big data abnormality detection method according to any one of claims 1to 7.

11. A computer program product, characterized in that the computer program product comprises a big data anomaly detection program which, when executed by a processor, implements the steps of the big data anomaly detection method of any one of claims 1 to 7.