[go: up one dir, main page]

CN107222472A - A kind of user behavior method for detecting abnormality under Hadoop clusters - Google Patents

A kind of user behavior method for detecting abnormality under Hadoop clusters Download PDF

Info

Publication number
CN107222472A
CN107222472A CN201710384599.7A CN201710384599A CN107222472A CN 107222472 A CN107222472 A CN 107222472A CN 201710384599 A CN201710384599 A CN 201710384599A CN 107222472 A CN107222472 A CN 107222472A
Authority
CN
China
Prior art keywords
data
user
user behavior
behavior
feature vector
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN201710384599.7A
Other languages
Chinese (zh)
Inventor
郝玉洁
钟德建
王芷若
崔建鹏
陆文斌
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
University of Electronic Science and Technology of China
Original Assignee
University of Electronic Science and Technology of China
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by University of Electronic Science and Technology of China filed Critical University of Electronic Science and Technology of China
Priority to CN201710384599.7A priority Critical patent/CN107222472A/en
Publication of CN107222472A publication Critical patent/CN107222472A/en
Pending legal-status Critical Current

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L63/00Network architectures or network communication protocols for network security
    • H04L63/14Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic
    • H04L63/1408Network architectures or network communication protocols for network security for detecting or protecting against malicious traffic by monitoring network traffic
    • H04L63/1425Traffic logging, e.g. anomaly detection
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/10File systems; File servers
    • G06F16/18File system types
    • G06F16/182Distributed file systems

Landscapes

  • Engineering & Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Computer Security & Cryptography (AREA)
  • Theoretical Computer Science (AREA)
  • Signal Processing (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Computing Systems (AREA)
  • Data Mining & Analysis (AREA)
  • Databases & Information Systems (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Hardware Design (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The invention discloses the user behavior method for detecting abnormality under a kind of Hadoop clusters, this method is by gathering and analyzing user behavior data, got off with logged, characteristic vector is formed according to the behavioural characteristic of user, utilize parallel Principal Component Analysis Algorithm processing feature vector set, the efficient behavior pattern for obtaining user, by contrasting the abnormal behaviour problem for finding that user produces when accessing HDFS with historical pattern, the security threat hidden under Hadoop clusters is found simultaneously, reaches the effect for ensureing HDFS safety.Effective monitoring is set up in data access behavior of the present invention not only to user, timely note abnormalities behavior, the data safety of Hadoop clusters is ensured, model training efficiency is also improved by parallelization Principal Component Analysis Algorithm, the problem of conventional model training effectiveness is low is solved.

Description

一种Hadoop集群下的用户行为异常检测方法A user behavior anomaly detection method under Hadoop cluster

技术领域technical field

本发明涉及一种用户行为异常检测方法,尤其是一种基于Hadoop集群下的用户行为异常检测方法。The invention relates to a method for detecting abnormal user behavior, in particular to a method for detecting abnormal user behavior based on a Hadoop cluster.

背景技术Background technique

近年来,Hadoop平台作为一个优秀的分布式计算系统,在企业大规模数据处理方面扮演着越来越重要的角色。然而,由于Hadoop在开发之初并未考虑安全因素,虽然后续加入了一些安全机制,但Hadoop的安全审计机制、访问控制机制和身份认证机制等都属于被动的静态安全技术,不能对用户行为活动进行监控,这就导致容易遭受隐藏的安全攻击。比如:非法用户盗取合法用户的账号和密码,获得相关权限非法访问数据;在恶意入侵、维修和介质丢失时容易产生数据泄露问题,集群的数据安全难以得到保障。数据是信息的载体,一旦遭遇数据灾难,可能给用户造成不可估量的损失。因此,需要对用户的数据访问行为建立有效的监控,及时的发现异常行为,保障Hadoop集群的数据安全。In recent years, the Hadoop platform, as an excellent distributed computing system, has played an increasingly important role in large-scale data processing in enterprises. However, because Hadoop did not consider security factors at the beginning of its development, although some security mechanisms were added later, Hadoop's security audit mechanism, access control mechanism, and identity authentication mechanism are all passive static security technologies that cannot monitor user behavior. monitoring, which leads to vulnerability to hidden security attacks. For example, illegal users steal the account numbers and passwords of legitimate users, and obtain relevant permissions to illegally access data; data leakage problems are likely to occur during malicious intrusion, maintenance, and media loss, and it is difficult to guarantee the data security of the cluster. Data is the carrier of information. Once a data disaster occurs, it may cause immeasurable losses to users. Therefore, it is necessary to establish effective monitoring of user data access behaviors, detect abnormal behaviors in a timely manner, and ensure the data security of Hadoop clusters.

在基于用户行为活动的监控方面,国内研究相对较少。Ashish Kamra等人提出了一种针对关系型数据库访问模式的异常检测方法,它是基于用户的SQL查询日志,但这种方法只针对关系数据库,不适用于大数据平台的用户行为监控;Mohiuddin Solaimani等人提出了一种基于Spark的虚拟机性能异常检测框架,目的是通过对虚拟机性能异常检测发现哪些用户占用大量资源,造成资源的共享不均衡影响集群运行效率,但Spark是基于内存的计算,当数据规模很大或是中间结果超过内存大小时就无法处理;刘朋提出了一个针对数据库的抽象架构和通用的异常行为检测解决方法,但却没有给出具体的算法;FredrikValeur等人提出了一种基于机器学习的SQL攻击行为检测方法,但只针对基于网络的后端数据库。In terms of monitoring based on user behavior activities, there are relatively few domestic studies. Ashish Kamra and others proposed an anomaly detection method for relational database access patterns, which is based on user SQL query logs, but this method is only for relational databases, not suitable for user behavior monitoring on big data platforms; Mohiuddin Solaimani et al proposed a Spark-based virtual machine performance anomaly detection framework, the purpose of which is to find out which users occupy a large amount of resources by detecting virtual machine performance anomalies, resulting in unbalanced resource sharing and affecting cluster operation efficiency, but Spark is a memory-based computing , when the data scale is large or the intermediate results exceed the memory size, it cannot be processed; Liu Peng proposed an abstract architecture for databases and a general solution to abnormal behavior detection, but did not give a specific algorithm; Fredrik Valeur et al. proposed A machine learning-based SQL attack behavior detection method is proposed, but only for network-based back-end databases.

传统的用户行为异常检测方法主要是在数据库以及集群性能异常方面。数据库一般是针对关系型数据库等,在Hadoop集群分布式环境下无法适用,而集群性能异常在Hadoop平台本身的负载均衡等机制下,表现并不突出,异常检测的结果正确性不高。此外,Hadoop集群下的数据规模通常很大,基于传统的主成分分析的模型训练算法,效率相对较低。Traditional user behavior anomaly detection methods are mainly in database and cluster performance anomalies. Databases are generally for relational databases, etc., which cannot be applied in the distributed environment of Hadoop clusters, and cluster performance anomalies are not outstanding under the load balancing mechanism of the Hadoop platform itself, and the accuracy of anomaly detection results is not high. In addition, the data scale under the Hadoop cluster is usually very large, and the model training algorithm based on traditional principal component analysis is relatively inefficient.

因此本发明的优化方法将对用户的数据访问行为建立有效的监控,及时的发现异常行为,保障Hadoop集群的数据安全,并且还通过并行化主成分分析算法提高模型训练效率,解决传统模型训练效率低的问题。Therefore, the optimization method of the present invention will establish effective monitoring of the user's data access behavior, find abnormal behavior in time, ensure the data security of the Hadoop cluster, and also improve the model training efficiency through the parallel principal component analysis algorithm, and solve the problem of traditional model training efficiency. low problem.

发明内容Contents of the invention

本发明的目的在于克服现有的技术不足,提供一种Hadoop集群下的用户行为异常检测方法,不仅能够解决在Hadoop集群下针对用户访问HDFS数据的异常行为监控问题,而且还对传统的主成分分析算法进行了并行化处理,解决模型训练效率较低的问题。The purpose of the present invention is to overcome existing technical deficiencies, provide a kind of user behavior anomaly detection method under Hadoop cluster, not only can solve the abnormal behavior monitoring problem for user accessing HDFS data under Hadoop cluster, but also to traditional principal component The analysis algorithm is processed in parallel to solve the problem of low model training efficiency.

本发明的目的是通过以下技术方案来实现的:一种Hadoop集群下的用户行为异常检测方法,包括以下步骤:The purpose of the present invention is achieved by the following technical solutions: a user behavior abnormal detection method under a Hadoop cluster, comprising the following steps:

S1:用户行为数据采集:通过Hadoop日志管理服务(Log4j)从集群NameNode节点获得HDFS的审计日志并存储于数据库;S1: User behavior data collection: Obtain HDFS audit logs from cluster NameNode nodes through Hadoop log management service (Log4j) and store them in the database;

S2:数据预处理;S2: data preprocessing;

S3:模型训练:抽取其中一个用户的部分特征向量集作为训练数据并构造为样本数据矩阵,基于本发明提出的并行主成分分析算法对样本数据进行降维处理,得到样本均值和变换矩阵,存入该用户模型库。其他用户的模型训练方法相同。其中变换矩阵主要完成把样本由原空间映射到主成分子空间的功能;S3: model training: extract one of the user's partial eigenvector sets as training data and construct it as a sample data matrix, carry out dimensionality reduction processing on the sample data based on the parallel principal component analysis algorithm proposed by the present invention, obtain the sample mean value and transformation matrix, store into the user model library. The model training method for other users is the same. Among them, the transformation matrix mainly completes the function of mapping the sample from the original space to the principal component subspace;

S4:用户行为异常检测:针对某一个用户,把该用户当前的行为模式(特征向量)与该用户模型训练得到的历史行为模式做匹配,如果不匹配,则为异常行为。S4: Abnormal user behavior detection: For a certain user, match the user's current behavior pattern (feature vector) with the historical behavior pattern trained by the user model. If they do not match, it is an abnormal behavior.

所述的用户行为数据采集,是利用Hadoop日志管理服务,并且默认Hadoop已经集成了Apache的开源项目Log4j,通过Log4j日志管理服务从集群NameNode节点获得了HDFS的审计日志并存储于数据库;The user behavior data collection is to use Hadoop log management service, and default Hadoop has integrated Apache's open source project Log4j, through the Log4j log management service, the audit log of HDFS is obtained from the cluster NameNode node and stored in the database;

所述的用户行为数据是用户访问HDFS行为时的审计记录,记录包括访问日期和时间、用户标识、文件操作命令、客户端IP地址;Described user behavior data is the audit record when the user visits the HDFS behavior, and the record includes access date and time, user identification, file operation command, client IP address;

所述的数据预处理,包括以下步骤:The data preprocessing includes the following steps:

S21:提取并统计数据,从数据库中读取审计记录,针对每一个用户的审计记录,基于一个时间窗口,统计该时间内每个文件操作命令出现的次数;S21: Extract and count data, read audit records from the database, and count the number of occurrences of each file operation command within a time window for each user's audit records;

S22:构成特征向量。S22: Construct feature vectors.

所述的特征向量是基于频域属性构造特征向量,该特征向量用x=(x1,x2,…,x13)来表示,该特征向量一共有13种文件操作命令,每一维的值代表一种文件操作命令在该时间窗口内出现的次数,依次进行便得到一个特征向量集,其中,13维对应HDFS文件操作命令种类数目。该特征向量集既可以作为模型训练数据又可以作为测试数据;The feature vector is constructed based on the frequency domain attribute, and the feature vector is represented by x=(x 1 , x 2 ,..., x 13 ). There are 13 kinds of file operation commands in the feature vector, and each dimension The value represents the number of occurrences of a file operation command within the time window, and a set of feature vectors can be obtained by sequentially proceeding. Among them, the 13th dimension corresponds to the number of types of HDFS file operation commands. The feature vector set can be used as both model training data and test data;

所述的模型训练包括以下子步骤:Described model training comprises the following substeps:

S31:根据抽取的模型训练数据,构造样本数据矩阵;S31: Construct a sample data matrix according to the extracted model training data;

S32:基于并行化主成分分析,求方差矩阵和样本均值,对样本矩阵进行水平分割分为N块,基于MapReduce计算模型求得样本均值和协方差矩阵;S32: Calculate the variance matrix and sample mean value based on parallelized principal component analysis, divide the sample matrix horizontally into N blocks, and obtain the sample mean value and covariance matrix based on the MapReduce calculation model;

S33:计算协方差矩阵的特征值和对应的特征向量,按照方差贡献率确定主成分数量k;S33: Calculate the eigenvalues and corresponding eigenvectors of the covariance matrix, and determine the number of principal components k according to the variance contribution rate;

S34:根据方差贡献率确定主成分并构造变换矩阵,根据前k大特征值对应的特征向量构造变换矩阵,样本矩阵与变换矩阵的乘积即为主成分矩阵;S34: Determine the principal components according to the variance contribution rate and construct a transformation matrix, construct a transformation matrix according to the eigenvectors corresponding to the top k eigenvalues, and the product of the sample matrix and the transformation matrix is the principal component matrix;

S35:把得到的样本均值和变换矩阵存入模型数据库,供异常检测使用。S35: Store the obtained sample mean value and transformation matrix into the model database for use in anomaly detection.

所述的用户行为异常检测包括以下子步骤:The described abnormal user behavior detection includes the following sub-steps:

S41:针对某一个用户,从测试数据提取出一个特征向量,进行均值调整处理;S41: For a certain user, extract a feature vector from the test data, and perform mean value adjustment processing;

S42:计算经过均值调整处理的向量与该向量的主成分重构之间的欧氏距离,如果距离大于预先设定的阈值,则为异常行为;否则,为正常行为;S42: Calculate the Euclidean distance between the mean-adjusted vector and the principal component reconstruction of the vector, if the distance is greater than a preset threshold, it is an abnormal behavior; otherwise, it is a normal behavior;

所述的经均值调整过的向量的主成分重构,是把均值调整过的向量经过训练得到变换矩阵,再映射到主成分子空间,随后利用变换矩阵的转置,把映射后的新向量重构回原来的空间得到的向量;The principal component reconstruction of the mean-adjusted vector is to train the mean-adjusted vector to obtain a transformation matrix, and then map it to the principal component subspace, and then use the transposition of the transformation matrix to convert the mapped new vector The vector obtained by reconstructing back to the original space;

所述的用户行为异常检测方法,将用户行为的异常检测分为两种情况进行测试:The described abnormal user behavior detection method divides the abnormal detection of user behavior into two situations for testing:

(1)如果要测试检测方法的误检率,则把一个用户的特征向量数据抽取部分数据作为训练数据,剩下的部分作为测试数据;(1) If the false detection rate of the detection method is to be tested, a part of the data extracted from a user's feature vector data is used as training data, and the remaining part is used as test data;

(2)如果要测试检测方法的检测率,则把一个用户的特征向量数据抽取部分数据作为训练数据,抽取另外其他用户的部分作为测试数据。(2) If the detection rate of the detection method is to be tested, part of the data extracted from one user's feature vector data is used as training data, and parts of other users are extracted as test data.

本发明的有益效果是:为Hadoop集群下的HDFS文件数据访问行为提供了一种有效的、正确的异常行为检测方法,该方法克服了传统异常检测方法在Hadoop集群环境下不适用的问题,并且对本方法使用的主成分分析算法进行了并行化改进处理,提高了模型训练的效率。The beneficial effect of the present invention is: provide a kind of effective, correct abnormal behavior detection method for HDFS file data access behavior under Hadoop cluster, this method has overcome the problem that traditional abnormal detection method is not applicable under Hadoop cluster environment, and The principal component analysis algorithm used in this method is improved in parallel, which improves the efficiency of model training.

附图说明Description of drawings

图1为本发明的流程图;Fig. 1 is a flowchart of the present invention;

图2为本发明的模型训练流程图;Fig. 2 is the model training flowchart of the present invention;

图3为本发明的用户行为异常检测流程图;Fig. 3 is a flow chart of abnormal user behavior detection of the present invention;

图4为本发明的主成分分析并行化处理过程图。Fig. 4 is a process diagram of parallel processing of principal component analysis in the present invention.

具体实施方式detailed description

下面结合附图进一步详细描述本发明的技术方案,但本发明的保护范围不局限于以下所述。The technical solution of the present invention will be further described in detail below in conjunction with the accompanying drawings, but the protection scope of the present invention is not limited to the following description.

如图1所示,一种Hadoop集群下的用户行为异常检测方法,包括以下步骤:As shown in Figure 1, a user behavior anomaly detection method under a Hadoop cluster comprises the following steps:

S1:用户行为数据采集,Hadoop默认集成了Apache的开源项目Log4j,通过Log4j日志管理服务从集群NameNode节点获得HDFS的审计日志并存储于数据库;S1: User behavior data collection. Hadoop integrates Apache's open source project Log4j by default, and obtains HDFS audit logs from the cluster NameNode through the Log4j log management service and stores them in the database;

S2:数据的预处理。从数据库中读取审计记录,针对每一个用户的审计记录,基于一个时间窗口,统计该时间内每个文件操作命令出现的次数,并组合构成一个特征向量,该特征向量用x=(x1,x2,…,x13)来表示,一共有13种文件操作命令,每一维的值代表一种文件操作命令在该时间窗口内出现的次数,依次进行便得到一个特征向量集,即待检测模式。该特征向量集可以作为模型训练数据和测试数据;S2: Preprocessing of data. Read the audit records from the database, for each user's audit records, based on a time window, count the number of occurrences of each file operation command within the time period, and combine them to form a feature vector, the feature vector is expressed by x=(x 1 ,x 2 ,…,x 13 ), there are 13 kinds of file operation commands in total, and the value of each dimension represents the number of occurrences of a file operation command in the time window, and a set of feature vectors can be obtained by performing sequentially, that is, Pattern to be detected. The feature vector set can be used as model training data and test data;

S3:模型训练:抽取其中一个用户的部分特征向量集作为训练数据并构造为样本数据矩阵,基于本发明提出的并行主成分分析算法对样本数据进行降维处理,得到样本均值和变换矩阵,存入该用户模型库。其他用户的模型训练方法相同。其中变换矩阵主要完成把样本由原空间映射到主成分子空间的功能;S3: Model training: extract one of the user's partial feature vector sets as training data and construct it as a sample data matrix, perform dimensionality reduction processing on the sample data based on the parallel principal component analysis algorithm proposed by the present invention, obtain the sample mean value and transformation matrix, store into the user model library. The model training method for other users is the same. Among them, the transformation matrix mainly completes the function of mapping the sample from the original space to the principal component subspace;

S4:用户行为异常检测:针对某一个用户,把该用户当前的行为模式(特征向量)与该用户模型训练得到的历史行为模式做匹配,如果不匹配,则为异常行为。S4: Abnormal user behavior detection: For a certain user, match the current behavior pattern (feature vector) of the user with the historical behavior pattern trained by the user model. If they do not match, it is an abnormal behavior.

如图2所示,模型训练的步骤为:As shown in Figure 2, the steps of model training are:

S31:根据抽取的模型训练数据(特征向量集),构造样本数据矩阵;S31: Construct a sample data matrix according to the extracted model training data (feature vector set);

S32:如图4所示,基于并行化主成分分析,求方差矩阵和样本均值,对样本矩阵进行水平分割分为N块,基于MapReduce计算模型求得样本均值和协方差矩阵;抽取其中一个用户的部分特征向量集作为训练数据并构造为样本数据矩阵,基于本发明提出的并行主成分分析算法对样本数据进行降维处理,得到样本均值和变换矩阵,存入该用户模型库。其他用户的模型训练方法相同。其中变换矩阵主要完成把样本由原空间映射到主成分子空间的功能;S32: As shown in Figure 4, based on parallel principal component analysis, find the variance matrix and sample mean value, divide the sample matrix horizontally into N blocks, and obtain the sample mean value and covariance matrix based on the MapReduce computing model; extract one of the users Part of the eigenvector set is used as training data and constructed as a sample data matrix. Based on the parallel principal component analysis algorithm proposed by the present invention, the dimensionality reduction process is performed on the sample data to obtain the sample mean value and transformation matrix, which are stored in the user model library. The model training method for other users is the same. Among them, the transformation matrix mainly completes the function of mapping the sample from the original space to the principal component subspace;

具体并行化主成分分析公式为:得到特征向量矩阵Xi,Xi=[X1,X2,...,X13]T,X的均值矩阵和协方差矩阵分别记为μ=E(X)和Σ=D(X)。The specific parallel principal component analysis formula is: get the eigenvector matrix X i , X i =[X1,X2,...,X13] T , and the mean matrix and covariance matrix of X are denoted as μ=E(X) and Σ=D(X).

S33:计算协方差矩阵的特征值和对应的特征向量,按照方差贡献率确定主成分数量k;S33: Calculate the eigenvalues and corresponding eigenvectors of the covariance matrix, and determine the number of principal components k according to the variance contribution rate;

S34:根据方差贡献率确定主成分并构造变换矩阵,根据前k大特征值对应的特征向量构造变换矩阵,样本矩阵与变换矩阵的乘积即为主成分矩阵;S34: Determine the principal components according to the variance contribution rate and construct a transformation matrix, construct a transformation matrix according to the eigenvectors corresponding to the top k eigenvalues, and the product of the sample matrix and the transformation matrix is the principal component matrix;

S35:根据变换矩阵得到主成分矩阵,把得到的样本均值和变换矩阵存入模型数据库,供异常检测使用。S35: Obtain the principal component matrix according to the transformation matrix, and store the obtained sample mean value and transformation matrix into the model database for use in anomaly detection.

如图3所示,用户行为异常检测,针对某一个用户,把该用户当前的行为模式(特征向量)与该用户模型训练得到的历史行为模式做匹配,如果不匹配,则为异常行为,具体步骤如下:As shown in Figure 3, user behavior anomaly detection, for a certain user, matches the user's current behavior pattern (feature vector) with the historical behavior pattern trained by the user model, if it does not match, it is an abnormal behavior, specifically Proceed as follows:

S41:把当前用户的行为特征向量作为测试数据;S41: using the behavior feature vector of the current user as test data;

S42:在MapReduce框架下,将当前用户的行为特征向量并均值调整为待检测数据;S42: Under the MapReduce framework, the behavior feature vector and mean value of the current user are adjusted to the data to be detected;

S43:计算待检测特征向量和主成分重构后的特征向量之间的距离;S43: Calculate the distance between the feature vector to be detected and the reconstructed feature vector of the principal component;

S44:判断阈值:若距离大于阈值,将当前用户行为划归为异常行为记录,未超过阈值,则将当前用户行为划归为正常行为;S44: judgment threshold: if the distance is greater than the threshold, the current user behavior is classified as an abnormal behavior record, and if the threshold is not exceeded, the current user behavior is classified as a normal behavior;

S45:判断是否还存在测试数据:若还存在测试数据,则重新进行均值调整,等到没有测试数据存在时结束测试。S45: Judging whether there is still test data: if there is still test data, re-adjust the mean value, and end the test when no test data exists.

以上所述仅是本发明的优选实施方式,应当理解本发明并非局限于本文所披露的形式,不应看作是对其他实施例的排除,而可用于各种其他组合、修改和环境,并能够在本文所述构想范围内,通过上述教导或相关领域的技术或知识进行改动。而本领域人员所进行的改动和变化不脱离本发明的精神和范围,则都应在本发明所附权利要求的保护范围内。The above descriptions are only preferred embodiments of the present invention. It should be understood that the present invention is not limited to the form disclosed herein, and should not be regarded as excluding other embodiments, but can be used in various other combinations, modifications and environments, and Modifications can be made within the scope of the ideas described herein, by virtue of the above teachings or skill or knowledge in the relevant art. However, changes and changes made by those skilled in the art do not depart from the spirit and scope of the present invention, and should all be within the protection scope of the appended claims of the present invention.

Claims (10)

1.一种Hadoop集群下的用户行为异常检测方法,其特征在于,它包括以下步骤:1. a user behavior anomaly detection method under a Hadoop cluster, is characterized in that, it comprises the following steps: S1:用户行为数据采集,所述用户行为数据包括用户访问Hadoop集群HDFS的审计记录;S1: user behavior data collection, described user behavior data comprises the audit record that user visits Hadoop cluster HDFS; S2:数据预处理,针对每一个用户的审计记录,基于一个时间窗口,统计该时间窗口内的用户行为特征,构成一个特征向量,再依次运用于不同的用户和不同的时间窗口,便得到包含多个用户及其不同时段行为特征的特征向量集;S2: Data preprocessing, for each user's audit record, based on a time window, count the user behavior characteristics in the time window to form a feature vector, and then apply it to different users and different time windows in turn, and then get the inclusion A feature vector set of multiple users and their behavior characteristics in different time periods; S3:模型训练,分别抽取每个用户的部分特征向量集作为训练数据并构造为样本数据矩阵,对样本数据进行降维处理,得到样本均值和变换矩阵,所述的变换矩阵把样本由原空间映射到主成分子空间;S3: model training, extracting part of the feature vector set of each user as training data and constructing it as a sample data matrix, carrying out dimensionality reduction processing on the sample data, obtaining the sample mean value and transformation matrix, and the transformation matrix transforms the sample from the original space Mapped to the principal component subspace; S4:用户行为异常检测。S4: Abnormal user behavior detection. 2.根据权利要求1所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,所述的审计记录包括访问日期和时间、用户标识、文件操作命令、客户端IP地址;所述的审计记录通过Hadoop日志管理服务从集群NameNode节点获得。2. the abnormal detection method of user behavior under a kind of Hadoop cluster according to claim 1, is characterized in that, described audit record comprises access date and time, user identification, file operation order, client IP address; The audit records are obtained from the cluster NameNode node through the Hadoop log management service. 3.根据权利要求1所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,所述的用户行为特征,包括该用户在所述时间窗口内的若干条日志记录中,每种文件操作命令出现的次数。3. the user behavior anomaly detection method under a kind of Hadoop cluster according to claim 1, is characterized in that, described user behavior characteristic, comprises this user in several log records in described time window, each kind The number of occurrences of the file operation command. 4.根据权利要求4所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,所述的用户行为特征构成的特征向量表示为x=(x1,x2,…,xn),其中n为文件操作命令总数,特征向量的每一维的值代表一种文件操作命令在该时间窗口内出现的次数。4. the abnormal user behavior detection method under a kind of Hadoop cluster according to claim 4, is characterized in that, the feature vector that described user behavior feature constitutes is expressed as x=(x 1 , x 2 ,..., x n ), where n is the total number of file operation commands, and the value of each dimension of the feature vector represents the number of occurrences of a file operation command within the time window. 5.根据权利要求1所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,所述的样本数据降维处理,包括以下子步骤:5. the abnormal user behavior detection method under a kind of Hadoop cluster according to claim 1, is characterized in that, described sample data dimensionality reduction process, comprises the following substeps: S21:提取并统计数据,从数据库中读取审计记录,针对每一个用户的审计记录,基于一个时间窗口,统计该时间内每个文件操作命令出现的次数;S21: Extract and count data, read audit records from the database, and count the number of occurrences of each file operation command within a time window for each user's audit records; S22:构成特征向量。S22: Construct feature vectors. 6.根据权利要求5所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,该特征向量是基于频域属性构造特征向量,该特征向量一共有13种文件操作命令,每一维的值代表一种文件操作命令在该时间窗口内出现的次数,其中,13维对应HDFS文件操作命令种类数目,该特征向量集既可以作为模型训练数据又可以作为测试数据。6. the abnormal user behavior detection method under a kind of Hadoop cluster according to claim 5, it is characterized in that, this feature vector is to construct feature vector based on frequency domain attribute, and this feature vector has 13 kinds of file operation commands altogether, each The value of the dimension represents the number of occurrences of a file operation command within the time window. Among them, the 13th dimension corresponds to the number of types of HDFS file operation commands. The feature vector set can be used as both model training data and test data. 7.根据权利要求1所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,所述的模型训练包括以下子步骤:7. the abnormal user behavior detection method under a kind of Hadoop cluster according to claim 1, is characterized in that, described model training comprises the following substeps: S31:根据抽取的模型训练数据,构造样本数据矩阵;S31: Construct a sample data matrix according to the extracted model training data; S32:基于并行化主成分分析求方差矩阵和样本均值,对样本矩阵进行水平分割分为N块,基于MapReduce计算模型求得样本均值和协方差矩阵;S32: Find variance matrix and sample mean value based on parallel principal component analysis, divide sample matrix horizontally into N blocks, obtain sample mean value and covariance matrix based on MapReduce computing model; S33:计算协方差矩阵的特征值和对应的特征向量,按照方差贡献率确定主成分数量k;S33: Calculate the eigenvalues and corresponding eigenvectors of the covariance matrix, and determine the principal component quantity k according to the variance contribution rate; S34:根据前k大特征值对应的特征向量构造变换矩阵,样本矩阵与变换矩阵的乘积即为主成分矩阵;S34: construct transformation matrix according to the eigenvectors corresponding to the top k eigenvalues, the product of sample matrix and transformation matrix is the principal component matrix; S35:把得到的样本均值和变换矩阵存入模型数据库,供异常检测使用。S35: Store the obtained sample mean value and transformation matrix into the model database for use in anomaly detection. 8.根据权利要求1所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,所述的用户行为异常检测包括以下子步骤:8. the abnormal user behavior detection method under a kind of Hadoop cluster according to claim 1, is characterized in that, described user behavior abnormal detection comprises the following substeps: S41:针对某一个用户,从测试数据提取出一个特征向量,进行均值调整处理;S41: For a certain user, extract a feature vector from the test data, and perform mean value adjustment processing; S42:计算经过均值调整处理的向量与该向量的主成分重构之间的欧氏距离,如果距离大于预先设定的阈值,则为异常行为;否则,为正常行为。S42: Calculate the Euclidean distance between the vector after the mean adjustment process and the principal component reconstruction of the vector, if the distance is greater than a preset threshold, it is an abnormal behavior; otherwise, it is a normal behavior. 9.根据权利要求6所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,所述的经均值调整过的向量的主成分重构,是把均值调整过的向量经过训练得到变换矩阵,再映射到主成分子空间,随后利用变换矩阵的转置,把映射后的新向量重构回原来的空间得到的向量。9. the user behavior anomaly detection method under a kind of Hadoop cluster according to claim 6, it is characterized in that, the described principal component reconstruction of the vector adjusted by the mean is obtained through training the vector adjusted by the mean The transformation matrix is then mapped to the principal component subspace, and then the transformed vector is reconstructed back to the original space by using the transposition of the transformation matrix. 10.根据权利要求1所述的一种Hadoop集群下的用户行为异常检测方法,其特征在于,对用户行为的异常检测分为两种情况进行测试:10. the abnormal detection method of user behavior under a kind of Hadoop cluster according to claim 1, is characterized in that, the abnormal detection of user behavior is divided into two kinds of situations and tests: (1)如果要测试检测方法的误检率,则把一个用户的特征向量数据抽取部分数据作为训练数据,剩下的部分作为测试数据;(1) If the false detection rate of the detection method is to be tested, a part of the data extracted from a user's feature vector data is used as training data, and the remaining part is used as test data; (2)如果要测试检测方法的检测率,则把一个用户的特征向量数据抽取部分数据作为训练数据,抽取另外其他用户的部分作为测试数据。(2) If the detection rate of the detection method is to be tested, part of the data extracted from one user's feature vector data is used as training data, and parts of other users are extracted as test data.
CN201710384599.7A 2017-05-26 2017-05-26 A kind of user behavior method for detecting abnormality under Hadoop clusters Pending CN107222472A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710384599.7A CN107222472A (en) 2017-05-26 2017-05-26 A kind of user behavior method for detecting abnormality under Hadoop clusters

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710384599.7A CN107222472A (en) 2017-05-26 2017-05-26 A kind of user behavior method for detecting abnormality under Hadoop clusters

Publications (1)

Publication Number Publication Date
CN107222472A true CN107222472A (en) 2017-09-29

Family

ID=59945516

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710384599.7A Pending CN107222472A (en) 2017-05-26 2017-05-26 A kind of user behavior method for detecting abnormality under Hadoop clusters

Country Status (1)

Country Link
CN (1) CN107222472A (en)

Cited By (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108040052A (en) * 2017-12-13 2018-05-15 北京明朝万达科技股份有限公司 A kind of network security threats analysis method and system based on Netflow daily record datas
CN108173818A (en) * 2017-12-13 2018-06-15 北京明朝万达科技股份有限公司 A kind of network security threats analysis method and system based on Proxy daily record datas
CN108399700A (en) * 2018-01-31 2018-08-14 上海乐愚智能科技有限公司 Theft preventing method and smart machine
CN108596738A (en) * 2018-05-08 2018-09-28 新华三信息安全技术有限公司 A kind of user behavior detection method and device
CN108881194A (en) * 2018-06-07 2018-11-23 郑州信大先进技术研究院 Enterprises user anomaly detection method and device
CN109033889A (en) * 2018-08-13 2018-12-18 杭州安恒信息技术股份有限公司 A kind of invasive biology method, apparatus and intelligent terminal based on space-time collision
CN109657803A (en) * 2018-03-23 2019-04-19 新华三大数据技术有限公司 The building of machine learning model
CN109688166A (en) * 2019-02-28 2019-04-26 新华三信息安全技术有限公司 A kind of exception outgoing behavioral value method and device
CN110427971A (en) * 2019-07-05 2019-11-08 五八有限公司 Recognition methods, device, server and the storage medium of user and IP
CN110830450A (en) * 2019-10-18 2020-02-21 平安科技(深圳)有限公司 Abnormal flow monitoring method, device and equipment based on statistics and storage medium
CN111163097A (en) * 2019-12-31 2020-05-15 新浪网技术(中国)有限公司 Web application firewall implementation system and method
CN112306835A (en) * 2020-11-02 2021-02-02 平安科技(深圳)有限公司 User data monitoring and analyzing method, device, equipment and medium
CN112579728A (en) * 2020-12-18 2021-03-30 成都民航西南凯亚有限责任公司 Behavior abnormity identification method and device based on mass data full-text retrieval
CN113011476A (en) * 2021-03-05 2021-06-22 桂林电子科技大学 User behavior safety detection method based on self-adaptive sliding window GAN
CN113821794A (en) * 2021-09-14 2021-12-21 北京八分量信息科技有限公司 Distributed trusted computing system and method
CN117834299A (en) * 2024-03-04 2024-04-05 福建银数信息技术有限公司 A network security intelligent supervision and management method and system
EP3918500B1 (en) * 2019-03-05 2024-04-24 Siemens Industry Software Inc. Machine learning-based anomaly detections for embedded software applications
CN119513929A (en) * 2024-11-07 2025-02-25 南京理工大学 User security behavior analysis method based on multi-data fusion

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150227809A1 (en) * 2014-02-12 2015-08-13 International Business Machines Corporation Anomaly detection in medical imagery
CN105024877A (en) * 2015-06-01 2015-11-04 北京理工大学 A Hadoop Malicious Node Detection System Based on Network Behavior Analysis
CN106101116A (en) * 2016-06-29 2016-11-09 东北大学 A kind of user behavior abnormality detection system based on principal component analysis and method

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20150227809A1 (en) * 2014-02-12 2015-08-13 International Business Machines Corporation Anomaly detection in medical imagery
CN105024877A (en) * 2015-06-01 2015-11-04 北京理工大学 A Hadoop Malicious Node Detection System Based on Network Behavior Analysis
CN106101116A (en) * 2016-06-29 2016-11-09 东北大学 A kind of user behavior abnormality detection system based on principal component analysis and method

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
侯咏佳等: "主成分分析算法的FPGA实现", 《机电工程》 *
贺婷: "面向Hadoop的云计算平台安全监测技术研究", 《中国优秀硕士学位论文全文数据库(电子期刊) 信息科技辑》 *

Cited By (27)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN108173818A (en) * 2017-12-13 2018-06-15 北京明朝万达科技股份有限公司 A kind of network security threats analysis method and system based on Proxy daily record datas
CN108040052A (en) * 2017-12-13 2018-05-15 北京明朝万达科技股份有限公司 A kind of network security threats analysis method and system based on Netflow daily record datas
CN108399700A (en) * 2018-01-31 2018-08-14 上海乐愚智能科技有限公司 Theft preventing method and smart machine
CN109657803A (en) * 2018-03-23 2019-04-19 新华三大数据技术有限公司 The building of machine learning model
CN109657803B (en) * 2018-03-23 2020-04-03 新华三大数据技术有限公司 Construction of machine learning models
CN108596738A (en) * 2018-05-08 2018-09-28 新华三信息安全技术有限公司 A kind of user behavior detection method and device
CN108881194B (en) * 2018-06-07 2020-12-11 中国人民解放军战略支援部队信息工程大学 Method and device for detecting abnormal behavior of users in enterprise
CN108881194A (en) * 2018-06-07 2018-11-23 郑州信大先进技术研究院 Enterprises user anomaly detection method and device
CN109033889B (en) * 2018-08-13 2020-12-18 杭州安恒信息技术股份有限公司 An intrusion identification method, device and intelligent terminal based on space-time collision
CN109033889A (en) * 2018-08-13 2018-12-18 杭州安恒信息技术股份有限公司 A kind of invasive biology method, apparatus and intelligent terminal based on space-time collision
CN109688166B (en) * 2019-02-28 2021-06-04 新华三信息安全技术有限公司 Abnormal outgoing behavior detection method and device
CN109688166A (en) * 2019-02-28 2019-04-26 新华三信息安全技术有限公司 A kind of exception outgoing behavioral value method and device
EP3918500B1 (en) * 2019-03-05 2024-04-24 Siemens Industry Software Inc. Machine learning-based anomaly detections for embedded software applications
CN110427971A (en) * 2019-07-05 2019-11-08 五八有限公司 Recognition methods, device, server and the storage medium of user and IP
CN110830450A (en) * 2019-10-18 2020-02-21 平安科技(深圳)有限公司 Abnormal flow monitoring method, device and equipment based on statistics and storage medium
CN111163097A (en) * 2019-12-31 2020-05-15 新浪网技术(中国)有限公司 Web application firewall implementation system and method
CN112306835A (en) * 2020-11-02 2021-02-02 平安科技(深圳)有限公司 User data monitoring and analyzing method, device, equipment and medium
CN112306835B (en) * 2020-11-02 2024-05-28 平安科技(深圳)有限公司 User data monitoring and analyzing method, device, equipment and medium
WO2022088632A1 (en) * 2020-11-02 2022-05-05 平安科技(深圳)有限公司 User data monitoring and analysis method, apparatus, device, and medium
CN112579728A (en) * 2020-12-18 2021-03-30 成都民航西南凯亚有限责任公司 Behavior abnormity identification method and device based on mass data full-text retrieval
CN113011476A (en) * 2021-03-05 2021-06-22 桂林电子科技大学 User behavior safety detection method based on self-adaptive sliding window GAN
CN113011476B (en) * 2021-03-05 2022-11-11 桂林电子科技大学 User behavior security detection method based on adaptive sliding window GAN
CN113821794B (en) * 2021-09-14 2023-08-18 北京八分量信息科技有限公司 Distributed trusted computing system and method
CN113821794A (en) * 2021-09-14 2021-12-21 北京八分量信息科技有限公司 Distributed trusted computing system and method
CN117834299A (en) * 2024-03-04 2024-04-05 福建银数信息技术有限公司 A network security intelligent supervision and management method and system
CN119513929A (en) * 2024-11-07 2025-02-25 南京理工大学 User security behavior analysis method based on multi-data fusion
CN119513929B (en) * 2024-11-07 2025-07-29 南京理工大学 User safety behavior analysis method based on multi-data fusion

Similar Documents

Publication Publication Date Title
CN107222472A (en) A kind of user behavior method for detecting abnormality under Hadoop clusters
Zou et al. A docker container anomaly monitoring system based on optimized isolation forest
CN119254489B (en) Information network security self-defense method and system based on trusted computing
JP2022512192A (en) Systems and methods for behavioral threat detection
JP2022512195A (en) Systems and methods for behavioral threat detection
US20200177633A1 (en) Cluster detection and elimination in security environments
CN106101116A9 (en) A kind of user behavior abnormality detection system and method based on principal component analysiss
US11595416B2 (en) Method, product, and system for maintaining an ensemble of hierarchical machine learning models for detection of security risks and breaches in a network
CN113132311B (en) Abnormal access detection method, device and equipment
Pundir et al. RanStop: A hardware-assisted runtime crypto-ransomware detection technique
CN107403091A (en) A kind of combination is traced to the source path and the system for real-time intrusion detection of figure of tracing to the source
Yasarathna et al. Anomaly detection in cloud network data
Sallam et al. Result-based detection of insider threats to relational databases
Roschke et al. A flexible and efficient alert correlation platform for distributed ids
JP2022512194A (en) Systems and methods for behavioral threat detection
Sallam et al. Detection of temporal data Ex-filtration threats to relational databases
Ren et al. Application of network intrusion detection based on fuzzy c-means clustering algorithm
Sun et al. LogPal: A generic anomaly detection scheme of heterogeneous logs for network systems
CN117527376A (en) Method for identifying whether active account number in application has vertical override based on flow data
Sapegin et al. Evaluation of in‐memory storage engine for machine learning analysis of security events
CN117591477A (en) A log aggregation query method for massive data
CN117034285A (en) Method, device, equipment and medium for detecting security threat of power system
Liu et al. A web back-end database leakage incident reconstruction framework over unlabeled logs
Zhang Design of Network Intrusion Detection System Based on Data Mining
Hu et al. Anomaly Detection in Network Access-Using LSTM and Encoder-Enhanced Generative Adversarial Networks

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication
RJ01 Rejection of invention patent application after publication

Application publication date: 20170929