CN106650446A

CN106650446A - Identification method and system of malicious program behavior, based on system call

Info

Publication number: CN106650446A
Application number: CN201611221989.4A
Authority: CN
Inventors: 崔宝江; 王崇; 董国伟; 邵帅
Original assignee: Beijing University of Posts and Telecommunications; China Information Technology Security Evaluation Center
Current assignee: Beijing University of Posts and Telecommunications; China Information Technology Security Evaluation Center
Priority date: 2016-12-26
Filing date: 2016-12-26
Publication date: 2017-05-10

Abstract

The present invention provides a malicious program behavior recognition method and system based on system calls, which relate to the technical field of software behavior recognition and analysis, including acquiring feature samples, preprocessing the feature samples, and obtaining system call information; classifying the system call information, and constructing three Tuple model; quantify the elements in the triple model, obtain the similarity of system calls according to the quantified elements, and obtain the similarity of system call sequences according to the similarity of system calls; perform mining and clustering of feature samples to obtain Mining results, comparing the mining results with the detection results to obtain the comparison probability, and determine the security status of the feature samples according to the comparison probability. The invention uses a binary tool to analyze program behavior, and extracts program features from multiple dimensions, improves the accuracy and efficiency of the system, and greatly reduces the security risk of the system.

Description

Method and system for identifying malicious program behavior based on system calls

技术领域technical field

本发明涉及软件行为识别分析技术领域，尤其是涉及基于系统调用的恶意程序行为识别方法和系统。The invention relates to the technical field of software behavior identification and analysis, in particular to a system call-based malicious program behavior identification method and system.

背景技术Background technique

近年来，恶意程序的数量在持续增长，世界范围内拥有大量待分析的恶意程序样本，由恶意代码引发的信息安全事件造成了巨大的经济损失。安全研究人员和软件用户迫切希望准确有效地对恶意程序进行识别。程序行为识别其核心技术是对程序的行为语义结构的类型和功能特征进行特征抽取和鉴别分类的技术，按照技术原理划分，当前的程序行为识别技术分为静态分析技术与动态分析技术。In recent years, the number of malicious programs has continued to grow, and there are a large number of malicious program samples to be analyzed worldwide, and information security incidents caused by malicious codes have caused huge economic losses. Security researchers and software users are eager to identify malicious programs accurately and effectively. The core technology of program behavior recognition is the technology of feature extraction and identification and classification of the type and functional characteristics of the behavioral semantic structure of the program. According to the technical principle, the current program behavior recognition technology is divided into static analysis technology and dynamic analysis technology.

其中，静态分析技术能够覆盖到所有执行流程，保证了分析的全面性，但是由于静态分析的代码和实际执行的代码存在的不一致性，恶意代码常常采用自我保护技术，产生了大量可用于分析的执行路径分支，且其中大部分是冗余路径，使得需要进行分析的程序分支数骤增。动态分析技术在近年有了长足的发展，但仍然存在一些局限，主要体现在缺乏对系统调用进行监控，从而很难对内核级别的恶意代码进行分析。另一方面，动态分析覆盖的执行路径数有限，从而很难保证恶意代码分析的完整性，对分析检测的准确性造成影响。Among them, the static analysis technology can cover all the execution processes and ensure the comprehensiveness of the analysis. However, due to the inconsistency between the statically analyzed code and the actually executed code, malicious code often adopts self-protection technology, resulting in a large number of available for analysis. Execution path branching, most of which are redundant, dramatically increases the number of program branches that need to be analyzed. Dynamic analysis technology has made great progress in recent years, but there are still some limitations, mainly reflected in the lack of monitoring of system calls, making it difficult to analyze malicious code at the kernel level. On the other hand, the number of execution paths covered by dynamic analysis is limited, so it is difficult to ensure the integrity of malicious code analysis, which affects the accuracy of analysis and detection.

发明内容Contents of the invention

有鉴于此，本发明的目的在于提供基于系统调用的恶意程序行为识别方法和系统，使用二进制工具分析程序行为，并从多个维度提取程序特征，提高系统的准确率和效率，大大降低系统的安全风险。In view of this, the object of the present invention is to provide a malicious program behavior recognition method and system based on system calls, use binary tools to analyze program behavior, and extract program features from multiple dimensions, improve system accuracy and efficiency, and greatly reduce system overhead. Security Risk.

第一方面，本发明实施例提供了基于系统调用的恶意程序行为识别方法，包括：In the first aspect, the embodiment of the present invention provides a system call-based malicious program behavior identification method, including:

获取特征样本，对所述特征样本进行预处理，得到系统调用信息；Obtaining feature samples, preprocessing the feature samples, and obtaining system call information;

将所述系统调用信息进行分类，构建三元组模型；Classifying the system call information to construct a triple model;

将所述三元组模型中的元素进行量化，根据量化的元素得到所述系统调用的相似度，并根据所述系统调用的相似度得到系统调用序列的相似度；Quantify the elements in the triplet model, obtain the similarity of the system calls according to the quantified elements, and obtain the similarity of the system call sequence according to the similarity of the system calls;

将所述特征样本进行挖掘聚类，得到挖掘结果，将所述挖掘结果与检测结果进行比对，得到比对概率；Mining and clustering the feature samples to obtain a mining result, comparing the mining result with the detection result to obtain a comparison probability;

根据所述比对概率确定所述特征样本的安全状态。Determine the security state of the feature sample according to the comparison probability.

结合第一方面，本发明实施例提供了第一方面的第一种可能的实施方式，其中，所述获取特征样本，对所述特征样本进行预处理，得到系统调用信息包括：In combination with the first aspect, the embodiment of the present invention provides a first possible implementation manner of the first aspect, wherein the acquiring feature samples, preprocessing the feature samples, and obtaining system call information include:

载入所述特征样本；loading said feature samples;

对所述特征样本进行二进制插桩，得到所述系统调用信息；performing binary instrumentation on the feature sample to obtain the system call information;

记录所述系统调用信息。Record the system call information.

结合第一方面，本发明实施例提供了第一方面的第二种可能的实施方式，其中，所述将所述系统调用信息进行分类，构建三元组模型包括：In combination with the first aspect, the embodiment of the present invention provides a second possible implementation manner of the first aspect, wherein classifying the system call information and constructing a triplet model includes:

将所述系统调用信息按照功能划分，得到多个类别信息；Dividing the system call information according to functions to obtain multiple categories of information;

分别提取多个所述类别信息对应的参数特征信息；respectively extracting parameter feature information corresponding to a plurality of category information;

统计每一类别所述系统调用执行的时间戳；Count the time stamps of the execution of the system calls described in each category;

按照所述类别信息、所述参数特征信息和所述时间戳构建所述三元组模型。Constructing the triplet model according to the category information, the parameter characteristic information and the time stamp.

结合第一方面的第二种可能的实施方式，本发明实施例提供了第一方面的第三种可能的实施方式，其中，所述将所述三元组模型中元素进行量化包括：With reference to the second possible implementation manner of the first aspect, the embodiment of the present invention provides a third possible implementation manner of the first aspect, wherein said quantifying the elements in the triplet model includes:

提取所述系统调用的语义部分的特征信息；extracting feature information of the semantic part of the system call;

通过所述语义部分的特征信息对所述三元组模型中的所述类别信息、所述参数特征信息和所述时间戳进行量化。The category information, the parameter feature information and the time stamp in the triplet model are quantified by the feature information of the semantic part.

结合第一方面，本发明实施例提供了第一方面的第四种可能的实施方式，其中，所述根据量化的元素得到所述系统调用的相似度包括：With reference to the first aspect, the embodiment of the present invention provides a fourth possible implementation manner of the first aspect, wherein the obtaining the similarity of the system calls according to the quantified elements includes:

根据下式计算所述系统调用的相似度：Calculate the similarity of the system calls according to the following formula:

Sim_function＝category*V_category+parameter*V_parameter+time*V_time Sim _function ＝category*V _category +parameter*V _parameter +time*V _time

其中，Sim_function为所述系统调用的相似度，category为类别，parameter为参数特征，time为时间戳，V_category、V_parameter、V_time分别代表给category、parameter、time分配的权值。Wherein, Sim _function is the similarity of the system call, category is the category, parameter is the parameter feature, time is the time stamp, V _category , V _parameter , and V _time respectively represent the weights assigned to category, parameter, and time.

结合第一方面，本发明实施例提供了第一方面的第五种可能的实施方式，其中，所述特征样本进行挖掘聚类，得到挖掘结果，将所述挖掘结果与检测结果进行比对，得到比对概率，包括：In combination with the first aspect, the embodiment of the present invention provides a fifth possible implementation manner of the first aspect, wherein the feature samples are mined and clustered to obtain a mining result, and the mining result is compared with the detection result, Get the comparison probability, including:

将各类别中所述系统调用的数量和所述系统调用序列的相似度进行聚类挖掘，采用聚类算法并设定阈值，得到所述挖掘结果；performing cluster mining on the number of system calls in each category and the similarity of the system call sequences, using a clustering algorithm and setting a threshold to obtain the mining results;

从知识数据库中选取与所述挖掘结果相关的所述检测结果，将所述挖掘结果与所述检测结果进行比对，得到所述比对概率，并将所述挖掘结果作为所述检测结果存入所述知识数据库中。Selecting the detection result related to the mining result from the knowledge database, comparing the mining result with the detection result to obtain the comparison probability, and storing the mining result as the detection result into the knowledge database.

结合第一方面，本发明实施例提供了第一方面的第六种可能的实施方式，其中，所述检测结果包括正常样本和恶意样本，所述根据所述比对概率确定所述特征样本的安全状态包括：In combination with the first aspect, the embodiment of the present invention provides a sixth possible implementation manner of the first aspect, wherein the detection result includes normal samples and malicious samples, and the determination of the feature sample according to the comparison probability Security status includes:

当所述挖掘结果与所述正常样本进行比对时，得到正常概率；When the mining result is compared with the normal sample, a normal probability is obtained;

当所述挖掘结果与所述恶意样本进行比对时，得到恶意概率；When the mining result is compared with the malicious sample, a malicious probability is obtained;

当所述正常概率大于所述恶意概率时，所述特征样本为正常程序状态；When the normal probability is greater than the malicious probability, the feature sample is in a normal program state;

当所述正常概率小于所述恶意概率时，所述特征样本为恶意程序状态。When the normal probability is less than the malicious probability, the feature sample is in a state of a malicious program.

第二方面，本发明实施例还提供基于系统调用的恶意程序行为识别系统，包括：特征提取模块、系统调用分类器、比对模块和数据挖掘模块；In the second aspect, the embodiment of the present invention also provides a system call-based malicious program behavior identification system, including: a feature extraction module, a system call classifier, a comparison module, and a data mining module;

所述特征提取模块，与所述系统调用分类器相连接，用于获取特征样本，对所述特征样本进行预处理，得到系统调用信息；The feature extraction module is connected with the system call classifier, and is used to obtain feature samples, preprocess the feature samples, and obtain system call information;

所述系统调用分类器，与所述比对模块相连接，用于将所述系统调用信息进行分类，构建三元组模型；The system call classifier is connected to the comparison module, and is used to classify the system call information and construct a triple model;

所述比对模块，与所述数据挖掘模块相连接，用于将所述三元组模型中元素进行量化，根据量化的元素得到所述系统调用的相似度，并根据所述系统调用的相似度得到系统调用序列的相似度；The comparison module is connected with the data mining module and is used to quantify the elements in the triple model, obtain the similarity of the system calls according to the quantified elements, and obtain the similarity of the system calls according to the similarity of the system calls degree to obtain the similarity degree of the system call sequence;

所述数据挖掘模块，与所述比对模块相连接，用于将所述特征样本进行挖掘聚类，得到挖掘结果，将所述挖掘结果与检测结果进行比对，得到比对概率，并根据所述比对概率确定所述特征样本的安全状态。The data mining module is connected with the comparison module, and is used for mining and clustering the feature samples to obtain mining results, comparing the mining results with the detection results to obtain a comparison probability, and according to The comparison probability determines the security state of the feature sample.

结合第二方面，本发明实施例提供了第二方面的第一种可能的实施方式，其中，所述检测结果包括正常样本和恶意样本，所述数据挖掘模块，还用于在所述挖掘结果与所述正常样本进行比对的情况下，得到正常概率；在所述挖掘结果与所述恶意样本进行比对的情况下，得到恶意概率。In combination with the second aspect, the embodiment of the present invention provides a first possible implementation manner of the second aspect, wherein the detection results include normal samples and malicious samples, and the data mining module is further configured to In the case of comparing with the normal sample, a normal probability is obtained; in the case of comparing the mining result with the malicious sample, a malicious probability is obtained.

结合第二方面，本发明实施例提供了第二方面的第一种可能的实施方式，其中，所述数据挖掘模块，还用于在所述正常概率大于所述恶意概率的情况下，确定所述特征样本为正常程序状态；在所述正常概率小于所述恶意概率的情况下，确定所述特征样本为恶意程序状态。In combination with the second aspect, the embodiment of the present invention provides a first possible implementation manner of the second aspect, wherein the data mining module is further configured to determine the The feature sample is in a normal program state; in the case that the normal probability is less than the malicious probability, it is determined that the feature sample is in a malicious program state.

本发明实施例提供了基于系统调用的恶意程序行为识别方法和系统，包括获取特征样本，对特征样本进行预处理，得到系统调用信息；将系统调用信息进行分类，构建三元组模型；将三元组模型中的元素进行量化，根据量化的元素得到系统调用的相似度，并根据系统调用的相似度得到系统调用序列的相似度；将特征样本进行挖掘聚类，得到挖掘结果，将挖掘结果与检测结果进行比对，得到比对概率，根据比对概率确定特征样本的安全状态。本发明使用二进制工具分析程序行为，并从多个维度提取程序特征，提高系统的准确率和效率，大大降低系统的安全风险。Embodiments of the present invention provide a method and system for identifying malicious program behaviors based on system calls, including acquiring feature samples, preprocessing the feature samples, and obtaining system call information; classifying system call information, and constructing triplet models; The elements in the tuple model are quantified, the similarity of system calls is obtained according to the quantified elements, and the similarity of system call sequences is obtained according to the similarity of system calls; the feature samples are mined and clustered to obtain the mining results, and the mining results are Compared with the detection results, the comparison probability is obtained, and the security status of the feature sample is determined according to the comparison probability. The invention uses a binary tool to analyze program behavior, and extracts program features from multiple dimensions, improves the accuracy and efficiency of the system, and greatly reduces the security risk of the system.

本发明的其他特征和优点将在随后的说明书中阐述，并且，部分地从说明书中变得显而易见，或者通过实施本发明而了解。本发明的目的和其他优点在说明书、权利要求书以及附图中所特别指出的结构来实现和获得。Additional features and advantages of the invention will be set forth in the description which follows, and in part will be apparent from the description, or may be learned by practice of the invention. The objectives and other advantages of the invention will be realized and attained by the structure particularly pointed out in the written description and claims hereof as well as the appended drawings.

为使本发明的上述目的、特征和优点能更明显易懂，下文特举较佳实施例，并配合所附附图，作详细说明如下。In order to make the above-mentioned objects, features and advantages of the present invention more comprehensible, preferred embodiments will be described in detail below together with the accompanying drawings.

附图说明Description of drawings

为了更清楚地说明本发明具体实施方式或现有技术中的技术方案，下面将对具体实施方式或现有技术描述中所需要使用的附图作简单地介绍，显而易见地，下面描述中的附图是本发明的一些实施方式，对于本领域普通技术人员来讲，在不付出创造性劳动的前提下，还可以根据这些附图获得其他的附图。In order to more clearly illustrate the specific implementation of the present invention or the technical solutions in the prior art, the following will briefly introduce the accompanying drawings that need to be used in the specific implementation or description of the prior art. Obviously, the accompanying drawings in the following description The drawings show some implementations of the present invention, and those skilled in the art can obtain other drawings based on these drawings without any creative work.

图1为本发明实施例一提供的基于系统调用的恶意程序行为识别方法流程图；FIG. 1 is a flowchart of a system call-based malicious program behavior identification method provided by Embodiment 1 of the present invention;

图2为本发明实施例一提供的基于系统调用的恶意程序行为识别方法中步骤S100的流程图；FIG. 2 is a flowchart of step S100 in the system call-based malicious program behavior identification method provided by Embodiment 1 of the present invention;

图3为本发明实施例一提供的基于系统调用的恶意程序行为识别方法中步骤S110的流程图；FIG. 3 is a flowchart of step S110 in the system call-based malicious program behavior identification method provided by Embodiment 1 of the present invention;

图4为本发明实施例一提供的基于系统调用的恶意程序行为识别方法中步骤S120的流程图；FIG. 4 is a flow chart of step S120 in the system call-based malicious program behavior identification method provided by Embodiment 1 of the present invention;

图5为本发明实施例一提供的基于系统调用的恶意程序行为识别方法中步骤S130的流程图；FIG. 5 is a flowchart of step S130 in the system call-based malicious program behavior identification method provided by Embodiment 1 of the present invention;

图6为本发明实施例一提供的基于系统调用的恶意程序行为识别方法中步骤S140的流程图；FIG. 6 is a flowchart of step S140 in the system call-based malicious program behavior identification method provided by Embodiment 1 of the present invention;

图7为本发明实施例二提供的基于系统调用的恶意程序行为识别系统的结构示意图。FIG. 7 is a schematic structural diagram of a system call-based malicious program behavior recognition system provided by Embodiment 2 of the present invention.

图标：icon:

10-特征提取模块；20-系统调用分类器；30-比对模块；40-数据挖掘模块。10-feature extraction module; 20-system call classifier; 30-comparison module; 40-data mining module.

具体实施方式detailed description

为使本发明实施例的目的、技术方案和优点更加清楚，下面将结合附图对本发明的技术方案进行清楚、完整地描述，显然，所描述的实施例是本发明一部分实施例，而不是全部的实施例。基于本发明中的实施例，本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例，都属于本发明保护的范围。In order to make the purpose, technical solutions and advantages of the embodiments of the present invention clearer, the technical solutions of the present invention will be clearly and completely described below in conjunction with the accompanying drawings. Obviously, the described embodiments are part of the embodiments of the present invention, not all of them. the embodiment. Based on the embodiments of the present invention, all other embodiments obtained by persons of ordinary skill in the art without making creative efforts belong to the protection scope of the present invention.

目前，静态分析技术能够覆盖到所有执行流程，保证了分析的全面性，但是由于静态分析的代码和实际执行的代码存在的不一致性，恶意代码常常采用自我保护技术，产生了大量可用于分析的执行路径分支，其中大部分是冗余路径，使得需要进行分析的程序分支数骤增。动态分析技术在近年有了长足的发展，但仍然存在一些局限，主要体现在缺乏对系统调用进行监控，从而很难对内核级别的恶意代码进行分析。另一方面，动态分析覆盖的执行路径数有限，从而很难保证恶意代码分析的完整性，对分析检测的准确性造成影响。At present, the static analysis technology can cover all execution processes, which ensures the comprehensiveness of the analysis. However, due to the inconsistency between the statically analyzed code and the actually executed code, malicious code often adopts self-protection technology, resulting in a large number of available for analysis. Execution path branches, most of which are redundant paths, dramatically increase the number of program branches that need to be analyzed. Dynamic analysis technology has made great progress in recent years, but there are still some limitations, mainly reflected in the lack of monitoring of system calls, making it difficult to analyze malicious code at the kernel level. On the other hand, the number of execution paths covered by dynamic analysis is limited, so it is difficult to ensure the integrity of malicious code analysis, which affects the accuracy of analysis and detection.

基于此，本发明实施例提供基于系统调用的恶意程序行为识别方法和系统。Based on this, embodiments of the present invention provide a system call-based malicious program behavior identification method and system.

为便于对本实施例进行理解，首先对本发明实施例所公开的一种基于系统调用的恶意程序行为识别方法进行详细介绍。In order to facilitate the understanding of this embodiment, a method for identifying malicious program behavior based on system calls disclosed in the embodiment of the present invention is firstly introduced in detail.

实施例一：Embodiment one:

图1为本发明实施例一提供的基于系统调用的恶意程序行为识别方法流程图。FIG. 1 is a flowchart of a system call-based malicious program behavior identification method provided by Embodiment 1 of the present invention.

参照图1，基于系统调用的恶意程序行为识别方法包括以下步骤：With reference to Fig. 1, the malicious program behavior identification method based on system call comprises the following steps:

步骤S100，获取特征样本，对特征样本进行预处理，得到系统调用信息；Step S100, acquiring feature samples, preprocessing the feature samples, and obtaining system call information;

步骤S110，将系统调用信息进行分类，构建三元组模型；Step S110, classifying the system call information to construct a triple model;

步骤S120，将三元组模型中的元素进行量化，根据量化的元素得到系统调用的相似度，并根据系统调用的相似度得到系统调用序列的相似度；Step S120, quantify the elements in the triplet model, obtain the similarity of system calls according to the quantified elements, and obtain the similarity of system call sequences according to the similarity of system calls;

步骤S130，将特征样本进行挖掘聚类，得到挖掘结果，将挖掘结果与检测结果进行比对，得到比对概率；Step S130, mining and clustering the feature samples to obtain the mining results, comparing the mining results with the detection results to obtain the comparison probability;

步骤S140，根据比对概率确定特征样本的安全状态。Step S140, determine the security state of the feature sample according to the comparison probability.

具体地，如图2所示，上述实施例基于系统调用的恶意程序行为识别方法中，步骤S100可采用如下步骤实现，包括：Specifically, as shown in FIG. 2, in the method for identifying malicious program behavior based on system calls in the above embodiment, step S100 may be implemented by the following steps, including:

步骤S201，载入特征样本；Step S201, loading feature samples;

步骤S202，对特征样本进行二进制插桩，得到系统调用信息；Step S202, performing binary instrumentation on the feature samples to obtain system call information;

步骤S203，记录系统调用信息。Step S203, recording system call information.

这里，采用二进制插桩工具，对程序动态执行过程中的系统调用以及其参数进行记录。Here, the binary instrumentation tool is used to record the system calls and their parameters during the dynamic execution of the program.

具体地，如图3所示，上述实施例基于系统调用的恶意程序行为识别方法中，步骤S110可采用如下步骤实现，包括：Specifically, as shown in FIG. 3, in the method for identifying malicious program behavior based on system calls in the above embodiment, step S110 may be implemented by the following steps, including:

步骤S301，将系统调用信息按照功能划分，得到多个类别信息；Step S301, dividing the system call information according to functions to obtain multiple categories of information;

步骤S302，分别提取多个类别信息对应的参数特征信息；Step S302, respectively extracting parameter feature information corresponding to a plurality of category information;

步骤S303，统计每一类别系统调用执行的时间戳；Step S303, counting the execution time stamps of each type of system call;

步骤S304，按照类别信息、参数特征信息和时间戳构建三元组模型。Step S304, constructing a triplet model according to category information, parameter feature information and time stamp.

这里，对每一个系统调用的结构与特征进行分析，统计各个类别系统调用出现的频率，并且建立系统调用三元组模型；Here, the structure and characteristics of each system call are analyzed, the frequency of each type of system call is counted, and the system call triplet model is established;

其中，三元组模型Table＜category,parameter,time>包括以下三个元素：Among them, the triple model Table<category, parameter, time> includes the following three elements:

category表示该系统调用所属的类别。我们将所有系统调用按照执行时的功能划分成五大类：注册表操作类型、文件操作类型、进程控制类型、内存管理类型、网络管理类型。其中每一类别又按照操作性质等规则又划分了如表1的子类别：category indicates the category to which the system call belongs. We divide all system calls into five categories according to their execution functions: registry operation type, file operation type, process control type, memory management type, and network management type. Each of these categories is further divided into subcategories as shown in Table 1 according to rules such as the nature of operations:

表1系统调用的分类Table 1 Classification of system calls

parameter表示系统调用函数所传进来的参数信息，根据不同的分类类型记录需要提取的相关特征，如表2所示：parameter indicates the parameter information passed in by the system call function, and records the relevant features to be extracted according to different classification types, as shown in Table 2:

表2系统调用提取的参数特征Table 2 The parameter characteristics extracted by the system call

类别category 提取的参数特征Extracted parametric features 注册表操作registry operation KeyHandle、ObjectAttributes、SetData等KeyHandle, ObjectAttributes, SetData, etc. 文件操作file operation FileHandle、Buffer、Routine等FileHandle, Buffer, Routine, etc. 进程控制process control ProcessHandle、ThreadHandle、AttributeList等ProcessHandle, ThreadHandle, AttributeList, etc. 内存管理memory management BaseAddress、MemoryInformation等BaseAddress, MemoryInformation, etc. 网络管理network management InputBuffer、OutputBuffer、IP、Port等InputBuffer, OutputBuffer, IP, Port, etc.

time为时间戳，表示该系统调用执行时所在的位置。time is a timestamp, indicating the location where the system call was executed.

具体地，如图4所示，上述实施例基于系统调用的恶意程序行为识别方法中，步骤S120中将三元组模型中的元素进行量化，可采用如下步骤实现，包括：Specifically, as shown in FIG. 4, in the method for identifying malicious program behavior based on system calls in the above embodiment, in step S120, the elements in the triple model are quantified, which can be implemented by the following steps, including:

步骤S401，提取系统调用的语义部分的特征信息；Step S401, extracting feature information of the semantic part of the system call;

步骤S402，通过语义部分的特征信息对三元组模型中的类别信息、参数特征信息和时间戳进行量化。Step S402, quantify the category information, parameter feature information and time stamp in the triplet model through the feature information of the semantic part.

具体地，上述实施例基于系统调用的恶意程序行为识别方法中，步骤S120中根据量化的元素得到所述系统调用的相似度，可采用如下步骤实现，包括：Specifically, in the method for identifying malicious program behavior based on system calls in the above-mentioned embodiment, in step S120, the similarity of the system calls is obtained according to the quantified elements, which can be implemented by the following steps, including:

根据公式(1)计算所述系统调用的相似度：Calculate the similarity of the system call according to formula (1):

Sim_function＝category*V_category+parameter*V_parameter+time*V_time (1)Sim _function ＝category*V _category +parameter*V _parameter +time*V _time (1)

其中，Sim_function为系统调用的相似度，category为类别，parameter为参数特征，time为时间戳，V_category、V_parameter、V_time分别代表给category、parameter、time分配的权值。Among them, Sim _function is the similarity of the system call, category is the category, parameter is the parameter feature, time is the timestamp, V _category , V _parameter , and V _time respectively represent the weights assigned to category, parameter, and time.

这里，将上述结果做归一化表示后输出；Here, the above results are normalized and then output;

进一步的，并根据系统调用的相似度得到系统调用序列的相似度包括：Further, according to the similarity of the system call, the similarity of the system call sequence includes:

根据公式(2)计算系统调用序列的相似度：Calculate the similarity of the system call sequence according to formula (2):

Sim_sequence[i][j]＝max(Sim[i-1][j],Sim[i][j-1],Sim[i-1][j-1]+Sim_function[i][j]) (2)Sim _sequence [i][j]＝max(Sim[i-1][j],Sim[i][j-1],Sim[i-1][j-1]+Sim _function [i][j ]) (2)

其中，Sim_sequence[i][j]表示两个系统调用序列的相似度，Sim_sequence[i]包括i个系统调用的相似度，Sim_sequence[j]包括j个系统调用的相似度，最终输出的结果即是Sim_sequence[len_a][len_b]，len_a、len_b分别代表两个系统调用序列的长度。Among them, Sim _sequence [i][j] represents the similarity of two system call sequences, Sim _sequence [i] includes the similarity of i system calls, Sim _sequence [j] includes the similarity of j system calls, and the final output The result is Sim _sequence [len _a ][len _b ], where len _a and len _b respectively represent the lengths of the two system call sequences.

具体地，如图5所示，上述实施例基于系统调用的恶意程序行为识别方法中，步骤S130可采用如下步骤实现，包括：Specifically, as shown in FIG. 5 , in the method for identifying malicious program behavior based on system calls in the above embodiment, step S130 may be implemented by the following steps, including:

步骤S501，将各类别中系统调用的数量和系统调用序列的相似度进行聚类挖掘，采用聚类算法并设定阈值，得到挖掘结果；Step S501, performing cluster mining on the number of system calls in each category and the similarity of system call sequences, using a clustering algorithm and setting a threshold to obtain a mining result;

步骤S502，从知识数据库中选取与挖掘结果相关的检测结果，将挖掘结果与检测结果进行比对，得到比对概率，并将挖掘结果作为检测结果存入知识数据库中。Step S502, selecting detection results related to the mining results from the knowledge database, comparing the mining results with the detection results to obtain a comparison probability, and storing the mining results as detection results in the knowledge database.

这里，为了提升系统的准确性和挖掘速度，采用聚类算法，在每一类别的数据中选取最具有代表性的一种或几种，设定阈值，将多条数据合并，降低了算法的空间复杂度，也提升了系统的效率，因此，本系统可以支持大数据背景下的恶意程序行为检测；Here, in order to improve the accuracy and mining speed of the system, a clustering algorithm is used to select the most representative one or several types of data in each category, set a threshold, and merge multiple pieces of data, which reduces the complexity of the algorithm. The space complexity also improves the efficiency of the system. Therefore, this system can support malicious program behavior detection in the context of big data;

进一步的，聚类和挖掘是一个不断迭代的过程。挖掘－聚类－挖掘－聚类，先挖掘数据，然后得到结果去聚类，之后通过聚类的结果不断提升挖掘的准确性，不断对比挖掘结果和检测结果，不断学习，提升自己下一次判断的准确率；Furthermore, clustering and mining is an iterative process. Mining-clustering-mining-clustering, firstly mine data, then get the results to cluster, and then continuously improve the accuracy of mining through the clustering results, constantly compare the mining results and detection results, keep learning, and improve your next judgment the accuracy rate;

具体地，首先进行原始挖掘过程，在系统初始情况下有一部分相关的挖掘结论，随着程序样本的不断增多，进行增量挖掘过程，在系统运行期间，读取上次增量挖掘之后的检测结果，并从先验知识库中读取有关程序的样本知识，然后将检测结果融入到知识数据库中，使得这些先验知识库的中心发生少量变化以适应最新的检测结果，并不断校正先验知识产生的挖掘算法。Specifically, the original mining process is carried out first. In the initial state of the system, there are some relevant mining conclusions. With the continuous increase of program samples, the incremental mining process is carried out. During the operation of the system, the detection after the last incremental mining is read. results, and read the sample knowledge about the program from the prior knowledge base, and then integrate the detection results into the knowledge database, so that the centers of these prior knowledge bases undergo a small change to adapt to the latest detection results, and continuously correct the prior Mining Algorithms for Knowledge Generation.

其中，本系统调用的增量挖掘算法包括贝叶斯分类模型、决策树分类模型等对数据进行增量挖掘，最后将挖掘结果重新存入挖掘结果中，替换原本结果形成新的增量挖掘库，使其能够对该新的恶意样本进行识别，增强数据挖掘的准确率。Among them, the incremental mining algorithm called by this system includes Bayesian classification model, decision tree classification model, etc. to incrementally mine data, and finally re-store the mining results in the mining results, replacing the original results to form a new incremental mining library , so that it can identify the new malicious sample and enhance the accuracy of data mining.

例如：将需要测试的程序特征样本A进行挖掘，将上次挖掘过的作为检测结果B存入增量挖掘的知识数据库中，然后从库中选取与我们特征样本相关的检测结果‘A，将特征样本A和检测结果‘A，进行比对，得到一个比对概率，下次判别时，将A作为检测结果存入库中。For example: Mining the program feature sample A that needs to be tested, storing the last mined test result B in the knowledge database of incremental mining, and then selecting the test result 'A related to our feature sample from the library, and The characteristic sample A is compared with the detection result 'A, and a comparison probability is obtained. In the next discrimination, A is stored in the database as the detection result.

具体地，如图6所示，上述实施例基于系统调用的恶意程序行为识别方法中，步骤S140可采用如下步骤实现，包括：Specifically, as shown in FIG. 6, in the method for identifying malicious program behavior based on system calls in the above embodiment, step S140 may be implemented by the following steps, including:

步骤S601，当挖掘结果与正常样本进行比对时，得到正常概率；Step S601, when the mining result is compared with the normal sample, the normal probability is obtained;

步骤S602，当挖掘结果与恶意样本进行比对时，得到恶意概率；Step S602, when the mining result is compared with the malicious sample, the malicious probability is obtained;

步骤S603，当正常概率大于恶意概率时，特征样本为正常程序状态；Step S603, when the normal probability is greater than the malicious probability, the feature sample is in a normal program state;

步骤S604，当正常概率小于恶意概率时，特征样本为恶意程序状态。Step S604, when the normal probability is less than the malicious probability, the feature sample is in the state of a malicious program.

其中，检测结果包括正常样本和恶意样本，通过两种不同样本与挖掘结果进行比对，得到正常概率与恶意概率，再将两者进行比较，来确定特征样本的安全状态。Among them, the detection results include normal samples and malicious samples. By comparing two different samples with the mining results, the normal probability and malicious probability are obtained, and then the two are compared to determine the security status of the characteristic samples.

本发明实施例提供的基于系统调用分类与序列比对的恶意程序行为识别方法，从底层系统调用的角度出发，将系统调用分类，统计各个类别出现的数量，并声称系统调用序列的相似度，最终通过数据挖掘的过程保证系统的最终结果。该方法从多个维度描述程序的行为特征，能够更准确和快速地识别恶意软件行为，提升系统的安全性。The malicious program behavior identification method based on system call classification and sequence comparison provided by the embodiment of the present invention classifies system calls from the perspective of underlying system calls, counts the number of occurrences of each category, and claims the similarity of system call sequences. Finally, the final result of the system is guaranteed through the process of data mining. This method describes the behavioral characteristics of the program from multiple dimensions, can identify malware behavior more accurately and quickly, and improve system security.

实施例二：Embodiment two:

参照图7，基于系统调用的恶意程序行为识别系统，包括特征提取模块10、系统调用分类器20、比对模块30和数据挖掘模块40；Referring to FIG. 7 , the malicious program behavior recognition system based on system calls includes a feature extraction module 10, a system call classifier 20, a comparison module 30 and a data mining module 40;

特征提取模块10，与系统调用分类器20相连接，用于获取特征样本，对特征样本进行预处理，得到系统调用信息；The feature extraction module 10 is connected with the system call classifier 20, and is used to obtain feature samples, preprocess the feature samples, and obtain system call information;

系统调用分类器20，与比对模块30相连接，用于将系统调用信息进行分类，构建三元组模型；System call classifier 20, connected with comparison module 30, is used to classify system call information and build triplet model;

比对模块30，与数据挖掘模块40相连接，用于将三元组模型中元素进行量化，根据量化的元素得到系统调用的相似度，并根据系统调用的相似度得到系统调用序列的相似度；The comparison module 30 is connected with the data mining module 40, and is used to quantify the elements in the triplet model, obtain the similarity of the system calls according to the quantified elements, and obtain the similarity of the system call sequence according to the similarity of the system calls ;

数据挖掘模块40，与比对模块30相连接，用于将特征样本进行挖掘聚类，得到挖掘结果，将挖掘结果与检测结果进行比对，得到比对概率，并根据比对概率确定特征样本的安全状态。The data mining module 40 is connected with the comparison module 30, and is used for mining and clustering the feature samples to obtain mining results, comparing the mining results with the detection results to obtain a comparison probability, and determining the feature samples according to the comparison probability security status.

进一步的，检测结果包括正常样本和恶意样本，数据挖掘模块40，还用于在挖掘结果与正常样本进行比对的情况下，得到正常概率；在挖掘结果与恶意样本进行比对的情况下，得到恶意概率。Further, the detection results include normal samples and malicious samples, and the data mining module 40 is also used to obtain a normal probability when the mining results are compared with the normal samples; when the mining results are compared with the malicious samples, Get malicious probability.

进一步的，数据挖掘模块40，还用于在正常概率大于恶意概率的情况下，确定特征样本为正常程序状态；在正常概率小于恶意概率的情况下，确定特征样本为恶意程序状态。Further, the data mining module 40 is also used to determine that the characteristic sample is in the state of a normal program when the normal probability is greater than the malicious probability; and determine that the characteristic sample is in the state of a malicious program when the normal probability is smaller than the malicious probability.

本发明实施例提供的基于系统调用的恶意程序行为识别系统，与上述实施例提供的基于系统调用的恶意程序行为识别方法具有相同的技术特征，所以也能解决相同的技术问题，达到相同的技术效果。The malicious program behavior recognition system based on system call provided by the embodiment of the present invention has the same technical features as the system call-based malicious program behavior recognition method provided by the above embodiment, so it can also solve the same technical problem and achieve the same technology Effect.

本发明实施例所提供的基于系统调用的恶意程序行为识别方法和系统的计算机程序产品，包括存储了程序代码的计算机可读存储介质，所述程序代码包括的指令可用于执行前面方法实施例中所述的方法，具体实现可参见方法实施例，在此不再赘述。The computer program product of the system call-based malicious program behavior identification method and system provided by the embodiments of the present invention includes a computer-readable storage medium storing program codes, and the instructions included in the program codes can be used to execute the method in the preceding method embodiments. For the specific implementation of the method, refer to the method embodiments, and details are not repeated here.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统和装置的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, the specific working process of the system and device described above can refer to the corresponding process in the foregoing method embodiment, and details are not repeated here.

另外，在本发明实施例的描述中，除非另有明确的规定和限定，术语“安装”、“相连”、“连接”应做广义理解，例如，可以是固定连接，也可以是可拆卸连接，或一体地连接；可以是机械连接，也可以是电连接；可以是直接相连，也可以通过中间媒介间接相连，可以是两个元件内部的连通。对于本领域的普通技术人员而言，可以具体情况理解上述术语在本发明中的具体含义。In addition, in the description of the embodiments of the present invention, unless otherwise clearly stipulated and limited, the terms "installation", "connection" and "connection" should be understood in a broad sense, for example, it can be a fixed connection or a detachable connection , or integrally connected; may be mechanically connected, may also be electrically connected; may be directly connected, may also be indirectly connected through an intermediary, and may be internal communication between two components. Those of ordinary skill in the art can understand the specific meanings of the above terms in the present invention in specific situations.

所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时，可以存储在一个计算机可读取存储介质中。基于这样的理解，本发明的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来，该计算机软件产品存储在一个存储介质中，包括若干指令用以使得一台计算机设备(可以是个人计算机，服务器，或者网络设备等)执行本发明各个实施例所述方法的全部或部分步骤。而前述的存储介质包括：U盘、移动硬盘、只读存储器(ROM，Read-Only Memory)、随机存取存储器(RAM，Random Access Memory)、磁碟或者光盘等各种可以存储程序代码的介质。If the functions described above are realized in the form of software function units and sold or used as independent products, they can be stored in a computer-readable storage medium. Based on this understanding, the essence of the technical solution of the present invention or the part that contributes to the prior art or the part of the technical solution can be embodied in the form of a software product, and the computer software product is stored in a storage medium, including Several instructions are used to make a computer device (which may be a personal computer, a server, or a network device, etc.) execute all or part of the steps of the methods described in various embodiments of the present invention. The aforementioned storage medium includes: U disk, mobile hard disk, read-only memory (ROM, Read-Only Memory), random access memory (RAM, Random Access Memory), magnetic disk or optical disk and other media that can store program codes. .

在本发明的描述中，需要说明的是，术语“中心”、“上”、“下”、“左”、“右”、“竖直”、“水平”、“内”、“外”等指示的方位或位置关系为基于附图所示的方位或位置关系，仅是为了便于描述本发明和简化描述，而不是指示或暗示所指的装置或元件必须具有特定的方位、以特定的方位构造和操作，因此不能理解为对本发明的限制。此外，术语“第一”、“第二”、“第三”仅用于描述目的，而不能理解为指示或暗示相对重要性。In the description of the present invention, it should be noted that the terms "center", "upper", "lower", "left", "right", "vertical", "horizontal", "inner", "outer" etc. The indicated orientation or positional relationship is based on the orientation or positional relationship shown in the drawings, and is only for the convenience of describing the present invention and simplifying the description, rather than indicating or implying that the referred device or element must have a specific orientation, or in a specific orientation. construction and operation, therefore, should not be construed as limiting the invention. In addition, the terms "first", "second", and "third" are used for descriptive purposes only, and should not be construed as indicating or implying relative importance.

最后应说明的是：以上所述实施例，仅为本发明的具体实施方式，用以说明本发明的技术方案，而非对其限制，本发明的保护范围并不局限于此，尽管参照前述实施例对本发明进行了详细的说明，本领域的普通技术人员应当理解：任何熟悉本技术领域的技术人员在本发明揭露的技术范围内，其依然可以对前述实施例所记载的技术方案进行修改或可轻易想到变化，或者对其中部分技术特征进行等同替换；而这些修改、变化或者替换，并不使相应技术方案的本质脱离本发明实施例技术方案的精神和范围，都应涵盖在本发明的保护范围之内。因此，本发明的保护范围应所述以权利要求的保护范围为准。Finally, it should be noted that: the above-described embodiments are only specific implementations of the present invention, used to illustrate the technical solutions of the present invention, rather than limiting them, and the scope of protection of the present invention is not limited thereto, although referring to the foregoing The embodiment has described the present invention in detail, and those skilled in the art should understand that any person familiar with the technical field can still modify the technical solutions described in the foregoing embodiments within the technical scope disclosed in the present invention Changes can be easily thought of, or equivalent replacements are made to some of the technical features; and these modifications, changes or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the embodiments of the present invention, and should be included in the scope of the present invention within the scope of protection. Therefore, the protection scope of the present invention should be based on the protection scope of the claims.

Claims

1. a kind of rogue program Activity recognition method called based on system, it is characterised in that include：

Feature samples are obtained, the feature samples are pre-processed, obtain system call information；

The system call information is classified, trigram models are built；

Element in the trigram models is quantified, the similarity that the system is called is obtained according to the element for quantifying, And the similarity called according to the system obtains the similarity of system call sequence；

The feature samples are carried out into excavation cluster, Result is obtained, the Result and testing result are compared, Obtain comparing probability；

The safe condition of feature samples according to the comparison determine the probability.

2. the rogue program Activity recognition method called based on system according to claim 1, it is characterised in that described to obtain Feature samples are taken, the feature samples are pre-processed, obtaining system call information includes：

It is loaded into the feature samples；

Binary system pitching pile is carried out to the feature samples, the system call information is obtained；

Record the system call information.

3. the rogue program Activity recognition method called based on system according to claim 1, it is characterised in that it is described will The system call information is classified, and building trigram models includes：

The system call information is divided by function, multiple classification informations are obtained；

The corresponding parameter attribute information of multiple classification informations is extracted respectively；

System described in counting each classification calls the timestamp of execution；

The trigram models are built according to the classification information, the parameter attribute information and the timestamp.

4. the rogue program Activity recognition method called based on system according to claim 3, it is characterised in that it is described will Element carries out quantization in the trigram models includes：

Extract the characteristic information of the semantic component that the system is called；

By the characteristic information of the semantic component to the trigram models in the classification information, the parameter attribute letter Breath and the timestamp are quantified.

5. the rogue program Activity recognition method called based on system according to claim 1, it is characterised in that described Obtaining the similarity that the system calls according to the element for quantifying includes：

The similarity that the system is called is calculated according to following formula：

Sim_function=category^*V_category+parameter*V_parameter+time*V_time

Wherein, Sim_functionFor the similarity that the system is called, category is classification, and parameter is parameter attribute, Time is timestamp, V_category、V_parameter、V_timeRepresent respectively to the weights of category, parameter, time distribution.

6. the rogue program Activity recognition method called based on system according to claim 1, it is characterised in that the spy Levying sample carries out excavation cluster, obtains Result, and the Result and testing result are compared, and obtains comparing generally Rate, including：

The quantity and the similarity of the system call sequence that system is called described in will be of all categories carries out cluster result, using poly- Class algorithm and given threshold, obtain the Result；

The testing result related to the Result is chosen from knowledge data base, by the Result and the inspection Survey result to compare, obtain the comparison probability, and the knowledge is stored in using the Result as the testing result In database.

7. the rogue program Activity recognition method called based on system according to claim 1, it is characterised in that the inspection Surveying result includes normal sample and malice sample, the safe condition bag of the feature samples according to the comparison determine the probability Include：

When the Result is compared with the normal sample, normal probability is obtained；

When the Result is compared with the malice sample, malice probability is obtained；

When the normal probability is more than the malice probability, the feature samples are normal program status；

When the normal probability is less than the malice probability, the feature samples are rogue program state.

8. a kind of rogue program Activity recognition system called based on system, it is characterised in that including characteristic extracting module, system Calling classification device, comparing module and data-mining module；

The characteristic extracting module, is connected with the System Call Classification device, for obtaining feature samples, to the feature sample Originally pre-processed, obtained system call information；

The System Call Classification device, is connected with the comparing module, for the system call information to be classified, structure Build trigram models；

The comparing module, is connected with the data-mining module, for element in the trigram models to be quantified, Element according to quantifying obtains the similarity that the system is called, and the similarity called according to the system obtains system and calls The similarity of sequence；

The data-mining module, is connected with the comparing module, for the feature samples to be carried out into excavation cluster, obtains Result, the Result and testing result are compared, and obtain comparing probability, and according to the comparison determine the probability The safe condition of the feature samples.

9. the rogue program Activity recognition system called based on system according to claim 8, it is characterised in that the inspection Surveying result includes normal sample and malice sample, the data-mining module, is additionally operable to normal with described in the Result In the case that sample is compared, normal probability is obtained；In the situation that the Result is compared with the malice sample Under, obtain malice probability.

10. the rogue program Activity recognition system called based on system according to claim 9, it is characterised in that described Data-mining module, is additionally operable to, in the case where the normal probability is more than the malice probability, determine that the feature samples are Normal program status；In the case where the normal probability is less than the malice probability, determine that the feature samples are malice journey Sequence state.