CN118484814B

CN118484814B - A slice-level feature-driven method for accurate software vulnerability detection

Info

Publication number: CN118484814B
Application number: CN202410948200.3A
Authority: CN
Inventors: 张晓东; 池剑磊; 郭皓羽; 张志为; 沈玉龙
Original assignee: Xidian University
Current assignee: Shanghai Daoreach Industry Development Co ltd
Priority date: 2024-07-16
Filing date: 2024-07-16
Publication date: 2024-10-11
Anticipated expiration: 2044-07-16
Also published as: CN118484814A

Abstract

The present invention discloses a slice-level feature-driven software vulnerability precision detection method, which relates to the field of software security technology. The present invention proposes a GRU statement embedding method, combines program slicing technology to obtain a more complete program feature representation, introduces the concept of a vulnerability dictionary, utilizes vulnerability detection models to mine and utilize vulnerability features in program slices, and realizes effective detection of function-level and statement-level vulnerabilities of a program to be tested.

Description

A slice-level feature-driven method for accurate software vulnerability detection

技术领域Technical Field

本发明属于软件安全技术领域，具体涉及一种切片级特征驱动的软件漏洞精准检测方法。The present invention belongs to the field of software security technology, and specifically relates to a slice-level feature-driven software vulnerability accurate detection method.

背景技术Background Art

近年来，全球软件市场规模持续增长，截至2023年底，全球IT支出已达4.6万亿美元。在当今数字化浪潮中，软件漏洞问题已成为数字安全领域面临的一项重大挑战。伴随设备、系统与服务数字化转型的深入，软件在个人生活和企业运作中扮演着日益重要的角色。然而软件广泛应用的同时，也带来了不容忽视的安全风险，尤其是软件漏洞问题。随着软件的多样化和复杂化，由设计和执行缺陷引发的软件漏洞已成为工程实践中不可回避的挑战，这些漏洞可能会带来严重的安全隐患。In recent years, the size of the global software market has continued to grow. By the end of 2023, global IT spending has reached 4.6 trillion US dollars. In today's digital wave, software vulnerabilities have become a major challenge in the field of digital security. With the deepening of the digital transformation of devices, systems and services, software plays an increasingly important role in personal life and business operations. However, while software is widely used, it also brings security risks that cannot be ignored, especially software vulnerabilities. With the diversification and complexity of software, software vulnerabilities caused by design and execution defects have become an unavoidable challenge in engineering practice, and these vulnerabilities may bring serious security risks.

漏洞的数量和复杂度呈显著增长趋势，鉴于这一形势，软件漏洞的检测与修复工作变得迫切且极具挑战。针对这一挑战，一种有效的策略是在应用软件部署前对潜在的漏洞进行全面检测。通过在软件开发生命周期的早期阶段采取预防措施，可以有效降低潜在漏洞对系统安全性的威胁，为数字生态环境的健康发展提供更坚实的安全保障。软件漏洞检测不仅是抵御安全威胁的关键手段，更是保障软件质量的重要环节。The number and complexity of vulnerabilities are increasing significantly. In view of this situation, the detection and repair of software vulnerabilities has become urgent and extremely challenging. To address this challenge, an effective strategy is to conduct comprehensive detection of potential vulnerabilities before the application software is deployed. By taking preventive measures in the early stages of the software development life cycle, the threat of potential vulnerabilities to system security can be effectively reduced, providing a more solid security guarantee for the healthy development of the digital ecological environment. Software vulnerability detection is not only a key means to resist security threats, but also an important part of ensuring software quality.

现有软件漏洞检测方法可划分为静态分析方法、动态分析方法和基于深度学习的方法。静态分析方法依赖于对源代码的分析，但通常存在较高的误报率。动态分析方法通过执行程序并分析其运行过程中产生的动态信息来进行检测，该方法通常面临低代码覆盖率的问题，易导致较高的漏报率。此外，静态和动态分析方法均受制于专家知识，在处理大规模软件漏洞时，检测效果并不理想。为了解决这些问题，研究人员提出了基于深度学习的漏洞检测方法，这类方法无需人工提取特征，大幅降低了人工资源的消耗。与传统方法相比，基于深度学习的漏洞检测方法性能得到显著提升。然而，现有基于深度学习的方法仍存在诸多挑战：（1）在处理长序列程序时，常采用直接截断操作以统一输入长度，导致特征丢失，检测性能低下；（2）现有方法大都通过隐式学习漏洞特征来检测漏洞，未能充分挖掘和利用漏洞特征，限制了模型性能的进一步提升。Existing software vulnerability detection methods can be divided into static analysis methods, dynamic analysis methods, and deep learning-based methods. Static analysis methods rely on the analysis of source code, but usually have a high false positive rate. Dynamic analysis methods detect by executing programs and analyzing the dynamic information generated during their operation. This method usually faces the problem of low code coverage and is prone to a high false negative rate. In addition, both static and dynamic analysis methods are subject to expert knowledge, and the detection effect is not ideal when dealing with large-scale software vulnerabilities. To solve these problems, researchers have proposed vulnerability detection methods based on deep learning. Such methods do not require manual feature extraction, which greatly reduces the consumption of manual resources. Compared with traditional methods, the performance of vulnerability detection methods based on deep learning has been significantly improved. However, existing deep learning-based methods still have many challenges: (1) When processing long sequence programs, direct truncation operations are often used to unify the input length, resulting in feature loss and poor detection performance; (2) Most existing methods detect vulnerabilities by implicitly learning vulnerability features, but fail to fully explore and utilize vulnerability features, which limits the further improvement of model performance.

发明内容Summary of the invention

为了解决现有技术中存在的上述问题，本发明提供了一种切片级特征驱动的软件漏洞精准检测方法。本发明要解决的技术问题通过以下技术方案实现：In order to solve the above problems existing in the prior art, the present invention provides a slice-level feature-driven software vulnerability accurate detection method. The technical problem to be solved by the present invention is achieved through the following technical solutions:

本发明提供了一种切片级特征驱动的软件漏洞精准检测方法，包括：The present invention provides a slice-level feature-driven software vulnerability accurate detection method, comprising:

步骤1：对训练集进行预处理，得到程序切片的第一语句嵌入，其中，所述训练集为附有语句级漏洞标签的开源漏洞数据集；Step 1: Preprocess the training set to obtain the first statement embedding of the program slice, wherein the training set is an open source vulnerability dataset with statement-level vulnerability labels;

步骤2：根据所述训练集，构建漏洞字典；Step 2: Construct a vulnerability dictionary based on the training set;

步骤3：将所述程序切片的第一语句嵌入与所述漏洞字典中匹配的簇中心进行拼接，得到第一拼接向量；Step 3: embed the first statement of the program slice into the cluster center matching the vulnerability dictionary for splicing to obtain a first splicing vector;

步骤4：将所述第一拼接向量输入至漏洞检测模型，学习所述程序切片的第一语句嵌入与所述漏洞字典的簇中心的关联性，得到训练完成的漏洞检测模型；Step 4: input the first concatenated vector into a vulnerability detection model, learn the correlation between the first statement embedding of the program slice and the cluster center of the vulnerability dictionary, and obtain a trained vulnerability detection model;

步骤5：对待测程序进行预处理，得到待测程序切片的第一语句嵌入，根据所述待测程序切片的第一语句嵌入和所述漏洞字典，利用所述训练完成的漏洞检测模型对所述待测程序切片进行漏洞检测，根据所述待测程序切片的漏洞预测结果，得到所述待测程序的漏洞预测结果。Step 5: Preprocess the program to be tested to obtain the first statement embedding of the program slice to be tested, perform vulnerability detection on the program slice to be tested using the trained vulnerability detection model based on the first statement embedding of the program slice to be tested and the vulnerability dictionary, and obtain the vulnerability prediction result of the program to be tested based on the vulnerability prediction result of the program slice to be tested.

与现有技术相比，本发明的有益效果在于：Compared with the prior art, the present invention has the following beneficial effects:

本发明的切片级特征驱动的软件漏洞精准检测方法，结合语句嵌入和程序切片技术，以漏洞关键点为起始点执行切片，提取与漏洞密切相关的代码片段，并利用语句嵌入汇聚语句信息，获得更加完整的特征表示，引入了漏洞字典的概念，利用漏洞检测模型挖掘和利用程序切片中的漏洞特征，实现了对待测程序的函数级和语句级漏洞的有效检测。The slice-level feature-driven software vulnerability precise detection method of the present invention combines statement embedding and program slicing technology, performs slicing with vulnerability key points as the starting point, extracts code fragments closely related to the vulnerability, and uses statement embedding to aggregate statement information to obtain a more complete feature representation. The concept of vulnerability dictionary is introduced, and vulnerability features in program slicing are mined and utilized using vulnerability detection models to achieve effective detection of function-level and statement-level vulnerabilities in the program to be tested.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其他目的、特征和优点能够更明显易懂，以下特举较佳实施例，并配合附图，详细说明如下。The above description is only an overview of the technical solution of the present invention. In order to more clearly understand the technical means of the present invention, it can be implemented in accordance with the contents of the specification. In order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand, the following specifically cites a preferred embodiment and describes it in detail with the accompanying drawings as follows.

附图说明BRIEF DESCRIPTION OF THE DRAWINGS

图1是本发明实施例提供的一种切片级特征驱动的软件漏洞精准检测方法的示意图；FIG1 is a schematic diagram of a slice-level feature-driven software vulnerability precision detection method provided by an embodiment of the present invention;

图2是本发明实施例提供的一种切片级特征驱动的软件漏洞精准检测方法的流程图；FIG2 is a flow chart of a slice-level feature-driven software vulnerability precision detection method provided by an embodiment of the present invention;

图3是本发明实施例提供的一种GRU语句嵌入的示意图；FIG3 is a schematic diagram of a GRU sentence embedding provided by an embodiment of the present invention;

图4是本发明实施例提供的一种漏洞字典的构建流程图；FIG4 is a flowchart of building a vulnerability dictionary provided by an embodiment of the present invention;

图5是本发明实施例提供的一种漏洞检测模型的训练过程图；FIG5 is a diagram of a training process of a vulnerability detection model provided by an embodiment of the present invention;

图6是本发明实施例提供的一种分类器的结构示意图。FIG. 6 is a schematic diagram of the structure of a classifier provided in an embodiment of the present invention.

具体实施方式DETAILED DESCRIPTION

为了进一步阐述本发明为达成预定发明目的所采取的技术手段及功效，以下结合附图及具体实施方式，对依据本发明提出的一种切片级特征驱动的软件漏洞精准检测方法进行详细说明。In order to further illustrate the technical means and effects adopted by the present invention to achieve the predetermined purpose of the invention, a slice-level feature-driven software vulnerability accurate detection method proposed in the present invention is described in detail below in combination with the accompanying drawings and specific implementation methods.

有关本发明的前述及其他技术内容、特点及功效，在以下配合附图的具体实施方式详细说明中即可清楚地呈现。通过具体实施方式的说明，可对本发明为达成预定目的所采取的技术手段及功效进行更加深入且具体地了解，然而所附附图仅是提供参考与说明之用，并非用来对本发明的技术方案加以限制。The above and other technical contents, features and effects of the present invention are clearly presented in the following detailed description of the specific implementation modes in conjunction with the accompanying drawings. Through the description of the specific implementation modes, the technical means and effects adopted by the present invention to achieve the predetermined purpose can be more deeply and specifically understood. However, the attached drawings are only for reference and explanation purposes and are not used to limit the technical solutions of the present invention.

本发明实施例提供了一种切片级特征驱动的软件漏洞精准检测方法，请参见图1和图2，图1是本发明实施例提供的一种切片级特征驱动的软件漏洞精准检测方法的示意图；图2是本发明实施例提供的一种切片级特征驱动的软件漏洞精准检测方法的流程图。An embodiment of the present invention provides a slice-level feature-driven software vulnerability precise detection method. Please refer to Figures 1 and 2. Figure 1 is a schematic diagram of a slice-level feature-driven software vulnerability precise detection method provided by an embodiment of the present invention; Figure 2 is a flow chart of a slice-level feature-driven software vulnerability precise detection method provided by an embodiment of the present invention.

如图2所示，本实施例的切片级特征驱动的软件漏洞精准检测方法，首先通过将训练集所有的漏洞语句，经过嵌入层得到平面向量后再经过聚类算法得到漏洞字典（vulDcit），其中，嵌入层包括：程序切片（Program Slicing）、词表嵌入（EMB）、两次GRU语句嵌入（GRU_stmt和GRU_vulI）。然后在训练过程中，训练集经过嵌入层之后，将输入的程序切片语句嵌入与该程序切片中漏洞语句在漏洞字典中所属的簇中心进行拼接，使用漏洞检测模型中Transformer的注意力机制充分学习程序切片语句嵌入和漏洞字典的簇中心的关联性，最后通过分类器得到程序切片的切片级预测和语句级预测，得到训练完成的漏洞检测模型。在检测阶段，将测试集，即待测程序经过嵌入层后，得到待测程序切片的语句嵌入，将待测程序切片的语句嵌入依次与漏洞字典中各簇中心进行拼接，并输入训练完成的漏洞检测模型中最终得到待测程序的函数级预测和语句级预测。As shown in FIG2 , the slice-level feature-driven software vulnerability precision detection method of this embodiment first obtains a vulnerability dictionary (vulDcit) by passing all the vulnerability statements in the training set through the embedding layer to obtain a plane vector and then through a clustering algorithm, wherein the embedding layer includes: program slicing, vocabulary embedding (EMB), and two GRU statement embeddings (GRU _stmt and GRU _vulI ). Then, during the training process, after the training set passes through the embedding layer, the input program slice statement embedding is spliced with the cluster center to which the vulnerability statement in the program slice belongs in the vulnerability dictionary, and the attention mechanism of the Transformer in the vulnerability detection model is used to fully learn the correlation between the program slice statement embedding and the cluster center of the vulnerability dictionary. Finally, the slice-level prediction and statement-level prediction of the program slice are obtained through the classifier to obtain the trained vulnerability detection model. In the detection stage, the test set, that is, the program to be tested, is passed through the embedding layer to obtain the statement embedding of the program slice to be tested, and the statement embedding of the program slice to be tested is spliced with each cluster center in the vulnerability dictionary in turn, and input into the trained vulnerability detection model to finally obtain the function-level prediction and statement-level prediction of the program to be tested.

进一步地，结合图1，对本实施例的切片级特征驱动的软件漏洞精准检测方法进行详细说明，该方法，包括：Further, in conjunction with FIG1 , the slice-level feature-driven software vulnerability accurate detection method of this embodiment is described in detail. The method includes:

步骤1：对训练集进行预处理，得到程序切片的第一语句嵌入。Step 1: Preprocess the training set to obtain the first statement embedding of the program slice.

在本实施例中，训练集为附有语句级漏洞标签的开源漏洞数据集，例如可以是Big-Vul开源漏洞数据集，该数据集是目前规模最大且带有语句级漏洞标签的C/C++漏洞数据集之一。In this embodiment, the training set is an open source vulnerability dataset with statement-level vulnerability labels, for example, the Big-Vul open source vulnerability dataset, which is one of the largest C/C++ vulnerability datasets with statement-level vulnerability labels.

在一个可选的实施例中，步骤1可以包括：In an optional embodiment, step 1 may include:

步骤1.1：对训练集进行程序切片处理，得到多个程序切片。Step 1.1: Perform program slicing on the training set to obtain multiple program slices.

可选地，步骤1.1包括：Optionally, step 1.1 includes:

步骤1.11：利用开源静态分析工具Joern对训练集中的程序进行分析，获取程序的关键点。Step 1.11: Use the open source static analysis tool Joern to analyze the programs in the training set and obtain the key points of the programs.

在本实施例中，采用了SyseVR（基于深度学习的源代码漏洞检测框架）中总结出的易发生漏洞的语法特征作为程序的关键点，包括敏感API、数组操作、整数和指针使用。具体而言，对于敏感API相关的漏洞关键点就为此API本身；数组和指针相关的漏洞关键点就是数组和指针语句。In this embodiment, the grammatical features of vulnerabilities that are prone to occur summarized in SyseVR (a source code vulnerability detection framework based on deep learning) are used as the key points of the program, including sensitive APIs, array operations, integers, and pointer usage. Specifically, the key point of vulnerabilities related to sensitive APIs is the API itself; the key point of vulnerabilities related to arrays and pointers is the array and pointer statements.

示例性地，基于开源静态分析工具Joern，构建目标函数的AST（Abstract SyntaxCode，抽象语法树），并将其与上述关键点的语法特征进行匹配，任何匹配成功的令牌均可作为程序切片构建的起始点。Exemplarily, based on the open source static analysis tool Joern, the AST (Abstract SyntaxCode) of the target function is constructed and matched with the grammatical features of the above key points. Any token that successfully matches can be used as the starting point for program slicing construction.

步骤1.12：以关键点作为起始点执行前向切片和后向切片操作，提取程序中与起始点相关的所有语句，得到对应的前向切片与后向切片。Step 1.12: Perform forward slicing and backward slicing operations with the key point as the starting point, extract all statements related to the starting point in the program, and obtain corresponding forward slicing and backward slicing.

在本实施例中，基于每个起始点，可以再次利用Joern来生成函数PDG（ProgramDependence Graph，程序依赖图），并执行前向切片和后向切片操作，以提取与起始点相关的所有语句。前向切片是从起始点出发，沿着程序的数据流和控制流追踪，识别所有受该起点影响的语句；后向切片是将起始点作为程序的终点，从终点出发逆向追踪数据流和控制流，找到所有影响该终点的语句。In this embodiment, based on each starting point, Joern can be used again to generate a function PDG (Program Dependence Graph), and forward slicing and backward slicing operations can be performed to extract all statements related to the starting point. Forward slicing is to start from the starting point, trace along the data flow and control flow of the program, and identify all statements affected by the starting point; backward slicing is to take the starting point as the end point of the program, and trace the data flow and control flow backward from the end point to find all statements that affect the end point.

步骤1.13：将前向切片和后向切片进行整合后去除重复语句，并对每个语句附上相应的语句标签，得到程序切片，语句标签包括漏洞语句标签和非漏洞语句标签。Step 1.13: Integrate the forward slice and the backward slice to remove duplicate statements, and attach a corresponding statement label to each statement to obtain program slices. The statement labels include vulnerability statement labels and non-vulnerability statement labels.

在本实施例中，通过检查语句标签，可以判断该程序切片是否存在漏洞，并对存在漏洞语句的程序切片附上漏洞切片标签，对不存在漏洞语句的程序切片附上非漏洞切片标签。In this embodiment, by checking the statement label, it can be determined whether the program slice has a vulnerability, and a vulnerability slice label is attached to the program slice with the vulnerability statement, and a non-vulnerability slice label is attached to the program slice without the vulnerability statement.

需要说明的是，每个程序切片包含有相等数量的语句，程序切片可以表示为，其中，表示第行语句，，为预设的每个程序切片的最大语句数，即程序切片共有行语句。若程序切片的语句数量小于，则对缺失行使用进行填充，若程序切片的语句数量大于，则对多余行的语句进行截断操作。It should be noted that each program slice contains an equal number of statements. Program slices can be expressed as ,in, Indicates Line statement, , The maximum number of statements for each program slice, that is, the total number of program slices If the number of statements in a program slice is less than , then for missing rows use Filling is performed if the number of statements in the program slice is greater than , the statements with extra lines are truncated.

步骤1.2：对程序切片的每个语句进行分词处理，得到程序切片中每个语句的令牌序列。Step 1.2: Perform word segmentation on each statement in the program slice to obtain a token sequence for each statement in the program slice.

可选地，可以使用字节对编码（Byte Pair Encoding，BPE）分词技术对程序切片的行语句进行分词处理，从而获得语句对应的令牌序列，其中，表示语句的第个令牌，，表示预设的语句的最大令牌数量。Optionally, you can use Byte Pair Encoding (BPE) word segmentation to slice the program. of Perform word segmentation on the line statement to obtain the token sequence corresponding to the statement ,in, Indicates the first Tokens, , Indicates the maximum number of tokens for a preset statement.

需要说明的是，每个语句包含有相等数量的令牌。若语句的令牌数量小于，则对缺失令牌使用特殊令牌进行填充，若语句的令牌数量大于，则对多余的令牌进行截断操作。It should be noted that each statement contains an equal number of tokens. If the number of tokens in a statement is less than , use a special token for missing tokens Fill if the number of tokens in the statement is greater than , the excess tokens are truncated.

步骤1.3：对程序切片中每个语句的每个令牌进行词表嵌入，得到程序切片的令牌嵌入。Step 1.3: Perform vocabulary embedding on each token of each statement in the program slice to obtain the token embedding of the program slice.

在本实施例中，可以利用词表嵌入层获取每个令牌的嵌入表示，指代分词器词表大小，为令牌嵌入向量维度，表示实数集，为语句的第个令牌，共计个，那么，对于程序切片的令牌嵌入可以表示为。In this embodiment, the vocabulary embedding layer can be used Get each token The embedding representation of Refers to the wordlist size of the tokenizer, is the token embedding vector dimension, represents the set of real numbers, The first Tokens, total Then, for program slicing The token embedding of can be expressed as .

示例性地，可以利用CodeT5预训练模型的嵌入层将每一个令牌映射到一个768维度的向量空间，是令牌嵌入向量维度。For example, the embedding layer of the CodeT5 pre-trained model can be used Each token Mapped to a 768-dimensional vector space, is the token embedding vector dimension.

步骤1.4：利用GRU（Gated Recurrent Unit，门控循环单元）对程序切片的令牌嵌入进行第一次语句嵌入，得到程序切片的第一语句嵌入。Step 1.4: Use GRU (Gated Recurrent Unit) to perform the first sentence embedding on the token embedding of the program slice to obtain the first sentence embedding of the program slice.

在本实施例中，利用GRU语句嵌入方法以增强特征的表现能力，使用每个语句的令牌数量𝑟作为时间步长，利用GRU汇聚整个语句的信息。在每个语句的令牌序列中，每个令牌都对网络状态产生影响，而最后一个时间步的隐藏状态则综合了这些影响，从而有效地捕捉到整个语句的状态和关键信息。In this embodiment, the GRU sentence embedding method is used to enhance the expressiveness of the features, the number of tokens of each sentence 𝑟 is used as the time step, and the GRU is used to aggregate the information of the entire sentence. In the token sequence of each sentence, each token has an impact on the network state, and the hidden state of the last time step integrates these impacts, thereby effectively capturing the state and key information of the entire sentence.

请参见图3，图3是本发明实施例提供的一种GRU语句嵌入的示意图。如图3所示，对于程序切片的令牌嵌入，利用GRU进行第一次语句嵌入，即采用GRU_stmt进行汇总，获得的第一次语句嵌入可以表示为，如公式（1）所示：Please refer to Figure 3, which is a schematic diagram of a GRU sentence embedding provided by an embodiment of the present invention. As shown in Figure 3, for program slicing Token embedding , using GRU for the first sentence embedding, that is, using GRU _stmt for summarization, the first sentence embedding obtained can be expressed as , as shown in formula (1):

（1）。 (1).

GRU_stmt使模型拥有学习能力，将语句中的每个令牌信息整合进语句向量，同时保留关键令牌的特征，并减少信息丢失的风险。当且时，本方法能处理单个程序切片中的3000个令牌，约是512个令牌的六倍。因此，本实施例提出的GRU语句嵌入与普通令牌嵌入相比，能提供更为完整的信息表示。GRU _stmt enables the model to learn and integrate the information of each token in the sentence into the sentence vector, while retaining the characteristics of key tokens and reducing the risk of information loss. and When , the method can process 3000 tokens in a single program slice, which is about six times that of 512 tokens. Therefore, the GRU sentence embedding proposed in this embodiment can provide a more complete information representation compared with the ordinary token embedding.

步骤2：根据训练集，构建漏洞字典。Step 2: Build a vulnerability dictionary based on the training set.

请参见图4，图4是本发明实施例提供的一种漏洞字典的构建流程图，如图4所示，在一个可选的实施例中，步骤2包括：Please refer to FIG. 4 , which is a flowchart of building a vulnerability dictionary provided by an embodiment of the present invention. As shown in FIG. 4 , in an optional embodiment, step 2 includes:

步骤2.1：从多个程序切片中筛选出含有漏洞语句的程序切片，得到漏洞程序切片，对每个漏洞程序切片中的漏洞语句进行提取，得到每个漏洞程序切片对应的漏洞语句集合。Step 2.1: Filter out program slices containing vulnerable statements from multiple program slices to obtain vulnerable program slices, extract vulnerable statements in each vulnerable program slice, and obtain a set of vulnerable statements corresponding to each vulnerable program slice.

在对训练集进行程序切片处理后会得到多个程序切片，通过检查语句标签，可以判断该程序切片是否存在漏洞，从而从多个程序切片中筛选出含有漏洞语句的程序切片，得到漏洞程序切片，再根据漏洞程序切片中的语句标签，将漏洞语句提取出来得到每个漏洞程序切片对应的漏洞语句集合，在本实施例中可以将漏洞语句集合记为VulnerabilityInstance（漏洞实例），简称为vulI。After program slicing the training set, multiple program slices will be obtained. By checking the statement label, it can be determined whether the program slice has a vulnerability, so as to filter out the program slice containing the vulnerability statement from the multiple program slices to obtain the vulnerable program slice. Then, according to the statement label in the vulnerable program slice, the vulnerable statement is extracted to obtain the vulnerability statement set corresponding to each vulnerable program slice. In this embodiment, the vulnerability statement set can be recorded as VulnerabilityInstance (vulnerability instance), abbreviated as vulI.

在本实施例中，程序切片中的vulI可以表示为，其中是预设的每个程序切片漏洞语句的最大数量。In this embodiment, program slicing The vulI in can be expressed as ,in It is the preset maximum number of vulnerable statements per program slice.

若程序切片的漏洞语句数量小于，将使用对缺失行进行填充。If the number of vulnerable statements in a program slice is less than , will use Fill in missing rows.

步骤2.2：对漏洞语句集合中每个漏洞语句进行分词处理，得到漏洞语句集合中每个漏洞语句的令牌序列。Step 2.2: Perform word segmentation on each vulnerability statement in the vulnerability statement set to obtain a token sequence of each vulnerability statement in the vulnerability statement set.

在本实施例中，与训练集进行预处理过程类似的，可以使用BPE分词方法对的每个语句进行分词，得到相应的令牌序列，其中是预设的每个语句的最大令牌数量。若语句的令牌数量小于，则使用特殊令牌<pad>进行填充，对于过长的序列，则进行截断操作。In this embodiment, similar to the preprocessing process of the training set, the BPE word segmentation method can be used to Each sentence is segmented to obtain the corresponding token sequence ,in is the preset maximum number of tokens per statement. If the number of tokens in a statement is less than , a special token <pad> is used for padding, and if the sequence is too long, it is truncated.

步骤2.3：对漏洞语句集合中每个漏洞语句的每个令牌进行词表嵌入，得到漏洞语句集合的令牌嵌入。Step 2.3: Perform vocabulary embedding on each token of each vulnerability statement in the vulnerability statement set to obtain the token embedding of the vulnerability statement set.

在本实施例中，与训练集进行预处理过程类似的，对每个令牌进行词表映射，将每个令牌映射为词表中的令牌ID。示例性地，可以利用CodeT5预训练模型的嵌入层将每一个令牌，映射到一个768维度的向量空间，其中代表分词器词表大小，=768是令牌嵌入向量维度。因此，每个vulI的令牌嵌入可表示为。In this embodiment, similar to the preprocessing process of the training set, each token is mapped to a token ID in the vocabulary. For example, the embedding layer of the CodeT5 pre-trained model can be used. Each token , Mapped to a 768-dimensional vector space, where Represents the wordlist size of the tokenizer, =768 is the token embedding vector dimension. Therefore, the token embedding of each vulI can be expressed as .

步骤2.4：利用GRU对漏洞语句集合的令牌嵌入进行第一次语句嵌入，得到漏洞语句集合的第一语句嵌入。Step 2.4: Use GRU to perform the first sentence embedding on the token embedding of the vulnerable sentence set to obtain the first sentence embedding of the vulnerable sentence set.

在本实施例中，与训练集进行预处理过程类似的，利用GRU_stmt对于每个vulI的令牌嵌入进行第一次语句嵌入，以汇总vulI中每个语句的信息，最终得到漏洞语句集合的第一语句嵌入，如公式（2）所示：In this embodiment, similar to the preprocessing process of the training set, the token embedding of each vulI is performed using GRU _stmt Perform the first sentence embedding to summarize the information of each sentence in vulI, and finally get the first sentence embedding of the vulnerability sentence set , as shown in formula (2):

（2）。 (2).

步骤2.5：利用GRU对漏洞语句集合的第一语句嵌入进行第二次语句嵌入，得到漏洞语句集合的第二语句嵌入。Step 2.5: Use GRU to perform a second sentence embedding on the first sentence embedding of the vulnerability sentence set to obtain the second sentence embedding of the vulnerability sentence set.

为了便于构建漏洞语句集合的平面向量，利用GRU对漏洞语句集合的第一语句嵌入进行第二次语句嵌入，即采用GRU_vulI对每个vulI的第一语句嵌入进一步汇总，以获得隐层状态，即漏洞语句集合的第二语句嵌入，如公式（3）所示：In order to facilitate the construction of the plane vector of the vulnerability statement set, GRU is used to embed the first statement of the vulnerability statement set for the second time, that is, GRU _vulI is used to embed the first statement of each vulI Further aggregation to obtain the hidden state , that is, the second sentence embedding of the vulnerability sentence set, as shown in formula (3):

（3）。 (3).

在本实施例中，GRU_vulI与GRU_stmt结构相同。In this embodiment, the structure of GRU _vulI is the same as that of GRU _stmt .

步骤2.6：将漏洞语句集合的第二语句嵌入进行第一维度压缩并将其映射到平面向量上，得到漏洞语句集合的平面向量。Step 2.6: The second sentence embedding of the vulnerability sentence set is compressed in the first dimension and mapped to a plane vector to obtain a plane vector of the vulnerability sentence set.

在本实施例中，对的第一维度进行压缩，并将其映射到平面向量上，最终得到漏洞语句集合的平面向量。In this embodiment, The first dimension is compressed and mapped to a flat vector Finally, we get the plane vector of the vulnerability statement set.

步骤2.7：所有漏洞程序切片对应的漏洞语句集合的平面向量组成漏洞平面向量集合，利用聚类算法得到漏洞平面向量集合的簇中心，所有簇中心组成漏洞字典。Step 2.7: The plane vectors of the vulnerability statement set corresponding to all vulnerable program slices constitute a vulnerability plane vector set. The cluster center of the vulnerability plane vector set is obtained by using a clustering algorithm. All cluster centers constitute a vulnerability dictionary.

在本实施例中，可以将漏洞平面向量集合记为Vulnerability Set（漏洞集合），简称为vulSet，中，其中表示已提取的漏洞切片总数。如图4所示，有图案填充的方块代表漏洞语句，无图案填充的方块代表非漏洞语句。在构建的中，可能存在许多重复或者相似的vulI。此外，在本实施例中从训练集中提取到的vulI高达33647个，在后续的训练和检测阶段，vulSet的庞大数量将导致匹配每个程序切片与vulSet中元素时消耗大量计算资源。In this embodiment, the vulnerability plane vector set can be recorded as Vulnerability Set (vulnerability set), abbreviated as vulSet. Among them Indicates the total number of vulnerability slices extracted. As shown in Figure 4, the squares filled with patterns represent vulnerability statements, and the squares without pattern filling represent non-vulnerability statements. There may be many repeated or similar vulIs in the training set. In addition, in this embodiment, the number of vulIs extracted from the training set is as high as 33,647. In the subsequent training and detection stages, the huge number of vulSets will consume a lot of computing resources when matching each program slice with the elements in the vulSet.

本实施例提出构建一个精简但能涵盖所有vulI的集合，称为vulDict（VulnerabilityDictionary，漏洞字典），漏洞字典是包含所有vulCenter（VulnerabilityCenter，漏洞中心）的集合，可以代表全部的vulI。This embodiment proposes to construct a concise set that covers all vulIs, called vulDict (VulnerabilityDictionary). The vulnerability dictionary is a set containing all vulCenters (VulnerabilityCenter), which can represent all vulIs.

在本实施例中，漏洞字典表示为vulDict=，其中每个为vulDict中的元素，也称为vulCenter，为vulDict中的元素数量。vulDict整合了具有相似特征的vulI，每个vulCenter代表了一组或一类相似的vulI。为确保每个vulCenter能代表一组相似的vulI，本实施例采用K-means聚类算法对vulSet进行精简和归纳，会得到个簇中心，即vulCenter。每个vulCenter代表了其所在类别的共同特征，可视作该类别的典型vulI。In this embodiment, the vulnerability dictionary is represented by vulDict= , where each is an element in vulDict, also called vulCenter, is the number of elements in vulDict. vulDict integrates vulIs with similar features, and each vulCenter represents a group or a class of similar vulIs. To ensure that each vulCenter can represent a group of similar vulIs, this embodiment uses the K-means clustering algorithm to simplify and summarize vulSet, and the result is Each vulCenter represents the common characteristics of its category and can be regarded as a typical vulI of this category.

在本实施例中，借助vulCenter构建了vulDict，以此取代原始的vulSet，旨在压缩vulSet的规模，提高模型的检测效率。需要说明的是，在进行聚类之前，可以将vulSet中的数据执行了归一化处理，以降低某些异常数据对聚类结果的影响。In this embodiment, vulDict is constructed with the help of vulCenter to replace the original vulSet, aiming to compress the size of vulSet and improve the detection efficiency of the model. It should be noted that before clustering, the data in vulSet can be normalized to reduce the impact of some abnormal data on the clustering results.

利用构建的vulDict保留vulI关键信息的同时，成功减少了vulSet的规模，降低了后续训练和检测阶段的计算负担。构建的vulDict使得漏洞检测模型能更有效地捕捉vulI的共性，为接下来的漏洞检测与分析提供了更具代表性的特征。While retaining the key information of vulI, the constructed vulDict successfully reduces the size of vulSet and reduces the computational burden of subsequent training and detection stages. The constructed vulDict enables the vulnerability detection model to more effectively capture the commonalities of vulI, providing more representative features for subsequent vulnerability detection and analysis.

步骤3：将程序切片的第一语句嵌入与漏洞字典中匹配的簇中心进行拼接，得到第一拼接向量。Step 3: Concatenate the first statement embedding of the program slice with the matching cluster center in the vulnerability dictionary to obtain the first concatenated vector.

在一个可选的实施例中，步骤3包括：In an optional embodiment, step 3 includes:

步骤3.1：将程序切片的第一语句嵌入进行压缩，得到程序切片的压缩语句嵌入。Step 3.1: Embed the first statement of the program slice for compression to obtain a compressed statement embedding of the program slice.

在本实施例中，对于程序切片的第一语句嵌入，压缩后得到的压缩语句嵌入可以表示为。In this embodiment, the first statement of the program slice is embedded in , the compressed sentence embedding obtained after compression can be expressed as .

步骤3.2：对于含有漏洞语句的程序切片，确定程序切片中漏洞语句在漏洞字典中所属的簇中心，将含有漏洞语句的程序切片的压缩语句嵌入与确定的所属簇中心进行拼接，得到第一拼接向量。Step 3.2: For a program slice containing a vulnerable statement, determine the cluster center to which the vulnerable statement in the program slice belongs in the vulnerability dictionary, and concatenate the compressed statement embedding of the program slice containing the vulnerable statement with the determined cluster center to obtain a first concatenated vector.

在本实施例中，对于漏洞语句在漏洞字典中所属的簇中心的确定，可以通过计算漏洞语句的平面向量与漏洞字典中各簇中心之间的欧式距离，选择欧式距离最小值对应的簇中心为该漏洞语句的在漏洞字典中所属的簇中心。具体地，漏洞语句的在漏洞字典中所属的簇中心的确定过程如公式（4）所示：In this embodiment, the cluster center to which the vulnerability statement belongs in the vulnerability dictionary can be determined by calculating the Euclidean distance between the plane vector of the vulnerability statement and the cluster centers in the vulnerability dictionary, and selecting the cluster center corresponding to the minimum Euclidean distance as the cluster center to which the vulnerability statement belongs in the vulnerability dictionary. Specifically, the process of determining the cluster center to which the vulnerability statement belongs in the vulnerability dictionary is shown in formula (4):

（4）； (4);

式中，为漏洞语句所属的簇中心，为漏洞语句的平面向量，为漏洞字典中簇中心的数量。In the formula, is the cluster center to which the vulnerability statement belongs, is the plane vector of the vulnerability statement, is the number of cluster centers in the vulnerability dictionary.

其中，漏洞语句的平面向量的计算过程，即采用GRU_stmt和GRU_vulI对含有漏洞语句的程序切片中的vulI进行处理后，再进行第一维度压缩并将其映射到平面向量上，得到漏洞语句的平面向量。Among them, the calculation process of the plane vector of the vulnerable statement is to use GRU _stmt and GRU _vulI to process the vulI in the program slice containing the vulnerable statement, and then perform first dimension compression and map it to the plane vector to obtain the plane vector of the vulnerable statement.

利用公式（4）确定与漏洞语句的平面向量最相似的簇中心后，将含有漏洞语句的程序切片的压缩语句嵌入与最相似的簇中心进行拼接，形成第一拼接向量，如公式（5）所示，作为后续漏洞检测模型的输入。Use formula (4) to determine the cluster center that is most similar to the plane vector of the vulnerability statement Then, the compressed statements of the program slice containing the vulnerable statements are embedded into The most similar cluster center Splice to form the first splicing vector , as shown in formula (5), As input to subsequent vulnerability detection models.

（5）。 (5).

步骤3.3：对于不包含漏洞语句的程序切片，确定预设的特殊嵌入在漏洞字典中所属的簇中心，将不包含漏洞语句的程序切片的压缩语句嵌入与确定的所属簇中心进行拼接，得到第一拼接向量。Step 3.3: For a program slice that does not contain a vulnerability statement, determine the cluster center to which the preset special embedding belongs in the vulnerability dictionary, and concatenate the compressed statement embedding of the program slice that does not contain the vulnerability statement with the determined cluster center to obtain a first concatenated vector.

在本实施例中，对于不包含漏洞语句的程序切片，使用预设的特殊嵌入矩阵来表示漏洞语句的第一语句嵌入，并使用<pad>进行填充，其中为vulI的最大数量。同样地，利用GRU_vulI将漏洞语句第一语句嵌入汇总为平面向量。In this embodiment, for program slices that do not contain vulnerability statements, a preset special embedding matrix is used To represent the first statement embedding of the vulnerability statement, and fill it with <pad>, where is the maximum number of vulIs. Similarly, GRU _vulI is used to embed the first sentence of the vulnerability sentence into Summarize into a flat vector .

在本实施例中，具体簇中心和拼接过程与步骤3.2中所述的类似，在此不做赘述。In this embodiment, the specific cluster center and stitching process are similar to those described in step 3.2 and will not be described in detail here.

步骤4：将第一拼接向量输入至漏洞检测模型，学习程序切片的第一语句嵌入与漏洞字典的簇中心的关联性，得到训练完成的漏洞检测模型。Step 4: Input the first concatenated vector into the vulnerability detection model, learn the correlation between the first statement embedding of the program slice and the cluster center of the vulnerability dictionary, and obtain a trained vulnerability detection model.

请参见图5，图5是本发明实施例提供的一种漏洞检测模型的训练过程图，如图5所示，在得到第一拼接向量后将其输入至漏洞检测模型进行学习训练。Please refer to FIG. 5, which is a diagram of a training process of a vulnerability detection model provided by an embodiment of the present invention. As shown in FIG. 5, after obtaining the first concatenated vector Then input it into the vulnerability detection model for learning and training.

可选地，漏洞检测模型包括：级联的多个Transformer Encoder模块，以及与最后一个Transformer Encoder模块连接的分类器。Optionally, the vulnerability detection model includes: a plurality of cascaded Transformer Encoder modules, and a classifier connected to the last Transformer Encoder module.

在本实施例中，级联的多个Transformer Encoder模块用于对第一拼接向量进行特征学习得到关联特征矩阵，设置有级联的12个Transformer Encoder模块。In this embodiment, a plurality of cascaded Transformer Encoder modules are used to perform feature learning on the first concatenated vector to obtain a correlation feature matrix, and 12 cascaded Transformer Encoder modules are provided.

对于软件漏洞检测任务可以视作文本分类任务，通过学习序列信息的特征并利用分类器进行预测。在此类任务中，目前广泛采用的模型是Transformer，其特有的自注意力机制在理解代码语义特征方面表现出色，尤其是在处理长序列数据时，相较于传统序列模型RNN（Recurrent Neural Network，循环神经网络）和LSTM（Long Short Term Memory，长短期记忆）网络等更为有效，并且能有效学习输入序列中每两个令牌之间的关系，从而更好地捕捉令牌之间的联系。The task of software vulnerability detection can be regarded as a text classification task, which learns the features of sequence information and uses classifiers for prediction. In such tasks, the widely used model is Transformer, whose unique self-attention mechanism performs well in understanding the semantic features of code, especially when processing long sequence data. It is more effective than traditional sequence models such as RNN (Recurrent Neural Network) and LSTM (Long Short Term Memory) networks, and can effectively learn the relationship between every two tokens in the input sequence, thereby better capturing the connection between tokens.

因此，在本实施例中，利用级联的多个Transformer Encoder模块充分学习代码嵌入和vulCenter之间的关联性。Therefore, in this embodiment, multiple cascaded Transformer Encoder modules are used to fully learn the correlation between code embedding and vulCenter.

在本实施例中，分类器包括并联的语句级预测支路和切片级预测支路，其中，语句级预测支路用于根据关联特征矩阵进行语句级预测，得到程序切片的每个语句的漏洞预测概率；切片级预测支路用于根据关联特征矩阵进行切片级预测，得到程序切片的漏洞预测概率。In this embodiment, the classifier includes a statement-level prediction branch and a slice-level prediction branch in parallel, wherein the statement-level prediction branch is used to perform statement-level prediction based on the associated feature matrix to obtain the vulnerability prediction probability of each statement in the program slice; the slice-level prediction branch is used to perform slice-level prediction based on the associated feature matrix to obtain the vulnerability prediction probability of the program slice.

在本实施例中，最后一个Transformer Encoder模块的输出作为分类器的输入，代表每个程序切片预设的最大语句数量。In this example, the output of the last Transformer Encoder module is As the input of the classifier, Represents the maximum number of statements preset for each program slice.

请参见图6，图6是本发明实施例提供的一种分类器的结构示意图。如图6所示，语句级预测支路包括级联的第一Dropout层、第一全连接层、第一激活层、第二Dropout层和第二全连接层。即，通过第一Dropout层、第一全连接层、第一激活层（激活函数为Tanh）和第二Dropout层进行处理后，再通过第二全连接层将第二维度的特征映射到个标量值，以便进行最终的语句级预测，得到程序切片的每个语句的漏洞预测概率。Please refer to FIG6 , which is a schematic diagram of the structure of a classifier provided by an embodiment of the present invention. As shown in FIG6 , the sentence-level prediction branch includes a cascaded first Dropout layer, a first fully connected layer, a first activation layer, a second Dropout layer, and a second fully connected layer. That is, After being processed by the first Dropout layer, the first fully connected layer, the first activation layer (the activation function is Tanh), and the second Dropout layer, the features of the second dimension are mapped to scalar values to make the final statement-level prediction and obtain the vulnerability prediction probability of each statement in the program slice.

如图6所示，切片级预测支路包括级联的GRU单元、向量压缩层、第三Dropout层、第三全连接层、第二激活层、第四Dropout层和第四全连接层。在本实施例中，采用GRU语句嵌入的思想，利用GRU单元对进行总结，获取最后一个时间步的隐层状态并经过向量压缩层进行压缩后，得到。随后，依次通过第三Dropout层、第三全连接层、第二激活层（激活函数为Tanh）和第四Dropout层进行处理，最后通过第四全连接层映射到一个标量值，以实现切片级预测，得到程序切片的漏洞预测概率。As shown in Figure 6, the slice-level prediction branch includes a cascade of GRU units, a vector compression layer, a third Dropout layer, a third fully connected layer, a second activation layer, a fourth Dropout layer, and a fourth fully connected layer. To summarize, we obtain the hidden state of the last time step and compress it through the vector compression layer to obtain . Then, It is processed in sequence through the third Dropout layer, the third fully connected layer, the second activation layer (the activation function is Tanh), and the fourth Dropout layer, and finally mapped to a scalar value through the fourth fully connected layer to achieve slice-level prediction and obtain the vulnerability prediction probability of the program slice.

步骤5：对待测程序进行预处理，得到待测程序切片的第一语句嵌入，根据待测程序切片的第一语句嵌入和漏洞字典，利用训练完成的漏洞检测模型对待测程序切片进行漏洞检测，根据待测程序切片的漏洞预测结果，得到待测程序的漏洞预测结果。Step 5: Preprocess the program to be tested to obtain the first statement embedding of the program slice to be tested. According to the first statement embedding of the program slice to be tested and the vulnerability dictionary, use the trained vulnerability detection model to perform vulnerability detection on the program slice to be tested. According to the vulnerability prediction result of the program slice to be tested, obtain the vulnerability prediction result of the program to be tested.

在一个可选的实施例中，步骤5包括：In an optional embodiment, step 5 includes:

步骤5.1：对待测程序进行程序切片处理得到多个待测程序切片，对每个待测程序切片依次进行分词处理、词表嵌入和第一次语句嵌入，得到待测程序切片的第一语句嵌入。Step 5.1: Slice the program to be tested to obtain multiple program slices to be tested, and perform word segmentation, vocabulary embedding and first sentence embedding on each program slice to obtain the first sentence embedding of the program slice to be tested.

在检测阶段，对待测程序进行预处理的过程，参见训练集的预处理过程，在此不做赘述。In the detection phase, the process of preprocessing the program to be tested refers to the preprocessing process of the training set, which will not be described in detail here.

值得注意的是，以一个函数F作为待测程序进行检测，设F切割为z个待测程序切片，需要保存每个待测程序切片中语句对应于源函数的行号。在对待测程序切片进行后续BPE分词、词表映射、将语句数量统一为以及词表嵌入时，对于语句数不足的待测程序切片，用-1代替行号进行补充以保持数量一致。It is worth noting that, taking a function F as the program to be tested, assuming that F is cut into z program slices to be tested, it is necessary to save the line number of the statement corresponding to the source function in each program slice to be tested. And when the word list is embedded, the number of sentences is insufficient The slice of the program to be tested is padded with -1 instead of line number to keep the number consistent.

步骤5.2：将待测程序切片的第一语句嵌入与漏洞字典中的每个簇中心分别进行拼接，得到对应的第二拼接向量。Step 5.2: Concatenate the first statement embedding of the program slice to be tested with each cluster center in the vulnerability dictionary to obtain a corresponding second concatenation vector.

在本实施例中，对于待测程序切片的第一语句嵌入，即可以得到个第二拼接向量。In this embodiment, the first statement of the program slice to be tested is embedded, that is, The second concatenation vector.

步骤5.3：将第二拼接向量输入至训练完成的漏洞检测模型中，得到待测程序切片的每个语句的漏洞预测概率和待测程序切片的漏洞预测概率。Step 5.3: Input the second concatenated vector into the trained vulnerability detection model to obtain the vulnerability prediction probability of each statement of the program slice to be tested and the vulnerability prediction probability of the program slice to be tested.

在本实施例中，对于待测程序切片，利用训练完成的漏洞检测模型将获得个切片级漏洞预测概率和语句级漏洞预测概率。In this embodiment, for the program slice to be tested, the trained vulnerability detection model is used to obtain The slice-level vulnerability prediction probability and statement-level vulnerability prediction probability are calculated.

对于切片级预测，选择个漏洞预测概率中的最大值作为待测程序切片的漏洞预测概率；对于语句级预测，计算每条语句在不同簇中心中的漏洞预测概率的平均值，以此作为待测程序切片的每个语句的漏洞预测概率。For slice-level prediction, select The maximum value of the vulnerability prediction probabilities is taken as the vulnerability prediction probability of the program slice to be tested; for statement-level prediction, the average value of the vulnerability prediction probabilities of each statement in different cluster centers is calculated, and this is taken as the vulnerability prediction probability of each statement in the program slice to be tested.

步骤5.4：根据每个待测程序切片的漏洞预测概率，得到待测程序的漏洞预测结果。Step 5.4: According to the vulnerability prediction probability of each slice of the program to be tested, the vulnerability prediction result of the program to be tested is obtained.

可选地，可以根据每个待测程序切片的漏洞预测概率，确定漏洞预测概率的最大值，若最大值大于预设阈值，则待测程序存在漏洞，否则待测程序不存在漏洞。Optionally, the maximum value of the vulnerability prediction probability can be determined based on the vulnerability prediction probability of each slice of the program to be tested. If the maximum value is greater than a preset threshold, the program to be tested has a vulnerability, otherwise the program to be tested does not have a vulnerability.

在本实施例中，预设阈值为0.5。对于函数F，如果其程序切片存在漏洞，那么该函数也必然存在漏洞，从而达到函数级的漏洞预测。In this embodiment, the preset threshold is 0.5. For function F , if its program slice has a vulnerability, then the function must also have a vulnerability, thereby achieving function-level vulnerability prediction.

需要说明的是，如果函数F无漏洞，则后续步骤5.5不再执行。It should be noted that if function F has no vulnerability, the subsequent step 5.5 will not be executed.

步骤5.5：对于存在漏洞的待测程序，根据每个待测程序切片的每个语句的漏洞预测概率，确定待测程序的漏洞语句预测结果。Step 5.5: For a program to be tested that has vulnerabilities, determine the vulnerability statement prediction result of the program to be tested according to the vulnerability prediction probability of each statement of each slice of the program to be tested.

可选地，可以对于存在漏洞的待测程序，根据每个待测程序切片的每个语句的漏洞预测概率，选出漏洞预测概率最高的K个语句，作为待测程序的漏洞语句预测结果。Optionally, for a program under test with vulnerabilities, K statements with the highest vulnerability prediction probabilities may be selected according to the vulnerability prediction probabilities of each statement in each slice of the program under test as the vulnerability statement prediction results of the program under test.

在本实施例中，选出漏洞预测概率最高的K个语句之后，根据保存的每个程序切片的语句对应的原始行号，将每个程序切片中语句的预测概率映射回函数F的原始行号，最终得到函数F中最有可能出现漏洞的K个语句。In this embodiment, after selecting the K statements with the highest vulnerability prediction probability, the predicted probability of the statement in each program slice is mapped back to the original line number of function F according to the original line number corresponding to the statement of each program slice saved, and finally the K statements in function F that are most likely to have vulnerabilities are obtained.

本发明实施例的切片级特征驱动的软件漏洞精准检测方法，结合语句嵌入和程序切片技术，以漏洞关键点为起始点执行切片，提取与漏洞密切相关的代码片段，并利用语句嵌入汇聚语句信息，获得更加完整的特征表示，引入了漏洞字典的概念，利用漏洞检测模型挖掘和利用程序切片中的漏洞特征，实现了对待测程序的函数级和语句级漏洞的有效检测。The slice-level feature-driven software vulnerability precision detection method of the embodiment of the present invention combines statement embedding and program slicing technology, performs slicing with vulnerability key points as the starting point, extracts code snippets closely related to the vulnerability, and uses statement embedding to aggregate statement information to obtain a more complete feature representation. The concept of vulnerability dictionary is introduced, and vulnerability features in program slices are mined and utilized using vulnerability detection models to achieve effective detection of function-level and statement-level vulnerabilities in the program to be tested.

应当说明的是，在本文中，诸如第一和第二等之类的关系术语仅仅用来将一个实体或者操作与另一个实体或操作区分开来，而不一定要求或者暗示这些实体或操作之间存在任何这种实际的关系或者顺序。而且，术语“包括”、“包含”或者任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的物品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括所述要素的物品或者设备中还存在另外的相同要素。“连接”或者“相连”等类似的词语并非限定于物理的或者机械的连接，而是可以包括电性的连接，不管是直接的还是间接的。It should be noted that, in this article, relational terms such as first and second, etc. are only used to distinguish one entity or operation from another entity or operation, and do not necessarily require or imply that there is any such actual relationship or order between these entities or operations. Moreover, the terms "include", "comprise" or any other variants are intended to cover non-exclusive inclusion, so that an article or device including a series of elements includes not only those elements, but also other elements that are not explicitly listed. In the absence of further restrictions, the elements defined by the sentence "including one..." do not exclude the existence of other identical elements in the article or device including the elements. Similar words such as "connect" or "connected" are not limited to physical or mechanical connections, but can include electrical connections, whether direct or indirect.

以上内容是结合具体的优选实施方式对本发明所作的进一步详细说明，不能认定本发明的具体实施只局限于这些说明。对于本发明所属技术领域的普通技术人员来说，在不脱离本发明构思的前提下，还可以做出若干简单推演或替换，都应当视为属于本发明的保护范围。The above contents are further detailed descriptions of the present invention in combination with specific preferred embodiments, and it cannot be determined that the specific implementation of the present invention is limited to these descriptions. For ordinary technicians in the technical field to which the present invention belongs, several simple deductions or substitutions can be made without departing from the concept of the present invention, which should be regarded as falling within the protection scope of the present invention.

Claims

1. A slice-level feature-driven software vulnerability precision detection method, characterized by comprising:

Step 1: preprocessing the training set to obtain the first statement embedding of the program slice, wherein the training set is an open source vulnerability dataset with statement-level vulnerability labels; Step 1 includes:

Step 1.1: performing program slicing processing on the training set to obtain a plurality of program slices, wherein each program slice contains an equal number of statements;

Step 1.2: performing word segmentation processing on each statement in the program slice to obtain a token sequence of each statement in the program slice, wherein each statement contains an equal number of tokens;

Step 1.3: embed each token of each statement in the program slice into a vocabulary to obtain a token embedding of the program slice;

Step 1.4: Perform a first sentence embedding on the token embedding of the program slice using GRU to obtain a first sentence embedding of the program slice;

Step 2: Construct a vulnerability dictionary based on the training set;

Step 3: embed the first statement of the program slice into the cluster center matching the vulnerability dictionary for splicing to obtain a first splicing vector;

Step 4: input the first concatenated vector into a vulnerability detection model, learn the correlation between the first statement embedding of the program slice and the cluster center of the vulnerability dictionary, and obtain a trained vulnerability detection model;

Step 5: Preprocess the program to be tested to obtain the first statement embedding of the program slice to be tested, perform vulnerability detection on the program slice to be tested using the trained vulnerability detection model based on the first statement embedding of the program slice to be tested and the vulnerability dictionary, and obtain the vulnerability prediction result of the program to be tested based on the vulnerability prediction result of the program slice to be tested.

2. According to the slice-level feature-driven software vulnerability accurate detection method of claim 1, it is characterized in that the step 1.1 comprises:

Step 1.11: Use the open source static analysis tool Joern to analyze the programs in the training set to obtain the key points of the programs, including sensitive APIs, array operations, integers, and pointer usage;

Step 1.12: Execute forward slicing and backward slicing operations with the key point as the starting point, extract all statements related to the starting point in the program, and obtain corresponding forward slicing and backward slicing;

Step 1.13: Integrate the forward slice and the backward slice to remove duplicate statements, and attach a corresponding statement label to each statement to obtain the program slice, where the statement label includes a vulnerability statement label and a non-vulnerability statement label.

3. According to claim 1, the slice-level feature-driven software vulnerability accurate detection method is characterized in that step 2 comprises:

Step 2.1: Filter out program slices containing vulnerability statements from the multiple program slices to obtain vulnerable program slices, extract the vulnerability statements in each of the vulnerable program slices, and obtain a set of vulnerability statements corresponding to each of the vulnerable program slices;

Step 2.2: performing word segmentation processing on each vulnerability statement in the vulnerability statement set to obtain a token sequence of each vulnerability statement in the vulnerability statement set;

Step 2.3: embedding each token of each vulnerability statement in the vulnerability statement set into a vocabulary to obtain a token embedding of the vulnerability statement set;

Step 2.4: Use GRU to perform a first sentence embedding on the token embedding of the vulnerable sentence set to obtain the first sentence embedding of the vulnerable sentence set;

Step 2.5: Perform a second sentence embedding on the first sentence embedding of the vulnerable sentence set using GRU to obtain a second sentence embedding of the vulnerable sentence set;

Step 2.6: embedding the second sentence of the vulnerability sentence set into a first dimension and compressing it and mapping it onto a plane vector to obtain the plane vector of the vulnerability sentence set;

Step 2.7: The plane vectors of the vulnerability statement sets corresponding to all the vulnerable program slices form a vulnerability plane vector set, and the cluster centers of the vulnerability plane vector set are obtained by using a clustering algorithm. All cluster centers form the vulnerability dictionary.

4. The slice-level feature-driven software vulnerability accurate detection method according to claim 1, wherein step 3 comprises:

Step 3.1: embedding and compressing the first statement of the program slice to obtain a compressed statement embedding of the program slice;

Step 3.2: for a program slice containing a vulnerable statement, determine the cluster center to which the vulnerable statement in the program slice belongs in the vulnerability dictionary, and concatenate the compressed statement embedding of the program slice containing the vulnerable statement with the determined cluster center to obtain the first concatenation vector;

Step 3.3: For a program slice that does not contain a vulnerability statement, determine the cluster center to which the preset special embedding in the vulnerability dictionary belongs, and splice the compressed statement embedding of the program slice that does not contain a vulnerability statement with the determined cluster center to obtain the first splicing vector.

5. The slice-level feature-driven software vulnerability accurate detection method according to claim 1 is characterized in that the vulnerability detection model comprises: a plurality of cascaded Transformer Encoder modules, and a classifier connected to the last Transformer Encoder module, the classifier comprising a parallel statement-level prediction branch and a slice-level prediction branch, wherein:

The cascaded multiple Transformer Encoder modules are used to perform feature learning on the first concatenated vector to obtain a correlation feature matrix;

The statement-level prediction branch is used to perform statement-level prediction according to the association feature matrix to obtain the vulnerability prediction probability of each statement of the program slice;

The slice-level prediction branch is used to perform slice-level prediction according to the associated feature matrix to obtain the vulnerability prediction probability of the program slice.

6. The slice-level feature-driven software vulnerability accurate detection method according to claim 5, characterized in that the statement-level prediction branch comprises a cascaded first Dropout layer, a first fully connected layer, a first activation layer, a second Dropout layer, and a second fully connected layer;

The slice-level prediction branch includes a cascaded GRU unit, a vector compression layer, a third Dropout layer, a third fully connected layer, a second activation layer, a fourth Dropout layer and a fourth fully connected layer.

7. The slice-level feature-driven software vulnerability accurate detection method according to claim 1, wherein step 5 comprises:

Step 5.1: performing program slicing processing on the program to be tested to obtain multiple program slices to be tested, and performing word segmentation processing, vocabulary embedding and first sentence embedding on each program slice to be tested in sequence to obtain the first sentence embedding of the program slice to be tested;

Step 5.2: concatenate the first statement embedding of the program slice to be tested with each cluster center in the vulnerability dictionary to obtain a corresponding second concatenation vector;

Step 5.3: input the second concatenated vector into the trained vulnerability detection model to obtain the vulnerability prediction probability of each statement of the program slice to be tested and the vulnerability prediction probability of the program slice to be tested;

Step 5.4: Obtaining a vulnerability prediction result of the program to be tested according to the vulnerability prediction probability of each slice of the program to be tested;

Step 5.5: For a program to be tested that has vulnerabilities, determine a vulnerability statement prediction result of the program to be tested according to the vulnerability prediction probability of each statement of each slice of the program to be tested.

8. The slice-level feature-driven software vulnerability accurate detection method according to claim 7, wherein step 5.4 comprises:

According to the vulnerability prediction probability of each slice of the program to be tested, a maximum value of the vulnerability prediction probability is determined. If the maximum value is greater than a preset threshold, the program to be tested has a vulnerability; otherwise, the program to be tested does not have a vulnerability.

9. The slice-level feature-driven software vulnerability accurate detection method according to claim 7, wherein step 5.5 comprises:

For a program under test with vulnerabilities, K statements with the highest vulnerability prediction probabilities are selected according to the vulnerability prediction probability of each statement in each slice of the program under test as the vulnerability statement prediction results of the program under test.