CN114679298A

CN114679298A - A data screening method and device for application identification intelligence database

Info

Publication number: CN114679298A
Application number: CN202210173540.4A
Authority: CN
Inventors: 孙磊
Original assignee: Secworld Information Technology Beijing Co Ltd; Qax Technology Group Inc
Current assignee: Secworld Information Technology Beijing Co Ltd; Qax Technology Group Inc
Priority date: 2022-02-24
Filing date: 2022-02-24
Publication date: 2022-06-28

Abstract

The invention discloses a data screening method and a data screening device for an application identification information library, relates to the technical field of network security, and mainly aims to realize refining and compression of network flow data under the condition of completely retaining field information of effective identifiable applications, so that invalid content redundancy in the information library is reduced, and the identification efficiency of the information library is improved. The main technical scheme of the invention is as follows: analyzing the collected target flow into message data in a preset format; processing data which cannot identify a flow source in the message data according to a preset rule to obtain identifiable data; classifying the identifiable data by using a preset label, and performing clustering calculation on the identifiable data of the same classification to obtain at least one data set; and respectively extracting at least one piece of identifiable data from the data sets corresponding to all the classifications, and adding the data into an application identification intelligence library. The invention is used for screening data of the application identification information base.

Description

A data screening method and device for application identification intelligence database

技术领域technical field

本发明涉及网络安全技术领域，尤其涉及一种应用识别情报库的数据筛选方法及装置。The invention relates to the technical field of network security, in particular to a data screening method and device for an application identification intelligence base.

背景技术Background technique

在网络的入口处对应用程序的识别是非常重要的，无论是网络安全产品，还是专业的流量分析引擎，应用流量的准确识别不但可洞悉整个网络的运行情况，而且可针对具体需求做用户行为的准确管控，这在一定程度上既可保证业务流的高效运行，也可预防由于内网中毒引起的断网事件。因此，需要建立应用识别情报库，以实现对与应用程序相关的网络流量的快速识别，但如果应用识别情报库建立时包含的内容过于冗余且包含很多不可标记应用的内容，就会在对应用程序相关的网络流量识别时，影响识别的效率。The identification of applications at the entrance of the network is very important. Whether it is a network security product or a professional traffic analysis engine, the accurate identification of application traffic can not only gain insight into the operation of the entire network, but also make user behavior based on specific needs. To a certain extent, it can not only ensure the efficient operation of business flows, but also prevent network disconnection events caused by intranet poisoning. Therefore, it is necessary to establish an application identification intelligence database to realize the rapid identification of network traffic related to the application. When identifying application-related network traffic, it affects the efficiency of identification.

目前，在对应用识别情报库建立时，针对网络流量数据的处理方法一般是在解析网络流量后，通过提取部分字段信息并去重的方式以减少字段信息内容冗余，然而，可识别应用的字段信息在网络流量中的位置以及名称可谓是五花八门，仅提取部分字段则会丢失其他有效的可识别应用的字段信息，导致容易发生漏识别的情况。因此，如何在保证情报库建立时完整保留有效可识别应用的字段信息的情况下，实现对网络流量数据的精炼和压缩成为目前噩待解决的问题。At present, when the application identification intelligence database is established, the processing method for network traffic data is generally to reduce the redundancy of field information content by extracting part of the field information and deduplication after parsing the network traffic. The location and name of field information in network traffic can be described as various, and only extracting some fields will lose other valid application-identifiable field information, which may easily lead to missed identification. Therefore, how to realize the refinement and compression of network traffic data while ensuring that the field information that can effectively identify the application is completely preserved when the intelligence database is established has become a problem to be solved at present.

发明内容SUMMARY OF THE INVENTION

鉴于上述问题，本发明提供一种应用识别情报库的数据筛选方法、系统及电子设备，主要目的是在完整保留有效可识别应用的字段信息的情况下，实现对网络流量数据的精炼和压缩，从而减少情报库中无效内容冗余，以提升情报库的识别效率。In view of the above-mentioned problems, the present invention provides a data screening method, system and electronic device for an application identification information base, the main purpose of which is to realize the refining and compression of network traffic data under the condition of completely retaining the field information that can effectively identify the application, Thereby reducing the redundancy of invalid content in the intelligence base to improve the identification efficiency of the intelligence base.

为解决上述技术问题，本发明提出以下方案：In order to solve the above-mentioned technical problems, the present invention proposes the following scheme:

第一方面，本发明提供了一种应用识别情报库的数据筛选方法，所述方法包括：In a first aspect, the present invention provides a data screening method for an application identification intelligence base, the method comprising:

将采集的目标流量解析为预设格式的报文数据；Parse the collected target traffic into packet data in a preset format;

根据预设规则处理所述报文数据中无法识别流量来源的数据，得到可识别数据；Process the data in the packet data that cannot identify the source of the traffic according to the preset rules to obtain identifiable data;

利用预设标签对所述可识别数据进行分类，并对同一分类的所述可识别数据进行聚类计算，得到至少一个数据集；Classify the identifiable data by using preset labels, and perform cluster calculation on the identifiable data of the same classification to obtain at least one data set;

从所有分类对应的所述数据集中分别提取至少一条所述可识别数据，添加至应用识别情报库中。At least one piece of the identifiable data is respectively extracted from the data sets corresponding to all the categories, and added to the application identification intelligence database.

第二方面，本发明提供了一种应用识别情报库的数据筛选装置，所述装置包括：In a second aspect, the present invention provides a data screening device for an application identification intelligence base, the device comprising:

解析单元，用于将采集的目标流量解析为预设格式的报文数据；A parsing unit, used to parse the collected target traffic into message data in a preset format;

处理单元，用于根据预设规则处理所述解析单元中获得的报文数据中无法识别流量来源的数据，得到可识别数据；a processing unit, configured to process, according to preset rules, the data of the unidentifiable traffic source in the packet data obtained by the parsing unit, and obtain identifiable data;

计算单元，用于利用预设标签对所述处理单元中获得的可识别数据进行分类，并对同一分类的所述可识别数据进行聚类计算，得到至少一个数据集；a computing unit, configured to use a preset label to classify the identifiable data obtained in the processing unit, and perform cluster calculation on the identifiable data of the same classification to obtain at least one data set;

提取单元，用于从所述计算单元中所有分类对应的所述数据集中分别提取至少一条所述可识别数据；an extraction unit, configured to extract at least one piece of the identifiable data from the data sets corresponding to all categories in the computing unit;

添加单元，用于将所述提取单元中获取的至少一条所述可识别数据添加至应用识别情报库中。The adding unit is configured to add at least one piece of the identifiable data obtained in the extracting unit to the application identification intelligence database.

为了实现上述目的，根据本发明的第三方面，提供了一种存储介质，所述存储介质包括存储的程序，其中，在所述程序运行时控制所述存储介质所在设备执行上述第一方面的应用识别情报库的数据筛选方法。In order to achieve the above object, according to a third aspect of the present invention, a storage medium is provided, the storage medium includes a stored program, wherein when the program is run, the device where the storage medium is located is controlled to execute the above-mentioned first aspect. Apply data screening methods that identify intelligence bases.

为了实现上述目的，根据本发明的第四方面，提供了一种处理器，所述处理器用于运行程序，其中，所述程序运行时执行上述第一方面的应用识别情报库的数据筛选方法。In order to achieve the above object, according to a fourth aspect of the present invention, a processor is provided for running a program, wherein when the program runs, the data screening method for an application identification intelligence database of the first aspect is executed.

借由上述技术方案，本发明提供的一种应用识别情报库的数据筛选方法及装置，通过本发明提供的数据筛选方案，可以在应用识别数据库的建立时实现对流量数据的筛选需求，当流量数据解析为报文数据后，即可根据预设规则对报文数据中无法识别流量来源的数据进行处理，从而对报文数据中的一些无法识别流量来源的数据进行去除，以实现对流量数据中冗余字段内容的删减，再利用预设标签对可识别数据进行分类，使得相同预设标签的可识别数据能够划分为一类，然后对同一分类的可识别数据进行聚类计算，以减小聚类计算的计算量，降低聚类计算的计算负担，而基于聚类计算而得到不同的数据集，其可以理解为从不同的维度来识别该应用的特征数据，在数据集所对应的维度足够全面时，则可以认为保留了识别应用所需的有效且完整的字段信息，最后从所有分类对应的数据集中分别提取至少一条可识别数据，添加至应用识别情报库中，从而在保证情报库建立时完整保留有效可识别应用的字段信息的情况下，实现对网络流量数据的精炼和压缩，进而减少情报库的无效内容冗余，以提升情报库的识别效率。With the above technical solution, the present invention provides a data screening method and device for an application identification information base. Through the data screening solution provided by the present invention, the screening requirements for traffic data can be realized when the application identification database is established. After the data is parsed into packet data, the data that cannot identify the traffic source in the packet data can be processed according to the preset rules, so as to remove some unidentifiable traffic source data in the packet data, so as to realize the traffic data analysis. The content of redundant fields is deleted, and then the identifiable data is classified by the preset label, so that the identifiable data of the same preset label can be divided into one category, and then the identifiable data of the same category is clustered to calculate Reduce the computational burden of clustering calculation and reduce the computational burden of clustering calculation, and obtain different data sets based on clustering calculation, which can be understood as identifying the characteristic data of the application from different dimensions, in the corresponding data set When the dimensions are comprehensive enough, it can be considered that the valid and complete field information required to identify the application is retained, and finally at least one piece of identifiable data is extracted from the data sets corresponding to all classifications and added to the application identification intelligence database, so as to ensure When the intelligence database is established, the field information that can effectively identify the application is completely retained, and the network traffic data can be refined and compressed, thereby reducing the redundancy of invalid content in the intelligence database, and improving the identification efficiency of the intelligence database.

上述说明仅是本发明技术方案的概述，为了能够更清楚了解本发明的技术手段，而可依照说明书的内容予以实施，并且为了让本发明的上述和其它目的、特征和优点能够更明显易懂，以下特举本发明的具体实施方式。The above description is only an overview of the technical solutions of the present invention, in order to be able to understand the technical means of the present invention more clearly, it can be implemented according to the content of the description, and in order to make the above and other purposes, features and advantages of the present invention more obvious and easy to understand , the following specific embodiments of the present invention are given.

附图说明Description of drawings

通过阅读下文优选实施方式的详细描述，各种其他的优点和益处对于本领域普通技术人员将变得清楚明了。附图仅用于示出优选实施方式的目的，而并不认为是对本发明的限制。而且在整个附图中，用相同的参考符号表示相同的部件。在附图中：Various other advantages and benefits will become apparent to those of ordinary skill in the art upon reading the following detailed description of the preferred embodiments. The drawings are for the purpose of illustrating preferred embodiments only and are not to be considered limiting of the invention. Also, the same components are denoted by the same reference numerals throughout the drawings. In the attached image:

图1示出了本发明实施例提供的一种应用识别情报库的数据筛选方法流程图；1 shows a flowchart of a data screening method for an application identification intelligence base provided by an embodiment of the present invention;

图2示出了本发明实施例提供的另一种应用识别情报库的数据筛选方法流程图；FIG. 2 shows a flowchart of another data screening method using an identification intelligence base provided by an embodiment of the present invention;

图3示出了本发明实施例提供的一种应用识别情报库的数据筛选装置的组成框图；3 shows a block diagram of a data screening apparatus for an application identification intelligence base provided by an embodiment of the present invention;

图4示出了本发明实施例提供的另一种应用识别情报库的数据筛选装置的组成框图。FIG. 4 shows a block diagram of the composition of another data screening apparatus for applying an identification information base provided by an embodiment of the present invention.

具体实施方式Detailed ways

下面将参照附图更详细地描述本公开的示例性实施例。虽然附图中显示了本公开的示例性实施例，然而应当理解，可以以各种形式实现本公开而不应被这里阐述的实施例所限制。相反，提供这些实施例是为了能够更透彻地理解本公开，并且能够将本公开的范围完整的传达给本领域的技术人员。Exemplary embodiments of the present disclosure will be described in more detail below with reference to the accompanying drawings. While exemplary embodiments of the present disclosure are shown in the drawings, it should be understood that the present disclosure may be embodied in various forms and should not be limited by the embodiments set forth herein. Rather, these embodiments are provided so that the present disclosure will be more thoroughly understood, and will fully convey the scope of the present disclosure to those skilled in the art.

在网络安全技术领域中，在网络的入口处对应用程序的识别是非常重要的，针对于网络流量的管理，往往会通过建立应用识别情报库的方式对应用进行识别，而建立应用识别情报库是由不同特征流量数据样本所组成的，而如果其中与应用识别无关的流量数据越多，就会影响情报库对应用流量识别的效率，而现有针对网络流量数据的处理方法一般是在解析网络流量后，通过提取部分字段信息并去重的方式以减少字段信息内容冗余，然而，可识别应用的字段信息在网络流量中的位置以及名称可谓是五花八门，仅提取部分字段则会丢失其他有效的可识别应用的字段信息，导致容易发生漏识别的情况，而如果对于应用的漏识别率过高，就会导致识别的不精准，放过本不该放过的流量，对网络安全造成威胁，例如某些应用携带安全漏洞，造成安全隐患，且某些应用会上传个人隐私信息，造成个人信息泄漏等，给应用的识别提出了更为严峻的挑战。因此，如何在保证情报库建立时完整保留有效可识别应用的字段信息的情况下，实现对网络流量数据的精炼和压缩成为目前噩待解决的问题。为此，本发明实施例提供了一种应用识别情报库的数据筛选方法，通过该方法能够在完整保留有效可识别应用的字段信息的情况下，实现对网络流量数据的精炼和压缩，从而减少情报库中无效内容冗余，以提升情报库的识别效率，其具体执行步骤如图1所示，包括：In the field of network security technology, it is very important to identify applications at the entrance of the network. For the management of network traffic, applications are often identified by establishing an application identification intelligence database. It is composed of traffic data samples with different characteristics, and if there are more traffic data unrelated to application identification, it will affect the efficiency of the intelligence database to identify application traffic, and the existing processing methods for network traffic data are generally based on analysis After network traffic, the redundancy of field information content is reduced by extracting some field information and deduplication. However, the location and name of the field information that can identify the application in the network traffic can be described as various, and only extracting some fields will lose other fields. The field information that can effectively identify the application will easily lead to the situation of missed identification, and if the missed identification rate of the application is too high, it will lead to inaccurate identification, let go of the traffic that should not be let go, and cause harm to network security. Threats, such as some applications carry security loopholes, causing security risks, and some applications will upload personal privacy information, causing personal information leakage, etc., posing more severe challenges to application identification. Therefore, how to realize the refinement and compression of network traffic data while ensuring that the field information that can effectively identify the application is completely preserved when the intelligence database is established has become a problem to be solved at present. To this end, the embodiments of the present invention provide a data screening method for an application identification intelligence database, through which the refinement and compression of network traffic data can be achieved while completely retaining field information that can effectively identify applications, thereby reducing The invalid content in the intelligence base is redundant to improve the identification efficiency of the intelligence base. The specific implementation steps are shown in Figure 1, including:

101、将采集的目标流量解析为预设格式的报文数据。101. Parse the collected target traffic into packet data in a preset format.

在本步骤中，目标流量即为采集到的应用的网络流量，而应用的网络流量包括已知应用流量样本和未知应用流量样本，在应用识别情报库的建立过程中，需要对全部应用的网络流量进行采集，而针对目标流量的采集方式可以是离线采集、实时采集、爬虫爬取或特定系统接口等相关方式采集数据等，对此，本实施例不做具体限定。而报文是网络中交换与传输的数据单元，包含了将要发送的完整的数据信息，其长短很不一致，长度不限且可变，其本身是多行数据构成的字符串文本。而对于不同类型的流量具有不同的解析方式，例如，对于应用层为http协议的流量，该类型的流量对应报文的特征是固定的,即http报文，其是用于http协议交互的信息,请求端的http报文叫请求报文，响应端的叫响应报文，http报文分为报文首部和报文主体，一般用空行隔开。因此可直接提取并解析http报文数据的报文首部中的host、Location、User-Agent等字段作为代表http协议的流量的特征信息，以便后续识别该流量的来源。In this step, the target traffic is the collected network traffic of the application, and the network traffic of the application includes known application traffic samples and unknown application traffic samples. The traffic is collected, and the collection method for the target traffic may be offline collection, real-time collection, crawler crawling, or specific system interface and other related methods to collect data, etc., which is not specifically limited in this embodiment. The message is the data unit exchanged and transmitted in the network, which contains the complete data information to be sent. The length of the message is very inconsistent, and the length is unlimited and variable. For different types of traffic, there are different analysis methods. For example, for the traffic whose application layer is http protocol, the characteristics of the corresponding packets of this type of traffic are fixed, that is, http packets, which are the information used for http protocol interaction. , The http message of the requesting end is called a request message, and the one of the responding end is called a response message. The http message is divided into a message header and a message body, which are generally separated by blank lines. Therefore, fields such as host, Location, and User-Agent in the message header of the http message data can be directly extracted and parsed as characteristic information representing the traffic of the http protocol, so as to identify the source of the traffic subsequently.

102、根据预设规则处理报文数据中无法识别流量来源的数据，得到可识别数据。102. Process the data in the packet data that cannot identify the traffic source according to a preset rule, and obtain identifiable data.

需要说明的是，在本步骤中，预设规则是基于无法识别流量来源的数据所具有的特征而设置的处理规则，其中，无法识别流量来源的数据即为不属于或不由应用创建的数据，通过该预设规则可对报文数据中无法识别流量来源的数据进行处理，得到可识别数据，而可识别数据用于表征能够识别流量来源的数据，即属于或由应用创建的数据，其包括但不限于应用内容数据，应用缓存数据，应用配置数据等，对此，本实施例不做具体限定。It should be noted that, in this step, the preset rule is a processing rule set based on the characteristics of the data whose traffic source cannot be identified, wherein, the data whose traffic source cannot be identified is the data that does not belong to or is not created by the application. Through this preset rule, the data in the packet data that cannot identify the source of the traffic can be processed to obtain identifiable data, and the identifiable data is used to represent the data that can identify the source of the traffic, that is, the data belonging to or created by the application, including However, it is not limited to application content data, application cache data, application configuration data, etc., which are not specifically limited in this embodiment.

103、利用预设标签对可识别数据进行分类，并对同一分类的可识别数据进行聚类计算，得到至少一个数据集。103. Classify the identifiable data by using a preset label, and perform a clustering calculation on the identifiable data of the same classification to obtain at least one data set.

在本步骤中，预设标签可以是用于判断可识别数据是否来源于用于同一应用的依据，具体可以为可识别数据对应的字段信息中的域名信息，当可识别数据中存在预设标签相同的数据时，可以认为预设标签相同的数据属于或由同一应用创建的数据，进而可将属于或由同一应用创建的数据划分为同一类别，而数据集则为同一分类的可识别数据进行聚类计算产生的结果，其是将同一分类的可识别数据进行相似性计算，并将相似的可识别数据划分为同一数据集内，以便后续执行步骤104。In this step, the preset label may be a basis for judging whether the identifiable data originates from the same application, and specifically may be domain name information in the field information corresponding to the identifiable data. When there is a preset label in the identifiable data When the data is the same, it can be considered that the data with the same preset label belongs to or is created by the same application, and then the data belonging to or created by the same application can be classified into the same category, and the data set is the identifiable data of the same category. The result generated by the clustering calculation is to perform similarity calculation on the identifiable data of the same classification, and divide the similar identifiable data into the same data set, so that step 104 can be executed subsequently.

104、从所有分类对应的数据集中分别提取至少一条可识别数据，添加至应用识别情报库中。104. Extract at least one piece of identifiable data from the data sets corresponding to all the classifications, and add it to the application identification intelligence database.

需要说明的是，所有分类的数据集是指属于或由不同应用创建的数据经过聚类计算而获得的多个隶属于不同应用的数据集，而由于聚类计算可以将同一分类中的特征相似的可识别数据成集合划分，因此，为了避免建立的应用识别情报库中内容过于冗余的情况发生，只需在一个数据集中提取至少一条可识别数据作为特征数据添加至应用识别情报库即可，也就是说，同一分类中的可识别数据可以理解为来源于同一应用的流量数据，而这些可识别数据通过聚类又可以划分为多个数据集，不同的数据集可以理解为从不同的维度来识别该应用的特征数据，因而，在数据集所对应的维度足够全面时，可以认为保留了识别应用所需的有效且完整的字段信息，而由于同一个数据集中所表征的特征数据是相同的，因此，从一个数据集中选取可识别数据的数量则可以表征情报库中数据的冗余程度。而针对可识别数据的提取方式，可以基于数据集中可识别数据随机选择，也可以通过人为分析后自主选择，对此，本实施中不做具体限定。It should be noted that all classified datasets refer to multiple datasets belonging to different applications obtained by clustering of data that belong to or created by different applications. Due to clustering, the features in the same category can be similar. The identifiable data is divided into sets. Therefore, in order to avoid the situation that the content in the established application identification intelligence database is too redundant, it is only necessary to extract at least one piece of identifiable data from one data set as feature data and add it to the application identification intelligence database. , that is to say, the identifiable data in the same category can be understood as traffic data originating from the same application, and these identifiable data can be divided into multiple data sets through clustering, and different data sets can be understood as different data sets from different Dimension to identify the feature data of the application, therefore, when the dimension corresponding to the dataset is comprehensive enough, it can be considered that the valid and complete field information required to identify the application is retained, and since the feature data represented in the same dataset is Similarly, the amount of identifiable data selected from a data set can characterize the degree of redundancy in the data in the intelligence base. As for the extraction method of identifiable data, it may be randomly selected based on the identifiable data in the data set, or may be selected independently after manual analysis, which is not specifically limited in this implementation.

基于上述图1的实现方式可以看出，本发明提供一种应用识别情报库的数据筛选方法，通过本发明提供的数据筛选方案，可以在应用识别数据库的建立时实现对流量数据的筛选需求，当流量数据解析为报文数据后，即可对报文数据中无法识别流量来源的数据进行去除，以实现对流量数据中冗余字段内容的删减，再通过对可识别数据分类处理后进行聚类计算，以减少聚类计算的计算量，降低聚类计算的计算负担，基于聚类计算而得到不同的数据集，其可以理解为从不同的维度来识别该应用的特征数据，在数据集所对应的维度足够全面时，则可以认为保留了识别应用所需的有效且完整的字段信息，最后从所有分类对应的数据集中分别提取至少一条可识别数据，添加至应用识别情报库中，从而在保证情报库建立时完整保留有效可识别应用的字段信息的情况下，实现对网络流量数据的精炼和压缩，进而减少情报库的无效内容冗余，以提升情报库的识别效率。It can be seen based on the implementation of the above-mentioned FIG. 1 that the present invention provides a data screening method for an application identification information database. Through the data screening scheme provided by the present invention, the screening requirements for traffic data can be realized when the application identification database is established, After the traffic data is parsed into packet data, the data in the packet data that cannot identify the source of the traffic can be removed to reduce the content of redundant fields in the traffic data, and then the identifiable data can be classified and processed. Clustering calculation, to reduce the calculation amount of clustering calculation, reduce the computational burden of clustering calculation, and obtain different data sets based on clustering calculation, which can be understood as identifying the characteristic data of the application from different dimensions, in the data When the dimensions corresponding to the set are comprehensive enough, it can be considered that the valid and complete field information required to identify the application is retained, and finally at least one piece of identifiable data is extracted from the data sets corresponding to all categories and added to the application identification intelligence database. Therefore, while ensuring that the field information of valid and identifiable applications is completely retained when the intelligence base is established, the network traffic data can be refined and compressed, thereby reducing the redundancy of invalid content in the intelligence base, and improving the identification efficiency of the intelligence base.

进一步的，作为对图1所示实施例的细化及扩展，本发明实施例还提供了另一种可疑流量的检测方法，如图2所示，其具体步骤如下：Further, as a refinement and expansion of the embodiment shown in FIG. 1 , the embodiment of the present invention also provides another method for detecting suspicious traffic, as shown in FIG. 2 , and the specific steps are as follows:

201、根据目标流量中的请求对象获取对应的报文数据。201. Acquire corresponding packet data according to the request object in the target traffic.

需要说明的是，一个目标流量中可能含有多个报文数据，而一个报文数据一般则对应一个请求对象，在具体实施过程中，存在不同的报文数据对应于相同的请求对象的情况，因此，为了对目标流量中的报文数据进行筛选，从而排除与请求对象无关的报文数据，可以基于请求对象获取与其对应的报文数据，具体过程可以为：根据所要处理的请求对象，将目标流量中的报文数据进行解析，以获取由报文数据解析而来的请求对象，将前述解析来的请求对象逐一与所要处理的请求对象进行对比，若一致，则提取解析来的请求对象对应的报文数据，并作为与所要处理的请求对象对应的报文数据，从而实现对报文数据的筛选，以便执行后续步骤202。It should be noted that a target traffic may contain multiple packet data, and one packet data generally corresponds to one request object. In the specific implementation process, there are cases where different packet data corresponds to the same request object. Therefore, in order to filter the packet data in the target traffic so as to exclude the packet data irrelevant to the request object, the corresponding packet data can be obtained based on the request object. The specific process can be: according to the request object to be processed, the The packet data in the target traffic is parsed to obtain the request object parsed from the packet data, and the parsed request objects are compared with the request objects to be processed one by one. If they are consistent, the parsed request objects are extracted. The corresponding message data is regarded as the message data corresponding to the request object to be processed, so as to realize the filtering of the message data, so as to perform the subsequent step 202 .

202、以键值对格式对报文数据的请求头内容进行解析，得到解析数据。202. Parse the content of the request header of the packet data in a key-value pair format to obtain parsing data.

其中，键值对格式中，键为请求头内容中的字段名，值为字段名所对应的字段信息。在本步骤中，由于目标流量对应的是应用层为http协议的流量，该类型的流量对应报文的特征是固定的,即http报文，因此可直接提取并解析报文数据的请求头部分，且解析后报文数据的请求头部分就是多个以键值对格式生成的字段，即key-value格式的字段，其中，key就是字段名，而针对于key，其都会有对应的唯一的value，value则是与key所对应的字段信息。Among them, in the key-value pair format, the key is the field name in the request header content, and the value is the field information corresponding to the field name. In this step, since the target traffic corresponds to the traffic whose application layer is http protocol, the characteristics of the corresponding packets of this type of traffic are fixed, that is, http packets, so the request header part of the packet data can be directly extracted and parsed , and the request header part of the parsed message data is a plurality of fields generated in the format of key-value pairs, that is, the fields in the key-value format, where the key is the field name, and for the key, it will have a corresponding unique value, value is the field information corresponding to the key.

根据步骤201-202的方法，通过根据目标流量中的请求对象获取对应的报文数据，从而基于请求对象实现对报文数据的筛选，并以键值对格式对筛选出来的报文数据的请求头内容进行解析，可保证得到解析数据中仅含有针对报文数据中的请求头内容部分进行解析的内容，而无需对报文数据中的其他部分再次解析，进一步减少对于报文数据的解析数据量，从而减少对内存的占用。According to the method of steps 201-202, the corresponding packet data is obtained according to the request object in the target traffic, so as to realize the filtering of the packet data based on the request object, and request the filtered packet data in the format of key-value pairs Parsing the header content ensures that the parsed data only contains the content that is parsed for the content of the request header in the message data, and does not need to parse other parts of the message data again, further reducing the parsing data for the message data. amount to reduce memory usage.

203、获取解析数据中指定字段对应的字段信息。203. Acquire field information corresponding to the specified field in the parsing data.

在本步骤中，指定字段为用户提前预设好的、需要进行处理的多个特征字段，具体的，可以根据指定字段对应的字段名与解析数据中的字段名进行匹配，若匹配成功，且该字段名具有唯一对应的字段信息时，可直接对解析数据中匹配到的字段名所对应的字段信息进行获取，以便执行后续步骤204。In this step, the specified fields are multiple feature fields preset by the user and need to be processed. Specifically, the field names corresponding to the specified fields can be matched with the field names in the parsed data. If the matching is successful, and When the field name has unique corresponding field information, the field information corresponding to the matched field name in the parsed data may be directly obtained, so as to perform the subsequent step 204 .

204、基于处理指定字段的预设规则对字段信息进行处理，得到可识别数据。204. Process the field information based on a preset rule for processing a specified field to obtain identifiable data.

其中，预设规则用于清理指定字段的字段信息中不具有识别流量来源的数据信息。本步骤中，预设规则是基于无法识别流量来源的数据信息所具有的特征而设置的处理规则，在获取解析数据中指定字段对应的字段信息后，可根据指定字段找到对应的预设规则，再基于预设规则对指定字段对应的字段信息进行处理，从而清理指定字段的字段信息中不具有识别流量来源的数据信息，以获取具有识别流量来源的数据信息，即可识别数据。The preset rule is used to clear the field information of the specified field that does not have data information identifying the traffic source. In this step, the preset rule is a processing rule set based on the characteristics of the data information that cannot identify the source of the traffic. After acquiring the field information corresponding to the specified field in the parsed data, the corresponding preset rule can be found according to the specified field, Then, the field information corresponding to the specified field is processed based on the preset rules, so that the field information of the specified field does not have the data information that identifies the traffic source, so as to obtain the data information that identifies the traffic source, that is, the data can be identified.

需要说明的是，在基于处理指定字段的预设规则对字段信息进行处理后，可能出现字段信息为空值的情况，而针对该情况，为了避免影响字段信息为空值的指定字段占用存储空间，对应的，可检测指定字段在处理后的字段信息是否为空，若为空，则将指定字段从解析数据中删除。It should be noted that, after the field information is processed based on the preset rules for processing specified fields, it may happen that the field information is null, and in this case, in order to avoid affecting the specified fields whose field information is null, the storage space is occupied. , correspondingly, it can detect whether the field information of the specified field after processing is empty, if it is empty, the specified field will be deleted from the parsed data.

示例性的，针对不同的指定字段，其对应的预设规则如下所示：Exemplarily, for different specified fields, the corresponding preset rules are as follows:

(1)对于url_path形式的字段，以/分割，分别处理，包括：(1) For fields in the form of url_path, split by / and process them separately, including:

纯数字的，替换为一个空格；For pure numbers, replace with a space;

16或32位md5，替换为一个空格；16 or 32 bit md5, replaced by a space;

包含xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx格式md5数据的，替换为一个空格；If it contains md5 data in the format of xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, replace it with a space;

40个及以上的数字或字母组合的，替换为一个空格；For 40 or more numbers or letter combinations, replace with a space;

mac地址类型的，替换为一个空格；For the mac address type, replace it with a space;

以音频(mp3、wav、flc等)、视频(mp4、avi、rm、rmvb、ts等)、图片(jpg、jpeg、png、gif、bmp等)或网络资源类后缀(js、css、html)为结尾的，该部分替换为一个空格；Suffix with audio (mp3, wav, flc, etc.), video (mp4, avi, rm, rmvb, ts, etc.), picture (jpg, jpeg, png, gif, bmp, etc.) or network resource class (js, css, html) is ending, the part is replaced with a space;

除php、asp、aspx、jsp、jspx、ashx、action和无后缀的，均替换为一个空格；Except for php, asp, aspx, jsp, jspx, ashx, action and no suffix, they are all replaced with a space;

所有处理完成后，去除首尾空格和/，如结果为空，则删除url_path。After all processing is completed, remove leading and trailing spaces and /, if the result is empty, delete url_path.

(2)对于url_query形式的字段，以&分割，分别处理，包括：(2) For fields in the form of url_query, separate them with & and process them separately, including:

value为数字或字符的，删除value；If the value is a number or character, delete the value;

value为16或32位md5的，删除value；If the value is 16 or 32-bit md5, delete the value;

value包含xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx格式md5数据的，删除value；If the value contains md5 data in the format of xxxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx, delete the value;

value为40个以上的数字或字母组合的，删除value；If the value is a combination of more than 40 numbers or letters, delete the value;

value为mac地址类型的，删除value；If value is a mac address type, delete value;

value长度小于4的，删除value；If the value length is less than 4, delete the value;

以上步骤处理完成后，无value且key为数字或字符的，删除key；After the above steps are processed, if there is no value and the key is a number or character, delete the key;

以上步骤处理完成后，无value且key长度小于4或大于39的，删除key；After the above steps are processed, if there is no value and the key length is less than 4 or greater than 39, delete the key;

其中，源数据无value，拼接时，不加＝；Among them, the source data has no value, when splicing, do not add =;

(3)对于User-Agent(User-Agent和UA字段)，正则匹配替换字段信息中不代表应用的字符串为空格，其中包括向访问网站提供你所使用的浏览器类型、操作系统及版本、CPU类型、浏览器渲染引擎、浏览器语言、浏览器插件等信息的标识；(3) For User-Agent (User-Agent and UA fields), the string that does not represent the application in the regular match replacement field information is a space, including providing the browser type, operating system and version you are using to the visiting website, Identification of information such as CPU type, browser rendering engine, browser language, browser plug-ins, etc.;

替换完成后，去除首尾空格，如最后为空，则删除。After the replacement is complete, remove the leading and trailing spaces, if the last is empty, delete it.

(4)对于其他Header，删除无可识别信息字段，包括：(4) For other headers, delete the unidentifiable information fields, including:

accept、cache-control、connection、content-length、content-type、accept-encoding、cookie、content-encoding、transfer-encoding、accept-charset、charset、accept-language、date、applanguage、language、contenttype、qimei36、rakey、q-qimei、空、value包含时间格式(^.*[0-9]{4}[0-9]{2}:[0-9]{2}:[0-9]{2}.*$)的，删除字段；accept, cache-control, connection, content-length, content-type, accept-encoding, cookie, content-encoding, transfer-encoding, accept-charset, charset, accept-language, date, applanguage, language, contenttype, qimei36, rakey, q-qimei, empty, value contains time format (^.*[0-9]{4}[0-9]{2}:[0-9]{2}:[0-9]{2} .*$), delete the field;

例如，Sun,25Jul 2021 13:17:55GMT，其中，For example, Sun, 25Jul 2021 13:17:55GMT, where,

key包含token、date、sign、cookie或md5的，删除字段；If the key contains token, date, sign, cookie or md5, delete the field;

value为数字或符号组合的，删除字段；If value is a combination of numbers or symbols, delete the field;

其中，value格式为key1＝value1&key2＝value2的formdata格式与query相同处理方式；Among them, the value format is key1=value1&key2=value2 The formdata format is processed in the same way as the query;

如为formdata格式；If it is in formdata format;

处理完成后如value为空，则删除key。If the value is empty after processing, delete the key.

根据步骤203-204的方法，通过获取解析数据中指定字段对应的字段信息，并基于处理指定字段的预设规则对字段信息进行处理，可以直接对报文数据中无法识别流量来源的解析数据进行去除，实现对可识别数据的有效筛选，从而实现对流量数据中冗余字段内容的删减，以便缩小后续针对可识别数据的计算量。According to the method in steps 203-204, by acquiring the field information corresponding to the specified field in the parsed data, and processing the field information based on the preset rules for processing the specified field, the parsed data in the packet data that cannot identify the traffic source can be directly processed. Remove, to achieve effective screening of identifiable data, so as to realize the deletion of redundant field content in traffic data, so as to reduce the amount of subsequent calculations for identifiable data.

进一步的，在解析数据进行处理并得到可识别数据之后，可识别数据中还可能存在重复的可识别数据，因此，为了进一步减少可识别数据中重复数据信息对应的内容造成的冗余情况，避免后续同一分类聚类时发生重复计算，减小聚类的计算负担，还可以在利用预设标签对可识别数据进行分类之前，获取可识别数据中各组键值对对应的数据信息，再检测数据信息中是否存在键、值均相同的重复数据信息，若存在，则选择重复数据信息中的一组数据信息留存在可识别数据中。Further, after parsing the data for processing and obtaining identifiable data, there may be duplicate identifiable data in the identifiable data. Therefore, in order to further reduce the redundancy caused by the content corresponding to the duplicate data information in the identifiable data, avoid Repeated calculation occurs in subsequent clustering of the same classification, which reduces the computational burden of clustering. It is also possible to obtain data information corresponding to each group of key-value pairs in the identifiable data before classifying the identifiable data using preset labels, and then detect Whether there is duplicate data information with the same key and value in the data information, if so, select a group of data information in the duplicate data information to keep in the identifiable data.

205、获取可识别数据的预设标签。205. Acquire a preset label of identifiable data.

其中，预设标签用于表征可识别数据对应的字段信息中的域名信息。本步骤中，域名信息是可识别数据对应的字段信息中对应的host字段，host字段表示服务器的域名以及服务器所监听的端口号，而针对应用层http协议的流量，使用host字段进行域名查找，可以查询与域名直接关联的应用，如果查询不到，则抽取host字段中的主域名，查询与主域名有关联的应用，从而初步确定可识别数据对应的应用名称。The preset label is used to represent the domain name information in the field information corresponding to the identifiable data. In this step, the domain name information is the corresponding host field in the field information corresponding to the identifiable data, the host field represents the domain name of the server and the port number monitored by the server, and for the traffic of the application layer http protocol, the host field is used to search the domain name, The application directly related to the domain name can be queried. If the query cannot be found, the main domain name in the host field is extracted, and the application related to the main domain name is queried, so as to preliminarily determine the application name corresponding to the identifiable data.

206、将同一预设标签所对应的所述可识别数据划分为同一分类。206. Divide the identifiable data corresponding to the same preset label into the same category.

本步骤中，在步骤205初步确定了应用名称后，可以基于应用名称对可识别数据进行同类划分，即隶属于同一应用名称的可识别数据划分为同一分类，通过划分后的多个分类，再以类为单位对其所包含的可识别数据执行后续步骤207。In this step, after the application name is preliminarily determined in step 205, the identifiable data can be classified into the same category based on the application name, that is, the identifiable data belonging to the same application name are classified into the same classification, and then the identifiable data belonging to the same application name are classified into the same classification. Subsequent step 207 is performed on the identifiable data it contains on a class-by-class basis.

根据步骤205-206的方法，通过获取可识别数据的预设标签，并将同一预设标签的可识别数据划分为一类，可以对隶属于同一应用名称的可识别数据进行分类，使得同一应用的可识别数据可以以类(应用)为单位进行后续计算，有效避免了不同分类中可识别数据之间对后续聚类计算的干扰，保证了后续计算的精确性，且减小了后续聚类的计算量，从而为提升对可识别数据的处理效率提供了基础。According to the method of steps 205-206, by obtaining preset labels of identifiable data and classifying the identifiable data of the same preset label into one category, the identifiable data belonging to the same application name can be classified, so that the same application The identifiable data can be subsequently calculated in units of classes (applications), which effectively avoids the interference of identifiable data in different classifications on subsequent clustering calculations, ensures the accuracy of subsequent calculations, and reduces subsequent clustering. It provides a basis for improving the processing efficiency of identifiable data.

207、利用层次聚类算法计算同一分类中任意两个可识别数据之间的相似距离。207. Use a hierarchical clustering algorithm to calculate the similarity distance between any two identifiable data in the same category.

其中，相似距离是基于可识别数据的键值对信息计算得到的。需要说明的是，相似距离用于表征可识别数据中任意两个数据的相似性程度，本步骤中的层次聚类算法采用自上而下的分裂聚类算法，其原理是先将所有数据看成一个数据集，之后再计算划分，直至将所有相同相似距离的可识别数据划分呈多个数据集，即将相似的可识别数据划分在一起用于表征其具有相同的数据特征。算法如下：Among them, the similarity distance is calculated based on the key-value pair information of the identifiable data. It should be noted that the similarity distance is used to represent the similarity of any two data in the identifiable data. The hierarchical clustering algorithm in this step adopts the top-down splitting clustering algorithm. The principle is to look at all the data first. Then, the division is calculated until all identifiable data with the same similar distance are divided into multiple data sets, that is, similar identifiable data are divided together to represent that they have the same data characteristics. The algorithm is as follows:

①假设有N条数据，分别为N1、N2、N3……Nn；①Assume there are N pieces of data, namely N1, N2, N3...Nn;

②任意选择一个数据，将数据分成2个集合{N1}与{N2,N3,N4……Nn}；②Select a data arbitrarily, and divide the data into two sets {N1} and {N2, N3, N4...Nn};

③分别计算N1与其余数据的距离，即将N1作为参照点，计算其余数据分别与N1的距离，根据距离相同为一个数据集的原则再次划分数据为{N1}{N2,N5,N7}{N3}{N4,N6,N8……Nn}；③ Calculate the distance between N1 and the rest of the data separately, that is, take N1 as the reference point, calculate the distance between the rest of the data and N1, and divide the data into {N1}{N2,N5,N7}{N3 according to the principle of the same distance as a data set }{N4,N6,N8...Nn};

④从已经划分的数据集中选一个没计算过距离的数据Ni，即为作为参照点的数据Ni，在基于数据Ni与其他数据集中的数据进行距离计算，根据相同集合且距离相同为一个数据集的原则再次划分数据，相同数据集保留，数据划分为{N1}{N2,N5,N7}{N3}{N4,N6,N8}{N9,N10……Nn}。④Select a data Ni whose distance has not been calculated from the divided data set, that is, the data Ni as the reference point. The distance calculation is performed based on the data Ni and the data in other data sets. According to the same set and the same distance, it is a data set. The principle of dividing the data again, the same data set is retained, and the data is divided into {N1}{N2,N5,N7}{N3}{N4,N6,N8}{N9,N10...Nn}.

重复③④步骤直到所有数据都经过距离计算为止，而该聚类算法的距离函数一般为欧氏距离、曼哈顿距离、余弦距离等，而本实施中则采用自定义距离函数的方式对相似距离进行计算。Repeat steps ③ and ④ until all data have been calculated by distance, and the distance function of the clustering algorithm is generally Euclidean distance, Manhattan distance, cosine distance, etc. In this implementation, a self-defined distance function is used to calculate the similarity distance .

基于对步骤207中相似距离的细化及扩展，其具体可包括：Based on the refinement and expansion of the similarity distance in step 207, it may specifically include:

获取可识别数据对应的键值对信息，根据任意两个可识别数据中所具有的键值对信息计算两者的相似距离。其中，若键值对信息中键在两个可识别数据中都存在且值相同，则确定第一计算结果，若键值对信息中键在两个可识别数据中都存在且值不相同，则确定第二计算结果，若键值对信息中键仅存在于其中一个可识别数据中，则获取第三计算结果，所述第一计算结果、第二计算结果和第三计算结果之和为键值对信息对应的相似距离，可识别数据的相似距离为所有键值对信息的相似距离之和。Obtain the key-value pair information corresponding to the identifiable data, and calculate the similarity distance between the two according to the key-value pair information contained in any two identifiable data. Among them, if the key in the key-value pair information exists in both identifiable data and the value is the same, the first calculation result is determined, and if the key in the key-value pair information exists in the two identifiable data and the value is different, Then determine the second calculation result, if the key in the key-value pair information only exists in one of the identifiable data, then obtain the third calculation result, the sum of the first calculation result, the second calculation result and the third calculation result is The similarity distance corresponding to the key-value pair information, the similarity distance of the identifiable data is the sum of the similarity distances of all the key-value pair information.

示例性的，分别比较数据A和数据B的key的value，其相似距离的计算方式如下：Exemplarily, by comparing the values of the keys of data A and data B, the similarity distance is calculated as follows:

key在A中，不在B中，距离的设定值记为100；The key is in A, not in B, and the set value of the distance is recorded as 100;

key在B中，不在A中，距离的设定值记为100；The key is in B, not in A, and the set value of the distance is recorded as 100;

key同时在A和B中，但value不相等，距离的设定值记为1；The key is in both A and B, but the value is not equal, and the set value of the distance is recorded as 1;

key同时在A和B中，且value相等，距离的设定值记为0；The key is in both A and B, and the value is equal, and the set value of the distance is recorded as 0;

数据A和数据B的距离为以上距离的和。其中，对于距离的设定值的具体数值可自定义设置，对此，本实施例不做具体限定。The distance between data A and data B is the sum of the above distances. The specific value of the set value of the distance can be set by self-definition, which is not specifically limited in this embodiment.

对应于上述计算方法，可分别在可识别数据中选取数据A和数据B，假设数据A中包括x＝1、y＝2以及t＝5三个键值对，数据B中包括x＝1、z＝3以及t＝4三个键值对，其中，x/y/z/t为字段名，即key，而对应的1/2/3/4为字段信息，即value，通过比较可知：Corresponding to the above calculation method, data A and data B can be selected from the identifiable data respectively. It is assumed that data A includes three key-value pairs of x=1, y=2 and t=5, and data B includes x=1, y=2 and t=5. There are three key-value pairs of z=3 and t=4, where x/y/z/t is the field name, that is, the key, and the corresponding 1/2/3/4 is the field information, that is, the value. It can be seen by comparison:

键值对x＝1均存在数据A和数据B中，且键、值均相同，可记为0；The key-value pair x=1 exists in both data A and data B, and the key and value are the same, which can be recorded as 0;

键值对y＝2和z＝3分别仅存在数据A或仅存在数据B中，可记为100*2＝200；The key-value pairs y=2 and z=3 exist only in data A or only in data B, respectively, and can be recorded as 100*2=200;

键值对t＝5和t＝4分别仅存在数据A或仅存在数据B中，但键相同、值不同，可记为1；The key-value pairs t=5 and t=4 only exist in data A or only in data B, respectively, but the keys are the same and the values are different, which can be recorded as 1;

因此，数据A和数据B的相似距离即为0+200+1＝201。Therefore, the similarity distance between data A and data B is 0+200+1=201.

208、根据相似距离，将具有相同相似距离的可识别数据划分至同一数据集中。208. Divide identifiable data with the same similarity distance into the same data set according to the similarity distance.

需要说明的是，由于相似距离用于表征可识别数据中任意两个数据的相似性程度，而相同相似距离的可识别数据的所具有的特征是相同的，因此，针对于所具有的特征相同的可识别数据，可将其划分至同一个数据集中。It should be noted that since the similarity distance is used to represent the degree of similarity between any two data in the identifiable data, and the identifiable data with the same similarity distance has the same characteristics, therefore, for the same characteristics identifiable data that can be grouped into the same dataset.

根据步骤207-208的方法，通过获取可识别数据对应的键值对信息，根据任意两个可识别数据中所具有的键值对信息计算两者的相似距离，再将具有相同相似距离的可识别数据划分至同一数据集中，即是基于键值对信息的比较情况进行距离计算的，而通过键值对信息中键、值的比较情况并针对比较结果进行自定义赋值，可以消除相同key不同value对结果的影响，从而优化了数据间的相似性划分，减少后续情报库对可识别数据的留存量。According to the method of steps 207-208, by obtaining the key-value pair information corresponding to the identifiable data, the similarity distance between the two is calculated according to the key-value pair information contained in any two identifiable data, and then the identifiable data with the same similarity distance is calculated. The identification data is divided into the same data set, that is, the distance calculation is performed based on the comparison of the key-value pair information, and the comparison of the keys and values in the key-value pair information and the custom assignment for the comparison result can eliminate the difference of the same key. The impact of value on the results, thereby optimizing the similarity division between data and reducing the amount of identifiable data retained by the subsequent intelligence database.

209、从所有分类对应的数据集中分别提取至少一条可识别数据，添加至应用识别情报库中。209. Extract at least one piece of identifiable data from the data sets corresponding to all the classifications, and add it to the application identification intelligence database.

本步骤结合上述方法中104步骤的描述，在此相同的内容不赘述。This step is combined with the description of step 104 in the above method, and the same content is not repeated here.

基于上述图2的实现方式可以看出，本发明提供一种应用识别情报库的数据筛选方法，首先通过根据目标流量中的请求对象获取对应的报文数据，从而基于请求对象实现对报文数据的筛选，并以键值对格式对筛选出来的报文数据的请求头内容进行解析，可保证得到解析数据中仅含有针对报文数据中的请求头内容部分进行解析的内容，而无需对报文数据中的其他部分再次解析，进一步减少对于报文数据的解析数据量，从而减少对内存的占用，再通过获取解析数据中指定字段对应的字段信息，并基于处理指定字段的预设规则对字段信息进行处理，可以直接对报文数据中无法识别流量来源的解析数据进行去除，实现对可识别数据的有效筛选，从而对流量数据中冗余字段内容的删减，以便缩小后续针对可识别数据的计算量，通过获取可识别数据的预设标签，并将同一预设标签的可识别数据划分为一类，可以对隶属于同一应用名称的可识别数据进行分类，使得同一应用的可识别数据可以类(应用)为单位进行后续计算，有效避免了不同分类中可识别数据之间对后续聚类计算的干扰，保证了后续计算的精确性，且减小了后续聚类的计算量，从而为提升对可识别数据的处理效率提供了基础，接着通过获取可识别数据对应的键值对信息，根据任意两个可识别数据中所具有的键值对信息计算两者的相似距离，再将具有相同相似距离的可识别数据划分至同一数据集中，即是基于键值对信息的比较情况进行距离计算的，而通过键值对信息中键、值的比较情况并针对比较结果进行自定义赋值，可以消除相同key不同value对结果的影响，优化数据间的相似性划分，减少后续情报库对可识别数据的留存量，最后从所有分类对应的数据集中分别提取至少一条可识别数据，添加至应用识别情报库中，实现对应用识别情报库建立时对数据的筛选需求。It can be seen from the above-mentioned implementation manner of FIG. 2 that the present invention provides a data screening method for an application identification intelligence database. First, the corresponding message data is obtained according to the request object in the target traffic, so as to realize the analysis of the message data based on the request object. and parses the content of the request header of the filtered packet data in the format of key-value pairs, which can ensure that the parsed data only contains the content parsed for the content of the request header in the packet data, without the need to parse the content of the request header in the packet data. Other parts of the text data are parsed again to further reduce the amount of parsed data for the message data, thereby reducing the memory occupation. Then, by obtaining the field information corresponding to the specified fields in the parsed data, and based on the preset rules for processing the specified fields The field information processing can directly remove the parsed data that cannot identify the traffic source in the packet data, realize the effective screening of the identifiable data, and delete the redundant field content in the traffic data, so as to reduce the subsequent targets for the identifiable data. The amount of data calculation, by obtaining the preset labels of identifiable data, and classifying the identifiable data of the same preset label into one category, the identifiable data belonging to the same application name can be classified, so that the identifiable data of the same application can be identified. Data can be used for subsequent calculations in units of classes (applications), which effectively avoids the interference between identifiable data in different categories on subsequent clustering calculations, ensures the accuracy of subsequent calculations, and reduces the amount of subsequent clustering calculations. This provides a basis for improving the processing efficiency of identifiable data. Then, by obtaining the key-value pair information corresponding to the identifiable data, the similarity distance between the two is calculated according to the key-value pair information contained in any two identifiable data, and then Divide identifiable data with the same similarity distance into the same data set, that is, the distance calculation is performed based on the comparison of key-value pair information, and the comparison of keys and values in key-value pair information is used to customize the comparison results. Assignment can eliminate the influence of different values of the same key on the results, optimize the similarity division between data, and reduce the retention of identifiable data in the follow-up intelligence database. To the application identification intelligence database, to realize the data screening requirements when the application identification intelligence database is established.

进一步的，作为对上述图1-2所示方法实施例的实现，本发明实施例提供了一种应用识别情报库的数据筛选装置，该装置用于在完整保留有效可识别应用的字段信息的情况下，实现对网络流量数据的精炼和压缩，从而减少情报库中无效内容冗余，以提升情报库的识别效率。该装置的实施例与前述方法实施例对应，为便于阅读，本实施例不再对前述方法实施例中的细节内容进行逐一赘述，但应当明确，本实施例中的装置能够对应实现前述方法实施例中的全部内容。具体如图3所示，该装置包括：Further, as an implementation of the method embodiments shown in FIGS. 1-2 above, an embodiment of the present invention provides a data screening device for an application identification information base, which is used to completely retain field information that can effectively identify applications. Under the circumstance, the network traffic data can be refined and compressed, so as to reduce the redundancy of invalid content in the intelligence base, and improve the identification efficiency of the intelligence base. The embodiments of the apparatus correspond to the foregoing method embodiments. For ease of reading, this embodiment will not repeat the details of the foregoing method embodiments one by one, but it should be clear that the apparatus in this embodiment can correspondingly implement the foregoing method embodiments. the entire contents of the example. Specifically as shown in Figure 3, the device includes:

解析单元31，用于将采集的目标流量解析为预设格式的报文数据；a parsing unit 31, configured to parse the collected target traffic into message data in a preset format;

处理单元32，用于根据预设规则处理解析单元31中获得的报文数据中无法识别流量来源的数据，得到可识别数据；The processing unit 32 is configured to process the data of the unidentifiable traffic source in the packet data obtained in the parsing unit 31 according to the preset rule, and obtain identifiable data;

计算单元33，用于利用预设标签对处理单元32中获得的可识别数据进行分类，并对同一分类的可识别数据进行聚类计算，得到至少一个数据集；The calculation unit 33 is used to classify the identifiable data obtained in the processing unit 32 by using a preset label, and perform cluster calculation on the identifiable data of the same classification to obtain at least one data set;

提取单元34，用于从计算单元33中所有分类对应的据集中分别提取至少一条可识别数据；The extraction unit 34 is used to extract at least one piece of identifiable data from the data sets corresponding to all classifications in the computing unit 33;

添加单元35，用于将提取单元34中获取的至少一条可识别数据添加至应用识别情报库中。The adding unit 35 is configured to add at least one piece of identifiable data acquired in the extracting unit 34 to the application identification intelligence database.

进一步的，如图4所示，解析单元31包括：Further, as shown in Figure 4, the parsing unit 31 includes:

第一获取模块311，用于根据目标流量中的请求对象获取对应的报文数据；The first obtaining module 311 is configured to obtain corresponding message data according to the request object in the target traffic;

解析模块312，用于以键值对格式对第一获取模块311中获得的报文数据的请求头内容进行解析，得到解析数据，键值对格式中，键为请求头内容中的字段名，值为字段名所对应的字段信息。The parsing module 312 is configured to parse the request header content of the message data obtained in the first obtaining module 311 in a key-value pair format to obtain parsed data. In the key-value pair format, the key is the field name in the request header content, The value is the field information corresponding to the field name.

进一步的，如图4所示，处理单元32包括：Further, as shown in FIG. 4 , the processing unit 32 includes:

第二获取模块321，用于获取解析数据中指定字段对应的字段信息；The second obtaining module 321 is used to obtain the field information corresponding to the specified field in the parsed data;

处理模块322，用于基于处理指定字段的预设规则对第二获取模块321中获得的字段信息进行处理，得到可识别数据，预设规则用于清理指定字段的字段信息中不具有识别流量来源的数据信息。The processing module 322 is used to process the field information obtained in the second acquisition module 321 based on the preset rules for processing the specified fields to obtain identifiable data, and the preset rules are used to clear the field information of the specified fields without identifying the traffic source. data information.

进一步的，如图4所示，处理单元32还包括：Further, as shown in FIG. 4 , the processing unit 32 further includes:

检测模块323，用于检测指定字段在处理模块322处理后的字段信息是否为空；The detection module 323 is used to detect whether the field information of the specified field processed by the processing module 322 is empty;

删除模块324，用于若在检测模块323中检测指定字段在处理后的字段信息为空，则将指定字段从解析数据中删除。The deletion module 324 is configured to delete the specified field from the parsed data if it is detected in the detection module 323 that the processed field information of the specified field is empty.

进一步的，如图4所示，计算单元33元包括：Further, as shown in Figure 4, the calculation unit 33 includes:

第三获取模块331，用于获取可识别数据的预设标签，预设标签用于表征可识别数据对应的字段信息中的域名信息；The third obtaining module 331 is configured to obtain a preset label of the identifiable data, and the preset label is used to represent the domain name information in the field information corresponding to the identifiable data;

分类模块332，将同一第三获取模块331中获得的预设标签所对应的可识别数据划分为同一分类；The classification module 332 divides the identifiable data corresponding to the preset labels obtained in the same third obtaining module 331 into the same classification;

计算模块333，用于利用层次聚类算法计算分类模块332中获得的同一分类中任意两个可识别数据之间的相似距离，相似距离是基于可识别数据的键值对信息计算得到的；The calculation module 333 is used to calculate the similarity distance between any two identifiable data in the same classification obtained in the classification module 332 by using the hierarchical clustering algorithm, and the similarity distance is calculated based on the key-value pair information of the identifiable data;

划分模块334，用于根据计算模块333中获得的相似距离，将具有相同相似距离的可识别数据划分至同一数据集中。The dividing module 334 is configured to divide the identifiable data with the same similarity distance into the same data set according to the similarity distance obtained in the calculation module 333 .

进一步的，如图4所示，计算模块333包括：Further, as shown in Figure 4, the calculation module 333 includes:

获取子模块3331，用于获取可识别数据对应的键值对信息；The acquisition submodule 3331 is used to acquire the key-value pair information corresponding to the identifiable data;

计算子模块3332，用于根据获取子模块3331中获得的任意两个可识别数据中所具有的键值对信息计算两者的相似距离；The calculation submodule 3332 is used to calculate the similarity distance between the two according to the key-value pair information in any two identifiable data obtained in the acquisition submodule 3331;

其中，若键值对信息中键在两个可识别数据中都存在且值相同，则确定第一计算结果；若键值对信息中键在两个可识别数据中都存在且值不相同，则确定第二计算结果；若键值对信息中键仅存在于其中一个可识别数据中，则获取第三计算结果；第一计算结果、第二计算结果和第三计算结果之和为键值对信息对应的相似距离，可识别数据的相似距离为所有键值对信息的相似距离之和。Among them, if the key in the key-value pair information exists in both identifiable data and the values are the same, the first calculation result is determined; if the key in the key-value pair information exists in both identifiable data and the values are different, Then determine the second calculation result; if the key in the key-value pair information only exists in one of the identifiable data, then obtain the third calculation result; the sum of the first calculation result, the second calculation result and the third calculation result is the key value For the similarity distance corresponding to the information, the similarity distance of the identifiable data is the sum of the similarity distances of all key-value pairs of information.

进一步的，如图4所示，该装置还包括：Further, as shown in Figure 4, the device also includes:

获取单元36，获取可识别数据中各组键值对对应的数据信息；The obtaining unit 36 obtains the data information corresponding to each group of key-value pairs in the identifiable data;

检测单元37，用于检测获取单元36中获得的数据信息中是否存在键、值均相同的重复数据信息；The detection unit 37 is used to detect whether there is duplicate data information with the same key and value in the data information obtained in the acquisition unit 36;

留存单元38，用于若检测单元37检测数据信息中存在键、值均相同的重复数据信息，则选择重复数据信息中的一组数据信息留存在可识别数据中。The retention unit 38 is configured to select a group of data information in the repeated data information to be retained in the identifiable data if the detection unit 37 detects that duplicate data information with the same key and value exists in the data information.

进一步的，本发明实施例还提供一种存储介质，所述存储介质用于存储计算机程序，其中，所述计算机程序运行时控制所述存储介质所在设备执行上述图1-2中所述的应用识别情报库的数据筛选方法。Further, an embodiment of the present invention further provides a storage medium, where the storage medium is used to store a computer program, wherein when the computer program runs, the device where the storage medium is located is controlled to execute the application described in the above-mentioned FIG. 1-2 Identify data screening methods for intelligence bases.

进一步的，本发明实施例还提供一种处理器，所述处理器用于运行程序，其中，所述程序运行时执行上述图1-2中所述的应用识别情报库的数据筛选方法。Further, an embodiment of the present invention further provides a processor for running a program, wherein when the program is running, the data screening method of the application identification intelligence database described in FIG. 1-2 is executed.

在上述实施例中，对各个实施例的描述都各有侧重，某个实施例中没有详述的部分，可以参见其他实施例的相关描述。In the above-mentioned embodiments, the description of each embodiment has its own emphasis. For parts that are not described in detail in a certain embodiment, reference may be made to the relevant descriptions of other embodiments.

可以理解的是，上述方法、系统及电子设备中的相关特征可以相互参考。另外，上述实施例中的“第一”、“第二”等是用于区分各实施例，而并不代表各实施例的优劣。It can be understood that the relevant features in the above-mentioned methods, systems and electronic devices may refer to each other. In addition, "first", "second", etc. in the above-mentioned embodiments are used to distinguish each embodiment, and do not represent the advantages and disadvantages of each embodiment.

所属领域的技术人员可以清楚地了解到，为描述的方便和简洁，上述描述的系统，装置和单元的具体工作过程，可以参考前述方法实施例中的对应过程，在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.

在此提供的算法和显示不与任何特定计算机、虚拟系统或者其它设备固有相关。各种通用系统也可以与基于在此的示教一起使用。根据上面的描述，构造这类系统所要求的结构是显而易见的。此外，本发明也不针对任何特定编程语言。应当明白，可以利用各种编程语言实现在此描述的本发明的内容，并且上面对特定语言所做的描述是为了披露本发明的最佳实施方式。The algorithms and displays provided herein are not inherently related to any particular computer, virtual system, or other device. Various general-purpose systems can also be used with teaching based on this. The structure required to construct such a system is apparent from the above description. Furthermore, the present invention is not directed to any particular programming language. It should be understood that various programming languages may be used to implement the inventions described herein, and that the descriptions of specific languages above are intended to disclose the best mode for carrying out the invention.

此外，存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)，存储器包括至少一个存储芯片。In addition, memory may include non-persistent memory in computer readable media in the form of random access memory (RAM) and/or non-volatile memory, such as read only memory (ROM) or flash memory (flash RAM), including at least one memory chip.

本领域内的技术人员应明白，本申请的实施例可提供为方法、系统、或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例、或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。As will be appreciated by those skilled in the art, the embodiments of the present application may be provided as a method, a system, or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

本申请是参照根据本申请实施例的方法、设备(系统)、和计算机程序产品的流程图和/或方框图来描述的。应理解可由计算机程序指令实现流程图和/或方框图中的每一流程和/或方框、以及流程图和/或方框图中的流程和/或方框的结合。可提供这些计算机程序指令到通用计算机、专用计算机、嵌入式处理机或其他可编程数据处理设备的处理器以产生一个机器，使得通过计算机或其他可编程数据处理设备的处理器执行的指令产生用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的装置。The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems), and computer program products according to embodiments of the present application. It will be understood that each flow and/or block in the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to the processor of a general purpose computer, special purpose computer, embedded processor or other programmable data processing device to produce a machine such that the instructions executed by the processor of the computer or other programmable data processing device produce Means for implementing the functions specified in a flow or flow of a flowchart and/or a block or blocks of a block diagram.

这些计算机程序指令也可存储在能引导计算机或其他可编程数据处理设备以特定方式工作的计算机可读存储器中，使得存储在该计算机可读存储器中的指令产生包括指令装置的制造品，该指令装置实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能。These computer program instructions may also be stored in a computer-readable memory capable of directing a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory result in an article of manufacture comprising instruction means, the instructions The apparatus implements the functions specified in the flow or flow of the flowcharts and/or the block or blocks of the block diagrams.

这些计算机程序指令也可装载到计算机或其他可编程数据处理设备上，使得在计算机或其他可编程设备上执行一系列操作步骤以产生计算机实现的处理，从而在计算机或其他可编程设备上执行的指令提供用于实现在流程图一个流程或多个流程和/或方框图一个方框或多个方框中指定的功能的步骤。These computer program instructions can also be loaded on a computer or other programmable data processing device to cause a series of operational steps to be performed on the computer or other programmable device to produce a computer-implemented process such that The instructions provide steps for implementing the functions specified in the flow or blocks of the flowcharts and/or the block or blocks of the block diagrams.

在一个典型的配置中，计算设备包括一个或多个处理器(CPU)、输入/输出接口、网络接口和内存。In a typical configuration, a computing device includes one or more processors (CPUs), input/output interfaces, network interfaces, and memory.

存储器可能包括计算机可读介质中的非永久性存储器，随机存取存储器(RAM)和/或非易失性内存等形式，如只读存储器(ROM)或闪存(flash RAM)。存储器是计算机可读介质的示例。Memory may include non-persistent memory in computer readable media, random access memory (RAM) and/or non-volatile memory in the form of, for example, read only memory (ROM) or flash memory (flash RAM). Memory is an example of a computer-readable medium.

计算机可读介质包括永久性和非永久性、可移动和非可移动媒体可以由任何方法或技术来实现信息存储。信息可以是计算机可读指令、数据结构、程序的模块或其他数据。计算机的存储介质的例子包括，但不限于相变内存(PRAM)、静态随机存取存储器(SRAM)、动态随机存取存储器(DRAM)、其他类型的随机存取存储器(RAM)、只读存储器(ROM)、电可擦除可编程只读存储器(EEPROM)、快闪记忆体或其他内存技术、只读光盘只读存储器(CD-ROM)、数字多功能光盘(DVD)或其他光学存储、磁盒式磁带，磁带磁磁盘存储或其他磁性存储设备或任何其他非传输介质，可用于存储可以被计算设备访问的信息。按照本文中的界定，计算机可读介质不包括暂存电脑可读媒体(transitory media)，如调制的数据信号和载波。Computer-readable media includes both persistent and non-permanent, removable and non-removable media, and storage of information may be implemented by any method or technology. Information may be computer readable instructions, data structures, modules of programs, or other data. Examples of computer storage media include, but are not limited to, phase-change memory (PRAM), static random access memory (SRAM), dynamic random access memory (DRAM), other types of random access memory (RAM), read only memory (ROM), Electrically Erasable Programmable Read Only Memory (EEPROM), Flash Memory or other memory technology, Compact Disc Read Only Memory (CD-ROM), Digital Versatile Disc (DVD) or other optical storage, Magnetic tape cassettes, magnetic tape magnetic disk storage or other magnetic storage devices or any other non-transmission medium that can be used to store information that can be accessed by a computing device. As defined herein, computer-readable media does not include transitory computer-readable media, such as modulated data signals and carrier waves.

还需要说明的是，术语“包括”、“包含”或者其任何其他变体意在涵盖非排他性的包含，从而使得包括一系列要素的过程、方法、商品或者设备不仅包括那些要素，而且还包括没有明确列出的其他要素，或者是还包括为这种过程、方法、商品或者设备所固有的要素。在没有更多限制的情况下，由语句“包括一个……”限定的要素，并不排除在包括要素的过程、方法、商品或者设备中还存在另外的相同要素。It should also be noted that the terms "comprising", "comprising" or any other variation thereof are intended to encompass a non-exclusive inclusion such that a process, method, article or device comprising a series of elements includes not only those elements, but also Other elements not expressly listed, or which are inherent to such a process, method, article of manufacture, or apparatus are also included. Without further limitation, an element qualified by the phrase "comprising a..." does not preclude the presence of additional identical elements in the process, method, article of manufacture or apparatus that includes the element.

本领域技术人员应明白，本申请的实施例可提供为方法、系统或计算机程序产品。因此，本申请可采用完全硬件实施例、完全软件实施例或结合软件和硬件方面的实施例的形式。而且，本申请可采用在一个或多个其中包含有计算机可用程序代码的计算机可用存储介质(包括但不限于磁盘存储器、CD-ROM、光学存储器等)上实施的计算机程序产品的形式。It will be appreciated by those skilled in the art that the embodiments of the present application may be provided as a method, a system or a computer program product. Accordingly, the present application may take the form of an entirely hardware embodiment, an entirely software embodiment, or an embodiment combining software and hardware aspects. Furthermore, the present application may take the form of a computer program product embodied on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.

以上仅为本申请的实施例而已，并不用于限制本申请。对于本领域技术人员来说，本申请可以有各种更改和变化。凡在本申请的精神和原理之内所作的任何修改、等同替换、改进等，均应包含在本申请的权利要求范围之内。The above are merely examples of the present application, and are not intended to limit the present application. Various modifications and variations of this application are possible for those skilled in the art. Any modification, equivalent replacement, improvement, etc. made within the spirit and principle of this application shall be included within the scope of the claims of this application.

Claims

1. A data screening method for an application identification information base is characterized by comprising the following steps:

analyzing the collected target flow into message data in a preset format;

processing data which cannot identify a flow source in the message data according to a preset rule to obtain identifiable data;

classifying the identifiable data by using a preset label, and performing clustering calculation on the identifiable data of the same classification to obtain at least one data set;

and respectively extracting at least one piece of identifiable data from the data sets corresponding to all the classifications, and adding the data into an application identification intelligence library.

2. The method according to claim 1, wherein parsing the collected target traffic into message data in a preset format comprises:

acquiring corresponding message data according to a request object in the target flow;

analyzing the request header content of the message data in a key-value pair format to obtain analyzed data, wherein in the key-value pair format, a key is a field name in the request header content, and a value is field information corresponding to the field name.

3. The method according to claim 2, wherein processing data that cannot identify a traffic source in the packet data according to a predetermined rule to obtain identifiable data comprises:

Acquiring the field information corresponding to a specified field in the analysis data;

and processing the field information based on the preset rule for processing the specified field to obtain the identifiable data, wherein the preset rule is used for clearing data information which does not have an identified flow source in the field information of the specified field.

4. The method of claim 3, wherein processing the field information based on the preset rule for processing the specified field comprises:

detecting whether the field information of the specified field after being processed is empty;

and if the data is null, deleting the specified field from the analysis data.

5. The method of any of claims 1-4, wherein classifying the identifiable data using a predetermined label and performing a clustering computation on the identifiable data of the same classification to obtain at least one data set comprises:

acquiring a preset label of the identifiable data, wherein the preset label is used for representing domain name information in the field information corresponding to the identifiable data;

dividing the identifiable data corresponding to the same preset label into the same category;

Calculating a similar distance between any two identifiable data in the same classification by using a hierarchical clustering algorithm, wherein the similar distance is calculated based on key-value pair information of the identifiable data;

and dividing the identifiable data with the same similar distance into the same data set according to the similar distance.

6. The method of claim 4, wherein calculating the similarity distance between any two of the identifiable data in the same category using a hierarchical clustering algorithm comprises:

acquiring the key-value pair information corresponding to the identifiable data;

calculating the similar distance between any two identifiable data according to the key-value pair information in the two identifiable data;

if the key exists in the two identifiable data in the key-value pair information and the values are the same, determining a first calculation result; if the key exists in the two identifiable data in the key value pair information and the values of the key are different, determining a second calculation result; if the key in the key-value pair information only exists in one identifiable data, acquiring a third calculation result; the sum of the first calculation result, the second calculation result and the third calculation result is the similar distance corresponding to the key-value pair information, and the similar distance of the identifiable data is the sum of the similar distances of all the key-value pair information.

7. The method of claim 2, wherein prior to classifying the identifiable data using a predetermined label and performing a cluster computation on the identifiable data of the same classification to obtain at least one data set, the method further comprises:

acquiring data information corresponding to each group of key value pairs in the identifiable data;

detecting whether repeated data information with the same key and value exists in the data information;

if so, selecting one set of the data information in the repeated data information to be retained in the identifiable data.

8. A data screening apparatus for application identification intelligence libraries, the apparatus comprising:

the analysis unit is used for analyzing the acquired target flow into message data in a preset format;

the processing unit is used for processing data which cannot identify a flow source in the message data obtained in the analysis unit according to a preset rule to obtain identifiable data;

the calculation unit is used for classifying the identifiable data obtained in the processing unit by using a preset label and performing cluster calculation on the identifiable data of the same classification to obtain at least one data set;

An extracting unit, configured to extract at least one piece of the identifiable data from the data sets corresponding to all the classifications in the computing unit, respectively;

and the adding unit is used for adding at least one piece of identifiable data acquired in the extracting unit into an application identification intelligence library.

9. A storage medium, characterized in that the storage medium comprises a stored program, wherein the apparatus on which the storage medium is located is controlled to perform the method according to any one of claims 1 to 7 when the program is run.

10. A processor, characterized in that the processor is configured to run a program, wherein the program when running performs the method of any one of claims 1 to 7.