CN115391568A - Entity classification method, system, terminal and storage medium based on knowledge graph - Google Patents
Entity classification method, system, terminal and storage medium based on knowledge graph Download PDFInfo
- Publication number
- CN115391568A CN115391568A CN202211160630.6A CN202211160630A CN115391568A CN 115391568 A CN115391568 A CN 115391568A CN 202211160630 A CN202211160630 A CN 202211160630A CN 115391568 A CN115391568 A CN 115391568A
- Authority
- CN
- China
- Prior art keywords
- entity
- dns
- domain name
- knowledge graph
- triple
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/30—Information retrieval; Database structures therefor; File system structures therefor of unstructured textual data
- G06F16/36—Creation of semantic tools, e.g. ontology or thesauri
- G06F16/367—Ontology
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Life Sciences & Earth Sciences (AREA)
- Animal Behavior & Ethology (AREA)
- Computational Linguistics (AREA)
- Data Mining & Analysis (AREA)
- Databases & Information Systems (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Information Retrieval, Db Structures And Fs Structures Therefor (AREA)
Abstract
本申请提供一种基于知识图谱的实体分类方法、系统、终端及存储介质。该方法包括:基于DNS的层级关系和查询解析关系建立DNS知识图谱,其中,DNS知识图谱包括至少一个域名对应的预设标签;将DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将实体和关系进行融合,得到融合后的DNS知识图谱;对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,其中,DNS知识图谱向量包括各个域名对应的知识图谱向量;将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型;根据实体分类模型,对域名进行分类检测。本申请能够提高域名分类检测的准确性和速度。
The present application provides a knowledge map-based entity classification method, system, terminal and storage medium. The method includes: establishing a DNS knowledge graph based on DNS hierarchical relationships and query resolution relationships, wherein the DNS knowledge graph includes at least one preset label corresponding to a domain name; splitting the DNS knowledge graph into entities and relationships, and aligning the entity attributes Merge the entities and relationships to obtain the fused DNS knowledge map; vectorize the entities and relations of the fused DNS knowledge map to obtain the DNS knowledge map vector, where the DNS knowledge map vector includes the knowledge map vector corresponding to each domain name ;Take the knowledge map vector corresponding to the domain name as the input, and the preset label corresponding to the domain name as the output, train the neural network model, and obtain the entity classification model; classify and detect the domain name according to the entity classification model. The application can improve the accuracy and speed of domain name classification and detection.
Description
技术领域technical field
本申请涉及计算机网络安全技术领域,尤其涉及一种基于知识图谱的实体分类方法、系统、终端及存储介质。The present application relates to the technical field of computer network security, and in particular to a knowledge graph-based entity classification method, system, terminal and storage medium.
背景技术Background technique
随着网络技术与互联网的迅猛发展,不管是网络用户还是网络中的数据流量都是呈现指数增长的趋势,网络用户群体也更为广泛。各式各样的网络设施在可靠性、安全性等方面遇到了极大的挑战。就现阶段而言互联网的安全处于一个比较严峻的态势,这种状态也会长久地存在下去。With the rapid development of network technology and the Internet, both network users and data traffic in the network are showing an exponential growth trend, and the network user group is also more extensive. All kinds of network facilities have encountered great challenges in terms of reliability and security. As far as the current stage is concerned, the security of the Internet is in a relatively severe situation, and this situation will continue to exist for a long time.
域名系统(Domain Name System,DNS)作为互联网通信的基础设施,通过对于域名的解析从而获得所要访问的资源的IP地址,主要的应用场景为web页面的访问以及电子邮件的收发等。中国国家计算机网络应急技术处理协调中心发布的《2019年上半年我国互联网网络安全态势》中指出2019年上半年CNCERT自主检测发现约4.6万个针对我国境内网站的仿冒页面。大部分的恶意行为都与域名系统息息相关,对恶意域名检测的需求也日益增长。Domain Name System (DNS), as the infrastructure of Internet communication, obtains the IP address of the resource to be accessed by analyzing the domain name. The main application scenarios are web page access and email sending and receiving. According to the "my country's Internet Network Security Situation in the First Half of 2019" released by the China National Computer Network Emergency Response Technology Coordination Center, CNCERT independently detected about 46,000 counterfeit pages targeting websites in my country in the first half of 2019. Most malicious behavior is closely related to the domain name system, and the need for detection of malicious domain names is also increasing.
目前,对于区分域名是否是恶意域名,一般是通过对域名日志结构进行判断检测是否为恶意域名,但是传统技术中,通常是一对一的形式对域名进行检测是否为恶意域名,消耗时间长,而且对于相同属性的域名并没有进行处理,使得域名检测不准确。因此,对于如何区分恶意域名就迫在眉睫了。At present, to distinguish whether a domain name is a malicious domain name, it is generally determined whether it is a malicious domain name by judging the domain name log structure. However, in the traditional technology, it is usually one-to-one to detect whether a domain name is a malicious domain name, which takes a long time. Moreover, domain names with the same attribute are not processed, which makes domain name detection inaccurate. Therefore, how to distinguish malicious domain names is imminent.
发明内容Contents of the invention
本申请提供了一种基于知识图谱的实体分类方法、系统、终端及存储介质,以解决现有技术中域名分类检测不准确的问题。The present application provides a knowledge graph-based entity classification method, system, terminal and storage medium to solve the problem of inaccurate domain name classification and detection in the prior art.
第一方面,本申请提供了一种基于知识图谱的实体分类方法,包括:基于DNS的层级关系和查询解析关系建立DNS知识图谱,其中,所述DNS知识图谱包括至少一个域名对应的预设标签,所述预设标签包括恶意和非恶意;In the first aspect, the present application provides a knowledge graph-based entity classification method, including: establishing a DNS knowledge graph based on DNS hierarchical relationships and query resolution relationships, wherein the DNS knowledge graph includes at least one preset label corresponding to a domain name , the preset labels include malicious and non-malicious;
将所述DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将所述实体和所述关系进行融合,得到融合后的DNS知识图谱;Splitting the DNS knowledge graph into entities and relationships, and merging the entities and the relationships according to the alignment of entity attributes to obtain a fused DNS knowledge graph;
对所述融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,其中,所述DNS知识图谱向量包括各个域名对应的知识图谱向量;Vectorizing entities and relationships of the fused DNS knowledge graph to obtain a DNS knowledge graph vector, wherein the DNS knowledge graph vector includes knowledge graph vectors corresponding to each domain name;
将所述域名对应的知识图谱向量作为输入量,将所述域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型;Using the knowledge map vector corresponding to the domain name as an input, and using the preset label corresponding to the domain name as an output, training a neural network model to obtain an entity classification model;
根据所述实体分类模型,对域名进行分类检测。According to the entity classification model, the domain name is classified and detected.
在一种可能的实现方式中,所述基于DNS的层级关系和查询解析关系建立DNS知识图谱,包括:In a possible implementation, the DNS knowledge map is established based on the hierarchical relationship and query resolution relationship of DNS, including:
基于DNS的层级关系,建立DNS域名分层图,并对所述DNS域名分层图添加所述预设标签;Based on the hierarchical relationship of DNS, establish a DNS domain name hierarchical map, and add the preset label to the DNS domain name hierarchical map;
基于DNS的查询解析关系,建立DNS查询响应图和被动DNS图;Based on DNS query analysis relationship, establish DNS query response graph and passive DNS graph;
将所述DNS查询响应图和所述被动DNS图结合,建立DNS流图,并对所述DNS流图添加所述预设标签;Combining the DNS query response graph with the passive DNS graph to create a DNS flow graph, and adding the preset label to the DNS flow graph;
将所述DNS域名分层图和所述DNS流图通过规则对齐方式结合,建立所述DNS知识图谱。The DNS knowledge graph is established by combining the DNS domain name hierarchical graph and the DNS flow graph through rule alignment.
在一种可能的实现方式中,所述实体包括头实体和尾实体;In a possible implementation manner, the entity includes a head entity and a tail entity;
所述将所述DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将所述实体和所述关系进行融合,得到融合后的DNS知识图谱,包括:Said splitting the DNS knowledge graph into entities and relationships, and merging the entities and the relationships according to the alignment of entity attributes to obtain the fused DNS knowledge graph, including:
利用三元组方式,将所述DNS知识图谱的客户端IP地址作为所述头实体,所述DNS知识图谱的Qname属性作为所述关系,所述DNS知识图谱的域名作为所述尾实体;Using triplet mode, the client IP address of the DNS knowledge graph is used as the head entity, the Qname attribute of the DNS knowledge graph is used as the relationship, and the domain name of the DNS knowledge graph is used as the tail entity;
通过实体属性对齐的方式,将所述头实体、所述关系和所述尾实体进行融合,得到融合后的DNS知识图谱。By means of entity attribute alignment, the head entity, the relationship and the tail entity are fused to obtain a fused DNS knowledge graph.
在一种可能的实现方式中,所述DNS知识图谱包括至少一类域名对应的三元组,所述对所述融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量包括:In a possible implementation manner, the DNS knowledge graph includes triples corresponding to at least one type of domain name, and the entities and relationships of the fused DNS knowledge graph are vectorized to obtain a DNS knowledge graph vector that includes :
针对每类域名,任意选取该类域名中的一个三元组的头实体作为起始节点,计算当前三元组的尾实体与第一三元组的头实体之间的实体距离;For each type of domain name, arbitrarily select the head entity of a triple in this type of domain name as the starting node, and calculate the entity distance between the tail entity of the current triple and the head entity of the first triple;
判断当前三元组的尾实体与第一三元组的头实体之间的实体距离是否不大于预设实体距离;Judging whether the entity distance between the tail entity of the current triple group and the head entity of the first triple group is not greater than the preset entity distance;
若当前三元组的尾实体与第一三元组的头实体之间的实体距离不大于所述预设实体距离,则将所述第一三元组的头实体与当前三元组的尾实体相链接,并将所述第一三元组作为当前三元组;If the entity distance between the tail entity of the current triple group and the head entity of the first triple group is not greater than the preset entity distance, then the head entity of the first triple group and the tail entity of the current triple group Entities are linked, and the first triplet is used as the current triplet;
重复当前步骤,直至当前三元组的头实体为所述起始节点,得到该域名的实体和关系;Repeat the current step until the head entity of the current triple is the starting node, and obtain the entity and relationship of the domain name;
对各个域名对应的实体和关系进行向量化,得到各个域名对应的知识图谱向量;Vectorize the entities and relationships corresponding to each domain name to obtain the knowledge graph vector corresponding to each domain name;
其中,所述第一三元组为该类域名对应的三元组中未被链接的三元组。Wherein, the first triplet is an unlinked triplet among the triplets corresponding to this type of domain name.
在一种可能的实现方式中,在所述判断当前三元组的尾实体与第一三元组的头实体之间的实体距离是否不大于预设实体距离之后,包括:In a possible implementation manner, after determining whether the entity distance between the tail entity of the current triplet and the head entity of the first triplet is not greater than the preset entity distance, the steps include:
步骤一:若当前三元组的尾实体与第一三元组的头实体之间的实体距离大于所述预设实体距离,则执行步骤二;Step 1: If the entity distance between the tail entity of the current triplet and the head entity of the first triplet is greater than the preset entity distance, then perform step 2;
步骤二:采用第二三元组的头实体替换掉所述第一三元组的头实体,采样第二三元组的尾实体替换所述第一三元组的尾实体,并将替换后的第一三元组作为负采样三元组;所述第二三元组为该类域名中已链接的任一三元组;Step 2: replace the head entity of the first triple with the head entity of the second triple, sample the tail entity of the second triple to replace the tail entity of the first triple, and replace the The first triplet of is used as a negative sampling triplet; the second triplet is any triplet that has been linked in this type of domain name;
步骤三:判断当前三元组的尾实体与所述负采样三元组的头实体之间的实体距离是否不大于预设实体距离;Step 3: judging whether the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is not greater than the preset entity distance;
步骤四:若当前三元组的尾实体与所述负采样三元组的头实体之间的实体距离不大于所述预设实体距离,则将所述负采样三元组的头实体与当前三元组的尾实体相链接,并将所述负采样三元组作为当前三元组;Step 4: If the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is not greater than the preset entity distance, then connect the head entity of the negative sampling triplet to the current The tail entities of the triples are linked, and the negative sampling triples are used as the current triples;
步骤五:若当前三元组的尾实体与所述负采样三元组的头实体之间的实体距离大于预设实体距离,则返回步骤二,并重复执行步骤二至步骤五,直至当前三元组的尾实体与所述负采样三元组的头实体之间的实体距离满足所述预设实体距离。Step 5: If the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is greater than the preset entity distance, return to step 2, and repeat steps 2 to 5 until the current three The entity distance between the tail entity of the tuple and the head entity of the negative sampling triple satisfies the preset entity distance.
在一种可能的实现方式中,所述对所述融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,包括:In a possible implementation manner, the vectorization of entities and relationships of the fused DNS knowledge graph to obtain a DNS knowledge graph vector includes:
采用TransE算法,对所述融合后的DNS知识图谱的实体和关系进行向量化,得到所述DNS知识图谱向量。TransE algorithm is used to vectorize the entity and relationship of the fused DNS knowledge graph to obtain the DNS knowledge graph vector.
在一种可能的实现方式中,所述将所述域名对应的知识图谱向量作为输入量,将所述域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型,包括:In a possible implementation manner, the knowledge map vector corresponding to the domain name is used as an input, and the preset label corresponding to the domain name is used as an output to train a neural network model to obtain an entity classification model, including:
将所述域名对应的知识图谱向量作为输入量,将所述域名对应的预设标签作为输出量,对BiLSTM神经网络进行训练,将训练后的BiLSTM神经网络模型作为所述实体分类模型。The knowledge map vector corresponding to the domain name is used as an input, and the preset label corresponding to the domain name is used as an output to train the BiLSTM neural network, and the trained BiLSTM neural network model is used as the entity classification model.
第二方面,本申请提供了一种基于知识图谱的实体分类系统,该系统包括:建立模块、融合模块、向量化模块、训练模块和检测模块;In the second aspect, the present application provides a knowledge map-based entity classification system, which includes: an establishment module, a fusion module, a vectorization module, a training module, and a detection module;
所述建立模块,用于基于DNS的层级关系和查询解析关系建立DNS知识图谱,其中,所述DNS知识图谱包括至少一个域名对应的预设标签,所述预设标签包括恶意和非恶意;The establishment module is configured to establish a DNS knowledge graph based on DNS hierarchical relationships and query resolution relationships, wherein the DNS knowledge graph includes at least one preset label corresponding to a domain name, and the preset label includes malicious and non-malicious;
所述融合模块,用于将所述DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将所述实体和所述关系进行融合,得到融合后的DNS知识图谱;The fusion module is configured to split the DNS knowledge graph into entities and relationships, and fuse the entities and the relationships according to entity attribute alignment to obtain a fused DNS knowledge graph;
所述向量化模块,用于对所述融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,其中,所述DNS知识图谱向量包括各个域名对应的知识图谱向量;The vectorization module is configured to vectorize entities and relationships of the fused DNS knowledge graph to obtain a DNS knowledge graph vector, wherein the DNS knowledge graph vector includes knowledge graph vectors corresponding to each domain name;
所述训练模块,用于将所述域名对应的知识图谱向量作为输入量,将所述域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型;The training module is used to use the knowledge map vector corresponding to the domain name as an input, and use the preset label corresponding to the domain name as an output to train a neural network model to obtain an entity classification model;
所述检测模块,用于根据所述实体分类模型,对域名进行分类检测。The detection module is configured to classify and detect domain names according to the entity classification model.
第三方面,本申请提供了一种终端,包括存储器、处理器以及存储在所述存储器中并可在所述处理器上运行的计算机程序,所述处理器执行所述计算机程序时实现如上第一方面任一种可能的实现方式所述方法的步骤。In a third aspect, the present application provides a terminal, including a memory, a processor, and a computer program stored in the memory and operable on the processor. When the processor executes the computer program, the above-mentioned In one aspect, the steps of the method described in any possible implementation manner.
第四方面,本申请提供了一种计算机可读存储介质,所述计算机可读存储介质存储有计算机程序,所述计算机程序被处理器执行时实现如上第一方面任一种可能的实现方式所述方法的步骤。In a fourth aspect, the present application provides a computer-readable storage medium, the computer-readable storage medium stores a computer program, and when the computer program is executed by a processor, the above-mentioned any possible implementation manner of the first aspect is implemented. steps of the method described above.
本申请提供一种基于知识图谱的实体分类方法、系统、终端及存储介质,通过基于DNS的层级关系和查询解析关系建立DNS知识图谱,其中,DNS知识图谱包括至少一个域名对应的预设标签,预设标签包括恶意和非恶意,就可以实现域名与预设标签相对应,然后将DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将实体和关系进行融合,得到融合后的DNS知识图谱,接着对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,其中,DNS知识图谱向量包括各个域名对应的知识图谱向量,将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型,根据实体分类模型,对域名进行分类检测。这样就可以使得域名分类检测更加准确,并且提高了检测速度。The present application provides a knowledge graph-based entity classification method, system, terminal, and storage medium, and establishes a DNS knowledge graph through DNS-based hierarchical relationships and query resolution relationships, wherein the DNS knowledge graph includes at least one preset label corresponding to a domain name, The preset labels include malicious and non-malicious, so that the domain name can correspond to the preset labels, and then the DNS knowledge map is split into entities and relationships, and the entities and relationships are fused according to the alignment of entity attributes to obtain the fused DNS Knowledge map, and then vectorize the entities and relationships of the fused DNS knowledge map to obtain the DNS knowledge map vector, wherein the DNS knowledge map vector includes the knowledge map vector corresponding to each domain name, and the knowledge map vector corresponding to the domain name is used as the input , use the preset label corresponding to the domain name as the output, train the neural network model, obtain the entity classification model, and classify and detect the domain name according to the entity classification model. In this way, domain name classification detection can be made more accurate, and the detection speed can be improved.
附图说明Description of drawings
为了更清楚地说明本申请实施例中的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本申请的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动性的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present application, the accompanying drawings that need to be used in the descriptions of the embodiments or the prior art will be briefly introduced below. Obviously, the accompanying drawings in the following description are only for the present application For some embodiments, those of ordinary skill in the art can also obtain other drawings based on these drawings without paying creative efforts.
图1是本申请实施例提供的基于知识图谱的实体分类方法的实现流程图;Fig. 1 is the implementation flowchart of the entity classification method based on knowledge map provided by the embodiment of the present application;
图2是本申请实施例提供的基于知识图谱的实体分类方法的嵌入结构图;Fig. 2 is the embedding structure diagram of the entity classification method based on the knowledge map provided by the embodiment of the present application;
图3是本申请实施例提供的基于知识图谱的实体分类方法的检测结构图;FIG. 3 is a detection structure diagram of a knowledge graph-based entity classification method provided in an embodiment of the present application;
图4是本申请实施例提供的基于知识图谱的实体系统的结构示意图;Fig. 4 is a schematic structural diagram of an entity system based on a knowledge map provided by an embodiment of the present application;
图5是本申请实施例提供的终端的示意图。FIG. 5 is a schematic diagram of a terminal provided by an embodiment of the present application.
具体实施方式Detailed ways
以下描述中,为了说明而不是为了限定,提出了诸如特定系统结构、技术之类的具体细节,以便透彻理解本申请实施例。然而,本领域的技术人员应当清楚,在没有这些具体细节的其它实施例中也可以实现本申请。在其它情况中,省略对众所周知的系统、装置、电路以及方法的详细说明,以免不必要的细节妨碍本申请的描述。In the following description, specific details such as specific system structures and technologies are presented for the purpose of illustration rather than limitation, so as to thoroughly understand the embodiments of the present application. It will be apparent, however, to one skilled in the art that the present application may be practiced in other embodiments without these specific details. In other instances, detailed descriptions of well-known systems, devices, circuits, and methods are omitted so as not to obscure the description of the present application with unnecessary detail.
为使本申请的目的、技术方案和优点更加清楚,下面将结合附图通过具体实施例来进行说明。In order to make the purpose, technical solution and advantages of the present application clearer, specific embodiments will be described below in conjunction with the accompanying drawings.
图1为本申请实施例提供的基于知识图谱的实体分类方法的实现流程图,详述如下:Fig. 1 is the implementation flowchart of the entity classification method based on the knowledge map provided by the embodiment of the present application, which is described in detail as follows:
在S101中,基于DNS的层级关系和查询解析关系建立DNS知识图谱,其中,DNS知识图谱包括至少一个域名对应的预设标签,预设标签包括恶意和非恶意。In S101, a DNS knowledge graph is established based on DNS hierarchical relationships and query resolution relationships, wherein the DNS knowledge graph includes at least one preset label corresponding to a domain name, and the preset labels include malicious and non-malicious.
其中,DNS(Domain Name System,域名系统)是互联网的一项服务。它作为将域名和IP地址相互映射的一个分布式数据库,能够使人更方便地访问互联网。DNS使用UDP端口53。当前,对于每一级域名长度的限制是63个字符,域名总长度则不能超过253个字符。Among them, DNS (Domain Name System, Domain Name System) is a service of the Internet. As a distributed database that maps domain names and IP addresses to each other, it can make it easier for people to access the Internet. DNS uses UDP port 53. Currently, the length limit for each level of domain name is 63 characters, and the total length of the domain name cannot exceed 253 characters.
DNS的层级关系,即为层次结构关系,可以将它可看作是一个树状结构,域名系统不区分树内节点和叶子节点,而统称为节点,不同节点可以使用相同的标记。所有节点的标记只能由3类字符组成:26个英文字母(a~z)、10个阿拉伯数字(0~9)和英文连词号(-),并且标记的长度不得超过22个字符。一个节点的域名是由从该节点到根的所有节点的标记连接组成的,中间以点分隔。最上层节点的域名称为顶级域名(TLD,Top-Level Domain),第二层节点的域名称为二级域名,依此类推。The hierarchical relationship of DNS is a hierarchical relationship, which can be regarded as a tree structure. The domain name system does not distinguish between nodes in the tree and leaf nodes, but are collectively called nodes. Different nodes can use the same mark. All node tags can only be composed of 3 types of characters: 26 English letters (a~z), 10 Arabic numerals (0~9) and English hyphens (-), and the length of the tag must not exceed 22 characters. A node's domain name is composed of token connections from the node to all nodes of the root, separated by dots. The domain name of the top-level node is called the top-level domain name (TLD, Top-Level Domain), the domain name of the second-level node is called the second-level domain name, and so on.
DNS的查询解析关系的实现过程包括:The implementation process of DNS query resolution relationship includes:
首先,根据域名系统域名空间的层次结构将其按子树划分为不同的区域,每个区域可看作是负责层次结构中这一部分节点的可管理的权力实体。例如,整个域的顶层区域由ICANN负责管理,一些国家域名及其下属的那些节点又构成了各自的区域,像.cn域就由CNNIC负责管理。而.cn域下又被划分为一些更小的区域,例如.fudan.edu.cn由复旦大学网络中心负责管理。First, according to the hierarchical structure of the Domain Name System domain name space, it is divided into different regions by subtrees, and each region can be regarded as a manageable power entity responsible for this part of the nodes in the hierarchy. For example, the top-level area of the entire domain is managed by ICANN, and some country domain names and their subordinate nodes constitute their own areas. For example, the .cn domain is managed by CNNIC. The .cn domain is divided into some smaller areas, for example, .fudan.edu.cn is managed by Fudan University Network Center.
其次,每个区域必须有对应的域名服务器,每个区域中包含的信息存储在域名服务器上。一个区域中可有两个或多个域名服务器,这样即使其中一个域名服务器出了故障,另一个域名服务器仍然可以正常提供信息。一个域名服务器也可以同时管辖多个区域。域名服务器在接到用户发出的请求后查询自身的资源记录集合,返回用户想要得到的最终答案,或者当自身的资源记录集合中查不到所需要的答案时,返回指向另外一个域名服务器的指针,用户将继续向那个域名服务器发出请求。所以说,域名服务器不需要记录所有下属域名和主机的信息,对于其中的子域(如果存在),只需要知道子域的域名服务器即可。Second, each zone must have a corresponding domain name server, and the information contained in each zone is stored on the domain name server. There can be two or more domain name servers in a zone, so that even if one domain name server fails, the other domain name server can still provide information normally. A domain name server can also govern multiple regions at the same time. After receiving the request from the user, the domain name server inquires its own resource record collection, and returns the final answer that the user wants to get, or when the desired answer cannot be found in its own resource record collection, returns a link pointing to another domain name server. pointer, the user will continue to make requests to that domain name server. Therefore, the domain name server does not need to record the information of all subordinate domain names and hosts. For the subdomain (if it exists), you only need to know the domain name server of the subdomain.
资源记录是一个域名到值的绑定,它包括以下字段:域名、值、类型、分类和生命期。域名字段和值字段分别用来表示解析的内容和解析返回的结果。类型字段代表了值的种类:类型为A代表值字段是一个IP地址,即用户所要的最终答案;类型为NS代表值字段是另一个域名服务器的域名,该域名服务器能够知道如何解析域名字段所指定的域名;类型为CNAME代表值字段是由域名所指定的主机的一个别名;类型为MX代表值字段是一个邮件服务器的域名,该邮件服务器接收由域名字段所指定的域的邮件;类型PTR用于域名反解等。分类字段允许指定其他的记录类型。生命期字段用于指出该资源记录的有效期是多少。为减少域名解析时间,域名服务器会缓存一些曾经查询过的、来自其他域名服务器的资源记录。由于这些资源记录会因为更改而失效,因此域名服务器设置了生命期,到期的资源记录会被清除出缓存。A resource record is a binding of a domain name to a value, which includes the following fields: domain name, value, type, category, and lifetime. The domain name field and the value field are used to represent the parsed content and the result returned by parsing, respectively. The type field represents the type of value: type A means that the value field is an IP address, which is the final answer that the user wants; type NS means that the value field is the domain name of another domain name server, and the domain name server can know how to resolve the domain name field. The specified domain name; the type CNAME means that the value field is an alias of the host specified by the domain name; the type MX means that the value field is the domain name of a mail server that receives mail for the domain specified by the domain name field; type PTR It is used for reverse resolution of domain names, etc. Classification fields allow specifying other record types. The lifetime field is used to indicate how long the resource record is valid. In order to reduce the domain name resolution time, the domain name server will cache some previously queried resource records from other domain name servers. Since these resource records will become invalid due to changes, the domain name server has set a lifetime, and expired resource records will be cleared out of the cache.
根域名服务器知道所有顶级域名的域名服务器,对应于每个顶级域名,它都有两条资源记录:一条是NS资源记录,域名字段是该顶级域名,值字段是该顶级域名解析的域名服务器的域名;另一条是A资源记录,用来指明该域名服务器的域名对应的IP地址。综合使用这两条记录,就可以知道对该域下的某个域名解析,应该继续去哪个IP地址的域名服务器寻找。第二层的域名服务器类似地存放各个第三层域名服务器的指针。第三层的域名服务器会出现A、CNAME、MX等类型的资源记录。每个域名服务器都有根域名服务器的地址记录。The root domain name server knows the domain name servers of all top-level domain names. Corresponding to each top-level domain name, it has two resource records: one is the NS resource record, the domain name field is the top-level domain name, and the value field is the name server of the top-level domain name resolution. The domain name; the other is an A resource record, which is used to indicate the IP address corresponding to the domain name of the domain name server. Using these two records comprehensively, you can know which IP address domain name server should continue to search for a certain domain name under the domain. The second-level domain name server similarly stores pointers to each third-level domain name server. A, CNAME, MX and other types of resource records will appear in the domain name server of the third layer. Each name server has an address record for the root name server.
最后,一个需要域名解析的用户先将该解析请求发往本地的域名服务器。如果本地的域名服务器能够解析,则直接得到结果,否则本地的域名服务器将向根域名服务器发送请求。依据根域名服务器返回的指针再查询下一层的域名服务器,依此类推,最后得到所要解析域名的IP地址。Finally, a user who needs domain name resolution first sends the resolution request to the local domain name server. If the local domain name server can resolve, the result will be obtained directly, otherwise the local domain name server will send a request to the root domain name server. According to the pointer returned by the root domain name server, query the domain name server of the next layer, and so on, and finally obtain the IP address of the domain name to be resolved.
知识图谱,在图书情报界称为知识域可视化或知识领域映射地图,是显示知识发展进程与结构关系的一系列各种不同的图形,用可视化技术描述知识资源及其载体,挖掘、分析、构建、绘制和显示知识及它们之间的相互联系。Knowledge map, known as knowledge domain visualization or knowledge domain mapping map in the library and information industry, is a series of different graphics showing the relationship between knowledge development process and structure, using visualization technology to describe knowledge resources and their carriers, mining, analyzing and constructing , Mapping and displaying knowledge and their interconnections.
知识图谱,是通过将应用数学、图形学、信息可视化技术、信息科学等学科的理论与方法与计量学引文分析、共现分析等方法结合,并利用可视化的图谱形象地展示学科的核心结构、发展历史、前沿领域以及整体知识架构达到多学科融合目的的现代理论。知识图谱,它能为学科研究提供切实的、有价值的参考。Knowledge map is a combination of theories and methods of applied mathematics, graphics, information visualization technology, information science and other disciplines with metrology citation analysis, co-occurrence analysis and other methods, and uses the visual map to vividly display the core structure of the subject, Modern theories that develop history, frontier fields, and overall knowledge structure to achieve multidisciplinary integration. Knowledge graph, which can provide practical and valuable references for subject research.
知识图谱的特点包括:用户搜索次数越多,范围越广,搜索引擎就能获取越多信息和内容;赋予字串新的意义,而不只是单纯的字串;融合了所有的学科,以便于用户搜索时的连贯性;为用户找出更加准确的信息,作出更全面的总结并提供更有深度相关的信息;把与关键词相关的知识体系系统化地展示给用户;从整个互联网汲取有用的信息让用户能够获得更多相关的公共资源。The characteristics of the knowledge map include: the more users search and the wider the scope, the more information and content the search engine can obtain; giving new meaning to strings, not just strings; integrating all disciplines to facilitate Coherence when users search; find out more accurate information for users, make more comprehensive summaries and provide more in-depth relevant information; systematically display the knowledge system related to keywords to users; learn useful information from the entire Internet The information allows users to obtain more relevant public resources.
在本申请实施例中,根据DNS的层级关系和查询解析关系,建立DNS知识图谱,其中,该DNS知识图谱包括至少一个域名,以及该域名对应的预设标签。In the embodiment of the present application, a DNS knowledge map is established according to the DNS hierarchical relationship and query resolution relationship, wherein the DNS knowledge map includes at least one domain name and a preset label corresponding to the domain name.
本申请实施例中,预设标签包括恶意标签和非恶意标签,其中,非恶意标签即为良好标签,代表拥有该标签的域名为好的域名,即不会产生危害网络的域名;恶意标签,代表拥有该标签的域名为不好的域名,即会对网络产生不良影响的域名。In the embodiment of the present application, the preset labels include malicious labels and non-malicious labels, wherein the non-malicious label is a good label, which means that the domain name that owns the label is a good domain name, that is, a domain name that will not harm the network; a malicious label, It means that the domain name with this label is a bad domain name, that is, a domain name that will have a bad impact on the network.
在一种可能的实现方式中,基于DNS的层级关系和查询解析关系建立DNS知识图谱,可以包括:In a possible implementation, the DNS knowledge map is established based on the hierarchical relationship and query resolution relationship of DNS, which may include:
基于DNS的层级关系,建立DNS域名分层图,并对DNS域名分层图添加预设标签;Based on the hierarchical relationship of DNS, establish a DNS domain name hierarchical map, and add preset labels to the DNS domain name hierarchical map;
基于DNS的查询解析关系,建立DNS查询响应图和被动DNS图;Based on DNS query analysis relationship, establish DNS query response graph and passive DNS graph;
将DNS查询响应图和被动DNS图结合,建立DNS流图,并对DNS流图添加预设标签;Combine the DNS query response graph with the passive DNS graph to create a DNS flow graph and add preset labels to the DNS flow graph;
将DNS域名分层图和DNS流图通过规则对齐方式结合,建立DNS知识图谱。Combine DNS domain name hierarchical graph and DNS flow graph through rule alignment to establish DNS knowledge graph.
其中,DNS域名分层图,基于DNS的层级关系,通过域名本身的具有的结构与层次的特性建立的,即DNS域名分层图为域名的层次化结构表示图,并对DNS域名层图中的每一个域名进行添加预设标签。Among them, the DNS domain name hierarchical diagram is based on the hierarchical relationship of DNS, and is established through the structure and hierarchical characteristics of the domain name itself. Add preset labels to each domain name.
DNS查询响应图和被动DNS图,基于DNS的查询解析关系,即依据DNS请求的流程建立的。然后根据建立的DNS查询响应图和被动DNS图,将两者进行结合,得到DNS流图,并对DNS流图中每一个域名进行添加预设标签。The DNS query response graph and the passive DNS graph are based on the DNS query resolution relationship, which is established according to the DNS request process. Then, according to the established DNS query response graph and passive DNS graph, combine the two to obtain a DNS flow graph, and add a preset label to each domain name in the DNS flow graph.
其中,DNS域名分层图以及DNS流图当中,能够找寻到的主要共同点为域名信息。DNS流图当中查询部分的Qname属性和域名信息息息相关。Among them, in the DNS domain name hierarchical diagram and the DNS flow diagram, the main common point that can be found is domain name information. The Qname attribute of the query part in the DNS flow graph is closely related to the domain name information.
通过规则对齐方式,将DNS域名分层图和DNS流图相结合,得到DNS知识图谱。Through the rule alignment method, the DNS domain name hierarchical graph and the DNS flow graph are combined to obtain the DNS knowledge graph.
在S102中,将DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将实体和关系进行融合,得到融合后的DNS知识图谱。In S102, the DNS knowledge graph is split into entities and relationships, and the entities and relationships are fused according to entity attribute alignment to obtain a fused DNS knowledge graph.
在本申请实施例中,采取实体属性对齐方式对DNS知识图谱进行融合,得到融合后的DNS知识图谱。In the embodiment of the present application, the entity attribute alignment method is adopted to fuse the DNS knowledge graph to obtain the fused DNS knowledge graph.
其中,实体属性对齐方式是通过实体与属性值相同的方式,将不同的子图连接起来的一种方法。Among them, the alignment of entity attributes is a method to connect different subgraphs in the same way that entities and attribute values are the same.
在一种可能的实现方式中,实体可以包括头实体和尾实体;In a possible implementation, the entity may include a head entity and a tail entity;
将DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将实体和关系进行融合,得到融合后的DNS知识图谱,具体可以包括:Split the DNS knowledge graph into entities and relationships, and fuse the entities and relationships according to the alignment of entity attributes to obtain the fused DNS knowledge graph, which can specifically include:
利用三元组方式,将DNS知识图谱的客户端IP地址作为头实体,DNS知识图谱的Qname属性作为关系,DNS知识图谱的域名作为尾实体;Using the triplet method, the client IP address of the DNS knowledge graph is used as the head entity, the Qname attribute of the DNS knowledge graph is used as the relationship, and the domain name of the DNS knowledge graph is used as the tail entity;
通过实体属性对齐的方式,将头实体、关系和尾实体进行融合,得到融合后的DNS知识图谱。Through the alignment of entity attributes, the head entity, relationship and tail entity are fused to obtain the fused DNS knowledge graph.
其中,三元组是指形如((x,y),z)的集合,常简记为(x,y,z)。主要是用来存储稀疏矩阵的一种压缩方式,也叫三元组表。假设以顺序存储结构来表示三元组表(tripletable),则得到稀疏矩阵的一种压缩存储方式,即三元组顺序表,简称三元组表。由于其自身的稀疏特性,通过压缩可以大大节省稀疏矩阵的内存代价。具体操作是:将非零元素所在的行、列以及它的值构成一个三元组(i,j,v),然后再按某种规律存储这些三元组,这种方法可以节约存储空间。Among them, a triple refers to a set of the shape ((x, y), z), often abbreviated as (x, y, z). It is mainly a compression method used to store sparse matrices, also called triple table. Assuming that the triple table is represented by a sequential storage structure, a compressed storage method of the sparse matrix is obtained, that is, the triple table, referred to as the triple table. Due to its own sparse nature, the memory cost of sparse matrices can be greatly saved through compression. The specific operation is: form a triplet (i, j, v) with the row, column and its value of the non-zero element, and then store these triplets according to a certain rule. This method can save storage space.
本申请实施例中,根据上述三元组的介绍,可以知道,x相当于头实体,y相当于关系,z相当于尾实体。In the embodiment of the present application, according to the introduction of the above triples, it can be known that x is equivalent to the head entity, y is equivalent to the relationship, and z is equivalent to the tail entity.
利用三元组方式,将DNS知识图谱中的DNS流图中的客户端IP地址作为头实体,将DNS知识图谱中的DNS流图中的Qname属性作为关系,将DNS知识图谱中的DNS域名分层图的对应域名作为尾实体,即根据三元组方式的头实体+关系=尾实体可得,DNS流图中的客户端IP地址+Qname属性=DNS域名分层图的对应域名,这样保证了整个DNS知识图谱的一致性。通过实体属性对齐的方式,将DNS流图中的客户端IP地址、Qname属性和DNS域名分层图的对应域名进行融合,得到融合后的DNS知识图谱。Using the triplet method, the client IP address in the DNS flow graph in the DNS knowledge graph is used as the head entity, and the Qname attribute in the DNS flow graph in the DNS knowledge graph is used as the relationship, and the DNS domain name in the DNS knowledge graph is divided into The corresponding domain name of the layer map is used as the tail entity, that is, according to the head entity + relationship = tail entity in the triplet mode, the client IP address + Qname attribute in the DNS flow diagram = the corresponding domain name of the DNS domain name hierarchical map, so that it is guaranteed This ensures the consistency of the entire DNS knowledge graph. Through entity attribute alignment, the client IP address, Qname attribute in the DNS flow graph and the corresponding domain name in the DNS domain name hierarchical graph are fused to obtain the fused DNS knowledge graph.
在一种可能的实现方式中,DNS知识图谱包括至少一类域名对应的三元组,对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量包括:In a possible implementation, the DNS knowledge graph includes triples corresponding to at least one type of domain name, and the entities and relationships of the fused DNS knowledge graph are vectorized, and the obtained DNS knowledge graph vector includes:
针对每类域名,任意选取该类域名中的一个三元组的头实体作为起始节点,计算当前三元组的尾实体与第一三元组的头实体之间的实体距离;For each type of domain name, arbitrarily select the head entity of a triple in this type of domain name as the starting node, and calculate the entity distance between the tail entity of the current triple and the head entity of the first triple;
判断当前三元组的尾实体与第一三元组的头实体之间的实体距离是否不大于预设实体距离;Judging whether the entity distance between the tail entity of the current triple group and the head entity of the first triple group is not greater than the preset entity distance;
若当前三元组的尾实体与第一三元组的头实体之间的实体距离不大于预设实体距离,则将第一三元组的头实体与当前三元组的尾实体相链接,并将第一三元组作为当前三元组;If the entity distance between the tail entity of the current triple group and the head entity of the first triple group is not greater than the preset entity distance, link the head entity of the first triple group with the tail entity of the current triple group, and take the first triplet as the current triplet;
重复当前步骤,直至当前三元组的头实体为起始节点,得到该域名的实体和关系;Repeat the current step until the head entity of the current triple is the starting node, and obtain the entity and relationship of the domain name;
对各个域名对应的实体和关系进行向量化,得到各个域名对应的知识图谱向量;Vectorize the entities and relationships corresponding to each domain name to obtain the knowledge graph vector corresponding to each domain name;
其中,第一三元组为该类域名对应的三元组中未被链接的三元组。Wherein, the first triplet is an unlinked triplet among the triplets corresponding to this type of domain name.
在本申请实施例中,DNS知识图谱之间的实体采用了距离的概念,利用有向图的特性,得到域名的实体和关系的过程包括:针对每类域名,任意选取该类域名中的一个三元组的头实体作为起始节点,设定预设实体距离,计算当前三元组的尾实体与第一三元组的头实体之间的实体距离,通过判断当前三元组的尾实体与第一三元组的头实体之间的实体距离是否不大于预设实体距离,若不大于,则将第一三元组的头实体与当前三元组的尾实体相链接,并将第一三元组作为当前三元组,重复上述操作,直至当前三元组的头实体为起始节点,得到该域名的实体和关系。In the embodiment of this application, the entity between DNS knowledge graphs adopts the concept of distance, and using the characteristics of directed graphs, the process of obtaining the entities and relationships of domain names includes: for each type of domain name, arbitrarily selecting one of the domain names of this type The head entity of the triplet is used as the starting node, and the preset entity distance is set to calculate the entity distance between the tail entity of the current triplet and the head entity of the first triplet, by judging the tail entity of the current triplet Whether the entity distance from the head entity of the first triple is not greater than the preset entity distance, if not, link the head entity of the first triple with the tail entity of the current triple, and link the A triplet is used as the current triplet, and the above operations are repeated until the head entity of the current triplet is the starting node, and the entities and relationships of the domain name are obtained.
然后再对各个域名对应的实体和关系进行向量化操作,得到各个域名对应的知识图谱向量。Then, vectorize the entities and relationships corresponding to each domain name to obtain the knowledge graph vector corresponding to each domain name.
举例说明,针对某一类域名,该类域名中含有A、B、C和D个域名,这里设置预设实体距离为1,域名A的头实体为起始节点,则查找域名A下一个域名,当确定是域名B,则计算域名A的尾实体与域名B的头实体之间的实体距离,当该实体距离不大于1,则将域名B的头实体与域名A的尾实体相链接,并将域名B作为当前域名,在寻找下一个域名,直至当域名D为最后一个域名,且计算域名D的尾实体与域名A的头实体之间的实体距离,若该实体距离不大于1,则认为该类域名计算完毕,得到了该域名的实体和关系。For example, for a certain type of domain name, this type of domain name contains domain names A, B, C, and D. Here, the default entity distance is set to 1, and the head entity of domain name A is the starting node, then the next domain name of domain name A is searched. , when it is determined to be domain name B, then calculate the entity distance between the tail entity of domain name A and the head entity of domain name B, when the entity distance is not greater than 1, link the head entity of domain name B with the tail entity of domain name A, Use domain name B as the current domain name, and look for the next domain name until domain name D is the last domain name, and calculate the entity distance between the tail entity of domain name D and the head entity of domain name A, if the entity distance is not greater than 1, It is considered that the calculation of this type of domain name is completed, and the entity and relationship of the domain name are obtained.
在一种可能的实现方式中,在判断当前三元组的尾实体与第一三元组的头实体之间的实体距离是否不大于预设实体距离之后,包括:In a possible implementation manner, after judging whether the entity distance between the tail entity of the current triplet and the head entity of the first triplet is not greater than the preset entity distance, the steps include:
步骤一:若当前三元组的尾实体与第一三元组的头实体之间的实体距离大于预设实体距离,则执行步骤二;Step 1: If the entity distance between the tail entity of the current triplet and the head entity of the first triplet is greater than the preset entity distance, then perform step 2;
步骤二:采用第二三元组的头实体替换掉第一三元组的头实体,采样第二三元组的尾实体替换第一三元组的尾实体,并将替换后的第一三元组作为负采样三元组;第二三元组为该类域名中已链接的任一三元组;Step 2: Use the head entity of the second triplet to replace the head entity of the first triplet, sample the tail entity of the second triplet to replace the tail entity of the first triplet, and replace the first triplet The tuple is used as a negative sampling triplet; the second triplet is any triplet that has been linked in this type of domain name;
步骤三:判断当前三元组的尾实体与负采样三元组的头实体之间的实体距离是否不大于预设实体距离;Step 3: Determine whether the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is not greater than the preset entity distance;
步骤四:若当前三元组的尾实体与负采样三元组的头实体之间的实体距离不大于预设实体距离,则将负采样三元组的头实体与当前三元组的尾实体相链接,并将负采样三元组作为当前三元组;Step 4: If the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is not greater than the preset entity distance, then the head entity of the negative sampling triplet and the tail entity of the current triplet are linked, and take the negative sampled triplet as the current triplet;
步骤五:若当前三元组的尾实体与负采样三元组的头实体之间的实体距离大于预设实体距离,则返回步骤二,并重复执行步骤二至步骤五,直至当前三元组的尾实体与负采样三元组的头实体之间的实体距离满足预设实体距离。Step 5: If the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is greater than the preset entity distance, return to step 2 and repeat steps 2 to 5 until the current triplet The entity distance between the tail entity of and the head entity of the negative sampling triple satisfies the preset entity distance.
本申请实施例中,若在完成了实体距离计算之后,有部分实体之间无法进行链接,即当前三元组的尾实体与第一三元组的头实体之间的实体距离大于预设实体距离,则认为这两者之间的距离为无穷大。In the embodiment of this application, if some entities cannot be linked after the entity distance calculation is completed, that is, the entity distance between the tail entity of the current triplet and the head entity of the first triplet is greater than the preset entity distance, the distance between the two is considered to be infinite.
本申请实施例中,采用在三元组训练的过程中,随机选取第二三元组的头实体和尾实体分别替换掉第一三元组的头实体和尾实体,将替换后的第一三元组作为负采样三元组,其中,该第二三元组为该类域名中已链接的任一三元组,再次判断当前三元组的尾实体与该负采样三元组的头实体之间的预设距离是否不大于预设实体距离,若不大于,则将该负采样三元组的头实体与当前三元组的尾实体进行链接,并将该负采样三元组作为当前三元组;若大于,则重新进行选取已链接的三元组组对第一三元组进行替换,直至当前三元组的尾实体与替换后的负采样三元组的头实体之间的实体距离不大于预设距离。In the embodiment of the present application, in the process of triplet training, the head entity and tail entity of the second triplet are randomly selected to replace the head entity and tail entity of the first triplet respectively, and the replaced first triplet The triplet is used as a negative sampling triplet, where the second triplet is any triplet that has been linked in this type of domain name, and then judge the tail entity of the current triplet and the head of the negative sampling triplet Whether the preset distance between entities is not greater than the preset entity distance, if not, link the head entity of the negative sampling triplet with the tail entity of the current triplet, and use the negative sampling triplet as The current triplet; if it is greater than, re-select the linked triplet group to replace the first triplet until the tail entity of the current triplet is between the head entity of the replaced negative sampling triplet The entity distance of is not greater than the preset distance.
举例说明,针对某一类域名,该类域名中含有A、B、C和D个域名,这里设置预设实体距离为1,域名A的头实体为起始节点,则查找域名A下一个域名,当确定是域名B,则计算域名A的尾实体与域名B的头实体之间的实体距离为15,当该实体距离大于1,则认为域名A的尾实体与域名B的头实体之间的距离为无穷大,则任选一个域名C,该域名C是已经链接完成的域名,将域名C的头实体替换到域名B的头实体,将域名C的尾实体替换到域名B的尾实体,将替换后的域名B作为负采样三元组,计算域名A的尾实体与负采样三元组的头实体之间的实体距离为0.7,该实体距离小于1,则将替换后的负采样三元组的头实体与域名A的尾实体进行链接,并将负采样三元组作为当前三元组;若选取的域名C对域名B进行替换后得到的负采样三元组依然不能满足要求,则在重新选取,直至满足要求为止。For example, for a certain type of domain name, this type of domain name contains domain names A, B, C, and D. Here, the default entity distance is set to 1, and the head entity of domain name A is the starting node, then the next domain name of domain name A is searched. , when it is determined to be domain name B, then calculate the entity distance between the tail entity of domain name A and the head entity of domain name B as 15, when the entity distance is greater than 1, it is considered that the distance between the tail entity of domain name A and the head entity of domain name B The distance is infinite, choose a domain name C, the domain name C is a domain name that has been linked, replace the head entity of domain name C with the head entity of domain name B, replace the tail entity of domain name C with the tail entity of domain name B, Use the replaced domain name B as a negative sampling triplet, and calculate the entity distance between the tail entity of the domain name A and the head entity of the negative sampling triplet as 0.7, and if the entity distance is less than 1, then the replaced negative sampling triplet The head entity of the tuple is linked with the tail entity of the domain name A, and the negative sampling triplet is used as the current triplet; if the selected domain name C replaces the domain name B, the negative sampling triplet obtained after replacing the domain name B still cannot meet the requirements, Then re-select until the requirements are met.
在本申请实施例中,需要对融合后的DNS知识图谱进行数据处理,主要是对于DNS知识图谱中一些重复的、冗余的数据进行过滤,对于相同的实体进行合并,还可以为其添加一些频率、持续时间等特征,使得融合后的DNS知识图谱中的数据更加清洁。例如主机在不同时段向www.google.com的域名发起请求,且请求参数相同,则认为这两次的请求相同,对于这样的请求进行合并操作。但是,该操作不会影响融合后的DNS知识图谱的数据内容。In the embodiment of this application, it is necessary to perform data processing on the fused DNS knowledge map, mainly to filter some repeated and redundant data in the DNS knowledge map, to merge the same entities, and to add some Features such as frequency and duration make the data in the fused DNS knowledge graph cleaner. For example, the host initiates requests to the www.google.com domain name at different time periods, and the request parameters are the same, it is considered that the two requests are the same, and such requests are merged. However, this operation will not affect the data content of the fused DNS knowledge graph.
在S103中,对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,其中,DNS知识图谱向量包括各个域名对应的知识图谱向量。In S103, vectorize entities and relationships of the fused DNS knowledge graph to obtain a DNS knowledge graph vector, wherein the DNS knowledge graph vector includes knowledge graph vectors corresponding to each domain name.
在一种可能的实现方式中,对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,包括:In a possible implementation manner, the entities and relationships of the fused DNS knowledge graph are vectorized to obtain a DNS knowledge graph vector, including:
采用TransE算法,对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量。The TransE algorithm is used to vectorize the entities and relationships of the fused DNS knowledge graph to obtain the DNS knowledge graph vector.
其中,TransE算法是一种用于表示图结构中节点及关系的嵌入表示的算法,是基于实体和关系的分布式向量表示,算法受word2vec启发,利用了词向量的平移不变现象。Among them, the TransE algorithm is an algorithm used to represent the embedded representation of nodes and relationships in the graph structure. It is based on the distributed vector representation of entities and relationships. The algorithm is inspired by word2vec and uses the translation invariance of word vectors.
本申请实施例中,采用TransE算法,对融合后的DNS知识图谱的实体和关系进行向量化操作,得到DNS知识图谱向量,为后续训练神经网络模型作基础。In the embodiment of the present application, the TransE algorithm is used to vectorize the entities and relationships of the fused DNS knowledge graph to obtain DNS knowledge graph vectors, which serve as the basis for subsequent training of the neural network model.
具体的,参照图2,由于TransE算法是用于嵌入表示的算法,则该过程包括两部分组成,一部分为三元组嵌入,另一部分为属性嵌入,属性嵌入是根据三元组方式实现的,对于三元组嵌入部分,采用TransE算法进行训练,训练的目标是尽可能使得头实体+关系=尾实体。由于是随机替换头实体或者尾实体,然而在本申请实施例中的DNS知识图谱当中,可能会造成随机替换的实体与正在训练的实体距离非常近的问题,在训练过程中我们发现,替换的实体可能和当前正在训练的实体之间只相隔一个实体,主要原因在于基于规则对齐方式进行图融合的结果会减少大量实体之间的距离,或者可能会使得原本并不相同的实体间连接起来。Specifically, referring to Figure 2, since the TransE algorithm is an algorithm for embedding representations, the process consists of two parts, one part is triplet embedding, and the other part is attribute embedding, and attribute embedding is implemented according to triplets. For the triplet embedding part, the TransE algorithm is used for training, and the training goal is to make the head entity + relation = tail entity as much as possible. Since the head entity or tail entity is randomly replaced, however, in the DNS knowledge map in the embodiment of this application, it may cause the problem that the randomly replaced entity is very close to the entity being trained. During the training process, we found that the replaced There may be only one entity between the entity and the entity currently being trained. The main reason is that the result of graph fusion based on regular alignment will reduce the distance between a large number of entities, or it may connect entities that are not the same.
在S104中,将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型。In S104, the knowledge map vector corresponding to the domain name is used as an input, and the preset label corresponding to the domain name is used as an output to train a neural network model to obtain an entity classification model.
在一种可能的实现方式中,将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型,包括:In a possible implementation, the knowledge map vector corresponding to the domain name is used as an input, and the preset label corresponding to the domain name is used as an output to train a neural network model to obtain an entity classification model, including:
将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,对BiLSTM神经网络进行训练,将训练后的BiLSTM神经网络模型作为实体分类模型。The knowledge map vector corresponding to the domain name is used as the input, and the preset label corresponding to the domain name is used as the output to train the BiLSTM neural network, and the trained BiLSTM neural network model is used as the entity classification model.
在本申请实施例中,采用对BiLSTM神经网络训练,得到的训练模型作为实体分类模型。In the embodiment of the present application, the training model obtained by training the BiLSTM neural network is used as the entity classification model.
其中,BiLSTM(Bi-directional Long Short-Term Memory,双向长短期记忆网络),由前向LSTM与后向LSTM组合而成,被用于处理上下文信息。在LSTM当中存在着编码无法从后向前地利用信息,这是由于LSTM结构本身的串行结构,造成的结果是在进行一些细粒度的分类任务时候,对于交互的学习能力更弱。BiLSTM由一个前向的LSTM利用过去的信息,一个后向的LSTM利用未来的信息。BiLSTM中每个节点为LSTM神经元,在训练过程中,将每个训练序列分为前向和后向两个独立的递归神经网络,并最终连接同个输出层,检测模块最终会采用BiLSTM进行特征的提取。在当前时刻下,可以同时利用双向的信息,所以会比单向LSTM的预测更加准确。Among them, BiLSTM (Bi-directional Long Short-Term Memory, two-way long and short-term memory network), which is composed of forward LSTM and backward LSTM, is used to process context information. In LSTM, there is a code that cannot use information from the back to the front. This is due to the serial structure of the LSTM structure itself. The result is that when performing some fine-grained classification tasks, the learning ability for interaction is weaker. BiLSTM consists of a forward LSTM using past information and a backward LSTM using future information. Each node in BiLSTM is an LSTM neuron. During the training process, each training sequence is divided into two independent recurrent neural networks, forward and backward, and finally connected to the same output layer. The detection module will eventually use BiLSTM for feature extraction. At the current moment, two-way information can be used at the same time, so it will be more accurate than the one-way LSTM prediction.
其中,BiLSTM神经网络的输入量为向量形式,则本申请实施例中,将域名对应的知识图谱向量作为BiLSTM神经网络的输入量。Wherein, the input volume of the BiLSTM neural network is in the form of a vector, so in the embodiment of the present application, the knowledge map vector corresponding to the domain name is used as the input volume of the BiLSTM neural network.
在本申请实施例中,将域名对应的知识图谱向量作为BiLSTM神经网络的输入量,将域名对应的预设标签作为BiLSTM神经网络的输出量,对BiLSTM神经网络进行训练,将训练后的BiLSTM神经网络模型作为实体分类模型。In the embodiment of this application, the knowledge map vector corresponding to the domain name is used as the input of the BiLSTM neural network, and the preset label corresponding to the domain name is used as the output of the BiLSTM neural network to train the BiLSTM neural network, and the trained BiLSTM neural network The network model acts as an entity classification model.
具体的,将域名对应的知识图谱向量作为BiLSTM神经网络的输入量,将域名对应的恶意或者非恶意标签作为BiLSTM神经网络的输出量,对BiLSTM神经网络进行训练,得到实体分类模型。Specifically, the knowledge map vector corresponding to the domain name is used as the input of the BiLSTM neural network, and the malicious or non-malicious label corresponding to the domain name is used as the output of the BiLSTM neural network to train the BiLSTM neural network to obtain an entity classification model.
具体的,参照图3,该实体分类模型依次经过BiLSTM层、Attention层、Dropout层、Flatten层、Dense层、Softmax层,最终输出检测结果。对各层的解释如下:Specifically, referring to Figure 3, the entity classification model sequentially passes through the BiLSTM layer, Attention layer, Dropout layer, Flatten layer, Dense layer, and Softmax layer, and finally outputs the detection result. The explanation of each layer is as follows:
BiLSTM层的主要作用在于能够学习序列中向量的上下文关系,同时包括前向的向量和后向的向量,从而更好地提取特征进行分类。The main function of the BiLSTM layer is to be able to learn the context of the vector in the sequence, including the forward vector and the backward vector, so as to better extract features for classification.
Attention层最早用于图像处理,目标在于对于图像进行处理的时候,将计算机的关注点放在图像当中需要被注意的地方,也就是说这个图像内的每一个场景的注意力分布是不同的,可以认为某些像素点的权重会大于其他像素点的权值。然而在文本序列的训练当中,会存在一定的问题。首先当输入的序列长度极长的时候,模型难以进行更好地向量表示。其次在序列输入时,随着序列的推移,所有的上下文都会被压缩到某个固定长度,导致模型能力受到限制。Attention的实现机制是在保留BiLSTM层的编码结果,之后对于这些输入进行选择性的学习并且将输出序列与其关联。The Attention layer was first used in image processing. The goal is to place the computer's attention on the places that need attention in the image when processing the image. That is to say, the attention distribution of each scene in the image is different. It can be considered that the weight of some pixels will be greater than the weight of other pixels. However, in the training of text sequences, there will be certain problems. First of all, when the input sequence length is extremely long, it is difficult for the model to perform better vector representation. Secondly, when the sequence is input, all contexts will be compressed to a certain fixed length as the sequence progresses, resulting in limited model capabilities. The implementation mechanism of Attention is to retain the coding results of the BiLSTM layer, and then selectively learn these inputs and associate the output sequence with them.
Dropout层相当于在整体的网络当中随机生成小模型,模拟集成学习。直接作用在于减少中间特征的数量,减少冗余度,从而达到增加每层的特征之间的正交性。在模型训练时则表现为让某些节点随机输出置零,同时也不会对权重进行更新,该层的参数主要为一个概率,表示以此概率对节点进行停止,添加Dropout层的作用在于可以防止模型过拟合。The Dropout layer is equivalent to randomly generating small models in the overall network to simulate integrated learning. The direct effect is to reduce the number of intermediate features and reduce redundancy, so as to increase the orthogonality between the features of each layer. During model training, it is shown that some nodes are randomly output to zero, and the weights are not updated at the same time. The parameter of this layer is mainly a probability, which means that the node is stopped with this probability. The function of adding the Dropout layer is to be able to prevent model overfitting.
Flatten层的作用在于将多维向量转化为一个维度的向量,同时不影响batch的大小。The function of the Flatten layer is to convert multi-dimensional vectors into one-dimensional vectors without affecting the size of the batch.
Dense层是最为常用的全连接层,全连接层的目的在于将上层的输出结果进行一个非线性的变换。The Dense layer is the most commonly used fully connected layer. The purpose of the fully connected layer is to perform a nonlinear transformation on the output of the upper layer.
Softmax层最终将多个标量映射为一个概率分布,使得其输出返回在(0,1)之间。通过Softmax函数的输出,也就完成了整个模块的检测。The Softmax layer finally maps multiple scalars to a probability distribution such that its output returns between (0, 1). Through the output of the Softmax function, the detection of the entire module is completed.
在S105中,根据实体分类模型,对域名进行分类检测。In S105, the domain name is classified and detected according to the entity classification model.
根据S104中训练得到的实体分类模型,将待检测的域名对应的知识图谱向量输入到实体分类模型中,进行域名分类检测,最后输出的是恶意或者非恶意,即该待检测的域名是恶意域名或者是非恶意域名。According to the entity classification model trained in S104, input the knowledge graph vector corresponding to the domain name to be detected into the entity classification model, perform domain name classification detection, and finally output whether malicious or non-malicious, that is, the domain name to be detected is a malicious domain name Or a non-malicious domain name.
本申请提供一种基于知识图谱的实体分类方法,通过基于DNS的层级关系和查询解析关系建立DNS知识图谱,其中,DNS知识图谱包括至少一个域名对应的预设标签,预设标签包括恶意和非恶意,就可以实现域名与预设标签相对应,然后将DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将实体和关系进行融合,得到融合后的DNS知识图谱,接着对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,其中,DNS知识图谱向量包括各个域名对应的知识图谱向量,将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型,根据实体分类模型,对域名进行分类检测。这样就可以使得域名分类检测更加准确,并且提高了检测速度。The present application provides a knowledge graph-based entity classification method, which establishes a DNS knowledge graph through DNS-based hierarchical relationships and query resolution relationships, wherein the DNS knowledge graph includes at least one preset label corresponding to a domain name, and the preset labels include malicious and non- Malicious, you can realize that the domain name corresponds to the preset label, then split the DNS knowledge graph into entities and relationships, and fuse the entities and relationships according to the alignment of entity attributes to obtain the fused DNS knowledge graph, and then analyze the fused DNS knowledge graph. The entities and relationships of the DNS knowledge graph are vectorized to obtain the DNS knowledge graph vector. The DNS knowledge graph vector includes the knowledge graph vector corresponding to each domain name. The knowledge graph vector corresponding to the domain name is used as input, and the domain name corresponds to the preset The label is used as the output, and the neural network model is trained to obtain the entity classification model, and the domain name is classified and detected according to the entity classification model. In this way, domain name classification detection can be made more accurate, and the detection speed can be improved.
应理解,上述实施例中各步骤的序号的大小并不意味着执行顺序的先后,各过程的执行顺序应以其功能和内在逻辑确定,而不应对本申请实施例的实施过程构成任何限定。It should be understood that the sequence numbers of the steps in the above embodiments do not mean the order of execution, and the execution order of each process should be determined by its function and internal logic, and should not constitute any limitation to the implementation process of the embodiment of the present application.
以下为本申请的系统实施例,对于其中未详尽描述的细节,可以参考上述对应的方法实施例。The following are system embodiments of the present application, and for details that are not exhaustively described therein, reference may be made to the corresponding method embodiments above.
图4示出了本申请实施例提供的基于知识图谱的实体分类系统的结构示意图,为了便于说明,仅示出了与本申请实施例相关的部分,详述如下:Figure 4 shows a schematic structural diagram of a knowledge graph-based entity classification system provided by the embodiment of the present application. For the convenience of illustration, only the parts related to the embodiment of the present application are shown, and the details are as follows:
如图4所示,基于知识图谱的实体分类系统4包括:建立模块41、融合模块42、向量化模块43、训练模块44和检测模块45;As shown in Figure 4, the entity classification system 4 based on the knowledge map includes: an
建立模块41,用于基于DNS的层级关系和查询解析关系建立DNS知识图谱,其中,DNS知识图谱包括至少一个域名对应的预设标签,预设标签包括恶意和非恶意;The
融合模块42,用于将DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将实体和关系进行融合,得到融合后的DNS知识图谱;The
向量化模块43,用于对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,其中,DNS知识图谱向量包括各个域名对应的知识图谱向量;A
训练模块44,用于将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型;The
检测模块45,用于根据实体分类模型,对域名进行分类检测。The
本申请提供一种基于知识图谱的实体分类系统,通过基于DNS的层级关系和查询解析关系建立DNS知识图谱,其中,DNS知识图谱包括至少一个域名对应的预设标签,预设标签包括恶意和非恶意,就可以实现域名与预设标签相对应,然后将DNS知识图谱拆分为实体和关系,并根据实体属性对齐方式将实体和关系进行融合,得到融合后的DNS知识图谱,接着对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量,其中,DNS知识图谱向量包括各个域名对应的知识图谱向量,将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,训练神经网络模型,得到实体分类模型,根据实体分类模型,对域名进行分类检测。这样就可以使得域名分类检测更加准确,并且提高了检测速度。The present application provides a knowledge graph-based entity classification system, which establishes a DNS knowledge graph through DNS-based hierarchical relationships and query resolution relationships. The DNS knowledge graph includes at least one preset label corresponding to a domain name, and the preset labels include malicious and non- Malicious, you can realize that the domain name corresponds to the preset label, then split the DNS knowledge graph into entities and relationships, and fuse the entities and relationships according to the alignment of entity attributes to obtain the fused DNS knowledge graph, and then analyze the fused DNS knowledge graph. The entities and relationships of the DNS knowledge graph are vectorized to obtain the DNS knowledge graph vector. The DNS knowledge graph vector includes the knowledge graph vector corresponding to each domain name. The knowledge graph vector corresponding to the domain name is used as input, and the domain name corresponds to the preset The label is used as the output, and the neural network model is trained to obtain the entity classification model, and the domain name is classified and detected according to the entity classification model. In this way, domain name classification detection can be made more accurate, and the detection speed can be improved.
在一种可能的实现方式中,建立模块41,具体可以用于:In a possible implementation, the
基于DNS的层级关系,建立DNS域名分层图,并对DNS域名分层图添加预设标签;Based on the hierarchical relationship of DNS, establish a DNS domain name hierarchical map, and add preset labels to the DNS domain name hierarchical map;
基于DNS的查询解析关系,建立DNS查询响应图和被动DNS图;Based on DNS query analysis relationship, establish DNS query response graph and passive DNS graph;
将DNS查询响应图和被动DNS图结合,建立DNS流图,并对DNS流图添加预设标签;Combine the DNS query response graph with the passive DNS graph to create a DNS flow graph and add preset labels to the DNS flow graph;
将DNS域名分层图和DNS流图通过规则对齐方式结合,建立DNS知识图谱。Combine DNS domain name hierarchical graph and DNS flow graph through rule alignment to establish DNS knowledge graph.
在一种可能的实现方式中,实体可以包括头实体和尾实体。In a possible implementation manner, entities may include a head entity and a tail entity.
在一种可能的实现方式中,融合模块42,具体可以用于:In a possible implementation manner, the
利用三元组方式,将DNS知识图谱的客户端IP地址作为头实体,DNS知识图谱的Qname属性作为关系,DNS知识图谱的域名作为尾实体;Using the triplet method, the client IP address of the DNS knowledge graph is used as the head entity, the Qname attribute of the DNS knowledge graph is used as the relationship, and the domain name of the DNS knowledge graph is used as the tail entity;
通过实体属性对齐的方式,将头实体、关系和尾实体进行融合,得到融合后的DNS知识图谱。Through the alignment of entity attributes, the head entity, relationship and tail entity are fused to obtain the fused DNS knowledge graph.
在一种可能的实现方式中,DNS知识图谱包括至少一类域名对应的三元组,向量化模块43,具体可以用于:In a possible implementation, the DNS knowledge graph includes triples corresponding to at least one type of domain name, and the
针对每类域名,任意选取该类域名中的一个三元组的头实体作为起始节点,计算当前三元组的尾实体与第一三元组的头实体之间的实体距离;For each type of domain name, arbitrarily select the head entity of a triple in this type of domain name as the starting node, and calculate the entity distance between the tail entity of the current triple and the head entity of the first triple;
判断当前三元组的尾实体与第一三元组的头实体之间的实体距离是否不大于预设实体距离;Judging whether the entity distance between the tail entity of the current triple group and the head entity of the first triple group is not greater than the preset entity distance;
若当前三元组的尾实体与第一三元组的头实体之间的实体距离不大于预设实体距离,则将第一三元组的头实体与当前三元组的尾实体相链接,并将第一三元组作为当前三元组;If the entity distance between the tail entity of the current triple group and the head entity of the first triple group is not greater than the preset entity distance, link the head entity of the first triple group with the tail entity of the current triple group, and take the first triplet as the current triplet;
重复当前步骤,直至当前三元组的头实体为起始节点,得到该域名的实体和关系;Repeat the current step until the head entity of the current triple is the starting node, and obtain the entity and relationship of the domain name;
对各个域名对应的实体和关系进行向量化,得到各个域名对应的知识图谱向量;Vectorize the entities and relationships corresponding to each domain name to obtain the knowledge graph vector corresponding to each domain name;
其中,第一三元组为该类域名对应的三元组中未被链接的三元组。Wherein, the first triplet is an unlinked triplet among the triplets corresponding to this type of domain name.
在一种可能的实现方式中,向量化模块43,还可以用于:In a possible implementation, the
步骤一:若当前三元组的尾实体与第一三元组的头实体之间的实体距离大于预设实体距离,则执行步骤二;Step 1: If the entity distance between the tail entity of the current triplet and the head entity of the first triplet is greater than the preset entity distance, then perform step 2;
步骤二:采用第二三元组的头实体替换掉第一三元组的头实体,采样第二三元组的尾实体替换第一三元组的尾实体,并将替换后的第一三元组作为负采样三元组;第二三元组为该类域名中已链接的任一三元组;Step 2: Use the head entity of the second triplet to replace the head entity of the first triplet, sample the tail entity of the second triplet to replace the tail entity of the first triplet, and replace the first triplet The tuple is used as a negative sampling triplet; the second triplet is any triplet that has been linked in this type of domain name;
步骤三:判断当前三元组的尾实体与负采样三元组的头实体之间的实体距离是否不大于预设实体距离;Step 3: Determine whether the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is not greater than the preset entity distance;
步骤四:若当前三元组的尾实体与负采样三元组的头实体之间的实体距离不大于预设实体距离,则将负采样三元组的头实体与当前三元组的尾实体相链接,并将负采样三元组作为当前三元组;Step 4: If the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is not greater than the preset entity distance, then the head entity of the negative sampling triplet and the tail entity of the current triplet are linked, and take the negative sampled triplet as the current triplet;
步骤五:若当前三元组的尾实体与负采样三元组的头实体之间的实体距离大于预设实体距离,则返回步骤二,并重复执行步骤二至步骤五,直至当前三元组的尾实体与负采样三元组的头实体之间的实体距离满足预设实体距离。Step 5: If the entity distance between the tail entity of the current triplet and the head entity of the negative sampling triplet is greater than the preset entity distance, return to step 2 and repeat steps 2 to 5 until the current triplet The entity distance between the tail entity of and the head entity of the negative sampling triple satisfies the preset entity distance.
在一种可能的实现方式中,向量化模块43,还可以用于:In a possible implementation, the
采用TransE算法,对融合后的DNS知识图谱的实体和关系进行向量化,得到DNS知识图谱向量。The TransE algorithm is used to vectorize the entities and relationships of the fused DNS knowledge graph to obtain the DNS knowledge graph vector.
在一种可能的实现方式中,训练模块44,具体可以用于:In a possible implementation, the
将域名对应的知识图谱向量作为输入量,将域名对应的预设标签作为输出量,对BiLSTM神经网络进行训练,将训练后的BiLSTM神经网络模型作为实体分类模型。The knowledge map vector corresponding to the domain name is used as the input, and the preset label corresponding to the domain name is used as the output to train the BiLSTM neural network, and the trained BiLSTM neural network model is used as the entity classification model.
图5是本申请实施例提供的终端的示意图。如图5所示,该实施例的终端5包括:处理器50、存储器51以及存储在所述存储器51中并可在所述处理器50上运行的计算机程序52。所述处理器50执行所述计算机程序52时实现上述各个基于知识图谱的实体分类方法实施例中的步骤,例如图1所示的S101至S105。或者,所述处理器50执行所述计算机程序52时实现上述各系统实施例中各模块/单元的功能,例如图4所示模块41至45的功能。FIG. 5 is a schematic diagram of a terminal provided by an embodiment of the present application. As shown in FIG. 5 , the
示例性的,所述计算机程序52可以被分割成一个或多个模块/单元,所述一个或者多个模块/单元被存储在所述存储器51中,并由所述处理器50执行,以完成本申请。所述一个或多个模块/单元可以是能够完成特定功能的一系列计算机程序指令段,该指令段用于描述所述计算机程序52在所述终端5中的执行过程。例如,所述计算机程序52可以被分割成图4所示的模块41至45。Exemplarily, the
所述终端5可以是桌上型计算机、笔记本、掌上电脑及云端服务器等计算设备。所述终端5可包括,但不仅限于,处理器50、存储器51。本领域技术人员可以理解,图5仅仅是终端5的示例,并不构成对终端5的限定,可以包括比图示更多或更少的部件,或者组合某些部件,或者不同的部件,例如所述终端还可以包括输入输出设备、网络接入设备、总线等。The
所称处理器50可以是中央处理单元(Central Processing Unit,CPU),还可以是其他通用处理器、数字信号处理器(Digital Signal Processor,DSP)、专用集成电路(Application Specific Integrated Circuit,ASIC)、现场可编程门阵列(Field-Programmable Gate Array,FPGA)或者其他可编程逻辑器件、分立门或者晶体管逻辑器件、分立硬件组件等。通用处理器可以是微处理器或者该处理器也可以是任何常规的处理器等。The so-called
所述存储器51可以是所述终端5的内部存储单元,例如终端5的硬盘或内存。所述存储器51也可以是所述终端5的外部存储设备,例如所述终端5上配备的插接式硬盘,智能存储卡(Smart Media Card,SMC),安全数字(Secure Digital,SD)卡,闪存卡(Flash Card)等。进一步地,所述存储器51还可以既包括所述终端5的内部存储单元也包括外部存储设备。所述存储器51用于存储所述计算机程序以及所述终端所需的其他程序和数据。所述存储器51还可以用于暂时地存储已经输出或者将要输出的数据。The
所属领域的技术人员可以清楚地了解到,为了描述的方便和简洁,仅以上述各功能单元、模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能单元、模块完成,即将所述装置的内部结构划分成不同的功能单元或模块,以完成以上描述的全部或者部分功能。实施例中的各功能单元、模块可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中,上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。另外,各功能单元、模块的具体名称也只是为了便于相互区分,并不用于限制本申请的保护范围。上述系统中单元、模块的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that for the convenience and brevity of description, only the division of the above-mentioned functional units and modules is used for illustration. In practical applications, the above-mentioned functions can be assigned to different functional units, Completion of modules means that the internal structure of the device is divided into different functional units or modules to complete all or part of the functions described above. Each functional unit and module in the embodiment may be integrated into one processing unit, or each unit may exist separately physically, or two or more units may be integrated into one unit, and the above-mentioned integrated units may adopt hardware It can also be implemented in the form of software functional units. In addition, the specific names of the functional units and modules are only for the convenience of distinguishing each other, and are not used to limit the protection scope of the present application. For the specific working process of the units and modules in the above system, reference may be made to the corresponding process in the foregoing method embodiments, and details will not be repeated here.
在上述实施例中,对各个实施例的描述都各有侧重,某个实施例中没有详述或记载的部分,可以参见其它实施例的相关描述。In the above-mentioned embodiments, the descriptions of each embodiment have their own emphases, and for parts that are not detailed or recorded in a certain embodiment, refer to the relevant descriptions of other embodiments.
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those skilled in the art can appreciate that the units and algorithm steps of the examples described in conjunction with the embodiments disclosed herein can be implemented by electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are executed by hardware or software depends on the specific application and design constraints of the technical solution. Those skilled in the art may use different methods to implement the described functions for each specific application, but such implementation should not be regarded as exceeding the scope of the present application.
在本申请所提供的实施例中,应该理解到,所揭露的装置/终端和方法,可以通过其它的方式实现。例如,以上所描述的装置/终端实施例仅仅是示意性的,例如,所述模块或单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通讯连接可以是通过一些接口,装置或单元的间接耦合或通讯连接,可以是电性,机械或其它的形式。In the embodiments provided in this application, it should be understood that the disclosed device/terminal and method may be implemented in other ways. For example, the device/terminal embodiments described above are only illustrative. For example, the division of the modules or units is only a logical function division. In actual implementation, there may be other division methods, such as multiple units or Components may be combined or integrated into another system, or some features may be omitted, or not implemented. In another point, the mutual coupling or direct coupling or communication connection shown or discussed may be through some interfaces, and the indirect coupling or communication connection of devices or units may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and the components shown as units may or may not be physical units, that is, they may be located in one place, or may be distributed to multiple network units. Part or all of the units can be selected according to actual needs to achieve the purpose of the solution of this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。上述集成的单元既可以采用硬件的形式实现,也可以采用软件功能单元的形式实现。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, each unit may exist separately physically, or two or more units may be integrated into one unit. The above-mentioned integrated units can be implemented in the form of hardware or in the form of software functional units.
所述集成的模块/单元如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请实现上述实施例方法中的全部或部分流程,也可以通过计算机程序来指令相关的硬件来完成,所述的计算机程序可存储于一计算机可读存储介质中,该计算机程序在被处理器执行时,可实现上述各个基于知识图谱的实体分类方法实施例的步骤。其中,所述计算机程序包括计算机程序代码,所述计算机程序代码可以为源代码形式、对象代码形式、可执行文件或某些中间形式等。所述计算机可读介质可以包括:能够携带所述计算机程序代码的任何实体或装置、记录介质、U盘、移动硬盘、磁碟、光盘、计算机存储器、只读存储器(Read-Only Memory,ROM)、随机存取存储器(Random Access Memory,RAM)、电载波信号、电信信号以及软件分发介质等。需要说明的是,所述计算机可读介质包含的内容可以根据司法管辖区内立法和专利实践的要求进行适当的增减,例如在某些司法管辖区,根据立法和专利实践,计算机可读介质不包括是电载波信号和电信信号。If the integrated module/unit is realized in the form of a software function unit and sold or used as an independent product, it can be stored in a computer-readable storage medium. Based on this understanding, all or part of the processes in the methods of the above embodiments in the present application can also be completed by instructing related hardware through computer programs. The computer programs can be stored in a computer-readable storage medium, and the computer When the program is executed by the processor, the steps in the above embodiments of the knowledge graph-based entity classification method can be realized. Wherein, the computer program includes computer program code, and the computer program code may be in the form of source code, object code, executable file or some intermediate form. The computer-readable medium may include: any entity or device capable of carrying the computer program code, a recording medium, a USB flash drive, a removable hard disk, a magnetic disk, an optical disk, a computer memory, and a read-only memory (Read-Only Memory, ROM) , random access memory (Random Access Memory, RAM), electric carrier signal, telecommunication signal and software distribution medium, etc. It should be noted that the content contained in the computer-readable medium may be appropriately increased or decreased according to the requirements of legislation and patent practice in the jurisdiction. For example, in some jurisdictions, computer-readable media Excluding electrical carrier signals and telecommunication signals.
以上所述实施例仅用以说明本申请的技术方案,而非对其限制;尽管参照前述实施例对本申请进行了详细的说明,本领域的普通技术人员应当理解:其依然可以对前述各实施例所记载的技术方案进行修改,或者对其中部分技术特征进行等同替换;而这些修改或者替换,并不使相应技术方案的本质脱离本申请各实施例技术方案的精神和范围,均应包含在本申请的保护范围之内。The above-described embodiments are only used to illustrate the technical solutions of the present application, rather than to limit them; although the present application has been described in detail with reference to the foregoing embodiments, those of ordinary skill in the art should understand that: it can still implement the foregoing embodiments Modifications to the technical solutions described in the examples, or equivalent replacements for some of the technical features; and these modifications or replacements do not make the essence of the corresponding technical solutions deviate from the spirit and scope of the technical solutions of the various embodiments of the application, and should be included in the Within the protection scope of this application.
Claims (10)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211160630.6A CN115391568A (en) | 2022-09-22 | 2022-09-22 | Entity classification method, system, terminal and storage medium based on knowledge graph |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202211160630.6A CN115391568A (en) | 2022-09-22 | 2022-09-22 | Entity classification method, system, terminal and storage medium based on knowledge graph |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN115391568A true CN115391568A (en) | 2022-11-25 |
Family
ID=84125819
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202211160630.6A Pending CN115391568A (en) | 2022-09-22 | 2022-09-22 | Entity classification method, system, terminal and storage medium based on knowledge graph |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN115391568A (en) |
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115659985A (en) * | 2022-12-09 | 2023-01-31 | 南方电网数字电网研究院有限公司 | Electric power knowledge graph entity alignment method and device and computer equipment |
| WO2025081580A1 (en) * | 2023-10-17 | 2025-04-24 | 中国互联网络信息中心 | Domain name detection method and system based on graph neural network model, and storage medium |
Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107071084A (en) * | 2017-04-01 | 2017-08-18 | 北京神州绿盟信息安全科技股份有限公司 | A kind of DNS evaluation method and device |
| CN110290116A (en) * | 2019-06-04 | 2019-09-27 | 中山大学 | A Malicious Domain Name Detection Method Based on Knowledge Graph |
| CN114328962A (en) * | 2021-12-29 | 2022-04-12 | 北京信息科技大学 | A method for identifying abnormal behavior of web logs based on knowledge graph |
-
2022
- 2022-09-22 CN CN202211160630.6A patent/CN115391568A/en active Pending
Patent Citations (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN107071084A (en) * | 2017-04-01 | 2017-08-18 | 北京神州绿盟信息安全科技股份有限公司 | A kind of DNS evaluation method and device |
| CN110290116A (en) * | 2019-06-04 | 2019-09-27 | 中山大学 | A Malicious Domain Name Detection Method Based on Knowledge Graph |
| CN114328962A (en) * | 2021-12-29 | 2022-04-12 | 北京信息科技大学 | A method for identifying abnormal behavior of web logs based on knowledge graph |
Non-Patent Citations (1)
| Title |
|---|
| 张奕等: "基于知识图谱的恶意域名检测方法", 通信技术, vol. 53, no. 1, 10 January 2020 (2020-01-10), pages 168 - 173 * |
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN115659985A (en) * | 2022-12-09 | 2023-01-31 | 南方电网数字电网研究院有限公司 | Electric power knowledge graph entity alignment method and device and computer equipment |
| CN115659985B (en) * | 2022-12-09 | 2023-03-31 | 南方电网数字电网研究院有限公司 | Electric power knowledge graph entity alignment method and device and computer equipment |
| WO2025081580A1 (en) * | 2023-10-17 | 2025-04-24 | 中国互联网络信息中心 | Domain name detection method and system based on graph neural network model, and storage medium |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| Park et al. | Hyperlink analyses of the World Wide Web: A review | |
| US11080483B1 (en) | Deep machine learning generation of domain names leveraging token metadata | |
| EP2728508B1 (en) | Dynamic data masking | |
| US20090089278A1 (en) | Techniques for keyword extraction from urls using statistical analysis | |
| CN107566376A (en) | One kind threatens information generation method, apparatus and system | |
| CN115391568A (en) | Entity classification method, system, terminal and storage medium based on knowledge graph | |
| CN115051863B (en) | Abnormal flow detection method and device, electronic equipment and readable storage medium | |
| CN115665284B (en) | FLINK job real-time message processing method and device based on distributed configuration center | |
| CN116822491A (en) | Log analysis method and device, equipment and storage medium | |
| JP2025114640A (en) | Vector embedding model for relational tables with null or equal values | |
| AU2021469297A1 (en) | Fragmented record detection based on records matching techniques | |
| CN114024701A (en) | Domain name detection method, device and communication system | |
| Ahamed et al. | An Efficient Mechanism for Deep Web Data Extraction Based on Tree‐Structured Web Pattern Matching | |
| CN109726292A (en) | Text analysis method and device for large-scale multilingual data | |
| CN113378544A (en) | Text analysis method, text data acquisition method, device, medium and equipment | |
| Cortez et al. | A flexible approach for extracting metadata from bibliographic citations | |
| CN118210645A (en) | JSON data verification method and related device based on tree structure | |
| US12124480B2 (en) | Simplified schema generation for data ingestion | |
| CN115168609B (en) | A text matching method, device, computer equipment and storage medium | |
| Hu et al. | The methods of big data fusion and semantic collision detection in Internet of Thing | |
| CN116955751A (en) | Crawler identification method, crawler identification device, computer equipment and storage medium | |
| CN116366312A (en) | Web attack detection method, device and storage medium | |
| CN114518993A (en) | System performance monitoring method, device, equipment and medium based on business characteristics | |
| US20200380048A1 (en) | Architecture and functional model of a generic data excavation engine | |
| US20220092186A1 (en) | Security information analysis device, system, method and program |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination |