CN113746738A

CN113746738A - Data forwarding method, device and related equipment

Info

Publication number: CN113746738A
Application number: CN202010478343.4A
Authority: CN
Inventors: 杨昌; 李飞
Original assignee: Huawei Technologies Co Ltd
Current assignee: Huawei Technologies Co Ltd
Priority date: 2020-05-29
Filing date: 2020-05-29
Publication date: 2021-12-03

Abstract

The present application provides a data forwarding method, which is used for identifying the application to which the data belongs in the network, and forwarding the data according to the forwarding rule corresponding to the application described in the data. The method includes: the network device receives the message, and if the network device obtains the target application corresponding to the message according to the destination address of the message, forwarding the message according to the forwarding rule corresponding to the target application; if the network device cannot obtain the message according to the destination address The target application corresponding to the message; the domain name corresponding to the destination address of the message is obtained, and the target application corresponding to the message is obtained according to the domain name corresponding to the destination address; and then the message is forwarded according to the forwarding rule corresponding to the target application. When the application corresponding to the message cannot be obtained according to the destination address of the message, the application to which the data belongs is identified by the domain name, and this method is used to identify the application to which the data belongs, which has better generalization and stability, and does not require data The package is unpacked and analyzed, and the identification speed is faster.

Description

Data forwarding method, device and related equipment

Technical Field

The present application relates to the field of computer technologies, and in particular, to a data forwarding method and apparatus, and a related device.

Background

With the rapid development of computer network technology and network applications, network users have higher and higher requirements on the speed and quality of network connection. Application identification is the key to application visualization and realization of high-quality network service bearer. The application identification technology is deployed in a network and can be used for analyzing the application flow of enterprises and individuals, assisting in identifying the application quality, and further performing intelligent routing and QoS guarantee on different applications.

Deep Packet Inspection (DPI) is a network packet filtering technology, and is used to identify an application or application type to which a packet belongs, and further perform corresponding processing on the packet according to the application or application type to which the packet belongs, and the principle of the DPI is to read the content of a packet load to obtain data of an application layer, then detect the data of the application layer according to an existing feature library, and identify that the packet belongs to a specific application by matching with a multi-level rule, thereby identifying protocols, viruses, junk mails and the like which do not conform to a specification, and also can perform intelligent routing and QoS guarantee on packets of different applications by using a predetermined rule. However, when the DPI technology is used to identify the application to which the packet belongs, the packet needs to be unpacked and analyzed, and the identification efficiency is low. And various application software is continuously developed at present, the characteristic information of the applied data is continuously changed, when the characteristics are changed, the characteristics in the characteristic library are also changed, and the characteristic library needs to be continuously updated and maintained along with the updating of the application, so that the difficulty of identifying the data based on the DPI technology is increased.

Disclosure of Invention

The embodiment of the application discloses a data forwarding method, a data forwarding device and related equipment.

In a first aspect, an embodiment of the present application provides a data forwarding method, where the method includes:

the network equipment receives a message, wherein the message comprises a destination address;

if the network equipment acquires the target application corresponding to the message according to the destination address, forwarding the message according to a forwarding rule corresponding to the target application;

if the network equipment can not obtain the target application corresponding to the message according to the destination address; acquiring a domain name corresponding to the destination address, and acquiring a target application corresponding to the message according to the domain name corresponding to the destination address; and forwarding the message according to the forwarding rule corresponding to the target application.

After the network equipment acquires the message, the application to which the message belongs is identified according to the destination address of the message, the application to which the message belongs is identified according to the corresponding relation between the destination address and the application, the load of the message does not need to be acquired, and the identification efficiency is improved. When the application to which the message belongs cannot be identified according to the corresponding relation between the destination address and the application, the domain name to be identified corresponding to the destination address is obtained, the application to which the message belongs is identified through the domain name, the application to which the message belongs is identified by the method, and the message is forwarded according to the application to which the message belongs.

In a specific implementation manner, before the network device obtains the target application corresponding to the packet according to the destination address, and forwards the packet according to a forwarding rule corresponding to the target application, the method further includes:

the network device identifies a target application corresponding to the destination address according to the destination address and an address database, wherein the address database comprises a corresponding relation between a plurality of addresses and a plurality of applications, each application corresponds to one or a plurality of addresses, and one address corresponds to one application.

The network equipment stores corresponding relations between a plurality of addresses and a plurality of applications, if the address database comprises the target application corresponding to the destination address of the message, the target application corresponding to the message is rapidly identified according to the address database, the identification is simple, the identification efficiency is higher, and the identification is more accurate.

In a specific implementation manner, the obtaining of the domain name corresponding to the destination address includes: the network device obtains a domain name corresponding to a destination address of the packet from a domain name database, where the domain name database includes a correspondence between a plurality of domain names and a plurality of addresses, where one domain name corresponds to one address, one address corresponds to one or more domain names, and the correspondence between the plurality of domain names and the plurality of addresses is obtained through a domain name system DNS server.

In a specific implementation manner, the obtaining of the target application corresponding to the packet according to the domain name corresponding to the destination address includes: and acquiring the target application corresponding to the message according to the domain name corresponding to the destination address and the target domain name identification model.

The network equipment acquires a destination address of the message, acquires a domain name to be recognized corresponding to the destination address returned by a domain name system according to the destination address, inputs the domain name to be recognized into a trained target domain name recognition model, and recognizes a target application to which the domain name to be recognized belongs, so as to determine data to be recognized which belong to the target application, wherein the target domain name recognition model is a machine learning model obtained by training according to a data set comprising a plurality of domain names and application identifiers of applications to which the domain names belong.

And training by a machine learning method to obtain a domain name recognition model, recognizing the application to which the domain name to be recognized belongs according to the domain name to be recognized and the domain name recognition model, and further determining the application to which the data to be recognized accessing the domain name to be recognized belongs. The method for identifying the application to which the data belongs can avoid the problem that when the application to which the data belongs is identified by adopting a deep data packet technology, the identification result is inaccurate due to continuous change of the characteristic information of the applied data, so that the method for identifying the application to which the data accessing the domain name belongs by utilizing the domain name has better generalization and stability, the data packet does not need to be unpacked and analyzed during identification, and the identification efficiency is higher.

In a specific implementation manner, before the obtaining, according to the domain name and the target domain name recognition model, the target application corresponding to the packet, the method further includes:

acquiring a plurality of initial domain name data sets, wherein each initial domain name data set in the plurality of initial domain name data sets comprises a plurality of domain names accessed by an application and flow corresponding to each domain name;

clustering a plurality of domain names in a target initial domain name data set to obtain a plurality of domain name subsets corresponding to the target initial domain name data set, wherein each domain name subset in the plurality of domain name subsets comprises one or more domain names, and the target initial domain name data set is any one of the plurality of initial domain name data sets;

acquiring a first traffic corresponding to a target domain name subset and a second traffic corresponding to the target initial domain name data set, wherein the first traffic comprises a sum of the traffic corresponding to one or more domain names in the target domain name subset, the second traffic comprises a sum of the traffic corresponding to a plurality of domain names in the target initial domain name data set, and the target domain name subset is any one of the domain name subsets;

determining a ratio of the first traffic to the second traffic, and when the ratio is greater than or equal to a first threshold, adding the domain name in the target domain name subset to a target data set, where the target data set includes domain names corresponding to a plurality of applications and application identifiers of the applications corresponding to each domain name, each domain name in the target data set corresponds to one application identifier, and each application in the plurality of applications corresponds to one or more domain names;

and training an initial domain name recognition model according to the target data set to obtain the target domain name recognition model.

After an initial domain name data set composed of domain names accessed by an application is obtained, clustering is carried out on the domain names accessed by the application to obtain a plurality of domain name subsets, first flow generated when the application accesses the domain names in each domain name subset and second flow generated when all the domain names accessed by the application are calculated according to a clustering result, and the domain names in the domain name subsets, of which the ratio of the first flow to the second flow is larger than or equal to a preset value, are used as the domain names mainly accessed by the application, so that a main domain name corresponding to the application is extracted, the identification accuracy of a target domain name identification model obtained through training of a target data set is improved, the data volume for training the initial domain name identification model is reduced, and the training efficiency is improved.

In a specific implementation manner, before clustering the plurality of domain names in the target initial domain name data set to obtain a plurality of domain name subsets corresponding to the target initial domain name data set, the method further includes:

determining the frequency of occurrence of a target domain name in the plurality of initial domain name data sets, and deleting the target domain name from the plurality of initial domain name data sets when the frequency is greater than or equal to a second threshold, wherein the target domain name is any one of the plurality of initial domain name data sets.

The domain name with higher occurrence frequency is determined in the plurality of initial domain name data sets, the domain name is determined to be the interference domain name, and then the interference domain name is deleted from the plurality of initial domain name data sets, so that the identification accuracy of a target domain name identification model obtained through training of a target data set can be improved, the data volume for training the initial domain name identification model is reduced, and the training efficiency is improved.

In a specific implementation, the method further includes:

extracting domain names with the same level of domain name in the target data set, and dividing the target data set into a first target data set and a second target data set, wherein the domain name in the first target data set does not have the domain name with the same level of domain name as the first domain name in the first target data set, and the domain name in the second target data set has the domain name with the same level of domain name as the second domain name in the second target data set;

the training of the initial domain name recognition model according to the target data set to obtain the target domain name recognition model comprises the following steps:

training a first-level domain name recognition model according to the first target data set to obtain a trained first-level domain name recognition model;

and training the multi-stage domain name recognition model according to the second target data set to obtain the trained multi-stage domain name recognition model.

The domain names in the target data set are divided into a first target data set and a second target data set, the first-level domain name recognition model is trained by the first target data set, the multi-level domain name recognition model is trained by the second target data set, and compared with the method for training the initial domain name recognition model by the target data set, the method for training the domain name recognition model has the advantages that the data volume for training each model is less, the training speed is higher, the trained model is smaller, and the training and recognition efficiency can be improved.

In a specific implementation manner, the obtaining a target application corresponding to the packet according to the domain name corresponding to the destination address and a target domain name recognition model includes: extracting domain name keywords of the domain name corresponding to the destination address, inputting the domain name corresponding to the destination address into the trained primary domain name recognition model under the condition that keywords with the similarity smaller than a third threshold value with the domain name keywords are inquired in a domain name keyword set, and acquiring target application corresponding to the message according to the domain name corresponding to the destination address and the primary domain name recognition model, wherein the domain name keyword set comprises primary domain name information of different applications with the same primary domain name.

In a specific implementation manner, the obtaining a target application corresponding to the packet according to the domain name corresponding to the destination address and the target domain name recognition model includes: extracting domain name keywords of a domain name to be identified, inputting the domain name to be identified into a first-level domain name identification model under the condition that keywords with the similarity smaller than a third threshold value with the domain name keywords of the domain name to be identified are inquired in a domain name keyword set, and determining target application corresponding to the domain name to be identified according to the domain name to be identified and the first-level domain name identification model; and under the condition that keywords with the similarity greater than or equal to a third threshold value with the domain name keywords are inquired in the domain name keyword set, inputting the domain name to be identified into a multi-stage domain name identification model, and determining the target application corresponding to the domain name to be identified according to the domain name to be identified and the multi-stage domain name identification model, wherein the domain name keyword set comprises first-stage domain name information of different applications with the same first-stage domain name.

In a specific implementation manner, after obtaining the target application corresponding to the packet according to the domain name corresponding to the destination address, the method further includes: and updating the address database according to the destination address and the target application.

After the corresponding relation between the domain name to be identified and the target application is identified through the target domain name identification model, the corresponding relation between the destination address and the target application can be obtained, the destination address and the target application are sent to the address database, and the equipment updates the address database according to the destination address and the target application. Therefore, when the data with the address as the destination address is received again, the application to which the data belongs can be identified only through the destination address.

In a second aspect, an embodiment of the present application provides a message forwarding system, which includes a training device and a forwarding device, wherein,

the training device is used for training an initial domain name recognition model through a target data set to obtain a target domain name recognition model, and sending the target domain name recognition model to the forwarding device, wherein the target data set comprises a plurality of domain names and application identifications corresponding to the domain names, and each domain name corresponds to one application identification;

the forwarding device is configured to: receiving a message, wherein the message comprises a destination address;

when a target application corresponding to the message is obtained according to the destination address, forwarding the message according to a forwarding rule corresponding to the target application;

when the target application corresponding to the message cannot be acquired according to the destination address; acquiring a domain name corresponding to the destination address, and acquiring a target application corresponding to the message according to the domain name and a target domain name identification model; and forwarding the message according to the forwarding rule corresponding to the target application.

In a specific implementation manner, the forwarding device is further configured to: and identifying the target application corresponding to the destination address according to the destination address and an address database, wherein the address database comprises corresponding relations between a plurality of addresses and a plurality of applications, each application corresponds to one or a plurality of addresses, and one address corresponds to one application.

In a specific implementation manner, the forwarding device is specifically configured to: and acquiring a domain name corresponding to the destination address from a domain name database, wherein the domain name database comprises a corresponding relation between a plurality of domain names and a plurality of addresses, one domain name corresponds to one address, one address corresponds to one or more domain names, and the corresponding relation between the plurality of domain names and the plurality of addresses is acquired through a Domain Name System (DNS) server.

In a specific implementation, the training apparatus is specifically configured to:

determining a first traffic corresponding to a target domain name subset and a second traffic corresponding to the target initial domain name dataset, where the first traffic includes a sum of the traffic corresponding to one or more domain names in the target domain name subset, the second traffic includes a sum of the traffic corresponding to a plurality of domain names in the target initial domain name dataset, and the target domain name subset is any one of the plurality of domain name subsets;

In a specific implementation, the training system is specifically configured to: determining the frequency of occurrence of a target domain name in the plurality of initial domain name data sets, and deleting the target domain name from the plurality of initial domain name data sets when the frequency is greater than or equal to a second threshold, wherein the target domain name is any one of the plurality of initial domain name data sets.

training a multi-stage domain name recognition model according to the second target data set to obtain a trained multi-stage domain name recognition model;

and sending the primary domain name identification model and the secondary domain name identification model to the forwarding equipment.

In a specific implementation manner, the forwarding device is specifically configured to: extracting domain name keywords of the domain name corresponding to the destination address, inputting the domain name corresponding to the destination address into the trained primary domain name recognition model under the condition that keywords with the similarity smaller than a third threshold value with the domain name keywords are inquired in a domain name keyword set, and acquiring target application corresponding to the message according to the domain name corresponding to the destination address and the primary domain name recognition model, wherein the domain name keyword set comprises primary domain name information of different applications with the same primary domain name.

In a specific implementation manner, the forwarding device is further configured to: and under the condition that keywords with similarity greater than or equal to a third threshold value with the domain name keywords are inquired in the domain name keyword set, inputting the domain name corresponding to the destination address into the trained multistage domain name recognition model, and acquiring the target application corresponding to the message according to the domain name corresponding to the destination address and the trained multistage domain name recognition model.

In a specific implementation manner, the forwarding device is further configured to: and updating the address database according to the destination address and the target application.

In a third aspect, an embodiment of the present application provides a data forwarding apparatus, including:

a receiving unit, configured to receive a message, where the message includes a destination address;

a processing unit to: acquiring a target application corresponding to the message according to the destination address, or when the target application corresponding to the message cannot be acquired according to the destination address; acquiring a domain name corresponding to a destination address, and acquiring a target application corresponding to the message according to the domain name;

and the sending unit is used for forwarding the message according to the forwarding rule corresponding to the target application.

In a specific implementation manner, the processing unit is specifically configured to: and identifying the target application corresponding to the destination address according to the destination address and an address database, wherein the address database comprises the corresponding relation between a plurality of addresses and a plurality of applications, each application corresponds to one or more addresses, and one address corresponds to one application.

In a specific implementation manner, the processing unit is specifically configured to: the domain name database comprises the corresponding relations between a plurality of domain names and a plurality of addresses, wherein one domain name corresponds to one address, one address corresponds to one or a plurality of domain names, and the corresponding relations between the plurality of domain names and the plurality of addresses are obtained through a Domain Name System (DNS) server.

In a specific implementation manner, the processing unit is specifically configured to: and acquiring the target application corresponding to the message according to the domain name corresponding to the destination address and a target domain name identification model.

In a specific implementation manner, the apparatus further includes:

a training unit to: acquiring a plurality of initial domain name data sets, wherein each initial domain name data set in the plurality of initial domain name data sets comprises a plurality of domain names accessed by an application and flow corresponding to each domain name;

determining a ratio of the first traffic to the second traffic, and when the ratio is greater than or equal to a first threshold, adding the domain name in the target domain name subset to a target data set, where the target data set includes a plurality of domain names and application identifiers of applications corresponding to each domain name, each domain name in the target data set corresponds to one application identifier, and each application in the plurality of applications corresponds to one or more domain names;

In a specific implementation, the training unit is further configured to: determining the frequency of occurrence of a target domain name in the plurality of initial domain name data sets, and deleting the target domain name from the plurality of initial domain name data sets when the frequency is greater than or equal to a second threshold, wherein the target domain name is any one of the plurality of initial domain name data sets.

In a specific implementation, the training unit is specifically configured to:

In a specific implementation manner, the processing unit is specifically configured to: extracting domain name keywords of the domain name corresponding to the destination address, inputting the domain name corresponding to the destination address into the trained primary domain name recognition model under the condition that keywords with the similarity smaller than a third threshold value with the domain name keywords are inquired in a domain name keyword set, and acquiring target application corresponding to the message according to the domain name corresponding to the destination address and the primary domain name recognition model, wherein the domain name keyword set comprises primary domain name information of different applications with the same primary domain name.

In a specific implementation, the processing unit is further configured to: and under the condition that keywords with similarity greater than or equal to a third threshold value with the domain name keywords are inquired in the domain name keyword set, inputting the domain name corresponding to the destination address into the trained multistage domain name recognition model, and acquiring the target application corresponding to the message according to the domain name corresponding to the destination address and the trained multistage domain name recognition model.

In a specific implementation, the processing unit is further configured to: and updating the address database according to the destination address and the target application.

In a fourth aspect, an embodiment of the present application provides a network device, including a processor and a memory; the memory is configured to store instructions, and the processor is configured to execute the instructions, and when the processor executes the instructions, the network device performs the data forwarding method according to the first aspect or any one of the specific implementation manners of the first aspect.

In a fifth aspect, an embodiment of the present application provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a device, the instructions cause a server to execute the data forwarding method in the first aspect or any implementation manner of the first aspect.

In a sixth aspect, an embodiment of the present application provides a computer program product, which when running on a device, causes the device to execute the data forwarding method in the first aspect or any implementation manner of the first aspect.

Drawings

In order to more clearly illustrate the technical solutions of the embodiments of the present application, the drawings needed to be used in the description of the embodiments are briefly introduced below, and it is obvious that the drawings in the following description are some embodiments of the present application, and it is obvious for those skilled in the art to obtain other drawings based on these drawings without creative efforts.

Fig. 1 is an architecture diagram of a data recognition system according to an embodiment of the present application;

FIG. 2 is an architecture diagram of another data recognition system provided by an embodiment of the present application;

FIG. 3 is an architecture diagram of another data recognition system provided by an embodiment of the present application;

fig. 4 is a schematic structural diagram of a network device according to an embodiment of the present application;

fig. 5 is a schematic flowchart of a data identification method according to an embodiment of the present application;

FIG. 6 is a schematic flow chart illustrating a method for training a machine learning model according to an embodiment of the present disclosure;

FIG. 7 is a schematic flow chart illustrating another method for training a machine learning model according to an embodiment of the present disclosure;

fig. 8 is a schematic structural diagram of a data recognition apparatus according to an embodiment of the present application;

fig. 9 is a schematic structural diagram of a network device according to an embodiment of the present application;

fig. 10 is a schematic diagram of a computing device system according to an embodiment of the present application.

Detailed Description

The technical solution in the present application will be described below with reference to the accompanying drawings.

The terms "first" and "second" in the embodiments of the present application are used for descriptive purposes only and are not to be construed as indicating or implying relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defined as "first" or "second" may explicitly or implicitly include one or more of that feature.

In the embodiment of the present application, "and/or" describes an association relationship of associated objects, and indicates that there are three relationships, for example, a and/or B, indicating that: a exists alone, A and B exist simultaneously, and B exists alone, wherein A and B can be singular or plural. The character "/" generally indicates that the former and latter associated objects are in an "or" relationship. Any embodiment or design described herein as "exemplary" or "e.g.," is not necessarily to be construed as preferred or advantageous over other embodiments or designs. Rather, use of the word "exemplary" or "such as" is intended to present concepts related in a concrete fashion.

In order to solve the above problem, an embodiment of the present application provides a data forwarding method, configured to identify an application to which a data packet belongs according to a destination address carried in a packet in a network or a domain name corresponding to the destination address, and then forward the data packet according to a forwarding rule corresponding to the application to which the data packet belongs. As shown in fig. 1, fig. 1 is an architecture diagram of a data forwarding system provided in an embodiment of the present application, and the system includes a network device 100 and a training device 200, where the network device 100 includes an address identification module 110 and a domain name identification module 120. The training device 200 is configured to train the initial domain name recognition model according to the target data set to obtain a trained target domain name recognition model, and then send the target domain name recognition model to the network device 100, where the network device 100 deploys the target domain name recognition model in the domain name recognition module 120 of the network device 100.

After the network device 100 receives the message, the address identification module 110 first obtains a destination address in the message, and identifies an application to which the message belongs according to a correspondence between the address and the application stored in the address database. If the address identifying module 110 identifies the target application to which the packet belongs, the network device 100 forwards the packet according to a pre-configured forwarding rule for data of the target application. If the address recognition module 110 cannot recognize the application to which the message belongs, the destination address in the message is sent to the domain name recognition module 120. After receiving the destination address, the domain name identifying module 120 searches for the domain name to be identified corresponding to the destination address according to the correspondence between the domain name and the address stored in the domain name database. And then inputting the domain name to be identified into a target domain name identification model, and identifying the target application corresponding to the domain name to be identified. Finally, the network device 100 forwards the packet according to the preconfigured forwarding rule for the data of the target application. Meanwhile, the domain name recognition module 120 recognizes the target application according to the domain name to be recognized, so as to obtain a corresponding relationship between the destination address corresponding to the domain name to be recognized and the target application, the domain name recognition module 120 sends the destination address and the target application to the address database, and the network device 100 updates the address database according to the destination address and the target application. When the network device 100 receives the message addressed to the destination address again, the application to which the message belongs can be identified only by the address identification module 110. The network device 100 may be a data transmission device such as a switch or a router, and the address may be an Internet Protocol (IP) address.

Optionally, as shown in fig. 2, the network device 100 may further include a DPI identification module 130, where the DPI identification module 130 is configured to, after the address identification module 110 fails to identify the application to which the packet belongs, obtain the packet and identify the packet according to a DPI identification technology, if the DPI identification module 130 identifies a target application corresponding to the packet. The network device 100 forwards the packet according to a pre-configured forwarding rule for the data of the target application. Meanwhile, the DPI identification module 130 sends the destination address and the target application to the address database, and the network device 100 updates the address database according to the destination address and the target application. When the network device 100 receives the message addressed to the destination address again, the application to which the data belongs can be identified only by the address identification module 110. If the DPI identification module 130 cannot identify the application to which the data belongs, the DPI identification module 130 sends the destination address in the packet to the domain name identification module 120, and the domain name identification module 120 identifies the destination address.

In one possible embodiment, as shown in fig. 3, fig. 3 is an architecture diagram of another data forwarding system provided in the embodiment of the present application, the system includes a network device 300, a controller 400, and a training device 200, the network device 300 includes an address identification module 310, and the controller includes a domain name identification module 410. The function of the address identification module 310 in the network device 300 is the same as the function of the address identification module 110 in the network device 100 in fig. 1, and the function of the domain name identification module 410 in the controller 400 is the same as the function of the domain name identification module 120 in the network device 100, which is not described herein again. The domain name database can be deployed in the network device 300 and can also be deployed in the controller 400, and fig. 3 illustrates an example in which the domain name database is located in the network device. Optionally, network device 300 further includes a DPI identification module 320, and a function of DPI identification module 320 in network device 300 is the same as a function of DPI identification module 130 in network device 100, which is not described herein again. The network device 300 may be a data transmission device such as a switch or a router. The controller 400 may be any electronic device that can be configured to perform domain name recognition, such as a server, a terminal, etc.

Alternatively, the training apparatus 200 may be a single module integrated in the controller 400, and the embodiment of the present application is not limited in particular.

In another possible embodiment, as shown in fig. 4, fig. 4 is a schematic structural diagram of a network device provided in this embodiment, where the network device 500 includes an address recognition module 510, a domain name recognition module 520, and a training module 530. The function of the address identification module 510 in the network device 500 is the same as that of the address identification module 110 in the network device 100 in fig. 1, the function of the domain name identification module 520 is the same as that of the domain name identification module 120 in the network device 100 in fig. 1, and the function of the training module 530 is the same as that of the training device 200 in fig. 1, which is not repeated herein. Optionally, network device 500 may further include DPI identification module 540, where DPI identification module 540 in network device 500 and DPI identification module 130 in network device 100 have the same function, and are not described herein again. The network device 500 may be a data transmission device such as a switch or a router. Or may be a device such as a server. If the network device 500 is a server, the network device 500 is in communication connection with a data transmission device such as a switch, the data transmission device sends the message to the network device 500 after receiving the message, the network device 500 sends the identification result to the data transmission device after identifying the application to which the message belongs, and the data transmission device forwards the message according to the identification result.

The data forwarding method provided by the present application is described in detail below with reference to the data identification system or the network device shown in fig. 1 to 4. As shown in fig. 5, fig. 5 is a schematic flowchart of a data forwarding method provided in an embodiment of the present application, where the method includes:

s502, receiving the message and obtaining the destination address of the message.

After sending the message, the user terminal needs to forward the message through a network device (such as a switch or a router) to reach the device indicated by the destination address. After receiving the message, the network device needs to identify the application to which the message belongs, and then forwards the message according to a preset forwarding rule. For example, if the message belongs to a video conferencing application, the message is forwarded preferentially. After the network device receives the message, the address identification module of the network device obtains the destination address of the message.

It should be noted that the network device in the embodiment of the present application may be any one of the network device 100, the network device 300, or the network device 500, and the embodiment of the present application is not particularly limited, and for brevity of description, the network device 500 is taken as an example in the embodiment of the present application to describe.

And S504, identifying the target application corresponding to the destination address according to the address database.

The address database stores a correspondence relationship between an address (for example, an IP address) and a target application, and for example, the IP address "192.168.135.166" corresponds to an address of a server in which the application a is located, and the IP address "192.168.100.125" corresponds to an address of a server in which the application B is located. One address corresponds to one application, and one application may correspond to one or more addresses.

After the address identification module 510 obtains the destination address, it identifies the application data to which the packet belongs by matching the destination address with the address in the address database. For example, if the destination address is "192.168.100.125", and the address identification module matches 510 in the address database that the destination address corresponds to application B, the message is determined to be data of application B.

S506, under the condition that the target application corresponding to the destination address is not identified according to the address database, the domain name to be identified corresponding to the destination address is obtained according to the destination address.

In this embodiment, the network device 500 further includes a domain name database, where a corresponding relationship between a domain name and an address is stored in the domain name database, for example, an IP address corresponding to a domain name "www.abc.com" is "192.168.135.166", where one domain name corresponds to one IP address, and one IP address may correspond to one or more domain names. It can be understood that, before the application accesses the server, it needs to query a Domain Name System (DNS) server for an IP address corresponding to the domain name of the application, and the DNS server returns the IP address corresponding to the domain name, so that the correspondence between the address and the domain name in the network device.

In this embodiment, if the address identification module 510 can identify the target application to which the packet belongs according to the data in the address database, the network device 500 forwards the packet according to a pre-configured forwarding rule for the data of the target application.

If the address recognition module 510 does not recognize the target application corresponding to the destination address according to the data in the address database, that is, the address recognition module 510 does not recognize the target application to which the packet belongs, the destination address is sent to the domain name recognition module 520, and the domain name recognition module 520 matches the destination address with the address in the domain name database according to the destination address to obtain the domain name to be recognized corresponding to the destination address.

S508, according to the domain name to be recognized and the target domain name recognition model, the target application corresponding to the domain name to be recognized is obtained, and the message is forwarded according to the forwarding rule corresponding to the target application.

In this embodiment of the present application, the target domain name recognition model is obtained by training an initial domain name recognition model according to a machine learning algorithm based on a target data set, where the target data set includes a plurality of domain names and application identifiers corresponding to the domain names, the application identifiers indicate applications to which the domain names belong, each domain name in the target data set corresponds to one application identifier, and one application identifier (i.e., one application) corresponds to a plurality of domain names. The initial domain name recognition model may be a support vector machine model, a neural network model, or other machine learning models, and embodiments of the present application are not particularly limited.

The target domain name recognition model is deployed in the domain name recognition module 520, after the domain name to be recognized corresponding to the destination address is obtained according to the destination address, the domain name recognition module 520 inputs the domain name to be recognized into the target domain name recognition model for recognition, recognizes the application identifier corresponding to the domain name to be recognized, obtains the target application corresponding to the domain name to be recognized, and obtains the target application corresponding to the destination address, namely the message belonging to the target application. The network device 500 forwards the packet according to the preconfigured forwarding rule for the data of the target application. It should be noted that the domain name identifying module 520 may obtain one or more domain names to be identified from the domain name database according to the destination address, and when the domain name identifying module 520 obtains one domain name to be identified, an application identified according to the domain name to be identified is used as a target application. When the domain name identifying module 520 obtains a plurality of domain names to be identified, the domain name identifying module 520 may identify one or more applications according to the plurality of domain names to be identified, and when a plurality of applications are identified, a majority of the identified applications are used as target applications.

After the domain name identification module 520 obtains the target application corresponding to the destination address, the destination address and the target application are sent to the address database, the network device 500 updates the address database according to the destination address and the target application, and when the network device 500 receives the data addressed to the destination address again, the application to which the data belongs can be identified only through the address identification module 510.

In a possible embodiment, the target domain name recognition model includes a trained one-level domain name recognition model and a multi-level domain name recognition model. The trained first-level domain name recognition model is obtained by training according to a first target data set, and the trained multi-level domain name recognition model is obtained by training according to a second target data set. The first target data set includes a first-level domain name and a multi-level domain name (e.g., a second-level domain name, a third-level domain name, etc.), and the domain names in the first target data set do not have the same first-level domain name, that is, each domain name in the first target data set cannot find the domain name having the same first-level domain name in the first target data set, but the first target data set may include a plurality of domain names corresponding to the same application. The second target data set comprises a plurality of subsets, each subset comprises a first-level domain name and a multi-level domain name, and the domain names in each subset have the same first-level domain name, namely, each domain name in the second target data set can find the domain name with the same first-level domain name in the second target data set. For example, subset a includes the domain names "www.abc.com", "www.game.abc.com", "www.video.abc.com", and "www.zone.abc.com", which all include the first level domain name "abc. Included in subset B are the domain names "www.game.xy.com", "www.map.xy.com", and "www.cloud.xy.com", which all comprise the first level domain name "xy.

After receiving the domain name to be identified, the domain name identification module 520 extracts the first-level domain name information of the domain name to be identified, matches the extracted first-level domain name information with domain name keywords in the domain name keyword set, and matches the first-level domain name information corresponding to the domain name to be identified with corresponding keywords in the domain name keyword set. For example, the similarity between the extracted first-level domain name information of the domain name to be identified and each keyword in the domain name keyword set is calculated, and when the similarity is greater than or equal to a similarity threshold, the corresponding keyword information is considered to be matched. The domain name recognition module 520 inputs the domain name to be recognized into the multi-level domain name recognition model, and recognizes the domain name to be recognized using the multi-level domain name recognition model. When the first-level domain name information corresponding to the domain name to be recognized does not match the corresponding keyword in the domain name keyword set, the domain name recognition module 520 inputs the domain name to be recognized into the first-level domain name recognition model, and the first-level domain name recognition model is used for recognizing the domain name to be recognized. The domain name keyword set includes keywords of a first-level domain name extracted from domain names having the same first-level domain name in the second target data set, for example, four domain names of "www.abc.com", "www.game.abc.com", "www.video.abc.com" and "www.zone.abc.com" in the second target data set include a first-level domain name "abc.com", then the first-level domain name keyword "abc" is extracted and added to the domain name keyword set, and three domain names of "www.game.xy.com", "www.map.xy.com" and "www.cloud.xy.com" include a first-level domain name "xy.com", then the first-level domain name keyword "xy" is extracted and added to the domain name keyword set.

In a possible embodiment, the network device 500 further includes a DPI identification module 540, in the above S506, in a case that the destination application corresponding to the destination address is not identified according to the address database, the DPI identification module 540 acquires the packet and identifies the packet according to a DPI identification technology, and if the DPI identification module 540 identifies the destination application corresponding to the packet, the network device 500 forwards the packet according to a pre-configured forwarding rule for data of the destination application. Meanwhile, the DPI identification module 540 sends the destination address and the target application to the address database, the network device 500 updates the address database according to the destination address and the target application, and when the network device 500 receives the data addressed to the destination address again, the application of the data can be identified only by the address identification module 510. If the DPI module 540 cannot identify the application to which the data belongs, the DPI module sends the destination address in the packet to the domain name identifying module 520, and the DPI module identifies the destination address through the domain name identifying module 520.

And training by a machine learning method to obtain a domain name recognition model, recognizing the application to which the domain name to be recognized belongs according to the domain name to be recognized and the domain name recognition model, and further determining the application to which the message for accessing the domain name to be recognized belongs. The method for identifying the application to which the data belongs can avoid the problem that when the application to which the data belongs is identified by adopting a deep data packet technology, the identification result is inaccurate due to continuous change of the characteristic information of the applied data, so that the method for identifying the application to which the data accessing the domain name belongs by utilizing the domain name has better generalization and stability, the data packet does not need to be unpacked and analyzed during identification, and the identification efficiency is higher.

In the foregoing embodiment, the network device identifies a domain name through a target domain name identification model, where the target domain name identification model is obtained by training an initial domain name identification model through a target data set, and an embodiment of the present application provides a training method for a machine learning model, as shown in fig. 6, fig. 6 is a schematic flow diagram of the training method for the machine learning model provided in the embodiment of the present application. The method comprises the following steps:

s602, acquiring a plurality of initial domain name data sets.

Before training the initial domain name recognition model according to a target data set, one or more domain names accessed by each application need to be acquired as domain names corresponding to each application, and the target data set comprises a set of domain names accessed by a plurality of applications. In the embodiment of the application, the initial domain name data sets corresponding to each application are respectively obtained, and a plurality of initial domain name data sets corresponding to a plurality of applications are obtained. Each application corresponds to an initial domain name data set, and each initial domain name data set comprises one or more domain names accessed by one application. The domain name corresponding to the application may be obtained by capturing traffic generated by the application accessing the network, or may be obtained by using a simple information aggregation (RSS) method, which is not specifically limited in the embodiment of the present application.

S604, preprocessing the domain name in each initial domain name data set in the plurality of initial domain name data sets respectively to obtain the main cause domain name set corresponding to each application.

After an initial domain name data set corresponding to each application is obtained, domain names in each initial domain name data set need to be screened, unknown application domain names corresponding to unknown applications in the initial domain name data sets are determined, a main cause domain name set corresponding to each application is extracted, and the main cause domain name set corresponding to each application comprises a main cause domain name corresponding to the application. The main cause domain name is a domain name mainly accessed by each application, for example, the application a includes 10 functions, when a user uses each function, the user terminal accesses the corresponding server by accessing the domain name corresponding to the function, and the main cause domain name of the application a includes the domain names corresponding to the 10 functions.

In the embodiment of the present application, for any target application in the multiple applications, the primary cause domain name in the initial domain name data set corresponding to each application may be extracted by the following method, so as to obtain the primary cause domain name set corresponding to each application. Firstly, a plurality of domain names in a target initial domain name data set corresponding to a target application are clustered through a clustering algorithm to obtain one or more domain name subsets. Then, the first traffic corresponding to each domain name subset and the second traffic corresponding to the target initial domain name data set are counted. The first traffic is the sum of the traffic generated when the target application accesses all domain names in a domain name subset within a preset time, and the second traffic is the sum of the traffic generated when the target application accesses all domain names in a target initial domain name data set within the preset time. And finally, calculating a flow ratio of the first flow to the second flow, if the flow ratio of the first flow corresponding to one domain name subset to the second flow of the target initial domain name data set is smaller than a preset flow ratio, taking the domain name in the domain name subset as an unknown application domain name, and if the flow ratio of the first flow corresponding to one domain name subset to the second flow is larger than or equal to the preset flow ratio, determining that the domain name in the domain name subset is a main cause domain name corresponding to the target application, namely, taking the domain name in the target initial domain name data set except the unknown application domain name as the main cause domain name of the target application.

It should be noted that, after a domain name in a domain name subset is determined as an unknown application domain name, an application identifier corresponding to the determined unknown application domain name needs to be changed to a label corresponding to the unknown application. For example, 100 domain names in the target initial domain name data set corresponding to the target application B are clustered to obtain 3 domain name subsets, which are respectively numbered as B1, B2, and B3, where the subset B1 includes 40 domain names, the subset B2 includes 25 domain names, and the subset B3 includes 35 domain names. If the ratio of the first traffic corresponding to the subset B2 to the second traffic corresponding to the target application is smaller than the preset traffic ratio among the three domain name subsets, 25 domain names in the subset B2 are used as unknown application domain names, the application identifier of the application B originally corresponding to the domain name in the subset B2 is changed into the identifier of the unknown application, and the domain names in the subset B1 and the subset B2 are used as the main cause domain name of the target application B.

Optionally, the foregoing determines, by counting the traffic corresponding to each domain name subset, whether the domain name in each domain name subset is the primary cause domain name corresponding to the target application. In the embodiment of the present application, it may also be determined whether the domain name in each domain name subset is the primary cause domain name corresponding to the target application by counting a ratio of the number of times that all domain names in each domain name subset are visited to the number of times that all domain names in the target initial domain name data set are visited. Illustratively, after clustering a plurality of domain names in a target initial domain name data set to obtain one or more domain name subsets, a first access amount corresponding to each domain name subset and a second access amount corresponding to the initial domain name data set are counted. The first access amount refers to the number of times that the target application accesses all domain names in one domain name subset within a preset time length, and the second access amount refers to the number of times that the target application accesses all domain names in the target initial domain name data set within the preset time length. And finally, calculating the number ratio of the first access quantity to the second access quantity of each domain name subset, if the number ratio of the first access quantity to the second access quantity corresponding to one domain name subset is smaller than the preset number ratio, taking the domain name in the domain name subset as an unknown application domain name, and if the number of the first access quantity to the second access quantity is larger than or equal to the preset number ratio, determining the domain name in the domain name subset as a main cause domain name corresponding to the target application.

It can be understood that whether the domain name in a domain name subset is the main cause domain name can also be determined according to the traffic and the access amount corresponding to the domain name subset. For example, when the flow ratio corresponding to one domain name subset is smaller than the preset flow ratio and the number ratio is smaller than the preset number ratio, the domain name in the domain name subset is used as the unknown application domain name. And when the flow ratio corresponding to one domain name subset is larger than or equal to a preset flow ratio and/or the quantity ratio is larger than or equal to a preset quantity ratio, determining the domain name in the domain name subset as the main cause domain name.

In a specific implementation manner, after a primary cause domain name corresponding to each target application is determined according to a ratio of a first flow corresponding to each domain name subset to a second flow corresponding to a target initial domain name data set, and/or according to a ratio of visited times of all domain names in each domain name subset to visited times of all domain names in the target initial domain name data set, for a domain name determined to be an unknown application domain name, primary domain name information of each unknown application domain name needs to be extracted, and if the primary domain name information of one unknown application domain name includes a target application name, the unknown application domain name is taken as the primary cause domain name of the target application and is added to the primary cause domain name corresponding to the target application.

After an initial domain name data set composed of domain names accessed by an application is obtained, clustering is carried out on the domain names accessed by the application to obtain a plurality of domain name subsets, first flow generated when the application accesses the domain names in each domain name subset and second flow generated when all the domain names accessed by the application are calculated according to a clustering result, and the domain names in the domain name subsets, of which the ratio of the first flow to the second flow is larger than or equal to a preset value, are used as the domain names mainly accessed by the application, so that a main domain name corresponding to the application is extracted, the identification accuracy of a target domain name identification model obtained through main cause domain name training is improved, the data volume for training the initial domain name identification model is reduced, and the training efficiency is improved.

Optionally, before clustering the plurality of domain names in the target initial domain name data set by calculating the similarity to obtain one or more domain name subsets, the frequency of each domain name or the occurrence frequency of the domain name keyword in the plurality of initial domain name data sets may be counted, and the domain name with the frequency exceeding the preset frequency threshold is used as the interference domain name. For example, in 200 acquired initial domain name datasets, a domain name "www.shurufa.com" appears in 150 initial domain name datasets, the frequency of the occurrence of the domain name is 75%, and exceeds a preset frequency threshold of 65%, the domain name is considered as an interference domain name, and the domain name is deleted from the 150 initial domain name datasets. And then clustering a plurality of domain names in the target initial domain name data set by calculating the similarity to obtain one or more domain name subsets.

The domain name with higher occurrence frequency is determined in the plurality of initial domain name data sets, the domain name is determined to be the interference domain name, and then the interference domain name is deleted from the plurality of initial domain name data sets, so that the identification accuracy of a target domain name identification model obtained through training of the target data set can be improved, the data volume for training the initial domain name identification model is reduced, and the training efficiency is improved.

S606, balancing the number of the main cause domain names corresponding to the plurality of applications.

It can be understood that each domain name in the initial domain name data set has a corresponding application identifier to indicate an application to which each domain name belongs, after the domain names in each initial domain name data set are screened to obtain a main cause domain name corresponding to each application, the number of the main cause domain names corresponding to each application is counted according to the application identifiers, and the number of the main cause domain names of each application is balanced according to the number of the main cause domain names corresponding to each application. When the number of the primary cause domain names corresponding to one application is greater than the balance threshold, the primary cause domain names corresponding to the application need to be reduced, and the number of the primary cause domain names corresponding to the application is reduced to the balance threshold. When the number of the primary cause domain names corresponding to one application is smaller than the balance threshold, the primary cause domain names corresponding to the application need to be added, so that the number of the primary cause domain names corresponding to the application reaches the balance threshold. The balance threshold refers to the number of the primary cause domain names corresponding to each application in a target data set finally used for training the initial domain name recognition model, and the balance threshold may be a median or an average of the number of the primary cause domain names corresponding to the multiple applications, which is not specifically limited in the embodiments of the present application.

When the number of the main cause domain names corresponding to one application is larger than the balance threshold, deleting parts of two or more main cause domain names with higher similarity according to the similarity of the main cause domain names corresponding to the application, and reducing the number of the main cause domain names of the application to the balance threshold. For example, the number of the primary cause domain names corresponding to the application a is 105, and the balance threshold is 100, that is, the number of the primary cause domain names corresponding to the application a needs to be reduced by 5. One of the two primary cause domain names with similarity above the similarity threshold may be deleted by calculating the similarity between any two of the 105 primary cause domain names. When the number of the main cause domain names corresponding to one application is smaller than the balance threshold, the number of the main cause domain names of the application can be increased to the balance threshold by increasing interference characters in the main cause domain names corresponding to the application, replacing characters in the main cause domain names, or repeatedly using the main cause domain names. For example, the number of the primary cause domain names corresponding to the application B is 96, and the balance threshold is 100, that is, the number of the primary cause domain names corresponding to the application B needs to be increased by 4. If one domain name in the main cause domain names corresponding to the application B is 'www.abc.com', a character 'xy' can be added in the domain name to obtain a new domain name 'www.xy.abc.com', and the obtained new domain name is added into the main cause domain name set corresponding to the application B; or replacing "com" in "www.abc.com" with "cn" to obtain a new domain name "www.abc.cn", and adding the obtained new domain name to the main cause domain name set corresponding to the application B; the 'www.abc.com' can be repeated once or more times in the main cause domain name set corresponding to the application B, so as to achieve the purpose of increasing the number of the main cause domain names corresponding to the application B.

The number of the main cause domain names corresponding to each application is balanced, so that the number of the main cause domain names corresponding to each application is the same or similar, and the domain names can be more accurately identified according to a target domain name identification model obtained by the main cause domain name training.

And S608, determining an unknown application domain name set corresponding to the unknown application, and obtaining a target data set according to the unknown application domain name set and the main cause domain name sets corresponding to the plurality of applications.

In the above S604, the domain names in the initial domain name data sets corresponding to each application are clustered to obtain a plurality of domain name subsets, and the unknown application domain names in the initial domain name data sets corresponding to each application are determined by calculating a flow ratio and/or a quantity ratio corresponding to each domain name subset, so as to obtain the unknown application domain names determined in the plurality of initial domain name data sets, where the set of the unknown application domain names determined in the plurality of initial domain name data sets is the initial unknown application domain name set, that is, the initial unknown application domain name set includes the unknown application domain names determined in the plurality of initial domain name data sets. After the initial unknown application domain name set is obtained, the unknown application domain names in the initial unknown application domain name set need to be screened to obtain the unknown application domain name set. Specifically, each unknown application domain name in the initial unknown application domain name set may be classified to obtain a probability that each unknown application domain name belongs to each application in the plurality of applications, and then, according to a maximum probability and a second-order probability of the probabilities that the unknown application belongs to the plurality of applications, it is determined whether the unknown application may be added to the unknown application domain name set. When the ratio of the maximum probability to the second-order probability is smaller than a preset value, determining that the unknown application can be added to an unknown application domain name set; and when the ratio of the maximum probability to the second-order probability is greater than or equal to a preset value, determining that the unknown application cannot be added into the unknown application domain name set. And determining the unknown application domain names in the unknown application domain name set according to the method, and then increasing or reducing the number of the unknown application domain names in the unknown application domain name set to a balance threshold value according to the method for balancing the number of the main cause domain names corresponding to the plurality of applications.

Obtaining the primary cause domain name sets and the unknown application domain name sets corresponding to the plurality of applications according to the methods in the above S604, S606, and S608, thereby obtaining a target data set for training an initial domain name recognition model, that is, the target data set includes the primary cause domain names and the unknown application domain names corresponding to the plurality of applications and the application identifier corresponding to each domain name.

S610, training an initial domain name recognition model according to a target data set to obtain the target domain name recognition model.

After the target data set is obtained, inputting the target data set to an initial domain name recognition model, training the initial domain name recognition model according to the target data set by using a machine learning algorithm to obtain a trained target domain name recognition model, and deploying the target domain name recognition model to network equipment or a controller.

The initial domain name recognition model may be a linear regression model, a support vector machine, or a neural network model, such as a cyclic neural network, a convolutional neural network, a deep convolutional network, or a deep residual error network, and the embodiment of the present application is not particularly limited.

The above-mentioned S602 to S610 describe the method for obtaining the target data set when the target domain name recognition model only includes one domain name recognition model, and with reference to fig. 7, describe the method for obtaining the first target data set for training the first-level domain name recognition model and the second target data set for training the multi-level domain name recognition model when the target domain name recognition model includes the first-level domain name recognition model and the multi-level domain name recognition model.

S702, acquiring a plurality of initial domain name data sets.

Before training the initial domain name recognition model according to a target data set, one or more domain names accessed by each application need to be acquired as domain names corresponding to each application, and the target data set comprises a set of domain names accessed by a plurality of applications. Specifically, the method for the network device to obtain the multiple initial domain name data sets may refer to the related description in S602, and is not described herein again.

S704, preprocessing the domain name of each initial domain name data set in the multiple initial domain name data sets respectively to obtain a main cause domain name set corresponding to each application.

The method for processing the domain names in the obtained multiple initial domain name data sets to obtain the unknown application domain names and the main cause domain name set corresponding to each application may refer to the description in S604, and details are not repeated here.

S706, dividing the domain names in the plurality of main cause domain name sets into a first target data set and a second target data set.

In the embodiment of the application, after the domain name sets corresponding to the multiple applications are obtained, domain names with the same level of domain name are extracted from the domain name sets corresponding to the multiple applications. For example, the four domain names "www.abc.com", "www.game.abc.com", "www.video.abc.com" and "www.zone.abc.com" all include the first-level domain name "abc.com", "www.game.xy.com", "www.map.xy.com" and "www.cloud.xy.com" all include the first-level domain name "xy.com", and domain names having the same first-level domain name are extracted from the main cause domain names corresponding to the plurality of applications, so as to obtain the second target data set. And taking a set formed by domain names except the domain name included in the second target data set in the main cause domain names corresponding to the plurality of applications as the first target data set. That is, each domain name in the first target dataset does not have a domain name with the same level of domain name as the first target dataset in the first target training set, but the first target dataset may include a plurality of domain names corresponding to the same application, and each domain name in the second target dataset has a domain name with the same level of domain name as the second target dataset.

After the second target data set is obtained, extracting the first-level domain name keywords included by each domain name in the second target data set to obtain a domain name keyword set. For example, the four domain names in the second target dataset, the four domain names, "www.abc.com," www.game.abc.com, "" www.video.abc.com, "and" www.zone.abc.com, "each comprise a primary domain name," abc.

S708, balancing the number of the main cause domain names corresponding to each application in the first target data set and the second target data set.

In this embodiment of the application, the number of the primary cause domain names corresponding to each application in the first target data set and the second target data set is balanced, and the method for balancing the number of the primary cause domain names corresponding to each application may refer to the description in S606, which is not described herein again.

S710, determining a first unknown application domain name set and a second unknown application domain name set corresponding to unknown applications.

After obtaining the unknown application domain name according to the method in S704, the unknown application domain name is divided into a first subset and a second subset according to the same method in S706, where each domain name in the first subset has no domain name with the same level of domain name as it in the first subset, and each domain name in the second subset has a domain name with the same level of domain name as it in the second subset. The number of unknown domain name datasets in each subset is then balanced according to the method in S608 above, increasing or decreasing the number of unknown application domain names in each subset to a balancing threshold. And adding the domain names in the first subset after the quantity balance into the first target data set, and adding the domain names in the second subset after the quantity balance into the second target data set, thereby obtaining a first target training set for training a one-level domain name recognition model and a second target training set for training a multi-level domain name recognition model.

And S712, training the first-level domain name recognition model through the first target data set to obtain a trained first-level domain name recognition model, and training the multi-level domain name recognition model through the second target data set to obtain a trained multi-level domain name recognition model.

Inputting the first target data set into a first-stage domain name recognition model to train the first-stage domain name recognition model to obtain a trained first-stage domain name recognition model, inputting the second target data set into a multi-stage domain name recognition model to train the multi-stage domain name recognition model to obtain the trained multi-stage domain name recognition model.

It can be understood that the above-mentioned first-level domain name recognition model and the multi-level domain name recognition model may be the same machine learning model or different machine learning models, and the embodiment of the present application is not particularly limited.

It should be noted that, for simplicity of description, the above method embodiments are described as a series of action combinations, but those skilled in the art should understand that the present invention is not limited by the described action sequence, and those skilled in the art should understand that the embodiments described in the specification belong to the preferred embodiments, and the actions involved are not necessarily required by the present invention.

Other reasonable combinations of steps that can be conceived by one skilled in the art from the above description are also within the scope of the invention. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred and that no particular act is required to implement the invention.

The data forwarding method provided in the embodiment of the present application is described in detail above with reference to fig. 1 to 7, and the following describes related apparatuses and devices for implementing the embodiment of the present application in cooperation. As shown in fig. 1 to 4, the present application provides a data identification system for performing steps S502 to S508, S602 to S610, or S702 to S712 in the above method embodiments.

As shown in fig. 8, fig. 8 is a schematic structural diagram of a data forwarding apparatus provided in this embodiment of the present application, where the apparatus is configured to execute a data identification method in the foregoing method embodiment, and the division of functional modules of the apparatus is not limited in this embodiment of the present application, and the following exemplary provides a division of functional modules, where the data identification apparatus 700 includes: a receiving unit 710, a processing unit 720, a transmitting unit 730, and a training unit 740, wherein,

the receiving unit 710 is configured to perform the receiving of the message in S502.

The processing unit 720 is configured to execute the method for identifying the target application to which the packet belongs according to the destination address or the domain name to be identified in S504 to S508, which may specifically refer to the detailed descriptions in S504 to S508 in the foregoing method embodiment, and is not described herein again.

The sending unit 730 is configured to forward the packet according to a forwarding rule corresponding to the target application after the processing unit 720 determines the target application of the packet.

The training unit 740 is configured to execute a method for performing data preprocessing on the initial domain name data set and training the initial domain name recognition model in S602 to S610 in fig. 6 in the above method embodiment, or a method for performing data preprocessing on the initial domain name data set and training the primary domain name recognition model and the secondary domain name recognition model in S702 to S712 in fig. 7 in the above method embodiment, which is not described herein again.

The four units can perform data transmission through a communication channel, and it should be understood that each module included in the apparatus 700 may be a software module, a hardware module, or a part of the software module and a part of the hardware module.

Referring to fig. 9, fig. 9 is a schematic structural diagram of a network device according to an embodiment of the present application. The network device 800 includes a processor 810, a communication interface 820, and a memory 830. The processor 810, the communication interface 820, and the memory 830 are connected to each other by a bus 840, wherein,

the processor 810 is configured to implement the operations performed by the processing unit 720 and the training unit 740, and specific implementation of the processor 810 to perform various operations may refer to specific operations in the above method embodiments. And will not be described in detail herein.

The processor 810 may be implemented in various ways, for example, the processor 810 may be a Central Processing Unit (CPU), and the processor 810 may be a combination of a CPU and a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a Programmable Logic Device (PLD), or a combination thereof. The PLD may be a Complex Programmable Logic Device (CPLD), a field-programmable gate array (FPGA), a General Array Logic (GAL), or any combination thereof. The processor 810 may also be implemented as a single logic device with built-in processing logic, such as an FPGA or a Digital Signal Processor (DSP).

The communication interface 820 may be used for communication with other modules or devices, and may be an ethernet interface, a Local Interconnect Network (LIN), or the like.

In the embodiment of the present application, the communication interface 820 performs operations implemented by the receiving unit 710 and the sending unit 720, for example, receiving a message, forwarding the message, and the like. Specifically, the actions performed by the communication interface 820 may refer to the actions received or sent in the above method embodiments, which are not described herein again.

The memory 830 may be a non-volatile memory, such as a read-only memory (ROM), a Programmable ROM (PROM), an Erasable PROM (EPROM), an Electrically Erasable PROM (EEPROM), or a flash memory. The memory 830 may also be volatile memory, which may be Random Access Memory (RAM), that acts as external cache memory.

Memory 830 may also be used to store instructions and data that facilitate processor 810 to invoke the instructions stored in memory 830 to implement the functionality of the various modules of apparatus 700 described above, and in addition, network device 800 may include more or fewer components than those illustrated in fig. 9, or have a different arrangement of components.

The bus 840 may be a Peripheral Component Interconnect (PCI) bus, an Extended Industry Standard Architecture (EISA) bus, or the like. The bus 840 may be divided into an address bus, a data bus, a control bus, etc. For ease of illustration, only one thick line is shown in FIG. 9, but this does not indicate only one bus or one type of bus.

Specifically, the processor 810 is configured to execute the program in the memory, and cause the network device to perform the following processes:

the network equipment receives the message, and if the network equipment obtains a target application corresponding to the message according to a destination address of the message, the network equipment forwards the message according to a forwarding rule corresponding to the target application; if the network equipment can not obtain the target application corresponding to the message according to the destination address; acquiring a domain name corresponding to the destination address of the message, and acquiring a target application corresponding to the message according to the domain name corresponding to the destination address; and then forwarding the message according to the forwarding rule corresponding to the target application.

For a specific implementation process of the processor 810 to implement the forwarding operation, please refer to detailed descriptions of steps S502 to S508 in the embodiment shown in fig. 5, which is not described herein again.

Processor 810 is also capable of executing programs in the memory that cause the network device to:

For a specific implementation process of the processor 810 to implement the above-mentioned training operation, please refer to detailed descriptions of steps S602 to S610 in the embodiment shown in fig. 6 or detailed descriptions of steps S702 to S712 in the embodiment shown in fig. 7, which is not described herein again.

In addition, in the case that the units in the embodiment shown in fig. 8, such as the obtaining unit 710, the processing unit 720 and the training unit 730, are software modules, the memory 802 stores these software modules, and the processor 801 executes these software modules to implement the functions and steps of the network device in the embodiments shown in fig. 5 to 7.

As shown in fig. 1-3, portions of the data forwarding system may run on multiple computing devices in different environments. Accordingly, the present application also provides a computing device system. As shown in fig. 10, the computing device system includes a plurality of computing devices 900, which respectively perform the operations performed by the network device 100 and the training device 200 in fig. 1, or the operations performed by the training device 200, the network device 300, and the controller 400 in fig. 3. Each computing device 900 includes a bus 910, a processor 920, a communication interface 930, and memory 940. Communication between the processor 920, the communication interface 930, and the memory 940 is via the bus 910. Communication paths are established between the computing devices 900 over a communication network. The processor 920 may be a CPU. Memory 940 may include volatile memory, such as RAM. The memory 940 may also include a non-volatile memory, such as a ROM or the like. The memory 940 has stored therein executable code that the processor 920 executes to perform the digital identification system portion method.

The embodiments of the present application further provide a non-transitory computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are run on a processor, the method steps performed in the foregoing method embodiments may be implemented, and specific implementation of the processor of the computer-readable storage medium to perform the method steps may refer to specific operations in the foregoing method embodiments, and details are not described herein again.

Those of ordinary skill in the art will appreciate that the elements and steps of the various examples described in connection with the embodiments disclosed herein may be embodied in electronic hardware, computer software, or combinations of both, and that the components and steps of the various examples have been described in a functional general in the foregoing description for the purpose of illustrating clearly the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the implementation. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present invention.

In the several embodiments provided in the present application, it should be understood that the above-described embodiments of the apparatus are merely illustrative, for example, the division of the units is only one logical function division, and there may be other division manners in actual implementation, for example, a plurality of units or components may be combined or may be integrated into another system, or some features may be omitted, or not executed. In addition, the shown or discussed mutual coupling or direct coupling or communication connection may be an indirect coupling or communication connection through some interfaces, devices or units, and may also be an electric, mechanical or other form of connection.

The units described as separate parts may or may not be physically separate, and parts displayed as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units can be selected according to actual needs to achieve the purpose of the solution of the embodiment of the present invention.

In addition, functional units in the embodiments of the present invention may be integrated into one processing unit, or each unit may exist alone physically, or two or more units are integrated into one unit. The integrated unit can be realized in a form of hardware, and can also be realized in a form of a software functional unit.

The integrated unit, if implemented in the form of a software functional unit and sold or used as a stand-alone product, may be stored in a computer readable storage medium. Based on such understanding, the technical solution of the present invention essentially or partially contributes to the prior art, or all or part of the technical solution can be embodied in the form of a software product stored in a storage medium and including instructions for causing a computer device (which may be a personal computer, a server, or a network device) to execute all or part of the steps of the method according to the embodiments of the present invention. And the aforementioned storage medium includes: various media capable of storing program codes, such as a U disk, a removable hard disk, a ROM, a RAM, a magnetic disk, or an optical disk.

While the invention has been described with reference to specific embodiments, the invention is not limited thereto, and various equivalent modifications and substitutions can be easily made by those skilled in the art within the technical scope of the invention. Therefore, the protection scope of the present invention shall be subject to the protection scope of the claims.

Claims

1. A method for forwarding data, comprising:

if the network equipment cannot acquire the target application corresponding to the message according to the destination address; acquiring a domain name corresponding to the destination address, and acquiring a target application corresponding to the message according to the domain name corresponding to the destination address; and forwarding the message according to the forwarding rule corresponding to the target application.

2. The method according to claim 1, wherein before the network device obtains the target application corresponding to the packet according to the destination address, according to a forwarding rule corresponding to the target application, the method further includes:

3. The method according to claim 1 or 2, wherein the obtaining the domain name corresponding to the destination address comprises:

the network device obtains a domain name corresponding to the destination address from a domain name database, wherein the domain name database comprises a corresponding relation between a plurality of domain names and a plurality of addresses, one domain name corresponds to one address, one address corresponds to one or a plurality of domain names, and the corresponding relation between the plurality of domain names and the plurality of addresses is obtained through a Domain Name System (DNS) server.

4. The method according to any one of claims 1 to 3, wherein the obtaining of the target application corresponding to the packet according to the domain name corresponding to the destination address includes:

and acquiring the target application corresponding to the message according to the domain name corresponding to the destination address and a target domain name identification model.

5. The method according to claim 4, wherein before the obtaining of the target application corresponding to the packet according to the domain name and the target domain name recognition model, the method further comprises:

6. The method of claim 5, wherein before clustering the plurality of domain names in the target initial domain name data set to obtain the plurality of domain name subsets corresponding to the target initial domain name data set, the method further comprises:

7. The method of claim 5 or 6, further comprising:

8. The method according to claim 7, wherein the obtaining of the target application corresponding to the packet according to the domain name corresponding to the destination address and a target domain name recognition model comprises:

extracting domain name keywords of the domain name corresponding to the destination address, inputting the domain name corresponding to the destination address into the trained primary domain name recognition model under the condition that keywords with the similarity smaller than a third threshold value with the domain name keywords are inquired in a domain name keyword set, and acquiring target application corresponding to the message according to the domain name corresponding to the destination address and the primary domain name recognition model, wherein the domain name keyword set comprises primary domain name information of different applications with the same primary domain name.

9. The method of claim 8, further comprising:

and under the condition that keywords with similarity greater than or equal to a third threshold value with the domain name keywords are inquired in the domain name keyword set, inputting the domain name corresponding to the destination address into the trained multistage domain name recognition model, and acquiring the target application corresponding to the message according to the domain name corresponding to the destination address and the trained multistage domain name recognition model.

10. The method according to any one of claim 2, wherein after the obtaining of the target application corresponding to the packet according to the domain name corresponding to the destination address, the method further comprises:

and updating the address database according to the destination address and the target application.

11. A message forwarding system, characterized in that the system comprises a training device and a forwarding device, wherein,

12. The system of claim 11, wherein the forwarding device is further configured to:

and identifying the target application corresponding to the destination address according to the destination address and an address database, wherein the address database comprises corresponding relations between a plurality of addresses and a plurality of applications, each application corresponds to one or a plurality of addresses, and one address corresponds to one application.

13. The system according to claim 11 or 12, wherein the forwarding device is specifically configured to:

and acquiring a domain name corresponding to the destination address from a domain name database, wherein the domain name database comprises a corresponding relation between a plurality of domain names and a plurality of addresses, one domain name corresponds to one address, one address corresponds to one or more domain names, and the corresponding relation between the plurality of domain names and the plurality of addresses is acquired through a Domain Name System (DNS) server.

14. The system of claims 11 to 13, wherein the training device is specifically configured to:

15. The system of claim 14, wherein the training system is specifically configured to:

16. The system according to claim 14 or 15, wherein the training device is specifically configured to:

17. The system of claim 16, wherein the forwarding device is specifically configured to:

18. The system of claim 17, wherein the forwarding device is further configured to:

19. The system of any of claim 12, wherein the forwarding device is further configured to:

20. A data forwarding apparatus, comprising:

a receiving unit, configured to receive a packet, where the packet includes a destination address;

a processing unit to: acquiring a target application corresponding to the message according to the destination address, or when the target application corresponding to the message cannot be acquired according to the destination address; acquiring a domain name corresponding to the destination address, and acquiring a target application corresponding to the message according to the domain name;

21. The apparatus according to claim 20, wherein the processing unit is specifically configured to:

22. The apparatus according to claim 20 or 21, wherein the processing unit is specifically configured to:

23. The apparatus according to any one of claims 20 to 22, wherein the processing unit is specifically configured to:

24. The apparatus of claim 23, further comprising:

25. The apparatus of claim 24, wherein the training unit is further configured to:

26. The apparatus according to claim 24 or 25, wherein the training unit is specifically configured to:

27. The apparatus according to claim 26, wherein the processing unit is specifically configured to:

28. The method of claim 27, wherein the processing unit is specifically configured to:

29. The method of any of claims 21, wherein the processing unit is further configured to:

30. A network device comprising a processor and a memory; the memory is for storing instructions, the processor is for executing the instructions, and the method of any one of claims 1 to 10 is performed when the processor executes the instructions.

31. A computer-readable storage medium having stored thereon instructions for performing the method of any one of claims 1 to 10 when the instructions are run on a device.