Disclosure of Invention
The invention aims to provide an API encrypted flow collection and labeling method based on an intermediate agent, which aims to solve the technical problems that the conventional flow collection and labeling method cannot provide specific parameter information or incomplete information of encrypted API flow, so that encrypted flow labeling is inaccurate and the application range is small.
In order to achieve the above purpose, the invention adopts the following technical scheme:
an API encrypted flow collection and labeling method based on an intermediate agent comprises the following steps in a system comprising a cloud server and an application terminal provided with an intermediate agent client, wherein the system comprises the following steps:
The method comprises the steps that a man-in-the-middle agent client detects and intercepts a network access request (adopting real parameters of a user) sent by an application program of an application terminal to an application server (such as a web server), analyzes the network access request to generate a corresponding API interface document, wherein the API interface document comprises network request links and request parameters of different APIs of a target application server;
The cloud server generates a corresponding script code for each API interface document, randomly selects parameters from an interface parameter dictionary and replaces the parameters of each API interface document, generates a simulation network access request and sends the simulation network access request to the application server, namely the cloud server generates communication flow of a corresponding API according to the API interface document and the parameter dictionary, when sending the simulation network access request to the application server, the cloud server collects encrypted communication flow and a communication private key generated by the same API interface, generates a flow log file for flow marking when collecting the encrypted communication flow, and records the collected communication private key for decrypting the encrypted communication flow.
Further, the request parameters in the API interface document comprise an encryption algorithm and key information used by each handshake.
Further, the application program of the application terminal contains user preference settings and cookie caching (data stored on the user's local terminal for session tracking in order to discern the user's identity) and is able to access different web applications.
Further, the man-in-the-middle agent client analyzes the network access request and generates a corresponding API interface document and stores the API interface document specifically including:
Screening and filtering the traffic protocol types in the network access request, and reserving http and https protocol traffic;
Analyzing the network data packet of the network access request according to the corresponding protocol format, extracting url links and parameter fields in the network data packet, and storing the url links and the parameter fields in an xml file format.
Further, the broker client obtains a process corresponding to the current application program by reading a Process ID (PID) in an operating system of the application terminal, and obtains a flow corresponding to the current application program by a process source port number. The invention maps the PID to a specific system application through a packet management application program interface of an operating system to obtain the mapping relation between the flow and the application. Namely, the labeling mode is a fine-grained corresponding relation from the application, the process and the source port to the API traffic.
Further, the man-in-the-middle agent client comprises an API request positioning and intercepting module, an API request forwarding module and an API request parameter recording module;
The API request positioning and intercepting module is used for detecting and intercepting a network access request sent by an application program to the application page server, analyzing the network access request and generating a corresponding API interface document;
The API request forwarding module is used for generating a simulated network access request, establishing connection and communication between the simulated application program and the application server, and sending the content returned by the webpage server to the corresponding application program;
the API request parameter recording module is used for recording the communication log file and the API interface document.
Further, the cloud server is deployed at a physical network position capable of establishing stable connection with the application server, and comprises a packet transmitter and a flow collector;
The package sender is used for generating a corresponding script code for each API interface document, randomly selecting parameters from the interface parameter dictionary and replacing the parameters of each API interface document, generating a simulation network access request and sending the simulation network access request to the application server;
the traffic collector monitors and collects encrypted communication traffic and a communication private key generated by the same API interface based on a preset traffic capture tool (such as tcpdump).
Further, the generating, by the packet transmitter of the cloud server, a simulated network access request and sending the simulated network access request to the application server is specifically:
generating communication flow of corresponding API according to API interface document and preset parameter dictionary
Analyzing the API interface document, reading URL links and request parameters in the API interface document, loading a parameter dictionary on a cloud server, replacing related request parameters, generating corresponding python codes, executing the python codes to generate a simulation network access request, and sending the simulation network access request to an application server.
Further, the traffic collector stores the collected encrypted traffic according to the designated encrypted traffic.
Further, the flow storage format is a format in which an API request of a web application corresponds to a flow pcap record.
Furthermore, the flow collector marks the collected encrypted communication flow based on the API log document uploaded by the man-in-the-middle agent client in a manner of comparing url links. Because only the parameters are modified, the links are unchanged, so only url links are compared.
The technical scheme provided by the invention has at least the following beneficial effects:
(1) The invention can collect the complete and decryptable flow data set of the API request plaintext. The private key of encrypted communication between the application terminal (user side) and the application server side can be obtained by the mode of the intermediate proxy, the private key refers to the private key of the user side, the response message of the server can be decrypted, the intermediate proxy client monitors and intercepts the plaintext message from the application, and the network traffic safety research work is facilitated under the condition of simultaneously mastering the plaintext message and the encrypted message;
(2) The invention can customize the API requests simulating different parameters. An API interface document may be generated for the client application and the server, where the API interface document includes different API interface documents of the server, including the interface address and the request parameter, and then by continuously changing the request parameter, the encrypted traffic of the same interface under different devices and network environments is simulated to generate, and then the encrypted traffic is captured by using a traffic collection tool, where the change of the parameter is determined by a pre-stored parameter dictionary.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the present invention more apparent, the embodiments of the present invention will be described in further detail with reference to the accompanying drawings.
The embodiment of the invention discloses an API encrypted flow collection and labeling method based on an intermediate agent. The method adopts an intermediate agent client to detect all network access requests of application of an application terminal (such as a PC computer), analyzes the concerned http/https flow, extracts url links, parameters and other information in the http/https flow to form an API interface document, and utilizes the mapping between a target application process, a target application process number (PID) and an API flow source port to realize the matching of TLS (secure transport layer protocol) encrypted network flow and a specific application, thereby achieving the aim of marking the API encrypted flow and generating a corresponding log document. The invention improves the efficiency and expansibility of flow collection and labeling based on the client-server mode, not only can realize a distributed structure, but also can collect the application API encrypted flow and the corresponding secret key, and realize the acquisition of the application and API-level flow classification labels on the basis of flow collection. The invention can automatically complete the simulation of various user API request parameters, directionally generate and acquire API request flow, and the storage of the encryption flow key can help to develop the exploration of the influence factors of the encryption flow, which is still blank in the field of encryption flow analysis at present.
As shown in fig. 1, the method for collecting and labeling API encrypted traffic based on an agent for man-in-the-middle according to the embodiments of the present invention is that a system including a cloud server and a PC computer with an agent client installed therein is heavy, and various applications (e.g., a browser) and counterfeit certificates replaced by the agent client are installed in the PC computer. The man-in-the-middle client uses session acquisition and certificate replacement to implement its functions. The session acquisition is to establish connection and communication between the simulation application software and a remote web server, and simultaneously simulate the web server to forward a message to a PC computer, and the man-in-the-middle proxy client simultaneously grasps the session key and certificate of the PC computer-the man-in-the-middle proxy client) and the session key and certificate of the man-in-the-middle proxy client-the web server, so that the encrypted traffic of the two parties can be decrypted and analyzed, and url links and parameters of the concerned http/https traffic can be extracted to form an API interface document. The man-in-the-middle proxy client aims to fool the PC computer into a connection with the web server and establish a connection through two sets of protocol certificates, i.e. if this signature does not match or comes from an untrusted party, the secure client of the third party will simply disconnect and refuse to continue.
Certificate replacement can thus be used to solve the above problem, requiring manual generation of a complete certificate, and adding the generated certificate to the trusted root certificate directory in order for the PC computer to trust the broker proxy server. The certificate (used for authenticating TLS protocol) is a mechanism used for verifying identity in an encrypted TLS protocol used by HTTPS, and is a file in digital signature form, and includes a public key of a certificate owner and certificate information of a third party. Certificates are classified into two types, self-signed certificates and CA certificates. Typically self-signed certificates cannot be used for identity authentication. The basic principle of CA certificate-based identity authentication between an application client and an application server in the key negotiation process of the TLS protocol is that the application client needs to trust the CA certificate of the application server (for example, the certificate is in a trusted certificate list of an operating system, or a user adds a public key and a private key of the CA into the trusted list in a mode of 'installing a root certificate' and the like) and then the CA signs (encrypts) an original certificate of the application server to generate a final certificate, and the application client decrypts by using the public key contained in the certificate after obtaining the final certificate to obtain the original certificate of the application server. Taking the encryption of RSA as an example, if decryption with the public key of the CA is successful, indicating that the certificate is indeed encrypted with the private key of the CA, the application server may be considered trusted.
In addition, the invention organizes in a distributed client-cloud server mode, monitors network traffic of PC equipment application, analyzes http/https traffic concerned in the network traffic, extracts API interface documents, uploads the API interface documents to a server for traffic replay, and achieves the purposes of acquiring traffic classification label data of application and API interface granularity level on the basis of traffic acquisition.
As a possible implementation manner, the method for acquiring and labeling the API encrypted traffic based on the broker according to the embodiment of the present invention is that in a system including a PC computer 1 and a cloud server 3, the PC computer is installed with a browser 5 and a broker client 2, the browser 5 includes a user preference setting and a cookie cache and can access different web applications, the broker client 2 includes an API request positioning interception module, an API request forwarding module and an API request parameter recording module (i.e. an API request parameter recording device in fig. 3), as shown in fig. 3, the API request positioning interception module intercepts a request sent to a web server by the browser (the detected network request uses parameters of a real user), and then the API request forwarding module simulates the communication between the browser and the web server, and resends the content returned by the web server to the browser, while the API request parameter recording device records the request parameters. The method comprises the steps that a man-in-middle agent client 2 replaces a man-in-middle CA certificate at a PC 1 side, an API request positioning and intercepting module of the man-in-middle agent client 2 intercepts all network requests of a browser installed on the PC 1 for accessing web applications, an API request forwarding module replaces the PC 1 to communicate with a web server and forwards the network access requests of the PC 1, an API request parameter recording device records encryption algorithms and key information used by handshaking each time to decrypt and obtain the request data and the request parameters of the complete applications, the man-in-middle agent client 2 can obtain the request data and the request parameters of the complete applications, a cloud server 3 is deployed at a physical network position (shown in fig. 3) capable of establishing stable connection with the web server, the cloud server 3 comprises a packet sender 6 and a flow collector 4, wherein the packet sender 6 generates simulated user flow based on real user request data, a parameter dictionary and time delay recorded by the man-in-middle agent client 2, and the flow collector 4 uses flow capture tools such as tcpdump to collect the flow generated by the packet sender 6, meanwhile, the log files are generated to conveniently label the collected flow, and the collected flow can be encrypted, and analyzed after the session is conveniently recorded.
In the API encryption flow collection and labeling method based on the man-in-the-middle Agent, on one hand, the man-in-the-middle Agent client 2 is used for detecting all network access requests of User applications, analyzing http and https flow concerned, extracting url links and parameters of the network access requests, generating corresponding interface documents so that parameters such as User-agents can be adjusted, flow generated by different network equipment in different network environments can be simulated, the man-in-the-middle Agent client 2 can generate log files while detecting http/https flow, the log files record application names and url links of the application requests, formats such as ' application 1 ', { ' https:// request1.Com ', ' application 2 ', { ' https:// request2.Com ', },/https:// request3.Com ' }, and the subsequent flow can be rapidly simulated, the design framework of the client-server is greatly improved, and the flow is simulated from the PC interface documents are extracted from the PC, and the flow is simulated by using the PC.
As a possible implementation manner, as shown in fig. 2, the specific implementation steps of the API encrypted traffic collection and labeling method based on the broker agent provided in the embodiment of the present invention include:
step S1, a man-in-the-middle agent client 2 is adopted to monitor network access requests of applications installed in the PC 1.
In order to collect pure application API traffic, proxy forwarding is required for the application traffic, that is, all traffic of the target application process is proxied to a designated port, and the broker proxy client 2 monitors the port in real time to parse and forward the traffic.
The application flow agent forwarding method is changed according to different application types, is relatively simple for common Web applications, and is obviously more beneficial to flow collection and labeling because the browser basically realizes the functions of http/https flow agents, can proxy all flow of the browser to specified IP and ports, and uses a method of constructing a routing table by using proxy plug-ins to complete the forwarding function of API requests, and different Web application flows are forwarded to different ports through configuration rules.
For general user applications and system applications, the flow agent tool can be used to configure corresponding rules, and the flow of the target application is forwarded to the monitoring port of the man-in-the-middle agent client 2.
In order for the broker proxy client 2 to acquire a session between the target application and the target application server, the target application needs to trust and use the credentials of the broker proxy client, so that the session key between the target application and the broker proxy client 2 and the session key between the broker proxy client 2 and the target application server (web server) can be grasped, and decryption of the encrypted traffic can be achieved.
Step S2, the man-in-the-middle agent client 2 exports and stores the API interface document generated by analyzing the network access request and log information stored by the comparison PID.
The broker client 2 will parse all network request data packets according to the TCP/IP protocol stack format, and first determine the protocol type of the data packet, and since the embodiment mainly collects http/https traffic, traffic of other protocol types will be ignored. For an http/https protocol data packet, the broker proxy client 2 determines whether the data packet is encrypted by using a TLS/SSL encryption suite, if the data packet is not encrypted, the application layer of the data packet is parsed according to an http protocol format to obtain the required URL link, a request mode, a request header field and other contents, and if the data packet is encrypted by using an encryption algorithm, because the broker proxy client 2 obtains a communication key by a certificate replacement and session acquisition method, the broker proxy client 2 can decrypt the encrypted application layer payload by using the communication key, and then decrypt the decrypted application layer plaintext data according to the same processing mode as that of http.
The intermediate proxy client 2 analyzes URL links obtained by http/https data packets, the contents such as request header fields and the like form corresponding API interface documents, the formats of { URL connection of API requests, parameter forms and flow network delay } are stored, all information of the data packet requests are recorded, a packet sender randomly changes the parameter forms and the flow network delay based on the recording formats, the generation capacity of the API requests is realized, the parameter forms comprise request modes, request header and request data, the request modes comprise request modes of various http protocols such as GET, POST, HEAD, the request header comprises header information of request lines such as Cookie, user-Agent, host and the like, and the request data comprise important data which need to be encrypted and transmitted in a POST mode.
The man-in-the-middle agent client 2 generates a specific log file by comparing the mapping relation between the application process PID and the source port of the data packet, for example, a network connection occupation port condition is monitored by using a 'netstat-aon| findstr' source port 'command, a port condition owned by the application process is monitored by using a' tasklist | findstr 'PID', and therefore, a one-to-one mapping relation between the API interface flow data packet and the application process PID is formed (the mapping relation is only stored in the local of the PC computer 1), and the flow accurate acquisition capability is further realized.
Step S3, the man-in-the-middle agent client 2 uploads the saved API interface document and log file to the cloud server 3.
The man-in-the-middle client 2 writes the relation between the API interface and the application acquired in real time into a log file in the form of "application name-stream five-tuple (source address, destination address, source port, destination port, protocol type) list-API interface information", and uploads the relation to the cloud server 3 simultaneously with the API interface document generated during the process.
Step S4, the cloud server 3 receives a plurality of API interface documents sent by the man-in-the-middle agent client 2, stores all the API records, integrates parameter forms and network delay in the records to obtain a parameter dictionary and a network delay change range, can generate corresponding python script codes for each API interface document, generates new API requests by combining the stored API records based on the parameter dictionary and the delay change range, executes the python codes to send the corresponding network requests, and then captures communication encrypted traffic and communication private keys generated by the same interface by using tcpdump.
That is, in step S4, the cloud server 3 receives the API interface documents and log files sent from the plurality of man-in-the-middle proxy clients 2, and the packet sender 6 on the cloud server 3 generates the python script code by using the information provided by the API interface documents, and since the API interface documents contain all necessary information requested by the network, the function of automatically generating the python packet sending script can be implemented by using the postman tool. Then, the packet sender 6 loads a parameter dictionary library and a time delay factor library, continuously replaces related parameter information of a corresponding API, and then executes a python script to send out different data packets, wherein the parameter dictionary mainly comprises related fields such as GET parameters in url links of the API, user-Agent in http/https heads and the like, and the related fields are used for simulating the conditions of different users and different devices for initiating network requests in different environments. The TLS/SSL certificates used by the python script of the wrapper 6 are controllable and are replaced by the present invention so that the present invention does not decrypt the encrypted traffic collected by the traffic collector, facilitating the analysis of the encrypted traffic. The flow collector 4 of the cloud server 3 is mainly implemented by tcpdump, and is configured to collect a network request sent by a python script, generate an original flow file (Pcap) file, compare the original flow file with a log file corresponding to an API interface file, generate the corresponding API interface communication flow of the target application through statistics, and obtain a private key in a certificate used by the wrapper 6 at the same time, where different API interface files of the same application are stored in the same folder, and the folder contains the Pcap files collected by the different API interface flows and the private keys corresponding to the Pcap files.
The method and the system can analyze the encrypted flow of the API interface which belongs to a certain type of application, further mine some flow characteristic information and behavior patterns of the application, and particularly can decrypt and analyze the collected encrypted flow, analyze the influence factors of the encrypted flow, mine the potential user behavior pattern characteristics and help judge whether the behavior of the certain type of application leaks user privacy.
It should be noted that the above-mentioned embodiments are merely for illustrating the technical solution of the present invention, and not for limiting the same, and although the present invention has been described in detail with reference to the above-mentioned embodiments, it should be understood by those skilled in the art that the technical solution described in the above-mentioned embodiments may be modified or some technical features may be equivalently replaced, and these modifications or substitutions do not make the essence of the corresponding technical solution deviate from the spirit and scope of the technical solution of the embodiments of the present invention.
What has been described above is merely some embodiments of the present invention. It will be apparent to those skilled in the art that various modifications and improvements can be made without departing from the spirit of the invention.