[go: up one dir, main page]

CN113946735A - Method and system for crawling and restoring WEB site by traffic recording - Google Patents

Method and system for crawling and restoring WEB site by traffic recording Download PDF

Info

Publication number
CN113946735A
CN113946735A CN202111167366.4A CN202111167366A CN113946735A CN 113946735 A CN113946735 A CN 113946735A CN 202111167366 A CN202111167366 A CN 202111167366A CN 113946735 A CN113946735 A CN 113946735A
Authority
CN
China
Prior art keywords
website
recording
restoring
data
crawling
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202111167366.4A
Other languages
Chinese (zh)
Inventor
林旭滨
刘梓乐
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Guangzhou Fanfanfang Information Security Technology Co ltd
Original Assignee
Guangzhou Fanfanfang Information Security Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Guangzhou Fanfanfang Information Security Technology Co ltd filed Critical Guangzhou Fanfanfang Information Security Technology Co ltd
Priority to CN202111167366.4A priority Critical patent/CN113946735A/en
Publication of CN113946735A publication Critical patent/CN113946735A/en
Pending legal-status Critical Current

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/951Indexing; Web crawling techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/903Querying
    • G06F16/90335Query processing
    • G06F16/90344Query processing by using string matching techniques
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/955Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/957Browsing optimisation, e.g. caching or content distillation
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F16/00Information retrieval; Database structures therefor; File system structures therefor
    • G06F16/90Details of database functions independent of the retrieved data types
    • G06F16/95Retrieval from the web
    • G06F16/958Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking

Landscapes

  • Engineering & Computer Science (AREA)
  • Databases & Information Systems (AREA)
  • Theoretical Computer Science (AREA)
  • Data Mining & Analysis (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computational Linguistics (AREA)
  • Information Transfer Between Computers (AREA)

Abstract

The invention discloses a method and a system for crawling and restoring a WEB site by recording traffic, which comprises a website traffic recording system, a website data structure generated by recording, a website restoring system and restored website application, and comprises the following steps: s1: filling the URL of the website to be crawled into the website flow recording system; s2: the website traffic recording system starts recording; s3: marking the recorded website after reverse proxy of the website recording system on a browser according to expected set access and weight; s4: importing the website data which is generated by the website recording system and has the structural characteristics of the website data into the website restoring system; s5: the website restoring system generates restored website applications; s6: running the restored website application, and accessing the restored website in a browser; the method realizes the purposes of crawling, restoring and displaying the data and files of the WEB site, and has the advantages of being suitable for the types of the websites, high in restoration degree, safe and free of loopholes.

Description

Method and system for crawling and restoring WEB site by traffic recording
Technical Field
The invention relates to the technical field of network design, in particular to a method and a system for crawling and restoring a WEB site by recording traffic.
Technical Field
In the conventional WEB site technology, switching between different pages is to load a whole new page directly from a server. However, with the development of WEB technologies, modern WEB site technologies are more prone to a development model of Single Page WEB Application (SPA), that is, interacting with users by dynamically rewriting portions of pages. This enables existing website crawling technologies that search for, parse, and then re-request links in a WEB site to crawl only traditional WEB sites, but not modern SAP-type websites.
Furthermore, in business practice, we need to restore not only the display pages of the website, but also the specific interactions of the website, including but not limited to: login interaction, registration interaction, search interaction and verification code change interaction. Wherein the interaction data includes but is not limited to: the method comprises the steps of verifying codes, prompting input errors, inputting correct return contents and the like, and the specific interaction of the website usually involves the inquiry, inspection and processing of data by a website background, but the existing website crawling technology still has no function of recording states and data, so that the existing website crawling has obvious defects in reduction degree.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention aims to provide a method and a system for crawling and restoring WEB sites with full WEB site types and high restoring degree.
In order to solve the technical problems, the invention adopts the following technical scheme:
a method and a system for crawling and restoring a WEB website through flow recording are characterized by comprising a website flow recording system, a website data structure generated by recording, a website restoring system and restored website application, and specifically comprising the following steps:
s1, filling a URL of a website to be crawled into a website flow recording system;
s2, the website traffic recording system starts recording;
s3, marking the recorded website after passing through the reverse proxy of the website recording system according to expected access and weight on a browser;
s4, until the pages needing to be crawled of the website to be crawled are clicked and the weight marking is finished, importing the website data generated by the website recording system into the website restoring system;
s5, the website restoring system generates restored website applications;
s6: and running the restored website application, and accessing the restored website in a browser.
Further, in step S2, the network data of the website to be crawled is generated by a recording module, where the recording module includes a reverse proxy module, a route analyzing module, a weight label identifying module, and a response data processing module.
Wherein, reverse agent module is used for treating the website of crawling and carries out reverse agent and data processing, includes:
a. reversely acting the website to be crawled to a local port;
b. uniformly standardizing protocol traffic (HTTP/HTTPS) of a website to be crawled into unencrypted HTTP protocol traffic;
c. and filtering invalid data about the original website in the website to be crawled, such as the original IP, the domain name and the port.
The route analysis module is used for recording an HTTP request method, a route URI and HTTP request data for accessing the recorded website.
The weight mark identification module is used for recording the custom weight under the current request when the recorded website is accessed, wherein the definition of the current request is uniquely matched by URI and HTTP request data.
The response data processing module is used for recording a response status code, a response data type, a response data ID and response data returned by accessing the recorded website, and combining the data of the routing analysis module and the data of the mark identification module into website request data of the current request.
The website data of the website to be crawled generated by the recording module is used for identifying and recording the structure and logic of the current website. The method is characterized in that: the double search data structure has HTTP key information as the first element and response data ID as the second element. The HTTP key information comprises: HTTP request method and HTTP request URI. Wherein the response data ID is obtained by comprehensive operation of the HTTP request URI and the HTTP request parameter. The retrieved data includes: an HTTP request method, a URI, an HTTP response status code, an HTTP response data type and a plurality of response data ID and weight mark combined data.
Further, in step S3, a custom request header is added to the HTTP request header to identify the current URI and the weight value of the response result returned under the request parameter.
Further, the website restoring system in step S5 includes: the device comprises a route analysis module, a response data matching module and an application generation module.
The route analysis module is used for analyzing data in website data, and comprises: the method comprises the steps of requesting a method, URI, response state code and response data type, and combining the data into an effective routing code of the HTTP website application through a character string matching and splicing method.
The response data matching module is used for analyzing data in the website, comprises multiple groups of response data IDs and weight marks, and combines the response content codes into an effective HTTP website application through a data string matching and splicing method. The response data matching method is characterized by exact matching and sequential matching. When the current request data and the response data ID are matched constantly by accurate matching, the specified response data is returned; when the request data ID for the current request IP is matched in sequence, the data are responded in sequence according to the times of the request IP and the weight mark. In matching response data in order, the weight label is small, the number of returns is small, and vice versa.
The application generation module is used for combining the existing multi-HTTP website application template codes, the routing codes generated by the routing analysis module and the response content codes generated by the response data matching module to generate effective HTTP website application codes, and then compiling to generate WEB applications of the restored websites.
And running the WEB application of the restored website, and accessing the restored website in a browser.
The invention has the beneficial effects that: the invention can crawl the websites of all the existing website types and realize the website restoration with high restoration degree, wherein the high-interaction restoration comprises the following steps: login interaction, registration interaction, search interaction, verification code change interaction and the like. In the business practice of deception defense, a specific website is required to be crawled and restored frequently to realize further deception defense, the website restoration method and the website restoration system realize the website restoration with extremely high restoration degree and comprehensive use range, and because response data are returned fixedly and orderly, webpage loopholes cannot exist, so that the website restoration method and the website restoration system can occupy a more advantageous position in deception defense.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method and a system for crawling and restoring a WEB site by traffic recording according to the present invention.
Fig. 2 is a schematic diagram of a working architecture of a WEB crawling and restoring system implemented by traffic recording according to the present invention.
Fig. 3 is a schematic diagram of a website data structure recorded by a website traffic recording system in the method for crawling and restoring a WEB website by traffic recording according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a method for crawling and restoring a WEB site by traffic recording is characterized by including a WEB site traffic recording system, a recorded and generated WEB site data structure, a WEB site restoring system, and a restored WEB site application, and specifically including the following steps:
s1, filling a URL of a website to be crawled into a website flow recording system;
s2, the website traffic recording system starts recording;
s3, marking the recorded website after passing through the reverse proxy of the website recording system according to expected access and weight on a browser;
s4, until the pages needing to be crawled of the website to be crawled are clicked and the weight marking is finished, importing the website data generated by the website recording system into the website restoring system;
s5, the website restoring system generates restored website applications;
s6: and running the restored website application, and accessing the restored website in a browser.
In step S2, the network data of the website to be crawled is generated by a recording module, where the recording module includes, as shown in fig. 2, a reverse proxy module, a route analyzing module, a weight label identifying module, and a response data processing module.
Wherein, reverse agent module is used for treating the website of crawling and carries out reverse agent and data processing, includes: a) reversely acting the website to be crawled to a local port; b) uniformly standardizing protocol traffic (HTTP/HTTPS) of a website to be crawled into unencrypted HTTP protocol traffic; c) and filtering invalid data about the original website in the website to be crawled, such as the original IP, the domain name and the port. The way reverse agents are selected instead of simple agents for web page crawling intervention is because we need to filter invalid data for the crawled web sites. Because after the website is crawled down, the invalid data may cause the restored website access to jump back to the original website, or the data request is targeted to a problem that no port exists, so that the restored website part fails.
The route analysis module is used for recording an HTTP request method, a route URI and HTTP request data for accessing the recorded website.
The WEIGHT mark identification module is used for recording the custom WEIGHT under the current request when the logging website is accessed, wherein the custom WEIGHT is obtained from an HTTP request header 'WEIGHT', and the definition of the current request is uniquely matched by URI and HTTP request data.
The response data processing module is configured to record a response status code, a response data type, a response data ID, and response data returned by accessing the logged-in website, and combine data of the routing analysis module and the tag identification module into website request data of a current request, as shown in fig. 3.
The website data of the website to be crawled generated by the recording module is used for identifying and recording the structure and logic of the current website as shown in fig. 3. The method is characterized in that: the double search data structure has HTTP key information as the first element and response data ID as the second element. The HTTP key information comprises: HTTP request method and HTTP request URI. Wherein the response data ID is obtained by comprehensive operation of the HTTP request URI and the HTTP request parameter. The retrieved data includes: an HTTP request method, a URI, an HTTP response status code, an HTTP response data type and a plurality of response data ID and weight mark combined data.
Further, in step S3, a custom request header "WEIGHT" is added to the HTTP request header to identify the current URI and the WEIGHT value of the response result returned under the request parameter.
Further, the website restoring system in step S5 is shown in fig. 2, and includes: the device comprises a route analysis module, a response data matching module and an application generation module.
The route analysis module is used for analyzing data in website data, and comprises: the method comprises the steps of requesting a method, URI, response state code and response data type, and combining the data into an effective routing code of the HTTP website application through a character string matching and splicing method. The routing code of a routing URI is actually the call of a self-defined template function, and only the request method of the character string type, the URI, the response state code and the response data type are filled in the position of the call parameter.
The response data matching module is used for analyzing data in the website, comprises multiple groups of response data IDs and weight marks, and combines the response content codes into an effective HTTP website application through a data string matching and splicing method. A response content code is actually called for a self-defined template function, and only the response data ID and the weight mark of the character string type are needed to be filled in the position of the calling parameter.
The response data matching method is characterized by exact matching and sequential matching. When the current request data and the response data ID are matched constantly by accurate matching, the specified response data is returned; when the request data ID for the current request IP is matched in sequence, the data are responded in sequence according to the times of the request IP and the weight mark. In matching response data in order, the weight label is small, the number of returns is small, and vice versa.
The application generation module is used for combining the existing multi-HTTP website application template codes, the routing codes generated by the routing analysis module and the response content codes generated by the response data matching module to generate effective HTTP website application codes, and then compiling to generate WEB applications of the restored websites. The template codes are applied, and comprise routing processing, template codes responding to data processing, processing response codes not defining routing and the like. The programming language model selection cross-platform friendly language Golang enables WEB application of a compiled reduction website to support running and displaying of a full platform.
And running the WEB application of the restored website, and accessing the restored website in a browser.
The invention has the beneficial effects that: the invention can crawl the websites of all the existing website types and realize the website restoration with high restoration degree, wherein the high-interaction restoration comprises the following steps: login interaction, registration interaction, search interaction, verification code change interaction and the like. In the business practice of deception defense, a specific website is required to be crawled and restored frequently to realize further deception defense, the website restoration method and the website restoration system realize the website restoration with extremely high restoration degree and comprehensive use range, and because response data are returned fixedly and orderly, webpage loopholes cannot exist, so that the website restoration method and the website restoration system can occupy a more advantageous position in deception defense. The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.

Claims (7)

1. A system for crawling and restoring WEB pages by recording traffic is characterized by comprising: the system comprises a website flow recording system and a website restoring system, wherein the website flow recording system is used for recording and generating website data of a website to be crawled, and the website restoring system is used for generating restored website application.
2. The system for crawling and restoring the WEB site through traffic recording as claimed in claim 1, further comprising a website data structure generated by recording and a website application for restoring.
3. A method for crawling and restoring a WEB site by recording traffic is characterized by comprising the following specific steps:
s1, filling a URL of a website to be crawled into a website flow recording system;
s2, the website traffic recording system starts recording;
s3, marking the recorded website after passing through the reverse proxy of the website recording system according to expected access and weight on a browser;
s4, importing the website data which is generated by the website recording system and has the website data structure characteristics into the website restoring system;
s5, the website restoring system generates restored website applications;
s6: and running the restored website application, and accessing the restored website in a browser.
4. The method for crawling and restoring the WEB site through traffic recording according to claim 3, wherein the website data structure generated by recording comprises: a double-retrieval data structure which takes HTTP key information as a first element and response data ID as a second element; the HTTP key information comprises: HTTP request method and HTTP request URI; wherein the response data ID is obtained by comprehensive operation of HTTP request URI and HTTP request parameters; the retrieved data includes: an HTTP request method, a URI, an HTTP response status code, an HTTP response data type and a plurality of response data ID and weight mark combined data.
5. The method for crawling and restoring the WEB site through traffic recording as claimed in claim 3, wherein the recording module of the WEB site recording system in step S2 comprises a reverse proxy module, a route parsing module, a weight mark identification module and a response data processing module.
6. The method for crawling and restoring WEB sites through traffic recording according to claim 3, wherein in step S3, a method for adding a custom request header to the HTTP request header to identify the current URI and the weight value of the response result under the request parameter is used.
7. The method for crawling and restoring the WEB site through traffic recording according to claim 3, wherein the modules of the WEB site restoring system in the step S5 include: the device comprises a route analysis module, a response data matching module and an application generation module.
CN202111167366.4A 2021-10-05 2021-10-05 Method and system for crawling and restoring WEB site by traffic recording Pending CN113946735A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202111167366.4A CN113946735A (en) 2021-10-05 2021-10-05 Method and system for crawling and restoring WEB site by traffic recording

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202111167366.4A CN113946735A (en) 2021-10-05 2021-10-05 Method and system for crawling and restoring WEB site by traffic recording

Publications (1)

Publication Number Publication Date
CN113946735A true CN113946735A (en) 2022-01-18

Family

ID=79330016

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202111167366.4A Pending CN113946735A (en) 2021-10-05 2021-10-05 Method and system for crawling and restoring WEB site by traffic recording

Country Status (1)

Country Link
CN (1) CN113946735A (en)

Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887463A (en) * 2010-07-22 2010-11-17 北京天融信科技有限公司 Virtual domain-based HTTP reduction display method
CN102098331A (en) * 2010-12-29 2011-06-15 北京锐安科技有限公司 Method and system for reducing WEB type application contents
US20130136253A1 (en) * 2011-11-28 2013-05-30 Hadas Liberman Ben-Ami System and method for tracking web interactions with real time analytics
CN106598991A (en) * 2015-10-19 2017-04-26 上海引跑信息科技有限公司 Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode

Patent Citations (4)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101887463A (en) * 2010-07-22 2010-11-17 北京天融信科技有限公司 Virtual domain-based HTTP reduction display method
CN102098331A (en) * 2010-12-29 2011-06-15 北京锐安科技有限公司 Method and system for reducing WEB type application contents
US20130136253A1 (en) * 2011-11-28 2013-05-30 Hadas Liberman Ben-Ami System and method for tracking web interactions with real time analytics
CN106598991A (en) * 2015-10-19 2017-04-26 上海引跑信息科技有限公司 Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode

Similar Documents

Publication Publication Date Title
CN103888490B (en) A kind of man-machine knowledge method for distinguishing of full automatic WEB client side
US6996622B2 (en) Session managing method, session managing system, and program
CN101211364B (en) Method and system for social bookmarking of resources exposed in web pages
US6665634B2 (en) Test system for testing dynamic information returned by a web server
CN103577427A (en) Browser kernel based web page crawling method and device and browser containing device
CN105243159A (en) Visual script editor-based distributed web crawler system
US10187444B2 (en) System and method of automatic generation and insertion of analytic tracking codes
US20210064453A1 (en) Automated application programming interface (api) specification construction
KR20080031276A (en) Information acquisition method and device
CN102436564A (en) Method and device for identifying tampered webpage
CN104573520B (en) The method and apparatus for detecting resident formula cross site scripting loophole
CN103440139A (en) Acquisition method and tool facing microblog IDs (identitiesy) of mainstream microblog websites
CN105743988B (en) Network user's tracing implementing method, apparatus and system
CN104168250B (en) Business Process Control method and device based on CGI frames
CN106897336A (en) Web page files sending method, webpage rendering intent and device, webpage rendering system
CN113032655A (en) Method for extracting and fixing dark network electronic data
JP5347429B2 (en) Uniform resource locator rewriting method and apparatus
CN108156118A (en) User Identity method and device
CN110532455A (en) A kind of Web page picture acquisition methods and system based on Chrome browser
CN103678341A (en) Database interaction system and method
US20250131086A1 (en) Detecting data leakage and/ or detecting dangerous information
CN106960158A (en) A kind of method and apparatus for preventing blog from being retrieved by web crawlers
Suguna et al. User interest level based preprocessing algorithms using web usage mining
CN110719344B (en) Domain name acquisition method and device, electronic equipment and storage medium
CN113946735A (en) Method and system for crawling and restoring WEB site by traffic recording

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
RJ01 Rejection of invention patent application after publication

Application publication date: 20220118

RJ01 Rejection of invention patent application after publication