CN113946735A - Method and system for crawling and restoring WEB site by traffic recording - Google Patents
Method and system for crawling and restoring WEB site by traffic recording Download PDFInfo
- Publication number
- CN113946735A CN113946735A CN202111167366.4A CN202111167366A CN113946735A CN 113946735 A CN113946735 A CN 113946735A CN 202111167366 A CN202111167366 A CN 202111167366A CN 113946735 A CN113946735 A CN 113946735A
- Authority
- CN
- China
- Prior art keywords
- website
- recording
- restoring
- data
- crawling
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Pending
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/951—Indexing; Web crawling techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/903—Querying
- G06F16/90335—Query processing
- G06F16/90344—Query processing by using string matching techniques
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/955—Retrieval from the web using information identifiers, e.g. uniform resource locators [URL]
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/957—Browsing optimisation, e.g. caching or content distillation
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F16/00—Information retrieval; Database structures therefor; File system structures therefor
- G06F16/90—Details of database functions independent of the retrieved data types
- G06F16/95—Retrieval from the web
- G06F16/958—Organisation or management of web site content, e.g. publishing, maintaining pages or automatic linking
Landscapes
- Engineering & Computer Science (AREA)
- Databases & Information Systems (AREA)
- Theoretical Computer Science (AREA)
- Data Mining & Analysis (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Computational Linguistics (AREA)
- Information Transfer Between Computers (AREA)
Abstract
The invention discloses a method and a system for crawling and restoring a WEB site by recording traffic, which comprises a website traffic recording system, a website data structure generated by recording, a website restoring system and restored website application, and comprises the following steps: s1: filling the URL of the website to be crawled into the website flow recording system; s2: the website traffic recording system starts recording; s3: marking the recorded website after reverse proxy of the website recording system on a browser according to expected set access and weight; s4: importing the website data which is generated by the website recording system and has the structural characteristics of the website data into the website restoring system; s5: the website restoring system generates restored website applications; s6: running the restored website application, and accessing the restored website in a browser; the method realizes the purposes of crawling, restoring and displaying the data and files of the WEB site, and has the advantages of being suitable for the types of the websites, high in restoration degree, safe and free of loopholes.
Description
Technical Field
The invention relates to the technical field of network design, in particular to a method and a system for crawling and restoring a WEB site by recording traffic.
Technical Field
In the conventional WEB site technology, switching between different pages is to load a whole new page directly from a server. However, with the development of WEB technologies, modern WEB site technologies are more prone to a development model of Single Page WEB Application (SPA), that is, interacting with users by dynamically rewriting portions of pages. This enables existing website crawling technologies that search for, parse, and then re-request links in a WEB site to crawl only traditional WEB sites, but not modern SAP-type websites.
Furthermore, in business practice, we need to restore not only the display pages of the website, but also the specific interactions of the website, including but not limited to: login interaction, registration interaction, search interaction and verification code change interaction. Wherein the interaction data includes but is not limited to: the method comprises the steps of verifying codes, prompting input errors, inputting correct return contents and the like, and the specific interaction of the website usually involves the inquiry, inspection and processing of data by a website background, but the existing website crawling technology still has no function of recording states and data, so that the existing website crawling has obvious defects in reduction degree.
Disclosure of Invention
In view of the above-mentioned deficiencies of the prior art, the present invention aims to provide a method and a system for crawling and restoring WEB sites with full WEB site types and high restoring degree.
In order to solve the technical problems, the invention adopts the following technical scheme:
a method and a system for crawling and restoring a WEB website through flow recording are characterized by comprising a website flow recording system, a website data structure generated by recording, a website restoring system and restored website application, and specifically comprising the following steps:
s1, filling a URL of a website to be crawled into a website flow recording system;
s2, the website traffic recording system starts recording;
s3, marking the recorded website after passing through the reverse proxy of the website recording system according to expected access and weight on a browser;
s4, until the pages needing to be crawled of the website to be crawled are clicked and the weight marking is finished, importing the website data generated by the website recording system into the website restoring system;
s5, the website restoring system generates restored website applications;
s6: and running the restored website application, and accessing the restored website in a browser.
Further, in step S2, the network data of the website to be crawled is generated by a recording module, where the recording module includes a reverse proxy module, a route analyzing module, a weight label identifying module, and a response data processing module.
Wherein, reverse agent module is used for treating the website of crawling and carries out reverse agent and data processing, includes:
a. reversely acting the website to be crawled to a local port;
b. uniformly standardizing protocol traffic (HTTP/HTTPS) of a website to be crawled into unencrypted HTTP protocol traffic;
c. and filtering invalid data about the original website in the website to be crawled, such as the original IP, the domain name and the port.
The route analysis module is used for recording an HTTP request method, a route URI and HTTP request data for accessing the recorded website.
The weight mark identification module is used for recording the custom weight under the current request when the recorded website is accessed, wherein the definition of the current request is uniquely matched by URI and HTTP request data.
The response data processing module is used for recording a response status code, a response data type, a response data ID and response data returned by accessing the recorded website, and combining the data of the routing analysis module and the data of the mark identification module into website request data of the current request.
The website data of the website to be crawled generated by the recording module is used for identifying and recording the structure and logic of the current website. The method is characterized in that: the double search data structure has HTTP key information as the first element and response data ID as the second element. The HTTP key information comprises: HTTP request method and HTTP request URI. Wherein the response data ID is obtained by comprehensive operation of the HTTP request URI and the HTTP request parameter. The retrieved data includes: an HTTP request method, a URI, an HTTP response status code, an HTTP response data type and a plurality of response data ID and weight mark combined data.
Further, in step S3, a custom request header is added to the HTTP request header to identify the current URI and the weight value of the response result returned under the request parameter.
Further, the website restoring system in step S5 includes: the device comprises a route analysis module, a response data matching module and an application generation module.
The route analysis module is used for analyzing data in website data, and comprises: the method comprises the steps of requesting a method, URI, response state code and response data type, and combining the data into an effective routing code of the HTTP website application through a character string matching and splicing method.
The response data matching module is used for analyzing data in the website, comprises multiple groups of response data IDs and weight marks, and combines the response content codes into an effective HTTP website application through a data string matching and splicing method. The response data matching method is characterized by exact matching and sequential matching. When the current request data and the response data ID are matched constantly by accurate matching, the specified response data is returned; when the request data ID for the current request IP is matched in sequence, the data are responded in sequence according to the times of the request IP and the weight mark. In matching response data in order, the weight label is small, the number of returns is small, and vice versa.
The application generation module is used for combining the existing multi-HTTP website application template codes, the routing codes generated by the routing analysis module and the response content codes generated by the response data matching module to generate effective HTTP website application codes, and then compiling to generate WEB applications of the restored websites.
And running the WEB application of the restored website, and accessing the restored website in a browser.
The invention has the beneficial effects that: the invention can crawl the websites of all the existing website types and realize the website restoration with high restoration degree, wherein the high-interaction restoration comprises the following steps: login interaction, registration interaction, search interaction, verification code change interaction and the like. In the business practice of deception defense, a specific website is required to be crawled and restored frequently to realize further deception defense, the website restoration method and the website restoration system realize the website restoration with extremely high restoration degree and comprehensive use range, and because response data are returned fixedly and orderly, webpage loopholes cannot exist, so that the website restoration method and the website restoration system can occupy a more advantageous position in deception defense.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings used in the description of the embodiments or the prior art will be briefly described below, it is obvious that the drawings in the following description are only some embodiments of the present invention, and for those skilled in the art, other drawings can be obtained according to the drawings without creative efforts.
Fig. 1 is a flowchart of a method and a system for crawling and restoring a WEB site by traffic recording according to the present invention.
Fig. 2 is a schematic diagram of a working architecture of a WEB crawling and restoring system implemented by traffic recording according to the present invention.
Fig. 3 is a schematic diagram of a website data structure recorded by a website traffic recording system in the method for crawling and restoring a WEB website by traffic recording according to the present invention.
Detailed Description
The technical solution in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. It is to be understood that the described embodiments are merely exemplary of the invention, and not restrictive of the full scope of the invention. All other embodiments, which can be derived by a person skilled in the art from the embodiments given herein without making any creative effort, shall fall within the protection scope of the present invention.
As shown in fig. 1, a method for crawling and restoring a WEB site by traffic recording is characterized by including a WEB site traffic recording system, a recorded and generated WEB site data structure, a WEB site restoring system, and a restored WEB site application, and specifically including the following steps:
s1, filling a URL of a website to be crawled into a website flow recording system;
s2, the website traffic recording system starts recording;
s3, marking the recorded website after passing through the reverse proxy of the website recording system according to expected access and weight on a browser;
s4, until the pages needing to be crawled of the website to be crawled are clicked and the weight marking is finished, importing the website data generated by the website recording system into the website restoring system;
s5, the website restoring system generates restored website applications;
s6: and running the restored website application, and accessing the restored website in a browser.
In step S2, the network data of the website to be crawled is generated by a recording module, where the recording module includes, as shown in fig. 2, a reverse proxy module, a route analyzing module, a weight label identifying module, and a response data processing module.
Wherein, reverse agent module is used for treating the website of crawling and carries out reverse agent and data processing, includes: a) reversely acting the website to be crawled to a local port; b) uniformly standardizing protocol traffic (HTTP/HTTPS) of a website to be crawled into unencrypted HTTP protocol traffic; c) and filtering invalid data about the original website in the website to be crawled, such as the original IP, the domain name and the port. The way reverse agents are selected instead of simple agents for web page crawling intervention is because we need to filter invalid data for the crawled web sites. Because after the website is crawled down, the invalid data may cause the restored website access to jump back to the original website, or the data request is targeted to a problem that no port exists, so that the restored website part fails.
The route analysis module is used for recording an HTTP request method, a route URI and HTTP request data for accessing the recorded website.
The WEIGHT mark identification module is used for recording the custom WEIGHT under the current request when the logging website is accessed, wherein the custom WEIGHT is obtained from an HTTP request header 'WEIGHT', and the definition of the current request is uniquely matched by URI and HTTP request data.
The response data processing module is configured to record a response status code, a response data type, a response data ID, and response data returned by accessing the logged-in website, and combine data of the routing analysis module and the tag identification module into website request data of a current request, as shown in fig. 3.
The website data of the website to be crawled generated by the recording module is used for identifying and recording the structure and logic of the current website as shown in fig. 3. The method is characterized in that: the double search data structure has HTTP key information as the first element and response data ID as the second element. The HTTP key information comprises: HTTP request method and HTTP request URI. Wherein the response data ID is obtained by comprehensive operation of the HTTP request URI and the HTTP request parameter. The retrieved data includes: an HTTP request method, a URI, an HTTP response status code, an HTTP response data type and a plurality of response data ID and weight mark combined data.
Further, in step S3, a custom request header "WEIGHT" is added to the HTTP request header to identify the current URI and the WEIGHT value of the response result returned under the request parameter.
Further, the website restoring system in step S5 is shown in fig. 2, and includes: the device comprises a route analysis module, a response data matching module and an application generation module.
The route analysis module is used for analyzing data in website data, and comprises: the method comprises the steps of requesting a method, URI, response state code and response data type, and combining the data into an effective routing code of the HTTP website application through a character string matching and splicing method. The routing code of a routing URI is actually the call of a self-defined template function, and only the request method of the character string type, the URI, the response state code and the response data type are filled in the position of the call parameter.
The response data matching module is used for analyzing data in the website, comprises multiple groups of response data IDs and weight marks, and combines the response content codes into an effective HTTP website application through a data string matching and splicing method. A response content code is actually called for a self-defined template function, and only the response data ID and the weight mark of the character string type are needed to be filled in the position of the calling parameter.
The response data matching method is characterized by exact matching and sequential matching. When the current request data and the response data ID are matched constantly by accurate matching, the specified response data is returned; when the request data ID for the current request IP is matched in sequence, the data are responded in sequence according to the times of the request IP and the weight mark. In matching response data in order, the weight label is small, the number of returns is small, and vice versa.
The application generation module is used for combining the existing multi-HTTP website application template codes, the routing codes generated by the routing analysis module and the response content codes generated by the response data matching module to generate effective HTTP website application codes, and then compiling to generate WEB applications of the restored websites. The template codes are applied, and comprise routing processing, template codes responding to data processing, processing response codes not defining routing and the like. The programming language model selection cross-platform friendly language Golang enables WEB application of a compiled reduction website to support running and displaying of a full platform.
And running the WEB application of the restored website, and accessing the restored website in a browser.
The invention has the beneficial effects that: the invention can crawl the websites of all the existing website types and realize the website restoration with high restoration degree, wherein the high-interaction restoration comprises the following steps: login interaction, registration interaction, search interaction, verification code change interaction and the like. In the business practice of deception defense, a specific website is required to be crawled and restored frequently to realize further deception defense, the website restoration method and the website restoration system realize the website restoration with extremely high restoration degree and comprehensive use range, and because response data are returned fixedly and orderly, webpage loopholes cannot exist, so that the website restoration method and the website restoration system can occupy a more advantageous position in deception defense. The above description is only for the purpose of illustrating the preferred embodiments of the present invention and is not to be construed as limiting the invention, and any modifications, equivalents, improvements and the like that fall within the spirit and principle of the present invention are intended to be included therein.
Claims (7)
1. A system for crawling and restoring WEB pages by recording traffic is characterized by comprising: the system comprises a website flow recording system and a website restoring system, wherein the website flow recording system is used for recording and generating website data of a website to be crawled, and the website restoring system is used for generating restored website application.
2. The system for crawling and restoring the WEB site through traffic recording as claimed in claim 1, further comprising a website data structure generated by recording and a website application for restoring.
3. A method for crawling and restoring a WEB site by recording traffic is characterized by comprising the following specific steps:
s1, filling a URL of a website to be crawled into a website flow recording system;
s2, the website traffic recording system starts recording;
s3, marking the recorded website after passing through the reverse proxy of the website recording system according to expected access and weight on a browser;
s4, importing the website data which is generated by the website recording system and has the website data structure characteristics into the website restoring system;
s5, the website restoring system generates restored website applications;
s6: and running the restored website application, and accessing the restored website in a browser.
4. The method for crawling and restoring the WEB site through traffic recording according to claim 3, wherein the website data structure generated by recording comprises: a double-retrieval data structure which takes HTTP key information as a first element and response data ID as a second element; the HTTP key information comprises: HTTP request method and HTTP request URI; wherein the response data ID is obtained by comprehensive operation of HTTP request URI and HTTP request parameters; the retrieved data includes: an HTTP request method, a URI, an HTTP response status code, an HTTP response data type and a plurality of response data ID and weight mark combined data.
5. The method for crawling and restoring the WEB site through traffic recording as claimed in claim 3, wherein the recording module of the WEB site recording system in step S2 comprises a reverse proxy module, a route parsing module, a weight mark identification module and a response data processing module.
6. The method for crawling and restoring WEB sites through traffic recording according to claim 3, wherein in step S3, a method for adding a custom request header to the HTTP request header to identify the current URI and the weight value of the response result under the request parameter is used.
7. The method for crawling and restoring the WEB site through traffic recording according to claim 3, wherein the modules of the WEB site restoring system in the step S5 include: the device comprises a route analysis module, a response data matching module and an application generation module.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111167366.4A CN113946735A (en) | 2021-10-05 | 2021-10-05 | Method and system for crawling and restoring WEB site by traffic recording |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202111167366.4A CN113946735A (en) | 2021-10-05 | 2021-10-05 | Method and system for crawling and restoring WEB site by traffic recording |
Publications (1)
| Publication Number | Publication Date |
|---|---|
| CN113946735A true CN113946735A (en) | 2022-01-18 |
Family
ID=79330016
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202111167366.4A Pending CN113946735A (en) | 2021-10-05 | 2021-10-05 | Method and system for crawling and restoring WEB site by traffic recording |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN113946735A (en) |
Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101887463A (en) * | 2010-07-22 | 2010-11-17 | 北京天融信科技有限公司 | Virtual domain-based HTTP reduction display method |
| CN102098331A (en) * | 2010-12-29 | 2011-06-15 | 北京锐安科技有限公司 | Method and system for reducing WEB type application contents |
| US20130136253A1 (en) * | 2011-11-28 | 2013-05-30 | Hadas Liberman Ben-Ami | System and method for tracking web interactions with real time analytics |
| CN106598991A (en) * | 2015-10-19 | 2017-04-26 | 上海引跑信息科技有限公司 | Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode |
-
2021
- 2021-10-05 CN CN202111167366.4A patent/CN113946735A/en active Pending
Patent Citations (4)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN101887463A (en) * | 2010-07-22 | 2010-11-17 | 北京天融信科技有限公司 | Virtual domain-based HTTP reduction display method |
| CN102098331A (en) * | 2010-12-29 | 2011-06-15 | 北京锐安科技有限公司 | Method and system for reducing WEB type application contents |
| US20130136253A1 (en) * | 2011-11-28 | 2013-05-30 | Hadas Liberman Ben-Ami | System and method for tracking web interactions with real time analytics |
| CN106598991A (en) * | 2015-10-19 | 2017-04-26 | 上海引跑信息科技有限公司 | Web crawler system capable of realizing website interaction and automatic form extraction by conversational mode |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| CN103888490B (en) | A kind of man-machine knowledge method for distinguishing of full automatic WEB client side | |
| US6996622B2 (en) | Session managing method, session managing system, and program | |
| CN101211364B (en) | Method and system for social bookmarking of resources exposed in web pages | |
| US6665634B2 (en) | Test system for testing dynamic information returned by a web server | |
| CN103577427A (en) | Browser kernel based web page crawling method and device and browser containing device | |
| CN105243159A (en) | Visual script editor-based distributed web crawler system | |
| US10187444B2 (en) | System and method of automatic generation and insertion of analytic tracking codes | |
| US20210064453A1 (en) | Automated application programming interface (api) specification construction | |
| KR20080031276A (en) | Information acquisition method and device | |
| CN102436564A (en) | Method and device for identifying tampered webpage | |
| CN104573520B (en) | The method and apparatus for detecting resident formula cross site scripting loophole | |
| CN103440139A (en) | Acquisition method and tool facing microblog IDs (identitiesy) of mainstream microblog websites | |
| CN105743988B (en) | Network user's tracing implementing method, apparatus and system | |
| CN104168250B (en) | Business Process Control method and device based on CGI frames | |
| CN106897336A (en) | Web page files sending method, webpage rendering intent and device, webpage rendering system | |
| CN113032655A (en) | Method for extracting and fixing dark network electronic data | |
| JP5347429B2 (en) | Uniform resource locator rewriting method and apparatus | |
| CN108156118A (en) | User Identity method and device | |
| CN110532455A (en) | A kind of Web page picture acquisition methods and system based on Chrome browser | |
| CN103678341A (en) | Database interaction system and method | |
| US20250131086A1 (en) | Detecting data leakage and/ or detecting dangerous information | |
| CN106960158A (en) | A kind of method and apparatus for preventing blog from being retrieved by web crawlers | |
| Suguna et al. | User interest level based preprocessing algorithms using web usage mining | |
| CN110719344B (en) | Domain name acquisition method and device, electronic equipment and storage medium | |
| CN113946735A (en) | Method and system for crawling and restoring WEB site by traffic recording |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| RJ01 | Rejection of invention patent application after publication |
Application publication date: 20220118 |
|
| RJ01 | Rejection of invention patent application after publication |