KR101650316B1

KR101650316B1 - Apparatus and method for collecting and analysing HTML5 documents based a distributed parallel processing

Info

Publication number: KR101650316B1
Application number: KR1020150009712A
Authority: KR
Inventors: 김환국; 정종훈; 배한철; 추현록; 장웅; 오상환; 윤수진
Original assignee: 한국인터넷진흥원
Priority date: 2015-01-21
Filing date: 2015-01-21
Publication date: 2016-08-23
Anticipated expiration: 2035-01-21
Also published as: KR20160089995A

Abstract

분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치 및 방법이 제공된다. 상기 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치는, Root URL 정보를 제1 데이터베이스에 저장하는 인젝터(injector) 모듈, 상기 제1 데이터베이스로부터 상기 Root URL 정보를 제공받아 수집 대상 URL 리스트를 생성하고, 상기 수집 대상 URL 리스트를 제2 데이터베이스에 저장하는 제너레이터(generator) 모듈, 상기 제2 데이터베이스로부터 상기 수집 대상 URL 리스트를 제공받아 상기 수집 대상 URL 리스트에 대응되는 웹 페이지로부터 컨텐츠를 추출하고, 상기 컨텐츠를 상기 제2 데이터베이스에 저장하는 페처(fetcher) 모듈, 상기 제2 데이터베이스로부터 상기 컨텐츠를 제공받아 상기 컨텐츠의 내용을 파싱하여 파싱 결과 정보를 생성하고, 상기 파싱 결과 정보를 상기 제2 데이터베이스에 저장하는 파싱(parsing) 모듈, 상기 파싱 모듈로부터 상기 파싱 결과 정보를 제공받아 상기 웹 페이지의 문서 타입이 HTML5인지 판단하는 필터(filter) 모듈, 및 상기 웹 페이지의 문서 타입이 HTML5인 경우에만 상기 컨텐츠에 포함된 HTML 코드의 취약점(vulnerability)을 분석하는 취약점 분석 모듈을 포함하되, 상기 취약점 분석 모듈은 상기 컨텐츠를 복수의 서브 컨텐츠로 스플릿(split)하고, 상기 서브 컨텐츠에 대해서 키워드와 속성을 추출하고, 상기 키워드 및 상기 속성의 빈도수를 연산하여 상기 컨텐츠의 취약점을 분석한다.An apparatus and method for HTML5 document collection and analysis based on distributed parallel processing is provided. An HTML5 document collection and analysis apparatus based on the distributed parallel processing includes an injector module for storing root URL information in a first database, generating a list of URLs to be collected by receiving the root URL information from the first database, A generator module that stores the collection target URL list in a second database, a content provider that receives the collection target URL list from the second database, extracts content from a web page corresponding to the collection target URL list, A parser module for storing the parsing result information in the second database, a fetcher module for storing the parsing result information in the second database, a parsing module for receiving the parsing result information, A vulnerability analysis module for analyzing a vulnerability of the HTML code included in the content only when the document type of the web page is HTML5, Wherein the vulnerability analysis module divides the content into a plurality of sub-contents, extracts a keyword and an attribute for the sub-content, calculates a frequency of the keyword and the attribute, Analyze.

Description

Technical Field The present invention relates to an apparatus and method for collecting and analyzing HTML5 documents based on distributed parallel processing,

본 발명은 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치 및 방법에 관한 것이다. The present invention relates to an apparatus and method for HTML5 document collection and analysis based on distributed parallel processing.

유비쿼터스 컴퓨팅 환경이 도래하고 사용자 중심의 인터넷 서비스 시장이 급속하게 발전함으로 인해 처리해야 할 데이터 스트림의 양이 빠르게 증가하고 있으며, 데이터 스트림의 종류 또한 더욱 다양화되고 있다. 이에 따라, 대용량 데이터 스트림에 대한 실시간 데이터 분석 및 가공 서비스를 제공하기 위한 데이터 스트림 분산 병렬 처리 관련 연구가 활발히 진행되고 있다.With the advent of ubiquitous computing environments and the rapid development of the user-oriented Internet service market, the amount of data streams to be processed is rapidly increasing, and the types of data streams are becoming more diversified. Accordingly, researches related to distributed parallel processing of data streams to provide real-time data analysis and processing services for large-capacity data streams are actively under way.

한국공개특허 제2013-0095910호에는 데이터 스트림 분산 병렬 처리 서비스 관리 장치 및 방법에 관하여 개시되어 있다. Korean Patent Laid-Open Publication No. 2013-0095910 discloses an apparatus and method for managing a data stream distributed parallel processing service.

본 발명이 해결하고자 하는 과제는, 대용량의 HTML5 웹 문서를 수집하여 분산 병렬 처리 기반으로 HTML5 보안 취약 태그 및 속성을 분석하며, 특히, 잭킹(jacking) 또는 크로스 사이트 스크립팅(cross-site scripting; XSS) 공격을 타겟으로 한 HTML5 보안 취약 태그 및 속성을 분석할 수 있는 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치를 제공하는 것이다. SUMMARY OF THE INVENTION The object of the present invention is to provide a method and system for collecting large-capacity HTML5 web documents and analyzing HTML5 security vulnerable tags and attributes based on distributed parallel processing, And to provide an HTML5 document collection and analysis apparatus based on distributed parallel processing capable of analyzing HTML5 security vulnerable tags and attributes targeting attacks.

본 발명이 해결하고자 하는 다른 과제는, 대용량의 HTML5 웹 문서를 수집하여 분산 병렬 처리 기반으로 HTML5 보안 취약 태그 및 속성을 분석하며, 특히, 잭킹 또는 크로스 사이트 스크립팅 공격을 타겟으로 한 HTML5 보안 취약 태그 및 속성을 분석할 수 있는 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 방법을 제공하는 것이다. Another problem to be solved by the present invention is to collect HTML5 web documents of large capacity and analyze HTML5 security vulnerable tags and attributes on the basis of distributed parallel processing. Especially, HTML5 security vulnerable tags targeting jacking or cross- The present invention provides an HTML5 document collection and analysis method based on distributed parallel processing capable of analyzing attributes.

본 발명이 해결하고자 하는 과제들은 이상에서 언급한 과제들로 제한되지 않으며, 언급되지 않은 또 다른 과제들은 아래의 기재로부터 당업자에게 명확하게 이해될 수 있을 것이다.The problems to be solved by the present invention are not limited to the above-mentioned problems, and other matters not mentioned can be clearly understood by those skilled in the art from the following description.

상기 과제를 해결하기 위한 본 발명의 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 일 실시예는, Root URL 정보를 제1 데이터베이스에 저장하는 인젝터(injector) 모듈, 상기 제1 데이터베이스로부터 상기 Root URL 정보를 제공받아 수집 대상 URL 리스트를 생성하고, 상기 수집 대상 URL 리스트를 제2 데이터베이스에 저장하는 제너레이터(generator) 모듈, 상기 제2 데이터베이스로부터 상기 수집 대상 URL 리스트를 제공받아 상기 수집 대상 URL 리스트에 대응되는 웹 페이지로부터 컨텐츠를 추출하고, 상기 컨텐츠를 상기 제2 데이터베이스에 저장하는 페처(fetcher) 모듈, 상기 제2 데이터베이스로부터 상기 컨텐츠를 제공받아 상기 컨텐츠의 내용을 파싱하여 파싱 결과 정보를 생성하고, 상기 파싱 결과 정보를 상기 제2 데이터베이스에 저장하는 파싱(parsing) 모듈, 상기 파싱 모듈로부터 상기 파싱 결과 정보를 제공받아 상기 웹 페이지의 문서 타입이 HTML5인지 판단하는 필터(filter) 모듈, 및 상기 웹 페이지의 문서 타입이 HTML5인 경우에만 상기 컨텐츠에 포함된 HTML 코드의 취약점(vulnerability)을 분석하는 취약점 분석 모듈을 포함하되, 상기 취약점 분석 모듈은 상기 컨텐츠를 복수의 서브 컨텐츠로 스플릿(split)하고, 상기 서브 컨텐츠에 대해서 키워드와 속성을 추출하고, 상기 키워드 및 상기 속성의 빈도수를 연산하여 상기 컨텐츠의 취약점을 분석한다. According to another aspect of the present invention, there is provided an HTML5 document collection and analysis apparatus based on distributed parallel processing, the apparatus comprising: an injector module for storing root URL information in a first database; A generator module for receiving the information, generating a list of URLs to be collected and storing the list of URLs to be collected in a second database, receiving the list of URLs to be collected from the second database, A fetcher module for extracting contents from the web page and storing the contents in the second database, parsing contents of the contents by receiving the contents from the second database to generate parsing result information, A parsing model for storing parsing result information in the second database A filter module that receives the parsing result information from the parsing module and determines whether the document type of the web page is HTML5, and a vulnerability module that, when the document type of the web page is HTML5, wherein the vulnerability analysis module divides the content into a plurality of sub-contents, extracts keywords and attributes for the sub-contents, and extracts keywords and attributes from the sub-contents, And analyzes the vulnerability of the content by calculating the frequency.

본 발명의 몇몇 실시예에서, 상기 취약점 분석 모듈은, 상기 서브 컨텐츠에 포함된 태그를 트리 구조로 정렬하여 상기 키워드 및 상기 속성을 추출할 수 있다. In some embodiments of the present invention, the vulnerability analysis module may extract the keyword and the attribute by arranging tags included in the sub-content in a tree structure.

본 발명의 몇몇 실시예에서, 상기 제2 데이터베이스로부터 상기 파싱 결과 정보를 제공받아 상기 제1 데이터베이스에 저장된 정보를 업데이트하는 업데이터(updater) 모듈을 더 포함할 수 있다. In some embodiments of the present invention, the parser may further include an updater module for receiving the parsing result information from the second database and updating information stored in the first database.

본 발명의 몇몇 실시예에서, 상기 페처 모듈은, 상기 컨텐츠에 관한 컨텐츠 수집 정보를 생성하고, 상기 컨텐츠 수집 정보를 상기 제2 데이터베이스에 더 저장할 수 있다. In some embodiments of the present invention, the catcher module may generate content collection information about the content and further store the content collection information in the second database.

본 발명의 몇몇 실시예에서, 상기 업데이터 모듈은, 상기 제2 데이터베이스로부터 상기 컨텐츠 수집 정보를 제공받아 상기 제1 데이터베이스에 저장된 정보를 업데이트할 수 있다. In some embodiments of the present invention, the updater module may update the information stored in the first database by receiving the content collection information from the second database.

본 발명의 몇몇 실시예에서, 상기 제1 데이터베이스는, 상기 Root URL 정보를 제1 포맷으로 변환하여 저장하고, 상기 제1 포맷은, URL, 수집 상태, 수집 시간, 수집된 이후 재시도 횟수, 및 문서 형식에 관한 정보를 포함할 수 있다. In some embodiments of the present invention, the first database converts the Root URL information into a first format and stores the URL, the collection status, the collection time, the number of retries after collection, And may include information on the document format.

본 발명의 몇몇 실시예에서, 상기 제2 데이터베이스는, 상기 컨텐츠를 제2 포맷으로 변환하여 저장하고, 상기 제2 포맷은, 상기 제1 포맷에 포함된 정보와, 상기 웹 페이지의 HTML 내용을 포함할 수 있다. In some embodiments of the present invention, the second database converts the content into a second format and stores the converted content, and the second format includes information included in the first format and HTML content of the web page can do.

본 발명의 몇몇 실시예에서, 상기 제2 데이터베이스는, 상기 컨텐츠를 파싱한 형태의 아웃링크(outlink) 주소 및 상기 아웃링크를 텍스트 라인 단위로 저장한 형태를 더 저장할 수 있다. In some embodiments of the present invention, the second database may further store an outlink address in the form of parsing the contents, and a form in which the outlink is stored in units of text lines.

본 발명의 몇몇 실시예에서, 상기 취약점에 관한 정보를 저장하는 제3 데이터베이스를 더 포함할 수 있다. In some embodiments of the invention, it may further comprise a third database for storing information about the vulnerability.

본 발명의 몇몇 실시예에서, 상기 Root URL 정보는, 수집 대상 URL을 포함하는 웹 페이지의 메인 URL 정보일 수 있다.In some embodiments of the present invention, the root URL information may be main URL information of a web page including a collection target URL.

상기 과제를 해결하기 위한 본 발명의 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 다른 실시예는, 데이터베이스, 제1 웹 페이지의 Root URL 정보를 추출하여 상기 데이터베이스에 저장하는 인젝터(injector) 모듈, 상기 Root URL 정보를 제공받아 수집 대상 URL 리스트를 생성하고, 상기 수집 대상 URL 리스트를 상기 데이터베이스에 저장하는 제너레이터(generator) 모듈, 상기 수집 대상 URL 리스트를 제공받아 대응되는 제2 웹 페이지로부터 컨텐츠를 추출하고, 상기 컨텐츠를 상기 데이터베이스에 저장하는 페처(fetcher) 모듈, 상기 컨텐츠를 제공받아 파싱하고, 파싱 결과 정보를 생성하고, 상기 파싱 결과 정보를 상기 데이터베이스에 저장하는 파싱(parsing) 모듈, 상기 파싱 결과 정보를 제공받아 상기 제2 웹 페이지의 문서 타입이 HTML5인지 판단하는 필터(filter) 모듈, 및 상기 제2 웹 페이지의 문서 타입이 HTML5인 경우에만 상기 컨텐츠에 포함된 HTML 코드의 취약점(vulnerability)을 분석하는 취약점 분석 모듈을 포함하되, 상기 취약점 분석 모듈은 상기 컨텐츠를 복수의 서브 컨텐츠로 스플릿(split)하고, 상기 서브 컨텐츠에 대해서 키워드와 속성을 추출하고, 상기 키워드 및 상기 속성의 빈도수를 연산하여 상기 컨텐츠의 취약점을 분석한다. According to another aspect of the present invention, there is provided an apparatus for collecting and analyzing HTML5 documents based on distributed parallel processing, the apparatus comprising: an injector module for extracting root URL information of a database and a first web page, A generator module for receiving the Root URL information to generate a list of URLs to be collected and storing the list of URLs to be collected in the database, extracting contents from a corresponding second web page by receiving the list of URLs to be collected, A fetcher module for storing the content in the database, a parsing module for receiving and parsing the content, generating parsing result information, and storing the parsing result information in the database, And a filter for determining whether the document type of the second web page is HTML5 and a vulnerability analysis module for analyzing a vulnerability of the HTML code included in the content only when the document type of the second web page is HTML5, Extracts a keyword and an attribute for the sub-content, calculates a frequency of the keyword and the attribute, and analyzes the vulnerability of the content.

본 발명의 몇몇 실시예에서, 상기 페처 모듈은, 상기 컨텐츠에 관한 컨텐츠 수집 정보를 생성하고, 상기 컨텐츠 수집 정보를 상기 데이터베이스에 더 저장할 수 있다. In some embodiments of the present invention, the catcher module may generate content collection information about the content and further store the content collection information in the database.

본 발명의 몇몇 실시예에서, 상기 데이터베이스는, 상기 Root URL 정보를 제1 포맷으로 변환하여 저장하고, 상기 제1 포맷은, URL, 수집 상태, 수집 시간, 수집된 이후 재시도 횟수, 및 문서 형식에 관한 정보를 포함할 수 있다. In some embodiments of the present invention, the database converts the Root URL information into a first format and stores the URL, the collection status, the collection time, the number of retries after collection, and the document format As shown in FIG.

본 발명의 몇몇 실시예에서, 상기 데이터베이스는, 상기 컨텐츠를 제2 포맷으로 변환하여 저장하고, 상기 제2 포맷은, 상기 제1 포맷에 포함된 정보와, 상기 제2 웹 페이지의 HTML 내용을 포함할 수 있다. In some embodiments of the present invention, the database converts and stores the content into a second format, and the second format includes information contained in the first format and HTML content of the second web page can do.

본 발명의 몇몇 실시예에서, 상기 데이터베이스는, 상기 컨텐츠를 파싱한 형태의 아웃링크(outlink) 주소 및 상기 아웃링크를 텍스트 라인 단위로 저장한 형태를 더 저장할 수 있다. In some embodiments of the present invention, the database may further store an outlink address in the form of parsing the content, and a form in which the outlink is stored in units of text lines.

본 발명의 몇몇 실시예에서, 상기 Root URL 정보는, 수집 대상 URL을 포함하는 상기 제1 웹 페이지의 메인 URL 정보일 수 있다. In some embodiments of the present invention, the root URL information may be the main URL information of the first web page including the collection target URL.

상기 과제를 해결하기 위한 본 발명의 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 방법의 일 실시예는, Root URL 정보를 기초로 하여, 수집 대상 URL 리스트를 생성하고, 상기 수집 대상 URL 리스트에 대응되는 웹 페이지로부터 컨텐츠를 추출하고, 상기 컨텐츠의 내용을 파싱하여 파싱 결과 정보를 생성하고, 상기 파싱 결과 정보를 기초로 하여, 상기 웹 페이지의 문서 타입이 HTML5인지 판단하고, 상기 웹 페이지의 문서 타입이 HTML5인 경우에만 상기 컨텐츠에 포함된 HTML 코드의 취약점(vulnerability)을 분석하는 것을 포함하되, 상기 컨텐츠를 복수의 서브 컨텐츠로 스플릿(split)하고, 상기 서브 컨텐츠에 대해서 키워드와 속성을 추출하고, 상기 키워드 및 상기 속성의 빈도수를 연산하여 상기 컨텐츠의 취약점을 분석한다. According to an aspect of the present invention, there is provided an HTML5 document collection and analysis method based on distributed parallel processing, the method comprising: generating a list of URLs to be collected based on root URL information; Extracts content from a web page, generates parsing result information by parsing the content of the content, and determines whether the document type of the web page is HTML5 based on the parsing result information, Analyzing a vulnerability of the HTML code included in the content only when the content is HTML5, the method comprising: splitting the content into a plurality of sub-contents, extracting keywords and attributes for the sub-contents, Keyword and the frequency of the attribute to analyze the vulnerability of the content.

본 발명의 몇몇 실시예에서, 상기 취약점을 분석하는 것은, 상기 서브 컨텐츠에 포함된 태그를 트리 구조로 정렬하여 상기 키워드 및 상기 속성을 추출할 수 있다. In some embodiments of the present invention, analyzing the vulnerability may extract the keyword and the attribute by arranging the tags included in the sub content in a tree structure.

본 발명의 몇몇 실시예에서, 상기 Root URL 정보를 데이터베이스에 저장하는 것을 더 포함할 수 있다. In some embodiments of the invention, it may further comprise storing the Root URL information in a database.

본 발명의 몇몇 실시예에서, 상기 수집 대상 URL 리스트 및 상기 컨텐츠를 상기 데이터베이스에 저장하는 것을 더 포함할 수 있다. In some embodiments of the present invention, the method further comprises storing the collection URL list and the content in the database.

본 발명의 몇몇 실시예에서, 상기 Root URL 정보는, 수집 대상 URL을 포함하는 웹 페이지의 메인 URL 정보일 수 있다. In some embodiments of the present invention, the root URL information may be main URL information of a web page including a collection target URL.

본 발명의 기타 구체적인 사항들은 상세한 설명 및 도면들에 포함되어 있다.Other specific details of the invention are included in the detailed description and drawings.

본 발명에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치 및 방법에 의하면, 분산 병렬 처리를 기반으로 하여 대용량의 HTML5 웹 문서를 수집하고 분석하며, 특히, HTML5 보안 취약 태그 및 속성을 분석할 수 있다. 본 발명에 따르면, 잭킹(jacking) 또는 크로스 사이트 스크립팅(cross-site scripting; XSS) 공격을 타겟으로 하여, HTML5 보안 취약 태그 및 속성을 분석할 수 있다. According to the apparatus and method for HTML5 document collection and analysis based on the distributed parallel processing according to the present invention, it is possible to collect and analyze large-capacity HTML5 web documents based on distributed parallel processing, have. According to the present invention, HTML5 security vulnerable tags and attributes can be analyzed, targeting jacking or cross-site scripting (XSS) attacks.

도 1은 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 블록도이다.
도 2는 HTML5 문서를 수집하기 위한 문서 형식을 도시한 것이다.
도 3은 분산 병렬 처리 기반의 분석 방법을 설명하기 위한 도면이다.
도 4는 HTML5 보안 취약 태그 및 속성을 분석하기 위한 트리 구조를 도시한 도면이다.
도 5는 HTML5 보안 취약 태그 중 input 태그에 대하여 속성을 검색할 수 있는 트리 구조를 예시적으로 도시한 도면이다.
도 6 및 도 7은 HTML5 보안 취약 태그 및 속성을 도시한 표이다.
도 8은 논리적 저장 단위인 세그먼트에 대하여 설명하기 위한 도면이다.
도 9는 본 발명의 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 블록도이다.
도 10은 본 발명의 또 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 블록도이다.
도 11은 본 발명의 또 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 블록도이다.
도 12는 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 방법을 순차적으로 나타낸 흐름도이다. 1 is a block diagram of an HTML5 document collection and analysis apparatus based on distributed parallel processing according to an embodiment of the present invention.
2 shows a document format for collecting HTML5 documents.
3 is a diagram for explaining an analysis method based on distributed parallel processing.
4 is a diagram showing a tree structure for analyzing HTML5 security vulnerable tags and attributes.
FIG. 5 is an exemplary diagram illustrating a tree structure in which an attribute can be searched for an input tag among the HTML5 security vulnerable tags.
6 and 7 are tables showing HTML5 security vulnerability tags and attributes.
8 is a diagram for explaining a segment which is a logical storage unit.
9 is a block diagram of an HTML5 document collection and analysis apparatus based on distributed parallel processing according to another embodiment of the present invention.
10 is a block diagram of an HTML5 document collection and analysis apparatus based on distributed parallel processing according to another embodiment of the present invention.
11 is a block diagram of an HTML5 document collection and analysis apparatus based on distributed parallel processing according to another embodiment of the present invention.
12 is a flowchart sequentially illustrating an HTML5 document collection and analysis method based on distributed parallel processing according to an exemplary embodiment of the present invention.

본 발명의 이점 및 특징, 그리고 그것들을 달성하는 방법은 첨부되는 도면과 함께 상세하게 후술되어 있는 실시예들을 참조하면 명확해질 것이다. 그러나 본 발명은 이하에서 개시되는 실시예들에 한정되는 것이 아니라 서로 다른 다양한 형태로 구현될 것이며, 단지 본 실시예들은 본 발명의 개시가 완전하도록 하며, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 발명의 범주를 완전하게 알려주기 위해 제공되는 것이며, 본 발명은 청구항의 범주에 의해 정의될 뿐이다. 명세서 전체에 걸쳐 동일 참조 부호는 동일 구성 요소를 지칭한다.BRIEF DESCRIPTION OF THE DRAWINGS The advantages and features of the present invention, and the manner of achieving them, will be apparent from and elucidated with reference to the embodiments described hereinafter in conjunction with the accompanying drawings. The present invention may, however, be embodied in many different forms and should not be construed as being limited to the embodiments set forth herein. Rather, these embodiments are provided so that this disclosure will be thorough and complete, and will fully convey the scope of the invention to those skilled in the art. Is provided to fully convey the scope of the invention to those skilled in the art, and the invention is only defined by the scope of the claims. Like reference numerals refer to like elements throughout the specification.

각 블록은 특정된 논리적 기능(들)을 실행하기 위한 하나 이상의 실행 가능한 인스트럭션들을 포함하는 모듈, 세그먼트 또는 코드의 일부를 나타낼 수 있다. 또한, 몇 가지 대체 실행 예들에서는 블록들에서 언급된 기능들이 순서를 벗어나서 발생하는 것도 가능함을 주목해야 한다. 예컨대 잇달아 도시되어 있는 두 개의 블록들은 사실 실질적으로 동시에 수행되는 것도 가능하고 또는 그 블록들이 때때로 해당하는 기능에 따라 역순으로 수행되는 것도 가능하다.Each block may represent a portion of a module, segment, or code that includes one or more executable instructions for executing the specified logical function (s). It should also be noted that in some alternative implementations the functions mentioned in the blocks may occur out of order. For example, two blocks that are shown one after the other may actually be executed substantially concurrently, or the blocks may sometimes be performed in reverse order according to the corresponding function.

비록 제1, 제2 등이 다양한 소자, 구성요소 및/또는 섹션들을 서술하기 위해서 사용되나, 이들 소자, 구성요소 및/또는 섹션들은 이들 용어에 의해 제한되지 않음은 물론이다. 이들 용어들은 단지 하나의 소자, 구성요소 또는 섹션들을 다른 소자, 구성요소 또는 섹션들과 구별하기 위하여 사용하는 것이다. 따라서, 이하에서 언급되는 제1 소자, 제1 구성요소 또는 제1 섹션은 본 발명의 기술적 사상 내에서 제2 소자, 제2 구성요소 또는 제2 섹션일 수도 있음은 물론이다.Although the first, second, etc. are used to describe various elements, components and / or sections, it is needless to say that these elements, components and / or sections are not limited by these terms. These terms are only used to distinguish one element, element or section from another element, element or section. Therefore, it goes without saying that the first element, the first element or the first section mentioned below may be the second element, the second element or the second section within the technical spirit of the present invention.

본 명세서에서 사용된 용어는 실시예들을 설명하기 위한 것이며 본 발명을 제한하고자 하는 것은 아니다. 본 명세서에서, 단수형은 문구에서 특별히 언급하지 않는 한 복수형도 포함한다. 명세서에서 사용되는 "포함한다(comprises)" 및/또는 "포함하는(comprising)"은 언급된 구성요소, 단계, 동작 및/또는 소자는 하나 이상의 다른 구성요소, 단계, 동작 및/또는 소자의 존재 또는 추가를 배제하지 않는다.The terminology used herein is for the purpose of illustrating embodiments and is not intended to be limiting of the present invention. In the present specification, the singular form includes plural forms unless otherwise specified in the specification. It is noted that the terms "comprises" and / or "comprising" used in the specification are intended to be inclusive in a manner similar to the components, steps, operations, and / Or additions.

다른 정의가 없다면, 본 명세서에서 사용되는 모든 용어(기술 및 과학적 용어를 포함)는 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자에게 공통적으로 이해될 수 있는 의미로 사용될 수 있을 것이다. 또 일반적으로 사용되는 사전에 정의되어 있는 용어들은 명백하게 특별히 정의되어 있지 않는 한 이상적으로 또는 과도하게 해석되지 않는다.Unless defined otherwise, all terms (including technical and scientific terms) used herein may be used in a sense commonly understood by one of ordinary skill in the art to which this invention belongs. Also, commonly used predefined terms are not ideally or excessively interpreted unless explicitly defined otherwise.

도 1은 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 블록도이다. 도 2는 HTML5 문서를 수집하기 위한 문서 형식을 도시한 것이다. 도 3은 분산 병렬 처리 기반의 분석 방법을 설명하기 위한 도면이다. 도 4는 HTML5 보안 취약 태그 및 속성을 분석하기 위한 트리 구조를 도시한 도면이다. 도 5는 HTML5 보안 취약 태그 중 input 태그에 대하여 속성을 검색할 수 있는 트리 구조를 예시적으로 도시한 도면이다. 도 6 및 도 7은 HTML5 보안 취약 태그 및 속성을 도시한 표이다. 도 8은 논리적 저장 단위인 세그먼트에 대하여 설명하기 위한 도면이다. 1 is a block diagram of an HTML5 document collection and analysis apparatus based on distributed parallel processing according to an embodiment of the present invention. 2 shows a document format for collecting HTML5 documents. 3 is a diagram for explaining an analysis method based on distributed parallel processing. 4 is a diagram showing a tree structure for analyzing HTML5 security vulnerable tags and attributes. FIG. 5 is an exemplary diagram illustrating a tree structure in which an attribute can be searched for an input tag among the HTML5 security vulnerable tags. 6 and 7 are tables showing HTML5 security vulnerability tags and attributes. 8 is a diagram for explaining a segment which is a logical storage unit.

도 1을 참조하면, 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치(1)는, 인젝터 모듈(100), 제1 데이터베이스(DB1), 제너레이터 모듈(110), 제2 데이터베이스(DB2), 페처 모듈(120), 파싱 모듈(130), 필터 모듈(140), 취약점 분석 모듈(150)을 포함한다. Referring to FIG. 1, an apparatus 5 for collecting and analyzing HTML5 documents based on distributed parallel processing according to an embodiment of the present invention includes an injector module 100, a first database DB1, a generator module 110, 2 database (DB2), a fetcher module 120, a parsing module 130, a filter module 140, and a vulnerability analysis module 150.

잭킹(jacking) 공격이란, 사용자의 의도와 상관없이 공격자가 심어놓은 다른 웹 문서를 클릭하도록 유도하는 공격 기법을 의미한다. 공격자는 악의적인 웹 문서를 포함하는 iFrame을 사용자가 클릭하도록 유도한다. 기존 HTML 문서에서 클라이언트 방어 기법으로 iFrame 내에 웹 컨텐츠가 로드되지 않도록 하는 프레임 버스팅(frame-busting) 기법을 주로 사용하였다. 다만, 프레임 버스팅 기법은 스크립트에 의존하는 방어 기법으로서, HTML5 문서에서는 iFrame의 sandbox 속성이 추가되면서 iFrame에서 이러한 프레임 버스팅 스크립트 속성의 무효화가 가능해졌다. A jacking attack is an attack technique that induces an attacker to click on another web document that they have planted regardless of the user's intention. An attacker can persuade a user to click on an iFrame containing a malicious web document. We mainly use frame-busting technique to prevent web contents from being loaded into iFrame as an client defense technique in existing HTML documents. However, the framebusting technique is a script-based defense technique. In the HTML5 document, the sandbox property of the iFrame is added, which makes it possible to invalidate the attribute of the framebusting script in the iFrame.

크로스 사이트 스크립팅(cross-site scripting; XSS) 공격이란, 공격자가 웹 문서에 악성 스크립트를 삽입하여 사용자의 정보를 탈취하거나 웹 문서가 비정상적인 기능을 수행하도록 하는 공격 기법을 의미한다. 기존 HTML 문서에서 input 태그 등에 악성 스크립트를 삽입하기 위해서는 사용자가 입력 값을 입력하는 행위가 먼저 수행되어야 했다. 다만, HTML5 문서에서는 autofocus 등의 속성이 추가되면서 사용자의 입력 행위 없이도 자동으로 공격자의 악성 스크립트 실행이 가능해졌다.A cross-site scripting (XSS) attack is an attack technique in which an attacker inserts a malicious script into a web document to steal the user's information or perform an abnormal function of the web document. In order to insert a malicious script into an input tag or the like in an existing HTML document, a user must first input an input value. However, in the HTML5 document, attributes such as autofocus are added, and it is possible to execute an attacker's malicious script automatically without the input action of the user.

분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치(1)는, 다수의 HTML 문서 중 HTML5 문서를 수집하고, 분산 병렬 처리를 이용하여 HTML5 문서의 보안 취약점을 분석함으로써, 분석 처리 속도를 향상시킬 수 있다. 특히, 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치(1)는, HTML5 문서의 보안 취약점인 잭킹 공격과 태그 기반의 크로스 사이트 스크립팅 공격을 대상으로 할 수 있다. 다만, 본 발명이 이에 제한되는 것은 아니다.The HTML5 document collection and analysis apparatus 1 based on distributed parallel processing can improve the analysis processing speed by collecting HTML5 documents among a plurality of HTML documents and analyzing security vulnerabilities of HTML5 documents using distributed parallel processing . In particular, the HTML5 document collection and analysis apparatus 1 based on distributed parallel processing can be targeted for a jacking attack and a tag-based cross-site scripting attack, which are security weak points of HTML5 documents. However, the present invention is not limited thereto.

인젝터 모듈(100)은 Root URL 정보(RUI)를 제1 데이터베이스(DB1)에 저장한다. 구체적으로, 인젝터 모듈(100)은 Root URL 정보(RUI)를 제1 포맷(F1)으로 변환하고, 이를 제1 데이터베이스(DB1)에 저장할 수 있다. 제1 포맷(F1)은 수집 대상 웹 문서의 URL, 수집 상태, 수집 시간, 수집된 이후 재시도 횟수, 및 문서 형식에 관한 정보를 포함할 수 있다. 제1 데이터베이스(DB1)에 저장되는 형식은 <URL, F1>으로 나타낼 수 있다. Root URL은 수집 대상 URL(CTU)을 포함하는 웹 페이지의 메인 URL을 의미한다. 수집 대상 URL(CTU)은 하나의 웹 페이지 내에 포함된 다수의 URL 들을 의미하며, 하이퍼 링크되는 웹 페이지의 URL을 의미한다. The injector module 100 stores the root URL information (RUI) in the first database DB1. Specifically, the injector module 100 may convert the root URL information (RUI) into the first format F1 and store it in the first database DB1. The first format F1 may include information on the URL of the web document to be collected, the collection status, the collection time, the number of retries after collection, and the document format. The format stored in the first database DB1 may be expressed as < URL, F1 >. The root URL means the main URL of the web page including the collection target URL (CTU). The collection target URL (CTU) means a plurality of URLs contained in one web page, and means a URL of a web page to be hyperlinked.

제너레이터 모듈(110)은 제1 데이터베이스(DB1)로부터 Root URL 정보(RUI)를 제공받아 수집 대상 URL 리스트(CTUL)를 생성하고, 수집 대상 URL 리스트(CTUL)를 제2 데이터베이스(DB2)에 저장한다. 구체적으로, 제너레이터 모듈(110)은 동일한 호스트(host)별로 구분하여 수집 대상 URL 리스트(CTUL)를 생성하고, 이를 제2 데이터베이스(DB2)에 저장할 수 있다. 즉, 제너레이터 모듈(110)은 단일의 Root URL 정보(RUI)를 제공받은 경우에, 단일의 수집 대상 URL 리스트(CTUL)를 생성할 수 있다. The generator module 110 receives the root URL information (RUI) from the first database DB1 to generate the collection target URL list CTUL and stores the collection target URL list CTUL in the second database DB2 . Specifically, the generator module 110 generates the collection target URL list (CTUL) by dividing the collection host 110 by the same host, and stores the list in the second database DB2. That is, when the generator module 110 is provided with a single Root URL information (RUI), it can generate a single collection target URL list CTUL.

페처 모듈(120)은 제2 데이터베이스(DB2)로부터 수집 대상 URL 리스트(CTUL)를 제공받아 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지로부터 컨텐츠(C)를 추출하고, 컨텐츠(C)를 제2 데이터베이스(DB2)에 저장한다. 구체적으로, 페처 모듈(120)은 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지를 방문하여 문서의 내용을 수집하고, <URL, F1>, <URL, C>의 형식으로 제2 데이터베이스(DB2)에 저장할 수 있다. 컨텐츠(C)는 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지를 방문하여 수집한 문서의 HTML 내용을 의미한다. The fetcher module 120 receives the collection target URL list CTUL from the second database DB2 and extracts the content C from the web page corresponding to the collection target URL list CTUL, 2 store it in the database (DB2). Specifically, the fetcher module 120 visits the web page corresponding to the collection target URL list CTUL to collect the contents of the document, and stores the contents of the second database (DB2) in the form of URL, F1, ). &Lt; / RTI > The content C refers to the HTML content of the document collected by visiting the web page corresponding to the collection target URL list CTUL.

파싱 모듈(130)은 제2 데이터베이스(DB2)로부터 컨텐츠(C)를 제공받아 컨텐츠(C)의 내용을 파싱하여 파싱 결과 정보(PRI)를 생성하고, 파싱 결과 정보(PRI)를 제2 데이터베이스(DB2)에 저장한다. 구체적으로, 파싱 모듈(130)은 제2 데이터베이스(DB2)에 저장된 <URL, C>를 파싱하여 아웃링크(outlink)를 추출한다. 파싱 모듈(130)은 추출한 아웃링크를 <URL, F1>, <URL, PD>, <URL, PT>의 형식으로 제2 데이터베이스(DB2)에 저장할 수 있다. PD(ParseData)는 파싱한 형태의 아웃링크 주소를 의미하고, PT(ParseText)는 아웃링크를 텍스트 라인 단위로 저장한 것을 의미한다. The parsing module 130 receives the content C from the second database DB2 and generates the parsing result information PRI by parsing the content of the content C and outputs the parsing result information PRI to the second database DB2). Specifically, the parsing module 130 parses <URL, C> stored in the second database DB2 to extract an outlink. The parsing module 130 may store the extracted outlink in the second database DB2 in the form of <URL, F1>, <URL, PD>, <URL, PT>. PD (ParseData) means a parsed outlink address, and PT (ParseText) means that an outlink is stored in units of text lines.

필터 모듈(140)은 파싱 모듈(130)로부터 파싱 결과 정보(PRI)를 제공받아 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지의 문서 타입이 HTML5인지 판단한다. 구체적으로, 도 2를 참조하면, 이전의 HTML 버전에서는 dtd 파일을 참조하여 HTML 버전을 판단하였으나, HTML5에서는 dtd 파일이 존재하지 않고, <!DOCTYPEhtml>로만 정의된다. 따라서, 필터 모듈(140)은 HTML 문서가 <!DOCTYPEhtml>로 정의되어 있는지 판단하여 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지의 문서 타입이 HTML5인지 판단한다. 필터 모듈(140)은 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지의 문서 타입이 HTML5인 경우에, 파싱 결과 정보(PRI)의 문서 형식을 content-type:text/html5로 지정하여 취약점 분석 모듈(150)로 전송한다. The filter module 140 receives the parsing result information PRI from the parsing module 130 and determines whether the document type of the web page corresponding to the collection target URL list CTUL is HTML5. Specifically, referring to FIG. 2, in the previous HTML version, the HTML version is determined by referring to the dtd file. In HTML5, however, the dtd file does not exist and is defined only by <! DOCTYPEhtml>. Accordingly, the filter module 140 determines whether the HTML document is defined as <! DOCTYPEhtml> and whether the document type of the web page corresponding to the collection URL list (CTUL) is HTML5. The filter module 140 designates the document format of the parsing result information PRI as content-type: text / html5 when the document type of the web page corresponding to the collection target URL list CTUL is HTML5, (150).

취약점 분석 모듈(150)은 파싱 결과 정보(PRI)의 문서 형식이 content-type:text/html5인 경우에, 컨텐츠(C)에 포함된 HTML 코드의 취약점을 분석한다. 구체적으로, 취약점 분석 모듈(150)은 컨텐츠(C)를 복수의 서브 컨텐츠(C_S)로 스플릿(split)하고, 서브 컨텐츠(C_S)에 대해 키워드(keyword; key)와 속성(attribute; att)을 추출하고, 키워드(key)와 속성(att)의 빈도수를 연산할 수 있다. The vulnerability analysis module 150 analyzes the vulnerability of the HTML code included in the content C when the document format of the parsing result information PRI is content-type: text / html5. Specifically, the vulnerability analysis module 150 splits the content C into a plurality of sub-content C_S and stores a keyword key and an attribute att in the sub-content C_S , And the frequency of the keyword (key) and the attribute (att) can be calculated.

도 3을 참조하면, 취약점 분석 모듈(150)에서 컨텐츠(C)에 포함된 HTML 코드의 취약점을 분석하는 방법에 대해 도시되어 있다. a는 HTML5 문서로서 분석 대상의 컨텐츠(C)를 포함하는 <URL, C>를 도시한 것이다. a에는 분석 대상 문서의 HTML5 코드가 그대로 포함되어 있다. 이를 복수의 서브 컨텐츠(C_S)로 스플릿하고, 매퍼(mapper)를 이용하여 b를 생성한다. b에는 취약점이 발견된 문서 URL, 취약점 이름, 취약점 위치에 관한 정보를 포함하고 있다. c는 리듀서(reducer)로서, b를 특정한 키워드(key)를 기준으로 하여 정렬한다. d는 출력 파일로서, 정렬된 URL, 정렬된 취약점 이름, 정렬된 취약점 위치에 관한 정보를 포함하고 있다. Referring to FIG. 3, a method for analyzing the vulnerability of the HTML code included in the content C is illustrated in the vulnerability analysis module 150. a < URL, C > that contains the content C to be analyzed as an HTML5 document. a contains the HTML5 code of the document to be analyzed. Splits it into a plurality of sub contents C_S, and generates b using a mapper. b contains the document URL where the vulnerability was found, the name of the vulnerability, and the location of the vulnerability. c is a reducer, and b is sorted based on a specific keyword. d is an output file that contains information about the sorted URL, the name of the aligned vulnerability, and the location of the aligned vulnerabilities.

특히, 취약점 분석 모듈(150)은 서브 컨텐츠(C_S)에 포함된 태그들을 트리 구조로 정렬하여, 키워드(key)와 속성(att)을 추출할 수 있다. 도 4를 참조하면, 서브 컨텐츠(C_S)에 포함된 태그들을 트리 구조로 정렬하는 예시가 도시되어 있다. 취약점 분석을 위해, 서브 컨텐츠(C_S)에 포함된 태그들을 HTML5 취약 태그 및 속성과, 자바 스크립트 취약 태그로 나눈 후 분석을 수행할 수 있다. 도 5에서는 예시적으로, 트리 구조에서 input 태그에 대해 autofocus 속성과 onfocus 속성을 추출하는 경우를 도시하고 있다. In particular, the vulnerability analysis module 150 can extract the keywords (key) and attributes (att) by arranging the tags included in the sub-content C_S into a tree structure. Referring to FIG. 4, an example of arranging the tags included in the sub-content C_S into a tree structure is shown. In order to analyze the vulnerability, the tags included in the sub content (C_S) can be divided into HTML5 vulnerable tags and attributes and JavaScript vulnerable tags, and then analyzed. FIG. 5 exemplarily shows a case where an autofocus attribute and an onfocus attribute are extracted for an input tag in a tree structure.

도 6에서는, 잭킹 공격에 사용될 수 있는 속성 및 태그에 대해 도시되어 있고, 도 7에서는, 크로스 사이트 스크립팅 공격에 사용될 수 있는 속성 및 태그에 대해 도시되어 있다. 이와 같은 속성 및 태그를 트리 구조에서 추출하여, 웹 페이지에 포함된 공격 가능성이 있는 태그 및 속성들을 검색할 수 있고, 위에서 설명한 분산 병렬 처리 방식을 이용하여 처리 속도를 향상시키면서 검색의 정확성을 높일 수 있다. In FIG. 6, attributes and tags that may be used for jacking attacks are shown, and in FIG. 7, attributes and tags that may be used in a cross-site scripting attack are shown. By extracting such attributes and tags from the tree structure, it is possible to search for tags and attributes that are likely to be included in the web page. By using the distributed parallel processing method described above, the retrieval accuracy can be improved while improving the processing speed have.

웹 페이지의 HTML 코드를 수집하고 파싱하여 웹 페이지 내에 존재하는 다른 웹 페이지로의 연결 관계를 파악할 때, 웹 페이지의 내용과 연결 정보들을 하나의 노드에서 처리하기에는 어려움이 있다. 따라서, 분산 병렬 처리를 기반으로 하여 웹 페이지의 저장과 파싱을 수행하며, 제2 데이터베이스(DB2)는 세그먼트라는 논리적인 저장 단위에서 컨텐츠(C)를 저장할 수 있다. 세그먼트에는 웹 페이지의 내용과 웹 페이지 내의 연결 정보를 분할하여 저장할 수 있다(도 8 참조). When collecting and parsing the HTML code of a web page, it is difficult to process the contents of the web page and the connection information in one node in order to grasp the connection relation to another web page existing in the web page. Accordingly, the web page is stored and parsed based on the distributed parallel processing, and the second database DB2 can store the content C in a logical storage unit called a segment. In the segment, the contents of the web page and the connection information in the web page can be divided and stored (see Fig. 8).

이하에서는, 본 발명의 다른 실시예들에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치에 대하여 설명하기로 한다. Hereinafter, an HTML5 document collection and analysis apparatus based on distributed parallel processing according to another embodiment of the present invention will be described.

도 9는 본 발명의 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 블록도이다. 설명의 편의상, 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치를 설명한 것과 실질적으로 동일한 부분의 설명은 생략하기로 한다. 9 is a block diagram of an HTML5 document collection and analysis apparatus based on distributed parallel processing according to another embodiment of the present invention. For the sake of convenience of description, description of substantially the same parts as those of the HTML5 document collecting and analyzing apparatus based on the distributed parallel processing according to the embodiment of the present invention will be omitted.

도 9를 참조하면, 본 발명의 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치(2)는, 인젝터 모듈(100), 제1 데이터베이스(DB1), 제너레이터 모듈(110), 제2 데이터베이스(DB2), 페처 모듈(120), 파싱 모듈(130), 필터 모듈(140), 취약점 분석 모듈(150), 업데이터 모듈(160)을 포함한다.9, an HTML5 document collection and analysis apparatus 2 based on distributed parallel processing according to another embodiment of the present invention includes an injector module 100, a first database DB1, a generator module 110, 2 database (DB2), a fetcher module 120, a parsing module 130, a filter module 140, a vulnerability analysis module 150, and an updater module 160.

인젝터 모듈(100), 제1 데이터베이스(DB1), 제너레이터 모듈(110), 제2 데이터베이스(DB2), 페처 모듈(120), 파싱 모듈(130), 필터 모듈(140), 취약점 분석 모듈(150)에 대해서는 위에서 설명한 것과 실질적으로 동일하다. The injector module 100, the first database DB1, the generator module 110, the second database DB2, the fetcher module 120, the parsing module 130, the filter module 140, the vulnerability analysis module 150, Is substantially the same as that described above.

업데이터 모듈(160)은 제2 데이터베이스(DB2)로부터 파싱 결과 정보(PRI)를 제공받아 제1 데이터베이스(DB1)에 저장된 정보를 업데이트 한다. 업데이터 모듈(160)은 제2 데이터베이스(DB2)에 저장된 파싱 결과 정보(PRI)를 참조하여, 제1 데이터베이스(DB1)에 저장된 제1 포맷(F1)에 포함된 정보들을 업데이트 한다. The updater module 160 receives the parsing result information PRI from the second database DB2 and updates the information stored in the first database DB1. The updater module 160 refers to the parsing result information PRI stored in the second database DB2 and updates information included in the first format F1 stored in the first database DB1.

또한, 페처 모듈(120)은 컨텐츠(C)에 관한 컨텐츠 수집 정보(CCI)를 생성하고, 컨텐츠 수집 정보(CCI)를 제2 데이터베이스(DB2)에 저장할 수 있다. 컨텐츠 수집 정보(CCI)는 컨텐츠(C)를 수집한 수집 시간, 위치 정보 등을 포함할 수 있다. The fetcher module 120 may also generate content collection information (CCI) relating to the content C and store the content collection information (CCI) in the second database DB2. The content collection information (CCI) may include a collection time, location information, and the like collected from the content (C).

업데이터 모듈(160)은 제2 데이터베이스(DB2)로부터 컨텐츠 수집 정보(CCI)를 제공받아 제1 데이터베이스(DB1)에 저장된 제1 포맷(F1)에 포함된 정보들을 업데이트 할 수 있다. The updater module 160 may receive the content collection information CCI from the second database DB2 and update information included in the first format F1 stored in the first database DB1.

도 10은 본 발명의 또 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 블록도이다. 설명의 편의상, 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치를 설명한 것과 실질적으로 동일한 부분의 설명은 생략하기로 한다. 10 is a block diagram of an HTML5 document collection and analysis apparatus based on distributed parallel processing according to another embodiment of the present invention. For the sake of convenience of description, description of substantially the same parts as those of the HTML5 document collecting and analyzing apparatus based on the distributed parallel processing according to the embodiment of the present invention will be omitted.

도 10을 참조하면, 본 발명의 또 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치(3)는, 인젝터 모듈(100), 제1 데이터베이스(DB1), 제너레이터 모듈(110), 제2 데이터베이스(DB2), 페처 모듈(120), 파싱 모듈(130), 필터 모듈(140), 취약점 분석 모듈(150), 제3 데이터베이스(DB3)를 포함한다.10, the HTML5 document collection and analysis apparatus 3 based on the distributed parallel processing according to another embodiment of the present invention includes an injector module 100, a first database DB1, a generator module 110, A second database DB2, a fetcher module 120, a parsing module 130, a filter module 140, a vulnerability analysis module 150, and a third database DB3.

제3 데이터베이스(DB3)에는 취약점에 관한 정보(VI)를 저장한다. 구체적으로, 취약점 분석 모듈(150)은 컨텐츠(C)에 포함된 HTML 코드의 취약점을 분석한 결과를 취약점에 관한 정보(VI)로 생성하여, 이를 제3 데이터베이스(DB3)에 저장할 수 있다. The third database (DB3) stores information (VI) related to the vulnerability. Specifically, the vulnerability analysis module 150 may generate a result of analyzing the vulnerability of the HTML code included in the content C as the information (VI) related to the vulnerability and store it in the third database DB3.

취약점에 관한 정보(VI)는 HTML5 보안 취약 태그 및 속성들에 관한 정보를 포함할 수 있으며, 새로운 HTML5 보안 취약 태그 및 속성들을 탐지한 경우에는 제3 데이터베이스(DB3)에 저장된 취약점에 관한 정보(VI)가 업데이트 될 수 있다. Vulnerability information (VI) may include information on the HTML5 security vulnerable tags and attributes. If new HTML5 security vulnerable tags and attributes are detected, the information (VI) on vulnerabilities stored in the third database (DB3) May be updated.

도 11은 본 발명의 또 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치의 블록도이다. 설명의 편의상, 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치를 설명한 것과 실질적으로 동일한 부분의 설명은 생략하기로 한다. 11 is a block diagram of an HTML5 document collection and analysis apparatus based on distributed parallel processing according to another embodiment of the present invention. For the sake of convenience of description, description of substantially the same parts as those of the HTML5 document collecting and analyzing apparatus based on the distributed parallel processing according to the embodiment of the present invention will be omitted.

도 11을 참조하면, 본 발명의 또 다른 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 장치(4)는, 인젝터 모듈(100), 제1 데이터베이스(DB1), 제너레이터 모듈(110), 제2 데이터베이스(DB2), 페처 모듈(120), 파싱 모듈(130), 필터 모듈(140), 취약점 분석 모듈(150), 업데이터 모듈(160), 제3 데이터베이스(DB3)를 포함한다.11, the HTML5 document collection and analysis apparatus 4 based on the distributed parallel processing according to another embodiment of the present invention includes an injector module 100, a first database DB1, a generator module 110, A parser module 140, a vulnerability analysis module 150, an updater module 160, and a third database DB3, as shown in FIG.

인젝터 모듈(100), 제1 데이터베이스(DB1), 제너레이터 모듈(110), 제2 데이터베이스(DB2), 페처 모듈(120), 파싱 모듈(130), 필터 모듈(140), 취약점 분석 모듈(150), 업데이터 모듈(160), 제3 데이터베이스(DB3)에 대해서는 위에서 설명한 것과 실질적으로 동일하다. The injector module 100, the first database DB1, the generator module 110, the second database DB2, the fetcher module 120, the parsing module 130, the filter module 140, the vulnerability analysis module 150, The updater module 160 and the third database DB3 are substantially the same as those described above.

이하에서는, 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 방법에 대하여 설명하기로 한다. Hereinafter, an HTML5 document collection and analysis method based on distributed parallel processing according to an embodiment of the present invention will be described.

도 12는 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 방법을 순차적으로 나타낸 흐름도이다. 12 is a flowchart sequentially illustrating an HTML5 document collection and analysis method based on distributed parallel processing according to an exemplary embodiment of the present invention.

도 12를 참조하면, 본 발명의 일 실시예에 따른 분산 병렬 처리 기반의 HTML5 문서 수집 및 분석 방법은, 우선, Root URL 정보(RUI)를 기초로 하여, 수집 대상 URL 리스트(CTUL)를 생성한다(S100). 동일한 호스트(host)별로 구분하여 수집 대상 URL 리스트(CTUL)를 생성하고, 이를 데이터베이스에 저장할 수 있다. 단일의 Root URL 정보(RUI)를 제공받은 경우에는, 단일의 수집 대상 URL 리스트(CTUL)를 생성할 수 있다. 12, in the HTML5 document collection and analysis based on the distributed parallel processing according to an embodiment of the present invention, a collection target URL list (CTUL) is first generated based on Root URL information (RUI) (S100). The collection URL list (CTUL) can be generated by dividing the same host (host) and stored in the database. When a single Root URL information (RUI) is provided, a single collection target URL list (CTUL) can be generated.

이어서, 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지로부터 컨텐츠(C)를 추출한다(S110). 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지를 방문하여 문서의 내용을 수집하고, <URL, F1>, <URL, C>의 형식으로 데이터베이스에 저장할 수 있다.Subsequently, the content C is extracted from the web page corresponding to the collection target URL list CTUL (S110). The contents of the document can be collected by visiting a web page corresponding to the collection target URL list (CTUL), and stored in the database in the form of <URL, F1>, <URL, C>.

이어서, 컨텐츠(C)의 내용을 파싱하여 파싱 결과 정보(PRI)를 생성한다(S120). 컨텐츠(C)는 수집 대상 URL 리스트(CTUL)에 대응되는 웹 페이지를 방문하여 수집한 문서의 HTML 내용을 의미한다. Subsequently, the contents of the contents C are parsed to generate parsing result information PRI (S120). The content C refers to the HTML content of the document collected by visiting the web page corresponding to the collection target URL list CTUL.

이어서, 파싱 결과 정보(PRI)를 기초로 하여, 상기 웹 페이지의 문서 타입이 HTML5인지 판단한다(S130). 데이터베이스에 저장된 <URL, C>를 파싱하여 아웃링크(outlink)를 추출한다. 추출한 아웃링크를 <URL, F1>, <URL, PD>, <URL, PT>의 형식으로 데이터베이스에 저장할 수 있다. PD(ParseData)는 파싱한 형태의 아웃링크 주소를 의미하고, PT(ParseText)는 아웃링크를 텍스트 라인 단위로 저장한 것을 의미한다. Then, based on the parsing result information PRI, it is determined whether the document type of the web page is HTML5 (S130). Parses <URL, C> stored in the database to extract the outlink. The extracted outlink can be stored in the database in the form of <URL, F1>, <URL, PD>, <URL, PT>. PD (ParseData) means a parsed outlink address, and PT (ParseText) means that an outlink is stored in units of text lines.

이어서, 웹 페이지의 문서 타입이 HTML5인 경우에, 컨텐츠(C)에 포함된 HTML 코드의 취약점(vulnerability)을 분석한다(S140). 컨텐츠(C)에 포함된 HTML 코드의 취약점을 분석할 때, 컨텐츠(C)를 복수의 서브 컨텐츠(C_S)로 스플릿(split)하고, 서브 컨텐츠(C_S)에 대해서 키워드(key)와 속성(att)을 추출하고, 키워드(key) 및 속성(att)의 빈도수를 연산하여 컨텐츠(C)의 취약점을 분석할 수 있다. 이 때, 서브 컨텐츠(C_S)에 포함된 태그를 트리 구조로 정렬하여 키워드(key) 및 속성(att)을 추출할 수 있다. Then, when the document type of the web page is HTML5, the vulnerability of the HTML code included in the content (C) is analyzed (S140). When analyzing the vulnerability of the HTML code included in the content C, the content C is split into a plurality of sub-contents C_S and a keyword key and an attribute att ), And analyze the vulnerability of the content (C) by calculating the frequency of the keyword (key) and the attribute (att). At this time, the tags included in the sub content C_S can be arranged in a tree structure to extract the key and the attribute att.

본 발명의 실시예들과 관련하여 설명된 방법 또는 알고리즘의 단계는, 프로세서에 의해 실행되는 하드웨어 모듈, 소프트웨어 모듈, 또는 그 2개의 결합으로 직접 구현될 수 있다. 소프트웨어 모듈은 RAM 메모리, 플래시 메모리, ROM 메모리, EPROM 메모리, EEPROM 메모리, 레지스터, 하드 디스크, 착탈형 디스크, CD-ROM, 또는 본 발명의 기술 분야에서 잘 알려진 임의의 형태의 컴퓨터로 읽을 수 있는 기록 매체에 상주할 수도 있다. 예시적인 기록 매체는 프로세서에 연결되며, 그 프로세서는 기록 매체로부터 정보를 독출할 수 있고 기록 매체에 정보를 기입할 수 있다. 다른 방법으로, 기록 매체는 프로세서와 일체형일 수도 있다. 프로세서 및 기록 매체는 주문형 집적회로(ASIC) 내에 상주할 수도 있다. ASIC는 사용자 단말기 내에 상주할 수도 있다. 다른 방법으로, 프로세서 및 기록 매체는 사용자 단말기 내에 개별 구성 요소로서 상주할 수도 있다.The steps of a method or algorithm described in connection with the embodiments of the invention may be embodied directly in hardware, software modules, or a combination of the two, executed by a processor. A software module may reside in RAM memory, flash memory, ROM memory, EPROM memory, EEPROM memory, registers, a hard disk, a removable disk, a CD-ROM, or any form of computer readable media known in the art Lt; / RTI > An exemplary recording medium is coupled to a processor, which is capable of reading information from, and writing information to, the recording medium. Alternatively, the recording medium may be integral with the processor. The processor and the recording medium may reside in an application specific integrated circuit (ASIC). The ASIC may reside within the user terminal. Alternatively, the processor and the recording medium may reside as discrete components in a user terminal.

이상 첨부된 도면을 참조하여 본 발명의 실시예를 설명하였지만, 본 발명이 속하는 기술분야에서 통상의 지식을 가진 자는 본 발명이 그 기술적 사상이나 필수적인 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적이 아닌 것으로 이해해야만 한다.While the present invention has been described in connection with what is presently considered to be practical exemplary embodiments, it is to be understood that the invention is not limited to the disclosed embodiments, but, on the contrary, You will understand. It is therefore to be understood that the above-described embodiments are illustrative in all aspects and not restrictive.

100: 인젝터 모듈 110: 제너레이터 모듈
120: 페처 모듈 130: 파싱 모듈
140: 필터 모듈 150: 취약점 분석 모듈100: Injector module 110: Generator module
120: fetcher module 130: parser module
140: Filter module 150: Vulnerability analysis module

Claims

An injector module for storing root URL information in a first database;
A generator module for receiving the Root URL information from the first database to generate a collection target URL list and storing the collection target URL list in a second database;
A fetcher module for receiving the collection target URL list from the second database, extracting contents from a web page corresponding to the collection target URL list, and storing the contents in the second database;
A parsing module for receiving the contents from the second database, generating parsing result information by parsing contents of the contents, and storing the parsing result information in the second database;
A filter module for receiving the parsing result information from the parsing module and determining whether the document type of the web page is HTML5; And
And a vulnerability analysis module for analyzing a vulnerability of the HTML code included in the content only when the document type of the web page is HTML5,
Wherein the vulnerability analysis module divides the content into a plurality of sub contents, extracts keywords and attributes by arranging the tags included in the sub contents in a tree structure, calculates frequency of the keywords and the attributes, An HTML5 document collection and analysis system based on distributed parallel processing that analyzes vulnerability of contents.

delete

The method according to claim 1,
And an updater module for receiving the parsing result information from the second database and updating the information stored in the first database.

The method of claim 3,
Wherein the finder module generates content collection information about the content and further stores the content collection information in the second database.

5. The method of claim 4,
Wherein the updater module receives the content collection information from the second database and updates the information stored in the first database.

The method according to claim 1,
Wherein the first database converts the Root URL information into a first format and stores the first format,
Wherein the first format includes information on a URL, a collection status, a collection time, a number of retries after collection, and a document format.

The method according to claim 6,
The second database converts the content into a second format and stores the converted content,
Wherein the second format includes information included in the first format and HTML content of the web page.

8. The method of claim 7,
Wherein the second database further stores an outlink address in the form of parsing the content and a form in which the outlink is stored in units of text lines.

The method according to claim 1,
And a third database for storing information on the vulnerability. The HTML5 document collection and analysis apparatus based on distributed parallel processing.

The method according to claim 1,
Wherein the Root URL information is main URL information of a Web page including a URL to be collected, based on a distributed parallel processing.

Database;
An injector module for extracting root URL information of a first web page and storing the extracted root URL information in the database;
A generator module for receiving the Root URL information to generate a list of URLs to be collected and storing the list of URLs to be collected in the database;
A fetcher module for receiving the collection URL list, extracting contents from a corresponding second web page, and storing the contents in the database;
A parsing module for receiving and parsing the content, generating parsing result information, and storing the parsing result information in the database;
A filter module for receiving the parsing result information and determining whether the document type of the second web page is HTML5; And
And a vulnerability analysis module for analyzing a vulnerability of the HTML code included in the content only when the document type of the second web page is HTML5,
Wherein the vulnerability analysis module divides the content into a plurality of sub contents, extracts keywords and attributes by arranging the tags included in the sub contents in a tree structure, calculates frequency of the keywords and the attributes, An HTML5 document collection and analysis system based on distributed parallel processing that analyzes vulnerability of contents.

delete

12. The method of claim 11,
Wherein the finder module generates content collection information about the content and further stores the content collection information in the database.

12. The method of claim 11,
Wherein the database converts the Root URL information into a first format and stores the first format,
Wherein the first format includes information on a URL, a collection status, a collection time, a number of retries after collection, and a document format.

15. The method of claim 14,
Wherein the database converts the content into a second format and stores the converted content,
Wherein the second format includes information included in the first format and HTML content of the second web page.

16. The method of claim 15,
Wherein the database further stores an outlink address in the form of parsing the content and a form in which the outlink is stored in a unit of a text line, based on a distributed parallel processing.

12. The method of claim 11,
Wherein the root URL information is main URL information of the first web page including a URL to be collected, based on distributed parallel processing.

Generates a collection URL list based on the root URL information,
Extracts a content from a web page corresponding to the collection target URL list,
Parsing the content of the content to generate parsing result information,
Determining whether the document type of the web page is HTML5 based on the parsing result information,
And analyzing a vulnerability of the HTML code included in the content only when the document type of the web page is HTML5,
Extracting a keyword and an attribute by arranging the tags included in the sub content in a tree structure, calculating frequency of the keyword and the attribute, and analyzing the vulnerability of the content Distributed Parallel Processing based HTML5 Document Collection and Analysis.

delete

19. The method of claim 18,
Further comprising storing the Root URL information in a database.

21. The method of claim 20,
Further comprising storing the collection URL list and the content in the database.

19. The method of claim 18,
Wherein the Root URL information is main URL information of a Web page including a collection target URL, based on distributed parallel processing.