KR101269428B1

KR101269428B1 - System and method for data distribution

Info

Publication number: KR101269428B1
Application number: KR1020120083209A
Authority: KR
Inventors: 김태홍; 최성필; 정창후; 엄정호; 정성재; 정한민
Original assignee: 한국과학기술정보연구원
Priority date: 2012-07-30
Filing date: 2012-07-30
Publication date: 2013-05-30
Anticipated expiration: 2032-07-30

Abstract

PURPOSE: A data distribution system and a method thereof are provided to minimize network IO time and join operations between nodes in a distribution parallel system, thereby shortening response time of the whole system. CONSTITUTION: Data nodes(200a-200c) store data. A management node(100) confirms a type pattern by analyzing the data. The management node determines a data node which stores the data based on state information of data nodes which the type pattern is set and distributes the data to the data node. When the data includes a plurality of type patterns, the management node is assigned to one data node. [Reference numerals] (100) Management node; (200a) Data node 1; (200b) Data node 2; (200c) Data node n; (AA) Data

Description

System and Method for data distribution

본 발명은 데이터 분배 시스템 및 방법에 관한 것으로, 보다 상세하게는 입력된 데이터를 분석하여 타입패턴을 확인하고, 타입패턴이 설정된 데이터 노드들의 상태정보를 근거로 데이터를 저장할 데이터 노드를 결정하여, 상기 데이터를 분배/저장하는 데이터 분배 시스템 및 방법에 관한 것이다.
The present invention relates to a data distribution system and method, and more particularly, to analyze a type pattern by analyzing input data, and to determine a data node to store data based on state information of data nodes in which the type pattern is set. A data distribution system and method for distributing / storing data.

인터넷이 발전함에 따라 네티즌에 의해 하루에도 수없이 많은 데이터가 생성되어 유통되고 있으며, 최근 많은 기업, 특히 검색 엔진 회사 및 웹 포탈들 간에 이와 같은 엄청난 양의 데이터를 가능한 많이 수집 및 축적하고, 수집된 데이터 중에서 가능한 빨리 의미있는 정보를 추출하는 것이 기업의 경쟁력이 되고 있다. As the Internet develops, a lot of data is generated and distributed by netizens a day, and recently, a large amount of data is collected and accumulated as much as possible among many companies, especially search engine companies and web portals. Extracting meaningful information from data as quickly as possible becomes a competitive advantage for companies.

때문에 많은 기업은 저비용으로 대규모 클러스터를 구축하여 대용량 분산 관리 및 작업 분산 병렬처리 기술에 대한 많은 연구를 하고 있다. As a result, many companies are investigating large-scale distributed management and distributed workload processing technology by building large clusters at low cost.

즉, 기존 싱글 머신 시스템에서 처리하기 어려운 대용량 데이터의 가치가 부각되고, 이를 처리하기 위한 대안으로 분산 병렬 기반의 시스템이 다양한 분야에 도입/활용되고 있다.In other words, the value of large data that is difficult to process in the existing single-machine system is highlighted, and distributed parallel-based systems have been introduced / used in various fields as an alternative for processing them.

그러나, 다수의 노드에 데이터를 저장하고 처리하는 분산 병렬 시스템에서 하나의 태스크를 처리하는 과정에서 각 노드 간 소요되는 네트워크 IO와 다수의 Join 연산으로 인한 부하로 인해 전체적인 시스템의 처리 속도 저하가 불가피하여 빠른 속도로 대용량 데이터를 처리하는데 본질적인 문제가 있었다.
However, in the distributed parallel system that stores and processes data in multiple nodes, the processing speed of the entire system is inevitable due to the load caused by the network IO and the number of join operations between nodes in the process of processing one task. There was an inherent problem with processing large amounts of data at high speed.

본 발명은 상기한 문제점을 해결하기 위하여 안출된 것으로, 분산 병렬 시스템의 각 노드 간 네트워크 IO 타임 및 Join 연산을 최소화하여 전체적인 시스템의 응답속도를 단축시킬 수 있는 데이터 분배 시스템 및 방법을 제공하는데 그 목적이 있다. The present invention has been made to solve the above problems, to provide a data distribution system and method that can reduce the response time of the overall system by minimizing the network IO time and Join operation between each node of a distributed parallel system There is this.

본 발명의 다른 목적은 데이터를 데이터 노드에 분산 저장함으로써 질의 처리 속도를 개선할 수 있고, 데이터 Replica 를 생성하여 내고장성을 보장할 수 있는 데이터 분배 시스템 및 방법을 제공하는데 있다. Another object of the present invention is to provide a data distribution system and method capable of improving query processing speed by distributing and storing data in a data node, and generating a data replica to ensure fault tolerance.

본 발명의 또 다른 목적은 데이터 노드 간 네트워크 IO를 최소화하여 전체 태스크의 속도 단축을 꾀할 수 있는 데이터 분배 시스템 및 방법을 제공하는데 있다.
It is still another object of the present invention to provide a data distribution system and method capable of minimizing network IO between data nodes to reduce the speed of an entire task.

상기 목적들을 달성하기 위하여 본 발명의 일 측면에 따르면, 데이터를 저장하는 복수의 데이터 노드, 입력된 데이터를 분석하여 타입패턴을 확인하고, 상기 타입패턴이 설정된 데이터 노드들의 상태정보를 근거로 상기 데이터를 저장할 데이터 노드를 결정 및 상기 데이터를 분배하는 관리 노드를 포함하는 데이터 분배 시스템이 제공된다. According to an aspect of the present invention to achieve the above objects, a plurality of data nodes for storing data, the input data is analyzed to confirm a type pattern, and based on the state information of the data nodes in which the type pattern is set; A data distribution system is provided that includes a management node that determines a data node to store data from and distributes the data.

상기 데이터 노드의 상태정보는 중첩 저장정보, 데이터 노드의 갯수, 각 데이터 노드의 저장 가능 용량 및 타입패턴을 포함할 수 있다. The state information of the data node may include overlapping storage information, the number of data nodes, a storage capacity of each data node, and a type pattern.

상기 관리 노드는 상기 데이터가 다수의 타입패턴을 포함하는 데이터인 경우 하나의 데이터 노드에 할당, 분배되지 않은 타입패턴을 포함하는 데이터인 경우 빈 데이터 노드에 할당하되, 기 설정된 중첩저장정보에 따라 인접 데이터 노드에 Replica 생성하고, Replica 설정 만족시까지 Replica 생성을 반복하여 인접 데이터 노드에 해당 데이터를 분배할 수 있다. The management node is allocated to one data node when the data includes a plurality of type patterns, and is allocated to an empty data node when the data includes an undistributed type pattern. Replica can be created in the data node, and the replica can be distributed to neighboring data nodes by repeating the replica creation until the replica configuration is satisfied.

본 발명의 다른 측면에 따르면, 연결된 데이터 노드들에 관한 정보가 저장된 데이터 노드 정보 데이터베이스, 입력된 데이터를 분석하여 타입패턴 및 용량을 확인하는 데이터 분석부, 상기 데이터 노드 정보 데이터베이스를 검색하여 상기 확인된 타입패턴이 설정된 데이터 노드들을 확인하고, 상기 확인된 데이터 노드들의 상태정보를 근거로 상기 데이터를 저장할 데이터 노드를 선택하는 데이터 노드 선택부, 상기 선택된 데이터 노드들에 데이터를 분배하는 데이터 분배부를 포함하는 관리 노드가 제공된다. According to another aspect of the present invention, a data node information database in which information about connected data nodes is stored, a data analyzer for analyzing typed data and checking a type pattern and capacity, and searching the data node information database for searching A data node selector configured to identify data nodes having a type pattern set and to select a data node to store the data based on the identified state information of the data nodes; and a data distributor configured to distribute data to the selected data nodes. A management node is provided.

상기 데이터 노드 정보 데이터베이스에는 중첩 저장정보, 데이터 노드의 갯수, 각 데이터 노드의 패턴 타입 및 저장 가능용량 중 적어도 하나가 저장될 수 있다. The data node information database may store at least one of overlapping storage information, the number of data nodes, a pattern type of each data node, and a storage capacity.

상기 데이터 노드 선택부는 상기 확인된 데이터 노드들 중에서 저장 가능 용량이 상기 데이터의 용량 이상인 데이터 노드들을 선택하거나, 상기 용량 이상의 데이터 노드가 존재하지 않은 경우, 상기 데이터를 일정 크기로 분할하고, 상기 확인된 데이터 노드들 중에서 상기 분할된 데이터의 용량 이상인 데이터 노드들을 선택할 수 있다. The data node selecting unit selects data nodes having a storage capacity greater than or equal to the capacity of the data from among the identified data nodes, or divides the data into a predetermined size when there are no data nodes greater than or equal to the capacity. Among the data nodes, data nodes larger than the capacity of the divided data may be selected.

또한, 상기 데이터 노드 선택부는 상기 데이터가 다수의 타입패턴을 포함하는 데이터인 경우 하나의 데이터 노드에 할당, 분배되지 않은 타입패턴을 포함하는 데이터인 경우 빈 데이터 노드에 할당하되, 기 설정된 중첩저장정보에 따라 인접 데이터 노드에 Replica 생성하고, Replica 설정 만족시까지 Replica 생성을 반복하여 인접 데이터 노드에 상기 데이터를 분배할 수 있다. The data node selector may be allocated to one data node when the data includes data of a plurality of type patterns, or to an empty data node when data includes a type pattern that is not distributed. As a result, the replica may be generated in the neighboring data node, and the replica may be repeatedly distributed to the neighboring data node until the replica setting is satisfied.

상기 관리 노드는 각 데이터 노드의 상태정보를 실시간으로 확인하여 상기 데이터 노드 정보 데이터베이스에 저장된 각 데이터 노드의 상태 정보를 업데이트하는 업데이트부를 더 포함할 수 있다. The management node may further include an updater configured to check state information of each data node in real time and update state information of each data node stored in the data node information database.

본 발명의 또 다른 측면에 따르면, 관리 노드가 복수 개의 데이터 노드에 데이터를 분배 및 저장하는 방법에 있어서, 입력된 데이터를 분석하여 타입패턴 및 용량을 확인하는 단계, 구비된 데이터 노드 정보 데이터베이스를 검색하여 상기 확인된 타입패턴이 설정된 데이터 노드들을 확인하는 단계, 상기 확인된 데이터 노드들의 상태정보를 근거로 상기 데이터를 저장할 데이터 노드를 선택하는 단계, 상기 선택된 데이터 노드들에 데이터를 분배하는 단계를 포함하는 데이터 분배 방법이 제공된다. According to another aspect of the present invention, in a method in which a managed node distributes and stores data among a plurality of data nodes, analyzing the input data to identify type patterns and capacities; and searching the provided data node information database. Identifying the data nodes for which the identified type pattern is set, selecting a data node to store the data based on state information of the identified data nodes, and distributing data to the selected data nodes. A data distribution method is provided.

상기 확인된 데이터 노드들의 상태정보를 근거로 상기 데이터를 저장할 데이터 노드를 선택하는 단계는, 상기 확인된 데이터 노드들 중에서 저장 가능 용량이 상기 데이터의 용량 이상인 데이터 노드들을 선택하거나, 상기 데이터 용량 이상의 데이터 노드가 존재하지 않은 경우, 상기 데이터를 일정 크기로 분할하고, 상기 확인된 데이터 노드들 중에서 상기 분할된 데이터의 용량 이상인 데이터 노드들을 선택하는 것을 특징으로 할 수 있다. Selecting a data node to store the data on the basis of the confirmed state information of the data nodes, selecting data nodes having a storage capacity greater than or equal to the data capacity among the identified data nodes, If the node does not exist, the data may be divided into a predetermined size, and among the identified data nodes, data nodes that are larger than or equal to the capacity of the divided data may be selected.

또한, 상기 확인된 데이터 노드들의 상태정보를 근거로 상기 데이터를 저장할 데이터 노드를 선택하는 단계는, 상기 데이터가 다수의 타입패턴을 포함하는 데이터인 경우 하나의 데이터 노드에 할당, 분배되지 않은 타입패턴을 포함하는 데이터인 경우 빈 데이터 노드에 할당하되, 기 설정된 중첩저장정보에 따라 인접 데이터 노드에 Replica 생성하고, Replica 설정 만족시까지 Replica 생성을 반복하여 인접 데이터 노드에 분배할 수 있다.
The selecting of the data node to store the data on the basis of the confirmed status information of the data nodes may include: a type pattern not allocated or distributed to one data node when the data is data including a plurality of type patterns. In the case of the data including the data, the data may be allocated to the empty data node, but the replica may be generated in the neighboring data node according to the preset overlapping storage information, and the replica may be repeatedly distributed to the neighboring data node until the replica setting is satisfied.

본 발명에 따르면, 분산 병렬 시스템의 각 노드 간 네트워크 IO 타임 및 Join 연산을 최소화하여 전체적인 시스템의 응답속도를 단축시킬 수 있다. According to the present invention, network IO time and join operations between nodes of a distributed parallel system can be minimized to reduce the response speed of the entire system.

또한, 데이터를 데이터 노드에 분산 저장함으로써 질의 처리 속도를 개선할 수 있고, 데이터 Replica 를 생성하여 내고장성을 보장할 수 있다. In addition, by distributing and storing data in data nodes, query processing speed can be improved, and data replicas can be created to ensure fault tolerance.

또한, 데이터 노드 간 네트워크 IO를 최소화하여 전체 태스크의 속도 단축을 꾀할 수 있다.
In addition, network IO between data nodes can be minimized to speed up the overall task.

도 1은 본 발명에 따른 데이터 분배 시스템을 나타낸 도면.
도 2는 본 발명에 따른 관리 노드의 구성을 개략적으로 나타낸 블럭도.
도 3은 본 발명에 따른 관리 노드가 복수 개의 데이터 노드에 데이터를 분배하는 방법을 나타낸 흐름도.1 illustrates a data distribution system in accordance with the present invention.
Figure 2 is a block diagram schematically showing the configuration of a management node according to the present invention.
3 is a flow chart illustrating a method for a managed node to distribute data to a plurality of data nodes in accordance with the present invention.

본 발명의 전술한 목적과 기술적 구성 및 그에 따른 작용 효과에 관한 자세한 사항은 본 발명의 명세서에 첨부된 도면에 의거한 이하 상세한 설명에 의해 보다 명확하게 이해될 것이다.
The foregoing and other objects, features, and advantages of the present invention will become more apparent from the following detailed description taken in conjunction with the accompanying drawings, in which: FIG.

도 1은 본 발명에 따른 데이터 분배 시스템을 나타낸 도면이다. 1 is a diagram illustrating a data distribution system according to the present invention.

도 1을 참조하면, 데이터 분배 시스템은 데이터를 저장하는 복수 개의 데이터 노드(200a 200b, .., 200n, 이항 200이라 칭함), 각 데이터 노드(200)에 데이터를 분배/저장하는 관리 노드(100)를 포함한다. Referring to FIG. 1, a data distribution system includes a plurality of data nodes 200a 200b,..., 200n, and binomial 200 that store data, and a management node 100 that distributes / stores data to each data node 200. ).

각 데이터 노드(200)에는 기 설정된 분배 규칙에 따라 저장될 데이터의 타입패턴이 미리 설정되어 있다. 따라서, 데이터 노드(200)에는 자신에게 설정된 타입패턴에 해당하는 데이터가 저장된다. Each data node 200 is preset with a type pattern of data to be stored according to a preset distribution rule. Therefore, the data node 200 stores data corresponding to the type pattern set to the data node 200.

관리 노드(100)는 입력된 데이터를 분석하여 타입패턴을 확인하고, 그 타입패턴과 동일한 타입패턴이 설정된 데이터 노드들의 상태정보를 근거로 데이터를 저장할 데이터 노드(200)를 결정 및 분배한다. 여기서, 데이터 노드의 상태정보는 중첩 저장정보, 데이터 노드의 갯수, 각 데이터 노드의 저장 가능 용량 및 타입패턴 등을 포함할 수 있다. The management node 100 checks the type pattern by analyzing the input data, and determines and distributes the data node 200 to store the data based on state information of the data nodes in which the same type pattern as the type pattern is set. Here, the state information of the data node may include overlapping storage information, the number of data nodes, the storage capacity and the type pattern of each data node.

관리 노드(100)는 입력된 데이터가 다수의 타입패턴을 포함하는 데이터인 경우, 하나의 데이터 노드에 할당, 분배되지 않은 타입패턴을 포함하는 데이터의 경우 빈 데이터 노드에 할당한다. 이때, 관리 노드(100)는 기 설정된 중첩저장정보에 따라 인접 데이터 노드에 Replica 생성하고, Replica 설정 만족시까지 Replica 생성을 반복하여 인접 데이터 노드에 해당 데이터를 중복 저장한다. When the input data is data including a plurality of type patterns, the management node 100 allocates the data to one data node, and allocates the data to the empty data node in the case of data including a type pattern that is not distributed. At this time, the management node 100 generates a replica in the neighboring data node according to the preset overlapping storage information, and repeatedly generates the replica in the neighboring data node by repeatedly generating the replica until the replica setting is satisfied.

상기와 같은 관리 노드(100)에 대한 상세한 설명은 도 2를 참조하기로 한다.
Detailed description of the management node 100 as described above with reference to FIG.

도 2는 본 발명에 따른 관리 노드의 구성을 개략적으로 나타낸 블럭도이다. 2 is a block diagram schematically illustrating a configuration of a management node according to the present invention.

도 2를 참조하면, 관리 노드(100)는 데이터 분석부(110), 데이터 노드 선택부(120), 데이터 노드 정보 데이터베이스(130), 분배규칙 데이터베이스(140), 데이터 분배부(150)를 포함한다. Referring to FIG. 2, the management node 100 includes a data analyzer 110, a data node selector 120, a data node information database 130, a distribution rule database 140, and a data distributor 150. do.

데이터 노드 정보 데이터베이스(130)에는 각 데이터 노드의 상태 정보가 저장되어 있다. 즉, 데이터 노드 정보 데이터베이스(130)에는 중첩 저장정보, 데이터 노드의 갯수, 각 데이터 노드의 타입패턴 및 저장 가능용량 등이 저장되어 있다. 여기서, 중첩 저장정보는 데이터의 중복 저장 횟수를 말할 수 있다. 예를 들어, 중첩 저장 정보가 3회로 설정된 경우, 관리 노드(100)는 입력된 데이터를 3개의 데이터 노드에 중복 저장되도록 한다. The data node information database 130 stores state information of each data node. That is, the data node information database 130 stores overlapping storage information, the number of data nodes, the type pattern of each data node, and the storage capacity. Here, the overlapping storage information may refer to the number of times of duplicate storage of data. For example, when the overlapping storage information is set three times, the management node 100 allows the input data to be repeatedly stored in three data nodes.

분배규칙 데이터베이스(140)에는 서비스별로 미리 정해진 분배 규칙이 저장되어 있다. 분배 규칙은 분산 병렬 기반 데이터 노드를 이용하는 경우, 연관 데이터들을 독립 노드에 저장하기 위한 규칙일 수 있다.The distribution rule database 140 stores a predetermined distribution rule for each service. The distribution rule may be a rule for storing association data in an independent node when using distributed parallel based data nodes.

분배규칙 데이터베이스(140)에 저장된 분배규칙은 각 서비스의 질의 목록에 포함된 질의 셋들을 분석하여 패턴을 추출하고, 그 패턴을 근거로 질의 단위의 패턴 셋 구성 파일을 생성한 후, 질의 단위의 패턴 셋 구성 파일을 해당 서비스의 분배 규칙으로 한 것이다. The distribution rule stored in the distribution rule database 140 extracts a pattern by analyzing query sets included in a query list of each service, generates a pattern set configuration file of a query unit based on the pattern, and then selects a pattern of a query unit. The three configuration files are the distribution rules of the service.

데이터 분석부(110)는 입력된 데이터를 분석하여 타입패턴 및 용량을 확인한다. 즉, 데이터 분석부(110)는 입력된 데이터를 라인(line) 단위로 읽어 타입 패턴을 확인할 수 있다. 또한, 상기 데이터 분석부(110)는 데이터를 일정 크기의 버퍼에 저장한 후, 그 버퍼에 저장된 데이터를 읽어 타입 패턴을 확인할 수 있다. 이 경우 버퍼의 크기는 임의 변경 가능하다.The data analyzer 110 analyzes the input data to check the type pattern and the capacity. That is, the data analyzer 110 may read the input data in line units and check the type pattern. In addition, the data analyzer 110 may store the data in a buffer having a predetermined size, and then read the data stored in the buffer to check the type pattern. In this case, the size of the buffer can be arbitrarily changed.

데이터 노드 선택부(120)는 상기 데이터 노드 정보 데이터베이스(130)를 검색하여 상기 데이터 분석부(110)에서 확인된 타입패턴이 설정된 데이터 노드들을 확인하고, 상기 확인된 데이터 노드들의 상태정보를 근거로 상기 데이터를 저장할 데이터 노드를 선택한다. 이때, 상기 데이터 노드 선택부(120)는 상기 데이터의 타입패턴과 동일한 타입패턴이 설정된 데이터 노드들을 획득하고, 상기 획득된 데이터 노드들의 저장 가능 용량이 상기 데이터의 용량 이상인 데이터 노드가 존재하는지의 여부를 판단한다. 상기 판단결과 상기 데이터의 용량 이상인 데이터 노드가 존재하는 경우, 상기 데이터 노드 선택부(120)는 상기 획득된 데이터 노드들중에서 상기 데이터의 용량 이상인 데이터 노드를 선택한다. The data node selector 120 searches the data node information database 130 to identify data nodes in which the type pattern identified in the data analyzer 110 is set, and based on the checked state information of the data nodes. Select a data node to store the data. In this case, the data node selector 120 acquires data nodes in which the same type pattern as the type pattern of the data is set, and whether there is a data node whose storage capacity of the obtained data nodes is equal to or greater than the capacity of the data. Judge. If there is a data node that is greater than or equal to the capacity of the data, the data node selector 120 selects a data node that is greater than or equal to the capacity of the data from the obtained data nodes.

만약, 상기 판단결과 상기 데이터의 용량 이상인 데이터 노드가 존재하지 않은 경우, 상기 데이터 노드 선택부(120)는 상기 데이터를 일정 크기로 분할하고, 상기 획득된 데이터 노드들 중에서 상기 분할된 데이터의 용량 이상인 데이터 노드들을 선택한다. If there is no data node that is greater than or equal to the capacity of the data, the data node selector 120 divides the data into a predetermined size and is equal to or greater than the capacity of the divided data among the obtained data nodes. Select the data nodes.

또한, 상기 데이터 노드 선택부(120)는 조인 횟수 단축을 위해 상기 데이터가 다수의 타입패턴을 포함하는 데이터인 경우 하나의 데이터 노드에 할당, 분배되지 않은 타입패턴을 포함하는 데이터인 경우 빈 데이터 노드에 할당할 수 있다. In addition, the data node selector 120 allocates an empty data node when the data includes a type pattern that is not allocated or distributed to one data node when the data includes a plurality of type patterns to shorten the number of joins. Can be assigned to

또한, 상기 데이터 노드 선택부(120)는 데이터 노드 정보 데이터베이스(130)에 중첩저장정보가 설정된 경우, 그 중첩저장정보에 따라 인접 데이터 노드에 Replica 생성하고, Replica 설정 만족시까지 Replica 생성을 반복하여 인접 데이터 노드에 상기 데이터를 중복 분배할 수 있다. 즉, 관리 노드(100)는 데이터 노드의 장애 시 데이터 손실 및 서비스 방지를 위하여 동일 데이터를 다수의 데이터 노드에 저장한다. 그러기 위해, 상기 관리 노드(100)는 중첩 저장정보를 설정하고, 상기 설정된 중첩 저장정보에 따라 입력된 데이터를 중복 저장할 수 있다. In addition, when the overlapping storage information is set in the data node information database 130, the data node selecting unit 120 generates a replica in an adjacent data node according to the overlapping storage information, and repeats the replica generation until the replica setting is satisfied. The data may be redundantly distributed to adjacent data nodes. That is, the management node 100 stores the same data in a plurality of data nodes in order to prevent data loss and service when a data node fails. To this end, the management node 100 may set overlapping storage information, and duplicately store the input data according to the set overlapping storage information.

상기 데이터 분배부(150)는 상기 데이터 노드 선택부(120)에서 선택된 데이터 노드들에 상기 데이터를 분배 및 저장한다. The data distributor 150 distributes and stores the data to the data nodes selected by the data node selector 120.

도면에는 도시하지 않았으나, 상기 관리 노드(100)는 각 데이터 노드의 상태정보를 실시간으로 확인하여 상기 데이터 노드 정보 데이터베이스에 저장된 각 데이터 노드의 상태 정보를 업데이트하는 업데이트부(미도시)를 더 포함할 수 있다. Although not shown in the drawing, the management node 100 may further include an update unit (not shown) for checking the state information of each data node in real time and updating the state information of each data node stored in the data node information database. Can be.

상기와 같이 구성된 관리 노드(100)는 입력된 데이터를 분석하여 한 개 이상의 데이터 노드에 분배하여 저장한다. 질의 단위로 저장된 데이터는 불필요한 Join 횟수를 줄이고 I/O 비용을 없애 빠른 질의 처리가 가능하다. 또한 사용자 설정에 따라 데이터를 중첩 저장함으로써 내고장성을 보장하는 분산 병렬 시스템의 장점을 그대로 유지할 수 있다.
The management node 100 configured as described above analyzes the input data and distributes the data to one or more data nodes. Data stored in the unit of query can reduce the number of unnecessary joins and eliminate the I / O cost, enabling fast query processing. It also maintains the advantages of a distributed parallel system that guarantees fault tolerance by nesting and storing data according to user settings.

도 3은 본 발명에 따른 관리 노드가 복수 개의 데이터 노드에 데이터를 분배하는 방법을 나타낸 흐름도이다. 3 is a flowchart illustrating a method for distributing data to a plurality of data nodes by a management node according to the present invention.

도 3을 참조하면, 관리 노드는 데이터가 입력되면(S302), 상기 입력된 데이터를 분석하여 타입패턴 및 용량을 확인한다(S304). 즉, 상기 관리 노드는 입력된 데이터를 라인(line) 단위 또는 일정 크기 단위로 분석하여, 타입패턴 및 용량을 확인한다.Referring to FIG. 3, when data is input (S302), the management node analyzes the input data and checks a type pattern and capacity (S304). That is, the management node analyzes the input data in a line unit or a predetermined size unit to check the type pattern and the capacity.

상기 S304의 수행 후, 상기 관리 노드는 구비된 데이터 노드 정보 데이터베이스를 검색하여 상기 확인된 타입패턴이 저장된 데이터 노드들을 확인한다(S306). 즉, 상기 관리 노드는 상기 데이터 노드 정보 데이터베이스를 검색하여 상기 데이터의 타입패턴과 동일한 타입패턴이 설정된 데이터 노드들을 확인한다. After performing the step S304, the management node searches the provided data node information database and checks the data nodes in which the identified type pattern is stored (S306). That is, the management node searches the data node information database and identifies data nodes in which the same type pattern as that of the data is set.

상기 S306의 수행 후, 상기 관리 노드는 상기 확인된 데이터 노드들의 상태정보를 근거로 상기 데이터를 저장할 데이터 노드를 선택한다(S308).After performing S306, the management node selects a data node to store the data based on the confirmed state information of the data nodes (S308).

이때, 상기 관리 노드는 상기 데이터의 용량과 상기 확인된 데이터 노드들의 저장 가능 용량을 비교하여, 상기 데이터 용량 이상의 저장 가능 용량을 가진 데이터 노드들을 선택한다. 그런 다음 상기 관리 노드는 상기 데이터가 다수의 타입패턴을 포함하는 데이터인 경우 하나의 데이터 노드에 할당, 분배되지 않은 타입패턴을 포함하는 데이터인 경우 빈 데이터 노드에 할당할 수 있다.In this case, the management node compares the capacity of the data with the storage capacity of the identified data nodes, and selects data nodes having a storage capacity more than the data capacity. Then, the management node may allocate the data to one data node when the data includes a plurality of type patterns, and to the empty data node when the data includes a non-distributed type pattern.

또한, 상기 관리 노드는 데이터 노드 정보 데이터베이스에 중첩저장정보가 설정된 경우, 그 중첩저장정보에 따라 인접 데이터 노드에 Replica 생성하고, Replica 설정 만족시까지 Replica 생성을 반복하여 인접 데이터 노드에 상기 데이터를 중복 분배할 수 있다. When the overlapping storage information is set in the data node information database, the management node creates a replica in the neighboring data node according to the overlapping storage information, and repeats the replica generation until the replica setting satisfies to duplicate the data in the neighboring data node. Can be distributed.

상기 S308의 수행 후 상기 관리 노드는 상기 선택된 데이터 노드들에 데이터를 분배 및 저장한다(S310).After performing the step S308, the management node distributes and stores data to the selected data nodes (S310).

상기 관리 노드는 입력된 데이터가 모두 저장되었는지의 여부를 판단하고(S312), 저장이 완료되지 않은 경우 상기 S302부터 반복 수행한다.The management node determines whether all input data has been stored (S312), and if the storage is not completed, repeats from step S302.

이하에서는 관리 노드가 다수의 타입패턴을 포함하는 데이터를 데이터 노드에 분배 저장하는 방법을 예를 들어 설명하기로 한다. Hereinafter, a method of distributing and storing data including a plurality of type patterns in a data node will be described as an example.

예를 들어, 입력된 데이터의 타입패턴을 분석한 결과, For example, as a result of analyzing the type pattern of the input data,

ID #1 => Typepattern#1,2ID # 1 => Typepattern # 1,2

ID #2 = (ID#1 + ID#3 + Typepattern#5,6) = Typepattern#1,2 + Typepattern#8,9 + Typepattern#5,6ID # 2 = (ID # 1 + ID # 3 + Typepattern # 5,6) = Typepattern # 1,2 + Typepattern # 8,9 + Typepattern # 5,6

ID #3 => Typepattern#8,9ID # 3 => Typepattern # 8,9

ID #4 => Typepattern#10,11ID # 4 => Typepattern # 10,11

ID #5 => Typepattern#12,13,14ID # 5 => Typepattern # 12,13,14

ID #6 => Typepattern#15ID # 6 => Typepattern # 15

ID #7 => Typepattern#16 이고, 5개의 데이터 노드가 있으며 Replica가 3인 경우를 표 1을 이용하여 설명하기로 한다.The case where ID # 7 => Typepattern # 16, there are 5 data nodes, and the replica is 3 will be described using Table 1.

NodeNode 1 One NodeNode 2 2 NodeNode 3 3 NodeNode 4 4 NodeNode 5 5 ①조인 횟수 단축을 위해 다수의 패턴을 포함한 그룹은 한 노드에 할당(1) Assign groups containing multiple patterns to one node to shorten the number of joins. 1,2,8,9,5,61,2,8,9,5,6 10,1110,11 12,13,1412,13,14 ②분배되지 않은 패턴을 빈노드에 할당② Assign undistributed pattern to empty node 1,2,8,9,5,61,2,8,9,5,6 10,1110,11 12,13,1412,13,14 1515 1616 ③인접노드에 Replica 생성③ Create replicas on neighbor nodes 1,2,8,9,5,6
10,11 1,2,8,9,5,6
10,11 10,1110,11
12,13,1412,13,14 12,13,1412,13,14
1515 1515
1616 1616
1,2,8,9,5,61,2,8,9,5,6 ④ Replica 설정 만족시까지 생성 반복④ Repeat creation until satisfied replica setting 1,2,8,9,5,61,2,8,9,5,6
10,1110,11
1515 10,1110,11
12,13,1412,13,14
1616 12,13,1412,13,14
1515
10,1110,11 1515
1616
1,2,8,9,5,61,2,8,9,5,6 1616
1,2,8,9,5,61,2,8,9,5,6
12,13,1412,13,14

표 1을 참조하면, 관리 노드는 조인 횟수 단축을 위해 다수의 타입패턴을 포함하는 데이터의 경우 하나의 데이터 노드에 분배한다. 즉, ID #2의 Typepattern#1,2,8,9,5,6은 데이터 노드 1에 분배, ID #4의 Typepattern#10,11은 데이터 노드 2에 분배, ID #5의 Typepattern#12,13,14는 데이터 노드 3에 분배한다. Referring to Table 1, the management node distributes data to a single data node in the case of data including a plurality of type patterns to shorten the number of joins. That is, Typepattern # 1,2,8,9,5,6 of ID # 2 is distributed to data node 1, Typepattern # 10,11 of ID # 4 is distributed to data node 2, Typepattern # 12 of ID # 5, 13, 14 distributes to data node 3.

또한, 관리 노드는 분배되지 않은 타입패턴을 포함하는 데이터의 경우 빈 데이터 노드에 분배한다. In addition, the management node distributes the data to the empty data node in the case of data including the undistributed type pattern.

즉, 관리노드는 분배되지 않은 ID #6의 Typepattern#15를 데이터 노드 4에 분배, ID #7의 Typepattern#16을 데이터 노드 5에 분배한다. That is, the management node distributes the unpatterned Typepattern # 15 with the ID # 6 to the data node 4 and distributes the Typepattern # 16 with the ID # 7 to the data node 5.

또한, 상기 관리 노드는 Replica가 3으로 설정되어 있으므로, 인접 데이터 노드에 Replica 생성하고, Replica 설정 만족시까지 Replica 생성을 반복하여 인접 데이터 노드에 분배한다. In addition, since the replica is set to 3, the management node generates a replica in the neighboring data node, and repeats the replica generation until the replica configuration is satisfied, and distributes the replica to the neighboring data node.

즉, 상기 관리 노드는 데이터 노드 1에 Typepattern#10, 11을 중복 저장, 데이터 노드2에 Typepattern# 12,13,14를 중복저장, 데이터노드3에 Typepattern#15를 중복저장, 데이터 노드 4에 Typepattern# 16을 중복 저장, 데이터 노드 5에 Typepattern#1,2,8,9,5,6을 중복저장한다. That is, the management node duplicates Typepattern # 10, 11 at data node 1, duplicates Typepattern # 12, 13, 14 at data node 2, duplicates Typepattern # 15 at data node 3, and typespattern at data node 4 Duplicate storage # 16, duplicate typepattern # 1,2,8,9,5,6 on data node 5.

그런 다음 상기 관리 노드는 데이터 노드 1에 Typepattern#15를 중복 저장, 데이터 노드2에 Typepattern# 16을 중복저장, 데이터노드3에 Typepattern#10,11을 중복저장, 데이터 노드 4에 Typepattern#1,2,8,9,5,6을 중복 저장, 데이터 노드 5에 Typepattern#12,13,14를 중복저장한다.
Then, the management node duplicates Typepattern # 15 in data node 1, duplicates Typepattern # 16 in data node 2, duplicates Typepattern # 10,11 in data node 3, and typespattern # 1,2 in data node 4 Duplicate storage of, 8,9,5,6 and Typepattern # 12,13,14 on data node 5.

데이터 분배를 위한 방법은 프로그램으로 작성 가능하며, 프로그램을 구성하는 코드들 및 코드 세그먼트들은 당해 분야의 프로그래머에 의하여 용이하게 추론될 수 있다. The method for data distribution can be written programmatically, and the codes and code segments constituting the program can be easily inferred by a programmer in the art.

이와 같이, 본 발명이 속하는 기술분야의 당업자는 본 발명이 그 기술적 사상이나 필수적 특징을 변경하지 않고서 다른 구체적인 형태로 실시될 수 있다는 것을 이해할 수 있을 것이다. 그러므로 이상에서 기술한 실시예들은 모든 면에서 예시적인 것이며 한정적인 것이 아닌 것으로서 이해해야만 한다. 본 발명의 범위는 상기 상세한 설명보다는 후술하는 특허청구범위에 의하여 나타내어지며, 특허청구범위의 의미 및 범위 그리고 그 등가개념으로부터 도출되는 모든 변경 또는 변형된 형태가 본 발명의 범위에 포함되는 것으로 해석되어야 한다.
Thus, those skilled in the art will appreciate that the present invention may be embodied in other specific forms without departing from the spirit or essential characteristics thereof. It is therefore to be understood that the embodiments described above are to be considered in all respects only as illustrative and not restrictive. The scope of the present invention is defined by the appended claims rather than the detailed description and all changes or modifications derived from the meaning and scope of the claims and their equivalents are to be construed as being included within the scope of the present invention do.

100 : 관리노드 110 : 데이터 분석부
120 : 데이터 노드 선택부 130 : 데이터노드 정보 DB
140 : 분배규칙 DB 150 : 데이터 분배부
200 : 데이터 노드100: management node 110: data analysis unit
120: data node selection unit 130: data node information DB
140: distribution rule DB 150: data distribution unit
200: data node

Claims

A plurality of data nodes for storing data; And
Analyze the input data to confirm a type pattern, determine a data node to store the data based on state information of data nodes in which the same type pattern as the identified type pattern is set, and distribute the data to the determined data node. Include managed nodes,
The management node allocates to one data node when the data includes data having a plurality of type patterns, and assigns to an empty data node when the data includes data patterns that are not distributed.
Replica generation in the adjacent data node according to the preset overlapping storage information, and repeats the replica generation until the replica setting satisfies the data distribution system characterized in that to distribute the data to the adjacent data node.

The method of claim 1,
The state information of the data node includes overlapping storage information, the number of data nodes, storage capacity and type pattern of each data node.

delete

A data node information database in which information about connected data nodes is stored;
A data analyzer which analyzes the input data and checks a type pattern and a capacity;
A data node selector configured to search the data node information database to identify data nodes having the same type pattern as the identified type pattern and to select a data node to store the data based on state information of the identified data nodes; And
And a data distribution unit for distributing data to the selected data nodes.
The data node selecting unit allocates the data node to one data node when the data includes a plurality of type patterns, and allocates the data node to an empty data node when the data includes a non-distributed type pattern.
And a replica is generated in a neighboring data node according to preset overlapping storage information, and the replica is distributed to neighboring data nodes by repeating the replica generation until the replica setting is satisfied.

5. The method of claim 4,
And at least one of overlapping storage information, a number of data nodes, a pattern type of each data node, and a storage capacity in the data node information database.

5. The method of claim 4,
The data node selector selects data nodes having a storage capacity greater than or equal to the data capacity among the identified data nodes,
And when there are no data nodes above the capacity, splitting the data into a predetermined size and selecting data nodes that are larger than or equal to the capacity of the divided data among the identified data nodes.

delete

5. The method of claim 4,
And a updating unit which checks the state information of each data node in real time and updates the state information of each data node stored in the data node information database.

A method for a managed node to distribute data to a plurality of data nodes, the method comprising:
(a) analyzing the input data to identify a type pattern and a capacity;
(b) searching the provided data node information database to identify data nodes having the same type pattern as the identified type pattern; And
(c) selecting a data node to store the data based on the identified state information of the data nodes and distributing the data to the selected data nodes;
In the step (c), the data is allocated to one data node when the data includes a plurality of type patterns, and the data is allocated to an empty data node when the data includes an undistributed type pattern.
A replica is generated in a neighboring data node according to preset overlapping storage information, and the replica is repeatedly distributed to a neighboring data node until the replica setting is satisfied.

10. The method of claim 9,
And at least one of overlapping storage information, a number of data nodes, a pattern type of each data node, and a storage capacity in the data node information database.

10. The method of claim 9,
The step (c)
Selecting data nodes having a storage capacity greater than or equal to the capacity of the data from among the identified data nodes, or dividing the data into a predetermined size if there are no data nodes greater than or equal to the data capacity, and among the identified data nodes Selecting data nodes that are greater than or equal to the capacity of the divided data; And
Distributing data to the selected data nodes.

delete