CN102541858B

CN102541858B - Based on mapping and the data balancing processing method of stipulations, Apparatus and system

Info

Publication number: CN102541858B
Application number: CN201010585613.8A
Authority: CN
Inventors: 蔡斌; 田万鹏; 万乐; 史晓峰; 邱翔虎; 刘奕慧; 肖桂菊; 宫振飞; 张文郁; 韩欣; 崔小丰
Original assignee: Tencent Technology Shenzhen Co Ltd
Current assignee: Tencent Cloud Computing Beijing Co Ltd
Priority date: 2010-12-07
Filing date: 2010-12-07
Publication date: 2016-06-15
Anticipated expiration: 2030-12-07
Also published as: CN102541858A

Abstract

The invention discloses a kind of based on mapping and the data balancing processing method of stipulations, Apparatus and system. Including: obtain the data that client is submitted to, according to the mapper number pre-set, the data obtained are carried out primary partition; Respectively the data in primary partition are carried out mapping process, obtains intermediate result data; Call zonal device function, intermediate result data carried out fine granularity subregion according to the fine granularity number of partitions pre-set, described in the fine granularity number of partitions that pre-sets more than the stipulations number of partitions; Intermediate result data amount information in each fine granularity subregion is exported to workspace server, receive fine granularity subregion and the corresponding relation of stipulations subregion that workspace server returns, merge the intermediate result data belonging to same stipulations subregion in fine granularity subregion and export to corresponding stipulations subregion; Intermediate result data in stipulations subregion is carried out stipulations process, obtains corresponding data processed result. The application present invention, it is possible to equalization data load, raising process the efficiency of data.

Description

Based on mapping and the data balancing processing method of stipulations, Apparatus and system

Technical field

The present invention relates to distributed data computing technique, particularly to one based on mapping and the data balancing processing method of stipulations (MapReduce), Apparatus and system.

Background technology

MapReduce is the existing a kind of system architecture being applied to large-scale data process, system architecture as a kind of programming, it is widely used in the concurrent operation of large-scale dataset (as more than 1TB data set), such as, large-scale distributed filtration, extensive distribution sorting, Web connect the machine translation etc. scheming reversion, web access log analysis, reverse indexing structure, clustering documents, machine learning and Corpus--based Method. In Map Reduce system framework, data handling procedure is divided into two stages: pending data, for mapping (Map) stage, are carried out primary partition, and each element in primary partition is calculated by the first stage, and stipulations subregion is arrived in output. Wherein, the result of calculation with same keys leaves same stipulations subregion in; Second stage is stipulations (Reduce) stages, and the mapping calculation result with same keys merges formation list, then the element of list is carried out suitable merging, obtains final result.

MapReduce be very easy to programming personnel when will not distributed parallel programming, the program oneself write operates in distributed system.

Fig. 1 is that the existing data balancing based on MapReduce processes system structure schematic diagram. It is example for running application program framework Hadoop on large-scale inexpensive hardware device clusters, referring to Fig. 1, this system includes: client (Client), Job Server (JobTracker) and one or more task server (TaskTracker), wherein

In the application, one time MapReduce computation requests is referred to as operation. Client submits operation to by client-side program (ClientProgram) and JobTracker, this operation is coordinated by JobTracker, first carries out the Map stage, M1, M2 and the M3 of mark in Fig. 1, then the Reduce stage is performed, R1 and the R2 of mark in Fig. 1. The data processing operation that Map stage and Reduce stage carry out is monitored by TaskTracker, and operates in independent of in the process of TaskTracker.

Specifically, client submits operation to by ClientProgram and JobTracker, namely data (InputData) are inputted, in the Map stage, input data are carried out primary partition by TaskTracker in advance, in the application, input data are divided into 5 primary partitions of non-overlapping copies, including input primary partition 1 (InputSplit1)～input primary partition 5 (InputSplit5), Map is called by pattern of the input (InputFormat), read input data, then processed respectively by 5 mappers (Mapper) respectively, wherein, TaskTracker1 and TaskTracker3 includes two Mapper respectively, TaskTracker2 includes a Mapper, the Mapper in data input TaskTracker1 in primary partition 1 and primary partition 4, the Mapper in data input TaskTracker3 in primary partition 3 and primary partition 5, the Mapper in data input TaskTracker2 in primary partition 2. the data form of input Mapper is<key, value>, in the following describes, is referred to as key1 and value1.

After key1 and value1 is processed by Mapper, produce the intermediate object program (intermediate data) existed with<key, value>form, and it is stored in random access memory (RAM, RandomAccessMemory), in the following describes, it is referred to as key2 and value2. The intermediate data of storage in RAM can be merged (Combine) by TaskTracker. By calling zonal device (Partitioner) function for each intermediate object program, specify the Reduce subregion (Partition) of intermediate object program, the corresponding relief area of the extremely corresponding Reduce subregion of output intermediate object program, Region1 and the Region2 in Fig. 1.

The definition of Partitioner function is as follows:

IntgetPartition (K2key, V2value, intnumPartitions)

This function needs parameter key and value, it is output key2 and the value2 of Mapper respectively, simultaneously, also has the number of partitions (numPartitions), namely the Reduce number of partitions specified, in the example illustrated in figure 1, including two reducer (Reducer), the value of the Reduce number of partitions that then this is specified is 2, a corresponding Reduce subregion of Reducer.

After the handling process in the Map stage that is finished, entering the Reduce stage, the Reduce stage includes 3 steps: shuffle (Shuffle), sequence (Sort) and Reduce. Shuffling step, the Map Reduce system of Hadoop is according to the key in Map result, by relevant result, the intermediate data through merging and store in a relief area (Region1 or Region2) of subregion by Mapper output, it is transferred in some Reducer task, namely will be distributed over the intermediate object program of the same key produced of the multiple Mapper on different TaskTracker and transmit to the TaskTracker of Reducer processing this key.Such as, the intermediate data of Region1 of belonging to all in Fig. 1 is transferred to same Reducer task, and, the intermediate data of Region2 of belonging to all in Fig. 1 is all transferred to another Reducer task. While shuffling, sequence is also performing, by by<the key2 with identical key value from different Mapper, value2>it is combined together, form one<key2,<list of value2>>, as the input of Reducer in TaskTracker, Reducer is by processing<key2,<list of value2>>that receive, form final result<key, value>, in the following describes, it is referred to as<key3, value3>, and as output data (OutputData) output.

In above-mentioned example, the TaskTracker performing Map phase process and the TaskTracker performing Reduce phase process can be same TaskTracker, it is also possible to for different TaskTracker.

From above-mentioned, the existing data balancing processing method based on MapReduce, in the process that the intermediate data in fine granularity subregion is merged, owing to lacking the hands section that data volume in fine granularity subregion is carried out effectively statistics, in the way of making the Reduce subregion to specify, export to the buffer area data volume that Reduce subregion is corresponding unbalanced respectively, namely the data volume that may result in the Region1 of relief area differs greatly with the data volume in the Region2 of relief area, so, the intermediate data of corresponding relief area is transferred in the process of corresponding Reducer task (Reducer), may result in the intermediate data as inputting Reducer in TaskTracker uneven, that is, the intermediate data being input on some Reducer is more than the intermediate data on other Reducer.

Fig. 2 is the data volume schematic diagram that the Reduce subregion after the existing data balancing based on MapReduce processes comprises. Referring to Fig. 2, including two Reduce subregions, assume in Fig. 1, in the Region1 of TaskTracker1～TaskTracker3, data are transmitted to Reduce subregion 1 (Reducer1), in the Region2 of TaskTracker1～TaskTracker3, data are transmitted to Reduce subregion 2 (Reducer2), area under the line of the data volume respectively curve of Reduce subregion input. As can be seen from Figure, due in the Region1 of TaskTracker1～TaskTracker3 data volume well beyond data volume in the Region2 of TaskTracker1～TaskTracker3, the data volume (Reducer1 execution) carrying out Reduce process in Reduce subregion 1 is made to substantially exceed the data volume (Reducer2 execution) carrying out Reduce process in Reduce subregion 2, so, the time needed for data of processing in Reduce subregion 1 is by much larger than the time processed in Reduce subregion 2 needed for data. And in practical application, data process the required time (time that task terminates) by the Reducer1 decision processing time length, thus due to the phenomenon of Reducer load imbalance, the prolongation of system processes data time will be caused, for instance, in system, Reducer1 is also processing data, and simultaneously, Reducer2 is but in idle condition so that TaskTracker resource can not get effective utilization, reduces the efficiency processing data.

Summary of the invention

In view of this, present invention is primarily targeted at and propose a kind of data balancing processing method based on mapping and stipulations, equalization data load, raising process the efficiency of data.

Another object of the present invention is to propose a kind of data balancing based on mapping and stipulations and process device, equalization data load, raising process the efficiency of data.

Another object of the present invention is in that proposing a kind of data balancing based on mapping and stipulations processes system, and equalization data load, raising process the efficiency of data.

For reaching above-mentioned purpose, the invention provides a kind of based on the data balancing processing method mapped with stipulations, the method includes:

The data obtained are carried out primary partition according to the mapper number pre-set by the data that A, acquisition client are submitted to;

B, respectively the data in primary partition are carried out mapping process, obtain intermediate result data;

C, call zonal device function, intermediate result data carried out fine granularity subregion according to the fine granularity number of partitions pre-set, described in the fine granularity number of partitions that pre-sets more than the stipulations number of partitions;

D, the intermediate result data amount information in each fine granularity subregion is exported to workspace server, receive fine granularity subregion and the corresponding relation of stipulations subregion that workspace server returns, merge the intermediate result data belonging to same stipulations subregion in fine granularity subregion and export to corresponding stipulations subregion;

E, the intermediate result data in stipulations subregion is carried out stipulations process, obtain corresponding data processed result.

Between described step C and step D, farther include:

Judge whether the progress that mapping processes reaches the progress threshold value pre-set, if it is, perform step D.

The described progress threshold value pre-set is the percentage ratio preset that the data volume mapping that mapper has completed in primary partition processes.

Intermediate result data amount information in each fine granularity subregion being exported described in step D to workspace server, the corresponding relation of the fine granularity subregion and stipulations subregion that receive workspace server return specifically includes:

Intermediate result data amount information in the respective fine granularity subregion that each mapper that workspace server statistics receives reports, obtain intermediate result data total amount, according to the stipulations number of partitions, calculate each stipulations subregion and need intermediate result data amount to be processed, intermediate result data amount information to be processed is needed according to calculated each stipulations subregion, determine the stipulations subregion that fine granularity subregion is corresponding, make the intermediate result data amount sum in the fine granularity subregion selected be equal or approximately equal to corresponding stipulations subregion and need intermediate result data amount to be processed, then, the correspondence relationship information of fine granularity subregion with stipulations subregion is exported to task server.

Farther include: pre-set mark fine granularity subregion fine granularity subregion sequence indicia position of order in fine granularity subregion group, the fine granularity subregion that described selection corresponding stipulations subregion is corresponding so that the intermediate result data amount sum in the fine granularity subregion of selection is equal or approximately equal to corresponding stipulations subregion needs intermediate result data amount to be processed to specifically include:

Order selects the fine granularity subregion that corresponding stipulations subregion is corresponding so that the intermediate result data amount sum in the fine granularity subregion that order selects is equal or approximately equal to corresponding stipulations subregion and needs intermediate result data amount to be processed.

It is a kind of that based on the data balancing process device mapped with stipulations, this device includes: receive unit, balance policy computing unit and transmitting element, wherein,

Receive unit, for receiving the data that client is submitted to, export to transmitting element; Intermediate result data amount information in each fine granularity subregion that task server outside receiving sends, exports to balance policy computing unit;

Balance policy computing unit, for according to the intermediate result data amount information in each fine granularity subregion receiving unit output, calculate according to the balance policy that pre-sets and be assigned in the stipulations number of partitions pre-set the fine granularity number of partitions of each stipulations subregion, by the correspondence relationship information output of fine granularity subregion and stipulations subregion to transmitting element;

Transmitting element, for the fine granularity subregion of the data client receiving unit output submitted to and the output of balance policy computing unit and the correspondence relationship information output of stipulations subregion to outside task server.

It is a kind of that based on the data balancing process device mapped with stipulations, this device includes: receive unit, primary partition unit, map processing unit, fine granularity zoning unit, stipulations zoning unit and stipulations processing unit, wherein,

Receive unit, for receiving the data from external procedure server, export to primary partition unit; Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of stipulations subregion, export to stipulations zoning unit;

Primary partition unit, for the data received being carried out primary partition according to the mapper number pre-set, output is corresponding map processing unit extremely;

Map processing unit, for the data of primary partition output are carried out mapping process, obtains intermediate result data, exports to fine granularity zoning unit;

Fine granularity zoning unit, for intermediate result data being carried out fine granularity subregion according to the fine granularity number of partitions pre-set, the described fine granularity number of partitions pre-set is more than the stipulations number of partitions, and intermediate result data amount information is exported to outside Job Server; Receive the correspondence relationship information from the fine granularity subregion with stipulations subregion receiving unit output, according to corresponding relation, the intermediate result data belonging to same stipulations subregion in fine granularity subregion is exported to corresponding stipulations subregion;

Stipulations zoning unit, for the intermediate result data output extremely corresponding stipulations processing unit that will receive;

Stipulations processing unit, for the intermediate result data of the merging of input is carried out stipulations process, obtains corresponding data processed result.

Farther include judging unit and transmitting element, wherein,

Judging unit, for judging whether the progress of map processing unit reaches the progress threshold value pre-set, exports the intermediate result data amount information in fine granularity zoning unit to transmitting element if it is, trigger;

Transmitting element, the Job Server of the intermediate result data amount information output extremely outside for receiving.

It is a kind of that based on the data balancing process system mapped with stipulations, this system includes: Job Server and one or more task server, wherein,

Job Server, for receiving the data that client is submitted to, exports to task server; According to the intermediate result data amount information in each fine granularity subregion received and the stipulations number of partitions that pre-sets, calculate the fine granularity number of partitions being assigned to each stipulations subregion according to the balance policy pre-set, the correspondence relationship information of fine granularity subregion with stipulations subregion is exported to task server;

Task server, for the data received being carried out primary partition according to the mapper number pre-set, respectively the data in primary partition are carried out mapping process, obtain intermediate result data, call zonal device function, intermediate result data is carried out fine granularity subregion according to the fine granularity number of partitions pre-set, described in the fine granularity number of partitions that pre-sets more than the stipulations number of partitions, and the intermediate result data amount information in each fine granularity subregion is exported to Job Server; Receive the fine granularity subregion of Job Server return and the corresponding relation of stipulations subregion, merge in fine granularity subregion the intermediate result data belonging to same stipulations subregion output to stipulations subregion corresponding to corresponding relation, and the intermediate result data of input in stipulations subregion is carried out stipulations process, obtain corresponding data processed result.

Described Job Server includes: receive unit, balance policy computing unit and transmitting element, wherein,

Receive unit, for receiving the data that client is submitted to, export to transmitting element;Intermediate result data amount information in each fine granularity subregion that task server outside receiving sends, exports to balance policy computing unit;

Described task server includes: receive unit, primary partition unit, map processing unit, fine granularity zoning unit, stipulations zoning unit and stipulations processing unit, wherein,

Receive unit, for receiving the data from external procedure server, export to primary partition unit; Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of stipulations subregion, export to fine granularity zoning unit;

Fine granularity zoning unit, for the intermediate result data of the merging of input being carried out fine granularity subregion according to the fine granularity number of partitions pre-set, the described fine granularity number of partitions pre-set is more than the stipulations number of partitions, and intermediate result data amount information is exported to outside Job Server; Receive the correspondence relationship information from the fine granularity subregion with stipulations subregion receiving unit output, according to corresponding relation, the intermediate result data belonging to same stipulations subregion in fine granularity subregion is exported to corresponding stipulations subregion;

Stipulations zoning unit, the intermediate result data output extremely corresponding stipulations processing unit of the merging for receiving;

Described task server farther includes judging unit and transmitting element, wherein,

As seen from the above technical solutions, provided by the invention a kind of based on mapping and the data balancing processing method of stipulations, Apparatus and system, the data that acquisition client is submitted to, according to the mapper number pre-set, the data obtained are carried out primary partition; Respectively the data in primary partition are carried out mapping process, obtains intermediate result data; Call zonal device function, intermediate result data carried out fine granularity subregion according to the fine granularity number of partitions pre-set, described in the fine granularity number of partitions that pre-sets more than the stipulations number of partitions; Intermediate result data amount information in each fine granularity subregion is exported to workspace server, receive fine granularity subregion and the corresponding relation of stipulations subregion that workspace server returns, merge the intermediate result data belonging to same stipulations subregion in fine granularity subregion and export to corresponding stipulations subregion; Intermediate result data in stipulations subregion is carried out stipulations process, obtains corresponding data processed result. So, by the output of Mapper being divided into a large amount of fine granularity subregion, then pass through merging fine granularity subregion, form the more uniform Reduce subregion of ratio, data payload in balanced each Reduce subregion, thus reducing the phenomenon of Reducer load imbalance so that TaskTracker resource is utilized effectively, decrease the total time that operation completes to need, improve the efficiency processing data.

Accompanying drawing explanation

Fig. 1 is that the existing data balancing based on MapReduce processes system structure schematic diagram.

Fig. 2 is the data volume schematic diagram that the Reduce subregion after the existing data balancing based on MapReduce processes comprises.

Fig. 3 is the embodiment of the present invention data balancing process flow schematic diagram based on MapReduce.

Fig. 4 is the intermediate result data structural representation after the embodiment of the present invention carries out fine granularity subregion.

Fig. 5 is the embodiment of the present invention data balancing process system structure schematic diagram based on MapReduce.

Fig. 6 is the embodiment of the present invention another structural representation based on the data balancing process system of MapReduce.

Fig. 7 is the structural representation of embodiment of the present invention Job Server.

Fig. 8 is the structural representation of embodiment of the present invention task server.

Detailed description of the invention

For making the object, technical solutions and advantages of the present invention clearly, the present invention is described in further detail below in conjunction with the accompanying drawings and the specific embodiments.

In prior art, the intermediate object program that Mapper processes is being stored in RAM, and when the intermediate result data being stored in RAM is performed merging and subregion, process in the way of the Reduce subregion specified, and export to corresponding relief area so that the intermediate result data load imbalance of input block.

In the embodiment of the present invention, before the intermediate result data being stored in RAM is performed subregion, by intermediate result data being carried out the pretreatment of fine granularity subregion, obtain the distribution situation of the intermediate result data of fine granularity subregion, according to the data distribution situation obtained, according to the balance policy pre-set, the intermediate result data of fine granularity subregion is merged process in the way of the Reduce subregion specified, and export to corresponding relief area, so that the data load balance in relief area.

Specifically, in MapReduce framework, the intermediate result data of Map being carried out two benches packet, the first stage carries out fine granularity subregion, and fine granularity subregion can use original Partitioner function, with the existing stipulations number of partitions (N_r) the difference is that, the fine granularity number of partitions (N of this first stage_f) much larger than the existing stipulations number of partitions, namely the number of partitions in input Partitioner function is much larger than existing Reducer number (N_r), and based on the fine granularity number of partitions (N in Mapper_f) intermediate result data is carried out fine granularity subregion.

When Mapper proceeds to progress set in advance, data amount information in statistics first stage each fine granularity subregion, and be reported to JobTracker, JobTracker is according to the Mapper each fine granularity partition data amount information reported, in the way of the Reduce subregion specified, the intermediate result data of each fine granularity subregion is merged process according to the balance policy pre-set, and export to corresponding relief area, thus producing data volume than more uniform Reduce subregion, it is intended that the number of partitions of Reduce subregion equal to Reducer number. Follow-up Reducer obtains corresponding Reduce subregion, carries out stipulations, same as the prior art. So, by secondary fine granularity subregion, it is possible to avoid that the calculative data volume situation much larger than other Reducer on some Reducer occurs in MapReduce framework, thus reducing the execution time of task.

Fig. 3 is the embodiment of the present invention data balancing process flow schematic diagram based on MapReduce. Referring to Fig. 3, this flow process includes:

Step 301, obtains the data that client is submitted to;

In this step, client submits the data of user's input to by client-side program and JobTracker, and is coordinated by JobTracker, and output obtains, to TaskTracker, TaskTracker, the data that client is submitted to.

The data obtained are carried out primary partition according to the mapper number pre-set by step 302;

In this step, the mapper number pre-set is that the present invention for carrying out the mapper number of Map process to data, for instance, if mapper number is 5, then the data of acquisition are divided into 5 primary partitions.

Data in primary partition are carried out Map process by step 303 respectively, obtain intermediate result data;

In this step, each Mapper reads the data in corresponding primary partition, and the data of input are<key, value>, in the following describes, are referred to as key1 and value1.

After key1 and value1 is processed by Mapper, produce the intermediate object program existed with<key, value>form, and be stored in RAM, in the following describes, be referred to as key2 and value2.

Step 301～step 303 is same as the prior art.

Step 304, calls Partitioner function, and according to the fine granularity number of partitions pre-set, intermediate result data is carried out fine granularity subregion, described in the fine granularity number of partitions that pre-sets more than the Reduce number of partitions;

In this step, so that the data payload that the corresponding Reduce subregion of input Reducer carries out stipulations process is comparatively balanced, in the embodiment of the present invention, further intermediate result data is carried out fine granularity subregion, so that the intermediate result data granularity in each fine granularity subregion is thinner, be conducive to follow-up load balancing.

Step 305, merges intermediate result data;

In this step, it is the MapReduce a kind of optimisation strategy applied that intermediate result data is merged, identical intermediate result data can be merged, so that the output of Map task the enterprising professional etiquette of intermediate result data about, contribute to reducing from the Map stage to the volume of transmitted data in Reduce stage, for optional step.

Fig. 4 is the intermediate result data structural representation after the embodiment of the present invention carries out fine granularity subregion. Referring to Fig. 4, in the embodiment of the present invention, the fine granularity number of partitions pre-set is 8, is divided into 8 fine granularity subregions by intermediate result data, and so, the intermediate result data granularity comprised in fine granularity subregion is thinner.

Step 306, it is judged that whether the progress that Map processes reaches the progress threshold value pre-set, if it is, perform step 307, otherwise, proceeds Map process;

In this step, Mapper, according to the data volume in each primary partition and the data volume (in RAM the intermediate result data amount of storage) carrying out Map process, calculates the Map progress processed.

The progress threshold value pre-set can be the percentage ratio that the 80% or 100% of the Mapper data volume Map having completed in corresponding primary partition process etc. is preset, and specifically can complete the Map data volume processed according to actual needs and and can reflect data distribution situation.

Step 307, intermediate result data amount information in each fine granularity subregion is exported to JobTracker, receive the corresponding relation of the JobTracker fine granularity subregion returned and Reduce subregion, merge the intermediate result data belonging to same stipulations subregion in fine granularity subregion and export to corresponding Reduce subregion;

In this step, JobTracker receives the intermediate result data amount information in each Mapper respective fine granularity subregion reported, according to the intermediate result data amount information in each fine granularity subregion and the Reduce number of partitions, i.e. Reducer number, the fine granularity number of partitions being assigned to each Reduce subregion is calculated according to the balance policy pre-set, specifically, intermediate result data amount information in the respective fine granularity subregion that each Mapper that JobTracker statistics receives reports, obtain intermediate result data total amount, according to the Reduce number of partitions, calculate each Reduce subregion and need intermediate result data amount to be processed, intermediate result data amount information to be processed is needed according to calculated each Reduce subregion, determine the stipulations subregion that fine granularity subregion is corresponding, make the intermediate result data amount sum in the fine granularity subregion selected be equal or approximately equal to corresponding Reduce subregion and need intermediate result data amount to be processed.Then, the correspondence relationship information of fine granularity subregion with Reduce subregion is exported to task server, so that the data volume load being assigned in each Reduce subregion is comparatively balanced.

In the embodiment of the present invention, the balance policy pre-set can be determined according to actual needs, as long as the overall principle stays assigned to the comparatively equilibrium of the data volume load in each Reduce subregion. Such as, for the intermediate result data carried out after fine granularity subregion shown in Fig. 4, assume that the Reduce number of partitions is 2, JobTracker according to each fine granularity subregion reported 1.～8. in intermediate result data amount information and the Reduce number of partitions 2, the balance policy pre-set is utilized to carry out equilibrium calculation, arrange fine granularity subregion 1., fine granularity subregion 3. with fine granularity subregion 4. in intermediate result data output to Reduce subregion 1, by other fine granularity subregion 2., 5., 6., 7., 8. in intermediate result data output to Reduce subregion 2. So, the data payload in Reduce subregion 1 and Reduce subregion 2 is just comparatively balanced.

Certainly, in practical application, before merging fine granularity subregion, can also pass through to arrange fine granularity subregion sequence indicia position to identify this fine granularity subregion order in fine granularity subregion group, such as, for the intermediate result data carried out after fine granularity subregion shown in Fig. 4, can arrange fine granularity subregion 1.～fine granularity subregion sequence indicia position 8. respectively 1～8, fine granularity subregion sequence indicia position be 1 order of representation before, so, can ensure that the order of fine granularity subregion, data are needed to be ranked up by this for some, to ensure that fine granularity subregion order is in the MapReduce application that application normal operation ensures, by arranging this fine granularity subregion sequence indicia position, when carrying out fine granularity subregion and merging, carry out order by fine granularity subregion sequence indicia position to merge, for Fig. 4, according to the intermediate result data amount information in each fine granularity subregion, the Reduce number of partitions and the balance policy pre-set, can by fine granularity subregion 1., fine granularity subregion 2. and fine granularity subregion 3. in intermediate result data output to Reduce subregion 1, by fine granularity subregion 4., fine granularity subregion is 5., fine granularity subregion is 6., fine granularity subregion 7. and fine granularity subregion 8. in intermediate result data output to Reduce subregion 2. so, although being likely to merge not as good as fine granularity subregion out of order, but can ensure that fine granularity subregion order, meet the application scenarios that data are ranked up processing by needs, it is also possible to make the data payload in Reduce subregion 1 and Reduce subregion 2 comparatively balanced.

Step 308, carries out stipulations process to the intermediate result data of input in Reduce subregion, obtains corresponding data processed result.

In this step, entering the Reduce stage, Reducer presses original flow process, obtains the subregion after merging, calculates accordingly, same as the prior art, does not repeat them here.

Fig. 5 is the embodiment of the present invention data balancing process system structure schematic diagram based on MapReduce. Referring to Fig. 5, this system includes: Job Server and one or more task server, wherein,

Job Server, for receiving the data that client is submitted to, exports to task server; According to the intermediate result data amount information in each fine granularity subregion received and the Reduce number of partitions that pre-sets, calculate the fine granularity number of partitions being assigned to each Reduce subregion according to the balance policy pre-set, the correspondence relationship information of fine granularity subregion with Reduce subregion is exported to task server;

Task server, for the data received being carried out primary partition according to the mapper number pre-set, respectively the data in primary partition are carried out Map process, obtain intermediate result data, call Partitioner function, intermediate result data is carried out fine granularity subregion according to the fine granularity number of partitions pre-set, the described fine granularity number of partitions pre-set is more than Reducer number, judge whether the Map progress processed reaches the progress threshold value pre-set, if it is, the intermediate result data amount information in each fine granularity subregion is exported to Job Server; Receive the fine granularity subregion of Job Server return and the corresponding relation of Reduce subregion, merge in fine granularity subregion the intermediate result data belonging to same stipulations subregion output to Reduce subregion corresponding to corresponding relation, and the intermediate result data of input in Reduce subregion is carried out stipulations process, obtain corresponding data processed result.

Fig. 6 is the embodiment of the present invention another structural representation based on the data balancing process system of MapReduce. Referring to Fig. 6, including: client, Job Server and one or more task server, the main flow of system includes: task is submitted to, performed Mapper and Reduce process, wherein,

Task is submitted to: MapReduce developer's requirement according to new framework, writes Mapper, Reducer and Combiner, and provides fine granularity number of partitions N_fWith Reduce number of partitions N_r, wherein, N_f> N_r。

Perform Mapper: the Mapper that user writes is called by system, perform computing. The output of Mapper is<key2, value2>and fine granularity number of partitions N_f, as parameter, to call Partitioner function, obtain the fine granularity subregion of this output, this output is stored in the fine granularity subregion of correspondence by system. Mapper is going to some point, during as completed 80% inputted, it is possible to reports Mapper in current fine granularity subregion to JobTracker and exports the counting (intermediate result data amount information) of data. Wherein, intermediate result data amount information is reflecting the distribution of Reducer input data to a certain degree. JobTracker is after having collected the reporting of all or most of Mapper, it is considered to subregion sequence indicia position, generates a more uniform Reduce partition scheme of ratio, and notifies Mapper;

Mapper is according to the Reduce partition scheme received, by N_fIndividual fine granularity subregion (Region.1～Region.n), merges and becomes N_rIndividual Reduce subregion.

Reduce process: Reduce subregion obtains and inputs data accordingly, carries out stipulations.

Above-mentioned flow process more specifically can describe as follows:

Client submits operation to by ClientProgram and JobTracker, namely data (InputData) are inputted, in the Map stage, input data are carried out primary partition by TaskTracker in advance, in the application, input data are divided into 5 primary partitions of non-overlapping copies, including input fine granularity subregion 1 (InputSplit1)～input fine granularity subregion 5 (InputSplit5), Mapper is called by pattern of the input (InputFormat), read corresponding data, in the embodiment of the present invention, data are processed respectively by 5 mappers (Mapper) respectively, wherein, TaskTracker1 and TaskTracker3 includes two Mapper respectively, TaskTracker2 includes a Mapper, the Mapper in data input TaskTracker1 in fine granularity subregion 1 and fine granularity subregion 4, the Mapper in data input TaskTracker3 in fine granularity subregion 3 and fine granularity subregion 5, the Mapper in data input TaskTracker2 in fine granularity subregion 2.The data form of input Mapper is<key, value>, in the following describes, is referred to as key1 and value1.

After key1 and value1 is processed by Mapper, produce the intermediate object program existed with<key, value>form, and be stored in random access memory (RAM, RandomAccessMemory), in the following describes, be referred to as key2 and value2. The intermediate data of each primary partition of correspondence of storage in RAM is merged by TaskTracker, and by calling zonal device (Partitioner) function for each output, above-mentioned handling process is identical with Fig. 1.

After fine granularity subregion is merged, with Fig. 1 the difference is that, the handling process performed after fine granularity subregion is merged, namely it is in that the Map of TaskTracker exports subregion and buffer memory administrative section and JobTracker equilibrium treatment part. In the embodiment of the present invention, the subregion that original<key2, value2>is exported strengthens, namely according to the fine granularity number of partitions N pre-set_fOriginal<key2, value2>is carried out subregion,<key2, value2>is divided into N_fIndividual fine granularity subregion. simultaneously, communicate with outside JobTracker, after judging the progress threshold value that the Map progress processed reaches to pre-set, the intermediate result data amount information in each fine granularity subregion is reported to JobTracker, JobTracker collects the Mapper statistical data (intermediate result data amount information) that fine granularity subregion is exported, subregion sequence indicia position can be considered, a more uniform Reduce partition scheme of ratio is generated according to balance policy, and notify TaskTracker, TaskTracker receives the order of JobTracker, the i.e. corresponding relation of fine granularity subregion and Reduce subregion, according to the Reduce number of partitions (Reducer number),<the key2 in fine granularity subregion is merged according to corresponding relation, value2>, in the way of the Reduce subregion specified, the relief area that the Reduce subregion of output extremely correspondence is corresponding respectively, Region1 and Region2 in Fig. 6.<key2, the value2>of relief area input is carried out stipulations process by each Reduce subregion respectively, obtains corresponding data processed result.

Fig. 7 is the structural representation of embodiment of the present invention Job Server. Referring to Fig. 7, this Job Server includes: receive unit, balance policy computing unit and transmitting element, wherein,

Balance policy computing unit, for according to the intermediate result data amount information in each fine granularity subregion receiving unit output, calculate according to the balance policy that pre-sets and be assigned in the Reduce number of partitions pre-set the fine granularity number of partitions of each Reduce subregion, by the correspondence relationship information output of fine granularity subregion and Reduce subregion to transmitting element;

Transmitting element, for the fine granularity subregion of the data client receiving unit output submitted to and the output of balance policy computing unit and the correspondence relationship information output of Reduce subregion to outside task server.

Fig. 8 is the structural representation of embodiment of the present invention task server. Referring to Fig. 8, this task server includes: receive unit, primary partition unit, Map processing unit, fine granularity zoning unit, judging unit, transmitting element, Reduce zoning unit and stipulations processing unit, wherein,

Receive unit, for receiving the data from external procedure server, export to primary partition unit;Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of Reduce subregion, export to fine granularity zoning unit;

Map processing unit, for the data of primary partition output are carried out Map process, obtains intermediate result data, exports to fine granularity zoning unit;

Fine granularity zoning unit, for the intermediate result data of merging of input being carried out fine granularity subregion according to the fine granularity number of partitions pre-set, described in the fine granularity number of partitions that pre-sets more than the Reduce number of partitions; Receive the correspondence relationship information from the fine granularity subregion with Reduce subregion receiving unit output, according to corresponding relation, the intermediate result data belonging to same stipulations subregion in fine granularity subregion is exported to corresponding Reduce subregion;

Judging unit, for judging whether the progress of Map processing unit reaches the progress threshold value pre-set, if it is, export the intermediate result data amount information in fine granularity zoning unit to transmitting element;

Transmitting element, the Job Server of the intermediate result data amount information output extremely outside for receiving;

Reduce zoning unit, for the intermediate result data output extremely corresponding stipulations processing unit that will receive;

Stipulations processing unit, for the intermediate result data of input is carried out stipulations process, obtains corresponding data processed result.

From above-mentioned, the data balancing processing method based on MapReduce of the embodiment of the present invention, Apparatus and system, task server after carrying out Map process acquisition intermediate result data to data, further intermediate result data is carried out fine granularity subregion according to the fine granularity number of partitions pre-set, and the intermediate result data amount information in each fine granularity subregion is exported to JobTracker, JobTracker is according to the intermediate result data amount information in each fine granularity subregion and the Reduce number of partitions, the fine granularity number of partitions being assigned to each Reduce subregion is calculated according to the balance policy pre-set, the correspondence relationship information of fine granularity subregion with Reduce subregion is exported to task server, the task server corresponding relation according to the fine granularity subregion received with Reduce subregion, the intermediate result data belonging to same stipulations subregion in fine granularity subregion is exported extremely corresponding Reduce subregion and carries out stipulations process. so, by the output of Mapper being divided into a large amount of fine granularity subregion, then pass through merging fine granularity subregion, form the more uniform Reduce subregion of ratio, data payload in balanced each Reduce subregion, thus reducing the phenomenon of Reducer load imbalance so that TaskTracker resource is utilized effectively, decrease the total time that operation completes to need, improve the efficiency processing data. further, can when judging that the Map progress processed reaches the progress threshold value pre-set, trigger and the intermediate result data amount information in each fine granularity subregion is exported to JobTracker, when making the intermediate result data amount information in fine granularity subregion can reflect that data are distributed, effectively reduce data volume mutual between TaskTracker and JobTracker, reduce the time carried out needed for data balancing processes. and, by arranging fine granularity subregion sign position, it is ensured that data to be needed the application scenarios being ranked up.

The foregoing is only presently preferred embodiments of the present invention, be not intended to limit protection scope of the present invention. All any amendment of making, equivalent replace and improvement etc. within the spirit and principles in the present invention, should be included within protection scope of the present invention.

Claims

1. the data balancing processing method based on mapping and stipulations, it is characterised in that the method includes:

E, the intermediate result data in stipulations subregion is carried out stipulations process, obtain corresponding data processed result;

2. the method for claim 1, it is characterised in that between described step C and step D, farther include:

3. method as claimed in claim 2, it is characterised in that described in the progress threshold value that pre-sets be the percentage ratio preset that the data volume mapping that mapper has completed in primary partition processes.

4. the method for claim 1, it is characterized in that, farther include: pre-set mark fine granularity subregion fine granularity subregion sequence indicia position of order in fine granularity subregion group, the described stipulations subregion determining that fine granularity subregion is corresponding so that the intermediate result data amount sum in the fine granularity subregion of selection is equal or approximately equal to corresponding stipulations subregion needs intermediate result data amount to be processed to specifically include:

5. one kind processes device based on the data balancing mapped with stipulations, it is characterised in that this device includes: receive unit, balance policy computing unit and transmitting element, wherein,

Balance policy computing unit, for according to the intermediate result data amount information in each fine granularity subregion receiving unit output, obtain intermediate result data total amount, according to the stipulations number of partitions, calculate each stipulations subregion and need intermediate result data amount to be processed, intermediate result data amount information to be processed is needed according to calculated each stipulations subregion, determine the stipulations subregion that fine granularity subregion is corresponding, make the intermediate result data amount sum in the fine granularity subregion selected be equal or approximately equal to corresponding stipulations subregion and need intermediate result data amount to be processed, the correspondence relationship information of fine granularity subregion with stipulations subregion is exported to transmitting element,

6. one kind processes device based on the data balancing mapped with stipulations, it is characterised in that this device includes: receive unit, primary partition unit, map processing unit, fine granularity zoning unit, stipulations zoning unit and stipulations processing unit, wherein,

Fine granularity zoning unit, for intermediate result data being carried out fine granularity subregion according to the fine granularity number of partitions pre-set, the described fine granularity number of partitions pre-set is more than the stipulations number of partitions, and intermediate result data amount information is exported to outside Job Server, wherein, intermediate result data amount information in the respective fine granularity subregion that external procedure server reports for each mapper adding up reception, obtain intermediate result data total amount, according to the stipulations number of partitions, calculate each stipulations subregion and need intermediate result data amount to be processed, intermediate result data amount information to be processed is needed according to calculated each stipulations subregion, determine the stipulations subregion that fine granularity subregion is corresponding, make the intermediate result data amount sum in the fine granularity subregion selected be equal or approximately equal to corresponding stipulations subregion and need intermediate result data amount to be processed, then, the correspondence relationship information of output fine granularity subregion and stipulations subregion, the intermediate result data belonging to same stipulations subregion in fine granularity subregion, for receiving the correspondence relationship information from the fine granularity subregion with stipulations subregion receiving unit output, is exported to corresponding stipulations subregion by fine granularity zoning unit according to corresponding relation,

7. device as claimed in claim 6, it is characterised in that farther include judging unit and transmitting element, wherein,

8. one kind processes system based on the data balancing mapped with stipulations, it is characterised in that this system includes: Job Server and one or more task server, wherein,

Job Server, for receiving the data that client is submitted to, exports to task server, intermediate result data amount information in the respective fine granularity subregion that each mapper that statistics receives reports, obtain intermediate result data total amount, according to the stipulations number of partitions, calculate each stipulations subregion and need intermediate result data amount to be processed, intermediate result data amount information to be processed is needed according to calculated each stipulations subregion, determine the stipulations subregion that fine granularity subregion is corresponding, make the intermediate result data amount sum in the fine granularity subregion selected be equal or approximately equal to corresponding stipulations subregion and need intermediate result data amount to be processed, the correspondence relationship information of fine granularity subregion with stipulations subregion is exported to task server,

9. system as claimed in claim 8, it is characterised in that described Job Server includes: receive unit, balance policy computing unit and transmitting element, wherein,

10. system as claimed in claim 8, it is characterised in that described task server includes: receive unit, primary partition unit, map processing unit, fine granularity zoning unit, stipulations zoning unit and stipulations processing unit, wherein,

Receive unit, for receiving the data from external procedure server, export to primary partition unit;Receive the fine granularity subregion of external procedure server output and the correspondence relationship information of stipulations subregion, export to fine granularity zoning unit;

11. system as claimed in claim 10, it is characterised in that described task server farther includes judging unit and transmitting element, wherein,