CN101894051A

CN101894051A - CPU-GPU Cooperative Computing Method Based on Primary and Secondary Data Structure

Info

Publication number: CN101894051A
Application number: CN 201010244535
Authority: CN
Inventors: 安虹; 姚平; 刘谷; 徐光�; 许牧; 李小强; 韩文廷; 张倩; 徐恒阳
Original assignee: University of Science and Technology of China USTC
Current assignee: University of Science and Technology of China USTC
Priority date: 2010-07-29
Filing date: 2010-07-29
Publication date: 2010-11-24

Abstract

The embodiment of the present invention proposes a CPU-GPU cooperative calculation method based on the main and auxiliary data structure, including the following steps: according to the object to be processed, determine the content of the main and auxiliary data and initialize it; start the CPU calculation thread and the GPU calculation thread; read The data to be processed is stored in the main and auxiliary data structures after preprocessing, and at the same time, the CPU computing thread and the GPU computing thread will process the data in the main and auxiliary data structures until there is no more data. The scheme proposed by the invention can effectively manage parallel data, so that when the GPGPU platform processes a database with an unbalanced distribution of effective calculation amount, it can ensure the load balance of each thread on the GPU. The above solution proposed by the present invention enables the CPU and GPU to perform complete parallel computing and maintain a high utilization rate by designing a simple and reusable thread division method.

Description

CPU-GPU cooperative computing method based on major-minor data structure

Technical field

The present invention relates to computer realm, particularly, the present invention relates to CPU-GPU cooperative computing method based on major-minor data structure.

Background technology

The HPC field will reach the output of extreme efficiency, usually must see through a large amount of CPU links, CPU (Central Processing Unit, central processing unit) is the core of controlling computer operation, utilize parallel dispersion treatment to carry out computing, but not only program development difficulty of this structure height, the hardware volume is big, and power consumption is surprising especially.The rise of GPGPU (General-Purpose Computing on Graphics Processing Units, general-purpose computations graphic process unit) notion also is in order to remedy the weakness on these traditional C PU framework.

General single GPU (Graphics Processing Unit; graphic process unit) usually can be built-in tens of to hundreds of programmable processing units; utilize these processing units that specially is skillful in parallel computing as long as see through correct method, just can obtain very large operation efficiency and increase at some application.Also because of such characteristic, following GPGPU also is regarded as the high in the clouds computing, or even the possible solution of artificial intelligence.

Up to now, GPGPU more is subjected to the user certainly in the server application facet than general consumer computing, as in applications such as biomedicine, meteorological simulation, film industry, professional graphics process, can save many operation times by the GPGPU computing, but in the consumer application facet, the benefit that GPGPU brought is then more not obvious compared to professional application.

The characteristics of GPGPU are: CPU is as master control person, the operation system, handles input and output, control program flow process; GPU is as coprocessor, and operation needs a large amount of core function of calculating.

GPGPU faces two problems: 1) the last thread load balance problem of GPU.Because each thread uses identical code, caused the real work amount of each thread all the same, be maximum effectively amount of calculation.And in fact, the effective workload of each thread may be also different, therefore can cause the GPU laod unbalance.2) the utilization factor problem of CPU and GPU.Cooperative computation mode between CPU and GPU will directly influence their utilization factor.Under the synchronization call pattern, CPU calls and must wait for that its calculating finishes behind the GPU and just can carry out further work, makes that the utilization factor of CPU is lower; Under the asynchronous call pattern, though CPU can return immediately, when calculating, GPU carries out parallel computation after calling GPU, the size of this parallel computation amount is difficult to determine.If CPU parallel computation amount is too small, the utilization factor of CPU is still very low; If CPU parallel computation amount is excessive,, will cause the utilization factor of GPU low to such an extent as to need waiting for CPU to distribute new calculation task after GPU calculating is finished to it; When having only the required time of CPU parallel computation amount just identical, could obtain higher CPU and GPU utilization factor simultaneously, but will determine accurately that this CPU parallel computation amount is very difficult with GPU computing time.

Therefore, be necessary to propose a kind of otherwise effective technique scheme, to solve the problem of CPU-GPU cooperative computation.

Summary of the invention

Purpose of the present invention is intended to solve at least one of above-mentioned technological deficiency, particularly proposes a kind of effective CPU-GPU cooperative computation scheme, to improve the HPC of computing machine.

In order to achieve the above object, embodiments of the invention have proposed a kind of CPU-GPU cooperative computing method of major-minor data structure, may further comprise the steps:

According to the object of handling, determine major-minor data structure and carry out initialization;

Read in pending data, till not having data, and send data to CPU computational threads and GPU computational threads and read in end signal RF;

Described CPU computational threads and described GPU computational threads are handled the data of reading in.

According to embodiments of the invention, read in pending data and comprise:

Reading in a unit data, is master data and auxiliary data with its pre-service, is stored to respectively in corresponding master data management interval and the secondary data structure, and keeps mapping relations.

According to embodiments of the invention, described master data is the entity content of unit data of the object of described processing, and described auxiliary data is to describe the information of master data.

According to embodiments of the invention, described CPU computational threads is handled the data of reading in and be may further comprise the steps:

Steps A: judge whether to obtain the RF signal,, flag F L then is set for true, otherwise is set to vacation if obtain;

Step B: scan the master data management interval successively,, call CPU and handle, safeguard secondary data structure simultaneously to satisfying the interval of CPU treatment conditions;

Step C: the value of judge mark FL, if be true, then finish, otherwise continue execution in step A.

According to embodiments of the invention, described GPU computational threads is handled the data of reading in and be may further comprise the steps:

Step D: judge whether to obtain the RF signal,, flag F L then is set for true, otherwise is set to vacation if obtain;

Step e: scan the master data management interval successively,, call CPU and handle, safeguard secondary data structure simultaneously to satisfying the interval of CPU treatment conditions;

Step F: the value of judge mark FL, if be true, then finish, otherwise continue execution in step D.

The such scheme that the present invention proposes is effectively managed parallel data, makes the GPGPU platform when the effective calculated amount of processing distributes unbalanced database, can guarantee that GPU goes up each threads load balance.The such scheme that the present invention proposes by thread dividing method simplicity of design, reusable, makes CPU and GPU can carry out parallel computation completely, keeps higher utilization factor.

Aspect that the present invention adds and advantage part in the following description provide, and part will become obviously from the following description, or recognize by practice of the present invention.

Description of drawings

Above-mentioned and/or additional aspect of the present invention and advantage are from obviously and easily understanding becoming the description of embodiment below in conjunction with accompanying drawing, wherein:

Fig. 1 is the process flow diagram of the major-minor data structure CPU-GPU cooperative computing method of the embodiment of the invention;

Fig. 2 is a main auxiliary data structure synoptic diagram;

Fig. 3 reads in the thread process flow diagram for data;

Fig. 4 is a CPU computational threads process flow diagram;

Fig. 5 is a master control GPU computational threads process flow diagram.

Embodiment

Describe embodiments of the invention below in detail, the example of described embodiment is shown in the drawings, and wherein identical from start to finish or similar label is represented identical or similar elements or the element with identical or similar functions.Below by the embodiment that is described with reference to the drawings is exemplary, only is used to explain the present invention, and can not be interpreted as limitation of the present invention.

In order to realize the present invention's purpose, the invention discloses a kind of CPU-GPU cooperative computing method of major-minor data structure, may further comprise the steps:, determine major-minor data structure and carry out initialization according to the object of handling; Read in pending data, till not having data, and send data to CPU computational threads and GPU computational threads and read in end signal RF; Described CPU computational threads and described GPU computational threads are handled the data of reading in.

As shown in Figure 1, the process flow diagram for the CPU-GPU cooperative computing method of the major-minor data structure of the embodiment of the invention may further comprise the steps:

S110:, determine major-minor data structure and carry out initialization according to the object of handling.

In step S110, determine major-minor data structure and carry out initialization.Usually, master data is the entity content of unit data of the object of described processing, and auxiliary data is to describe the information of master data.

S120: read in all pending data, and send data to CPU computational threads and GPU computational threads and read in end signal RF.

In step S120, read in pending data, till not having data, and send data to CPU computational threads and GPU computational threads and read in end signal RF.

Particularly, reading in pending data comprises:

S130:CPU computational threads and GPU computational threads are handled the data of reading in.

In step S130, CPU computational threads and described GPU computational threads are handled the data of reading in, and particularly, the CPU computational threads is handled the data of reading in and be may further comprise the steps:

The GPU computational threads is handled the data of reading in and be may further comprise the steps:

Step e: judge whether to obtain the RF signal,, flag F L then is set for true, otherwise is set to vacation if obtain;

Step F: scan the master data management interval successively,, call CPU and handle, safeguard secondary data structure simultaneously to satisfying the interval of CPU treatment conditions;

Step G: the value of judge mark FL, if be true, then finish, otherwise continue execution in step E.

For the ease of understanding the present invention, the above-mentioned disclosed scheme of the present invention is further launched to describe.

Whole calculation tasks of a program can be divided into main processing procedure and auxiliary process process, wherein main processing procedure is the part that calculated amount is mainly concentrated in the calculation task, and the auxiliary process process is the computation process outside the main processing procedure, and calculated amount is less.Main processing procedure is not all to be suitable for carrying out on GPU yet, and this is determined by the program self character and to the requirement of program feature, therefore need divide calculation task between GPU and CPU, comes deal with data in the mode of cooperative computation.

At first define two notions:

Master data: the entity content in the unit data, this part data is handled by main processing procedure.

Auxiliary data: the other parts data in the unit data except that master data, in the auxiliary process process, use.This part data may be sky, and promptly unit data belongs to master data all, does not need the auxiliary process process.If this part data is not empty, then in the auxiliary process process, handle.

Be the definition and the feature of master data structure and secondary data structure below:

Master data structure: master data structure is used for management and storage master data.

, as criteria for classification the master data by stages is managed according to the required effective calculated amount of master data.Effective calculated amount of master data is determined by master data size, length or further feature, has characterized main processing procedure and has handled the required calculated amount of master data.

The division in master data management interval need decide according to the statistical distribution feature of the effective calculated amount of master data.The purpose of subregion is to decide according to each interval dense degree that goes up data and is divided into GPU and goes up and calculate or CPU goes up and calculates, and in general, data-intensive short interval is suitable for GPU most handles, and the sparse long interval of data is suitable for the CPU processing.

The buffer zone in each master data management interval is used for storage and is divided into master data on this interval, and the size of buffer zone is to be preestablished by the programmer.After satisfying predetermined condition, for example buffer zone is full, and the master data in interval responsible will the buffering of master data management submits to CPU or GPU handles, and carries out follow-up operation afterwards, for example empties buffer zone.

Secondary data structure: secondary data structure is used for management and storage auxiliary data.

Secondary data structure need keep the mapping relations of each master data and corresponding auxiliary data.

Utilize major-minor data structure, the present invention designs a kind of CPU-GPU cooperative computing method.As embodiments of the invention, for example, 3 threads are set on CPU: data are read in thread, CPU computational threads, master control GPU computational threads, mutual asynchronous execution.

Describe each thread operation content below.

Data are read in thread and are operated on the CPU, and its responsibility is to be responsible for all man-machine interactions, reads in data and write the master/auxiliary data structure body array, supervise the execution of two other thread from data source.The operational process that data are read in thread is:

1) the major-minor data structure of initialization;

2) start CPU computational threads and master control GPU computational threads;

3) reading in a unit data, is master data and auxiliary data with its pre-service, is stored to respectively in corresponding master data management interval and the secondary data structure, and keeps mapping relations;

4) continue to read in data, till not having data;

5) send data to CPU computational threads and GPU computational threads and read in end signal RF, wait for their end;

6) carry out necessary aftertreatment.

CPU computational threads responsibility is to handle the master data that suitable CPU handles, and as replenishing that GPU calculates, two kinds of functions is arranged:

1) handles the data that GPU is bad to handle, and use GPU to handle the data that can not obtain to quicken benefit;

2) utilization factor of raising CPU makes it no longer wait for finishing of GPU task.

The operational process of CPU computational threads is as follows:

1) judges whether to obtain the RF signal,, flag F L then is set for true, otherwise is set to vacation if obtain;

2) scan the master data management interval successively,, call CPU and handle, safeguard secondary data structure simultaneously satisfying the interval of CPU treatment conditions;

3) value of judge mark FL if be true, then finish, otherwise changes 1).

Master control GPU computational threads responsibility is to handle the master data that suitable GPU handles.The operational process of master control GPU computational threads is as follows:

2) scan the master data management interval successively,, call GPU and handle, safeguard secondary data structure simultaneously satisfying the interval of GPU treatment conditions;

3) value of judge mark FL if be true, then finish, otherwise changes 1).

From the foregoing description as can be seen, so-called cooperative computation is meant that CPU and GPU can independently handle data simultaneously, just carry out the division of task according to the given condition of programmer among the present invention.For example, the programmer can stipulate a threshold value, go up the data number between the main data area less than the handling of this threshold value, and handle greater than the GPU that gives of this threshold value by CPU, or the like.

Technical scheme for a better understanding of the present invention below is further described the present invention by further embodiment.

Use Hmmsearch with bioinformatics below and on the CUDA platform, be embodied as example, describe the specific embodiment of the present invention in detail.Hmmsearch is used for protein sequence database is inquired about, thereby obtains some character of target protein sequence.

The core function of Hmmsearch is P7_vitebi, and the realization of this function on the CUDA platform is called P7_vitebi_kernel.

Write the CUDA code of Hmmsearch according to the present invention.Wherein, major-minor data structure is achieved as follows:

The master data implication is the actual content of protein; The auxiliary data implication is name, length, calibration information of protein or the like.The index value implication is the subscript of the corresponding auxiliary data of certain master data in secondary data structure body array.

Master/auxiliary data structure is divided into two parts: master data structure body array and secondary data structure body array, synoptic diagram as shown in Figure 2.

Master data structure body array length is set to 64, and 64 effective calculated amount intervals are promptly arranged.

The master data management interval is realized that by the master data structure body each element implication in it is as follows:

(1) the protein sequence length of interval that can be managed by this structure is described in effective calculated amount interval.Preceding 60 burst lengths are 32, the protein sequence of management length between 0-1920; The 61st, 62,63,64 intervals manage respectively length [1920,2320), [2320,2720), [2720,3120), [and 3120,37000) between protein sequence;

(2) the maximum master data amount that max_num, this structure can manage is set to 4096;

(3) curren_num, the master data amount of the current management of this structure;

(4) pbuffer[2], size is 2 array of pointers, point to two can a store M AX_NUM master data array (this array adopts dynamic memory management method to manage);

(5) pindex[2], size is 2 array of pointers, point to two can a store M AX_NUM index value array (this array adopts dynamic memory management method to manage), for example, pindex[0] i element of indication array be exactly pbuffer[0] index value of i master data of indication array;

(6) full[2], size is 2 shaping array, expression pbuffer[2] whether the indication zone stored MAX_NUM master data, if be filled with, just can submit to GPU and calculate.For example, full[0] be 0 o'clock, expression pbuffer[0] also be not filled with full[1] and be 1 o'clock, expression pbuffer[1] be filled with;

(7) current_index, value is 0 or 1, is used for expression when which storage of array master data of forward direction, for example, current_index is 1 o'clock, expression is to pbuffer[1] the array stored master data pointed to.

Each element implication is as follows in the secondary data structure body:

(1) structure numbering No. is used for the subscript of minute book structure in array;

(2) next is used to write down the numbering of next idle structure;

(3) paiddata is used in reference to the memory headroom to an auxiliary data of storage.

Simultaneously,, will be organized into an idle structure chained list,, come the gauge outfit of this idle chained list of mark by freelist_head for the idle structure in the whole array.

Data are read in the thread process flow diagram shown in 3, and operational process is as follows:

1) will to master data structure body array (Maindata Management Array, MMA) to carry out initialization as follows for each structure in:

(a)Current_num：0；

(b)pbuffer[0]，pbuffer[1]：NULL；

(c)pindex[0]，pindex[1]：NULL；

(d)full[0]，full[1]：0；

(e)current_index：0；

Then to secondary data structure body array (Aiddata Management Array, AMA) to carry out initialization as follows for each structure in:

(a) No:i (subscript in array);

(b)next：i+1；

(c)paiddata：NULL；

Again with the initialization of idle structure linked list head:

(a)Freelist_head：0

Start CPU computational threads and GPU computational threads then.

2) whether judgment data has been read in and has been finished, and is then to change 6), otherwise change 3).

3) read in a protein sequence, with the actual content of protein as master data, with the name of protein, length, calibration information or the like as auxiliary data.

4) auxiliary data being stored to idle chain gauge outfit node, writing down the No value of this node, is the node that is designated as Next down with the idle chain top-of-form set.

5) obtain protein sequence length, judge its place length of interval, be stored to then among the member of corresponding MM, record 4) in the No value that obtains in the manipulative indexing array.Change 2 then).

6) data have been read in and have been finished, and send data to CPU computational threads and GPU computational threads and read in end signal RF.

7) end of waiting for CPU computational threads and GPU computational threads.

8) aftertreatment data, termination routine.

CPU computational threads process flow diagram is shown in 4, and operational process is as follows:

1) judges whether to obtain the RF signal,, flag F L then is set for true, otherwise is set to vacation if obtain.

2) the variable i value being set is 0

3) if i the member's of MM (MMA[i]) pending primary data store array full (completely being masked as 1), and the master data in this management structure body meets the CPU treatment conditions that set in advance, and then changes 4), otherwise change 6).

4) call P7_vitebi, use CPU that master data is handled, be provided with after finishing and completely be masked as 0.

5) the index stores array of the primary data store array correspondence of handling in the traversal (4) obtains index value IDX, and the node that is designated as IDX among the AMA is down joined in the idle node chained list.

6) i increases by 1.

7) if i greater than 63, changes 8), otherwise change 3).

8) value of judge mark FL if be true, changes 9), otherwise change 1).

9) send end signal to main thread, finish.

Master control GPU computational threads process flow diagram is shown in 5, and operational process is as follows:

2) the variable i value being set is 0

3) if i the member's of MM (MMA[i]) pending primary data store array full (completely being masked as 1), and the master data in this management structure body meets the GPU treatment conditions that set in advance, and then changes 4), otherwise change 6).

4) call P7_vitebi_kernel, use GPU that master data is handled, be provided with after finishing and completely be masked as 0.

The index stores array of the primary data store array correspondence of handling 5) traversal 4) obtains index value IDX, and the node that is designated as IDX among the AMA is down joined in the idle node chained list.

6) i increases by 1.

7) if i greater than 63, changes 8), otherwise change 3).

8) value of judge mark FL if be true, changes 9), otherwise change 1).

9) send end signal to main thread, finish.

One of ordinary skill in the art will appreciate that and realize that all or part of step that the foregoing description method is carried is to instruct relevant hardware to finish by program, described program can be stored in a kind of computer-readable recording medium, this program comprises one of step or its combination of method embodiment when carrying out.

In addition, each functional unit in each embodiment of the present invention can be integrated in the processing module, also can be that the independent physics in each unit exists, and also can be integrated in the module two or more unit.Above-mentioned integrated module both can adopt the form of hardware to realize, also can adopt the form of software function module to realize.If described integrated module realizes with the form of software function module and during as independently production marketing or use, also can be stored in the computer read/write memory medium.

The above-mentioned storage medium of mentioning can be a ROM (read-only memory), disk or CD etc.

The above only is a preferred implementation of the present invention; should be pointed out that for those skilled in the art, under the prerequisite that does not break away from the principle of the invention; can also make some improvements and modifications, these improvements and modifications also should be considered as protection scope of the present invention.

Claims

1. the CPU-GPU cooperative computing method of a major-minor data structure is characterized in that, may further comprise the steps:

According to the object of handling, determine major-minor data content and carry out initialization;

Start CPU computational threads and GPU computational threads;

Read in pending data, be stored in the major-minor data structure through after the pre-service, described CPU computational threads and GPU computational threads will be handled the data in the major-minor data structure simultaneously, till not having data.

2. the CPU-GPU cooperative computing method of major-minor data structure as claimed in claim 1 is characterized in that, reads in pending data and comprises:

3. the CPU-GPU cooperative computing method of major-minor data structure as claimed in claim 2 is characterized in that, described master data is the entity content of unit data of the object of described processing, and described auxiliary data is to describe the information of master data.

4. the CPU-GPU cooperative computing method of major-minor data structure as claimed in claim 3 is characterized in that, described CPU computational threads is handled the data of reading in and be may further comprise the steps:

5. the CPU-GPU cooperative computing method of major-minor data structure as claimed in claim 3 is characterized in that, described GPU computational threads is handled the data of reading in and be may further comprise the steps:

Step e: scan the master data management interval successively,, call GPU and handle, safeguard secondary data structure simultaneously to satisfying the interval of CPU treatment conditions;