CN120029674A

CN120029674A - A GPGPU instruction pre-analysis out-of-order scheduling method, system, device, and medium

Info

Publication number: CN120029674A
Application number: CN202510023109.5A
Authority: CN
Inventors: 许桂龙; 赵鑫鑫; 魏朝飞
Original assignee: Shandong Inspur Science Research Institute Co Ltd
Current assignee: Shandong Inspur Science Research Institute Co Ltd
Priority date: 2025-01-07
Filing date: 2025-01-07
Publication date: 2025-05-23

Abstract

The present application relates to the technical field of GPGPU instruction scheduling and execution, for example, to a GPGPU instruction pre-analysis out-of-order scheduling method and system, device, and medium. The method includes: in the instruction fetch decoding stage, under the guidance of the instruction fetch scheduler, the instructions to be executed are taken out of the instruction cache and stored in the instruction buffer; in the pre-analysis stage, the instructions of the same thread bundle in the instruction buffer that enter in sequence are checked for correlation with the instructions of the same thread bundle in the sending buffer, and are sent out of order, the thread bundle scheduler schedules and switches the thread bundle, sends the instructions to the execution unit for execution, and sends the sent instruction information to the scoreboard module; in the execution stage, the sent instructions are executed, the execution results are written back to the register file, and fed back to the SIMT stack and the instruction fetch scheduler. The present invention shows significant beneficial effects in reducing hardware resource overhead, improving execution efficiency, and reducing pipeline pauses.

Description

Out-of-order scheduling method, system, device and medium for GPGPU instruction pre-analysis

Technical Field

The application relates to the technical field of GPGPU instruction scheduling execution, in particular to a GPGPU instruction pre-analysis out-of-order scheduling method, a system, a device and a medium.

Background

Currently, in GPGPU (general purpose graphics processing unit) architectures, conventional sequential execution modes tend to cause pipeline stalls when long latency instructions (e.g., memory access instructions) are encountered, thereby reducing the overall GPU execution efficiency. Although GPGPU employs large-scale multithreading in combination with a scheduler to hide instruction latency, latency variations in memory access instructions are still unavoidable during complex program execution. Many techniques have focused on improving thread-level parallelism to cope with GPU stall cycles, such as increasing the number of Warp (thread bundles) to increase the probability of finding non-stall Warp and switching, or reordering the priority of Warp to increase parallelism. However, these approaches still have limitations in that they do not completely eliminate the overall stall caused by all Warp encountering long latency instructions at the same time.

Referencing out-of-order execution techniques in CPUs is considered a potential solution in order to more effectively avoid pipeline stalls. Out-of-order execution allows the processor to continue to issue and execute other ready instructions while waiting for the data of some instructions, thereby improving execution efficiency. However, implementing out-of-order execution in a GPGPU faces high hardware resource overhead, particularly when processing GPGPUs with a large number of registers, reordering load and store techniques and register renaming techniques can result in significant resource consumption.

It should be noted that the information disclosed in the above background section is only for enhancing understanding of the background of the application and thus may include information that does not form the prior art that is already known to those of ordinary skill in the art.

Disclosure of Invention

The following presents a simplified summary in order to provide a basic understanding of some aspects of the disclosed embodiments. This summary is not an extensive overview, and is intended to neither identify key/critical elements nor delineate the scope of such embodiments, but is intended as a prelude to the more detailed description that follows.

The embodiment of the disclosure provides a method, a system, a device and a medium for out-of-order scheduling of GPGPU instruction pre-analysis, which aim to solve the problem that the conventional method and technology still have limitations when processing the GPGPU pipeline stalling problem.

In some embodiments, the method comprises:

The instruction fetching and decoding stage, namely fetching the instruction to be executed from the instruction cache under the direction of the instruction fetching dispatcher and storing the instruction into the instruction cache; when encountering a branch instruction, fetching according to a branch PC value to be executed popped by the SIMT stack management unit; when the PC instruction jumps, fetching the instruction according to the jump address calculated by the execution unit;

The pre-analysis stage comprises the steps of carrying out correlation check on the instructions which enter the same thread bundle in sequence in an instruction buffer and the instructions of the same thread bundle in a sending buffer, storing new instructions without data correlation in the sending buffer, simultaneously transmitting the transmitted and unwritten instruction information transmitted from the scoreboard module to the sending buffer module, carrying out correlation check on the instructions in the sending buffer and the transmitted and unwritten instruction information, setting control signals corresponding to conflict positions in the sending buffer for carrying out sending control on the instructions with correlation or special instructions, and selecting the instructions without correlation to carry out-of-order sending;

The transmitting stage, the thread bundle scheduler performs the scheduling and switching of thread bundles, transmits the instruction to the execution unit for execution, and transmits the transmitted instruction information to the scoreboard module;

and the execution stage is to execute the transmitted instruction, write the execution result back to the register file and feed back to the SIMT stack and the instruction fetch scheduler.

Preferably, the special instructions in the pre-analysis stage include a synchronization instruction, a memory access instruction, and a branch instruction.

Preferably, the scoreboard module is used for recording the transmitted but unwritten instruction information and supporting correlation check of the pre-analysis stage.

Preferably, the pre-analysis stage dependency check is a register dependency check for each new Warp instruction in the received instruction buffer, in the following manner:

Comparing the source 1, source 2 and source 3 register index values and the destination register index value of the new instruction, and judging whether the destination register index values of all stored instructions of the same thread bundle in the current sending buffer are the same or not, if so, judging that the new instruction is a data conflict instruction;

Comparing whether the index value of the destination register of the new instruction is equal to the index values of the source 1, source 2 and source 3 registers of all stored instructions in the current sending buffer, if so, judging the new instruction as a data conflict instruction;

A data collision instruction is prevented from entering the send buffer for a decision.

Preferably, the data collision instruction is prevented from entering the sending buffer by the following specific modes:

When a certain instruction detects that no scoreboard conflict exists, namely the instruction in the sending buffer is not related with the currently executing instruction, the Ready position of the corresponding instruction item is high, and if the scoreboard conflict exists, the conflict position of the corresponding instruction item is high, and the Ready position is not high.

Preferably, the specific mode of out-of-order sending is that an instruction with high Ready bit and high Valid bit is selected for out-of-order sending, and then out-of-order execution and out-of-order write-back are carried out.

Preferably, the load instruction is executed while adhering to a memory consistency model;

for a load instruction and a store instruction, the load instruction bit and the store instruction bit are high to control the sending of the instructions;

for synchronous instructions, before entering a pre-analysis stage from an instruction buffer in sequence, the synchronous instruction position in a sending check list item is high, the instruction buffer is stopped, the instructions are sent to the pre-analysis stage again, until all instructions except the synchronous instruction in the sending buffer are executed, and after the synchronous instruction is executed, the instruction pre-analysis is continued after the synchronous instruction is executed;

For a branch instruction, after entering a pre-analysis stage, the position of the branch instruction of the corresponding instruction is high, the instruction after the branch instruction is not input into the pre-analysis stage before the determination of the branch result, the instruction after the branch waits for the final result of the branch, and then the instruction of the corresponding branch path is selected to enter the pre-analysis stage for correlation check.

In some embodiments, the system comprises:

The instruction fetching decoding module is configured to fetch an instruction to be executed from the instruction cache and store the instruction into the instruction cache; when encountering a branch instruction, fetching according to a branch PC value to be executed popped by the SIMT stack management unit; when the PC instruction jumps, fetching the instruction according to the jump address calculated by the execution unit;

The transmitting module is configured to schedule and switch the thread bundles by the thread bundle scheduler, transmit the instruction to the execution unit for execution, and transmit the transmitted instruction information to the scoreboard module;

The execution module is configured to execute the transmitted instruction, write the execution result back to the register file and feed back to the SIMT stack and the instruction fetch scheduler;

The pre-analysis module is configured to perform correlation check and out-of-order sending on the instructions in the instruction buffer, and comprises the following steps:

the instruction receiving module is used for receiving an instruction buffer containing a plurality of Warp instructions;

the correlation checking module is connected with the instruction receiving module and is used for checking the register correlation of each new Warp instruction received;

the sending buffer module is used for storing the non-data conflict instructions passing the correlation check and setting corresponding Valid bits for each stored instruction;

the sending checking module is used for monitoring scoreboard conflict between the instruction in the sending buffer and the instruction in the current execution, and for the instruction without scoreboard conflict, the corresponding Ready position is high;

The disorder sending module is connected with the sending checking module and is used for selecting instructions with high Ready bits and Valid bits to perform disorder sending;

and the execution and write-back module is used for executing the out-of-order transmitted instruction and writing the result back to the corresponding register in an out-of-order manner.

In some embodiments, the apparatus includes a processor and a memory storing program instructions, the processor configured to perform the out-of-order scheduling method of GPGPU instruction pre-analysis when the program instructions are executed.

In some embodiments, the storage medium stores program instructions that, when executed, perform the out-of-order scheduling method of GPGPU instruction pre-analysis.

The out-of-order scheduling method for GPGPU instruction pre-analysis provided by the embodiment of the disclosure can realize the following technical effects:

By adding the pre-analysis stage in the initial stage of the pipeline, the correlation check is carried out on the instruction, and the out-of-order emission is carried out on the instruction without data correlation, so that the problem of pipeline stagnation caused by long-delay instructions (such as memory access instructions) is effectively avoided, and the execution efficiency of the GPGPU is improved.

When encountering special instructions (such as a memory access instruction, a synchronous instruction and a branch instruction), the method adopts special operation strategies, such as prohibiting out-of-order sending of a loading instruction and a storage instruction to maintain memory consistency, suspending sending of the instruction by a control bit until the synchronous instruction is executed, and the like, and further ensures the accuracy of instruction execution and the stability of a pipeline.

The method of the invention avoids complex and resource-consuming operations such as reorder loading and storing technology, register renaming technology and the like, thereby reducing the cost of GPGPU hardware resources. Specifically, through correlation checking in the pre-analysis stage, no data correlation among sent instructions is ensured, hardware design is simplified, the pause time of a pipeline is reduced, and the overall execution efficiency is improved.

In summary, the out-of-order scheduling method for GPGPU instruction pre-analysis provided by the invention has obvious beneficial effects in the aspects of reducing hardware resource overhead, improving execution efficiency, reducing pipeline stalls and the like, and provides a new and more efficient solution for GPGPU instruction scheduling execution.

The foregoing general description and the following description are exemplary and explanatory only and are not restrictive of the application.

Drawings

One or more embodiments are illustrated by way of example and not limitation in the figures of the accompanying drawings, in which like references indicate similar elements, and in which like reference numerals refer to similar elements, and in which:

FIG. 1 is a schematic flow chart of the method of the present invention;

FIG. 2 is a schematic diagram of a pre-analytical flow architecture according to the present invention;

FIG. 3 is a schematic diagram of the pre-analysis function of the present invention;

Fig. 4 is a schematic view of an apparatus provided by an embodiment of the present disclosure.

Detailed Description

So that the manner in which the features and techniques of the disclosed embodiments can be understood in more detail, a more particular description of the embodiments of the disclosure, briefly summarized below, may be had by reference to the appended drawings, which are not intended to be limiting of the embodiments of the disclosure. In the following description of the technology, for purposes of explanation, numerous details are set forth in order to provide a thorough understanding of the disclosed embodiments. However, one or more embodiments may still be practiced without these details. In other instances, well-known structures and devices may be shown simplified in order to simplify the drawing.

The terms first, second and the like in the description and in the claims of the embodiments of the disclosure and in the above-described figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate in order to describe embodiments of the present disclosure. Furthermore, the terms "comprise" and "have," as well as any variations thereof, are intended to cover a non-exclusive inclusion.

The term "plurality" means two or more, unless otherwise indicated.

As shown in fig. 1, a method for out-of-order scheduling in GPGPU instruction pre-analysis includes:

The transmitting stage is to schedule and switch thread bundles through a thread bundle scheduler, transmit instructions to an execution unit for execution, and send the transmitted instruction information to a scoreboard module;

As a refinement of the above embodiment, the special instructions in the pre-analysis stage include a synchronization instruction, a memory access instruction, and a branch instruction.

As a refinement of the above embodiment, the scoreboard module is configured to record the instruction information that is transmitted but not written back, and support correlation checking in the pre-analysis stage.

As a refinement of the above embodiment, the pre-analysis stage dependency check is a register dependency check for each new Warp instruction in the received instruction buffer, in the following manner:

As a refinement of the above embodiment, the data collision instruction is prevented from entering the send buffer, in the following manner:

As a refinement of the embodiment, the out-of-order sending specific mode is that an instruction with high Ready bit and Valid bit is selected for out-of-order sending, and then out-of-order execution and out-of-order write-back are carried out. Because the sent instruction is irrelevant, a reordering operation and a register renaming operation are not needed, so that the consumption of resources is greatly reduced, and the pause of a pipeline is reduced.

As a refinement of the above embodiment, the memory consistency model is observed for the load instruction when executing;

When encountering special instructions, the method adopts special operation strategies, such as prohibiting out-of-order sending of the loading instructions and the storing instructions to maintain memory consistency, suspending sending of the instructions by the control bits until the synchronous instructions are executed, and the like, and further ensures the accuracy of instruction execution and the stability of a pipeline.

An out-of-order scheduling system for GPGPU instruction pre-analysis, comprising:

and the pre-analysis module is configured for carrying out correlation check and out-of-order sending on the instructions in the instruction buffer.

As a refinement of the above embodiment, the analysis module includes:

As shown in connection with fig. 4, an embodiment of the present disclosure provides an out-of-order scheduler 300 for GPGPU instruction pre-analysis, comprising a processor (processor) 304 and a memory (memory) 301. Optionally, the apparatus may also include a communication interface (Communication Interface) 302 and a bus 303. The processor 304, the communication interface 302, and the memory 301 may communicate with each other through the bus 303. The communication interface 302 may be used for information transfer. The processor 304 may call logic instructions in the memory 301 to perform the out-of-order scheduling method of GPGPU instruction pre-analysis of the above embodiments.

Further, the logic instructions in the memory 301 may be implemented in the form of software functional units and stored in a computer readable storage medium when sold or used as a stand alone product.

The memory 301 is used as a computer readable storage medium for storing a software program, a computer executable program, such as program instructions/modules corresponding to the methods in the embodiments of the present disclosure. The processor 304 executes the functional application and data processing by running the program instructions/modules stored in the memory 301, i.e. implements the out-of-order scheduling method of GPGPU instruction pre-analysis in the above embodiments.

The memory 301 may include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the terminal device, etc. In addition, the memory 301 may include a high-speed random access memory, and may also include a nonvolatile memory.

Embodiments of the present disclosure provide a computer readable storage medium storing computer executable instructions configured to perform the above-described out-of-order scheduling method of GPGPU instruction pre-analysis.

The computer readable storage medium may be a transitory computer readable storage medium or a non-transitory computer readable storage medium.

Embodiments of the present disclosure may be embodied in a software product stored on a storage medium, including one or more instructions for causing a computer device (which may be a personal computer, a server, or a network device, etc.) to perform all or part of the steps of a method according to embodiments of the present disclosure. The storage medium may be a non-transitory storage medium, including a usb disk, a removable hard disk, a Read-Only Memory (ROM), a random access Memory (RAM, randomAccess Memory), a magnetic disk, or an optical disk, or may be a transitory storage medium.

The above description and the drawings illustrate embodiments of the disclosure sufficiently to enable those skilled in the art to practice them. Other embodiments may involve structural, logical, electrical, process, and other changes. The embodiments represent only possible variations. Individual components and functions are optional unless explicitly required, and the sequence of operations may vary. Portions and features of some embodiments may be included in, or substituted for, those of others. Moreover, the terminology used in the present application is for the purpose of describing embodiments only and is not intended to limit the claims. As used in the description of the embodiments and the claims, the singular forms "a," "an," and "the" (the) are intended to include the plural forms as well, unless the context clearly indicates otherwise. Similarly, the term "and/or" as used in this disclosure is meant to encompass any and all possible combinations of one or more of the associated listed. Furthermore, when used in the present disclosure, the terms "comprises," "comprising," and/or variations thereof, mean that the recited features, integers, steps, operations, elements, and/or components are present, but do not preclude the presence or addition of one or more other features, integers, steps, operations, elements, components, and/or groups thereof. Without further limitation, an element defined by the phrase "comprising one..+ -." does not exclude the presence of additional identical elements in a process, method or apparatus comprising said element. In this context, each embodiment may be described with emphasis on the differences from the other embodiments, and the same similar parts between the various embodiments may be referred to each other. For the methods, products, etc. disclosed in the embodiments, if they correspond to the method sections disclosed in the embodiments, the description of the method sections may be referred to for relevance.

Those of skill in the art will appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, or combinations of computer software and electronic hardware. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. The skilled artisan may use different methods for each particular application to achieve the described functionality, but such implementation should not be considered to be beyond the scope of the embodiments of the present disclosure. It will be clearly understood by those skilled in the art that, for convenience and brevity of description, specific working procedures of the above-described systems, apparatuses and units may refer to corresponding procedures in the foregoing method embodiments, which are not repeated herein.

In the embodiments disclosed herein, the disclosed methods, articles of manufacture (including but not limited to devices, apparatuses, etc.) may be practiced in other ways. For example, the apparatus embodiments described above are merely illustrative, and for example, the division of the units may be merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted, or not performed. In addition, the coupling or direct coupling or communication connection shown or discussed with each other may be through some interface, device or unit indirect coupling or communication connection, which may be in electrical, mechanical or other form. The units described as separate units may or may not be physically separate, and units shown as units may or may not be physical units, may be located in one place, or may be distributed on a plurality of network units. Some or all of the units may be selected according to actual needs to implement the present embodiment. In addition, each functional unit in the embodiments of the present disclosure may be integrated in one processing unit, or each unit may exist alone physically, or two or more units may be integrated in one unit.

The flowcharts and block diagrams in the figures illustrate the architecture, functionality, and operation of possible implementations of systems, methods and computer program products according to embodiments of the present disclosure. In this regard, each block in the flowchart or block diagrams may represent a module, segment, or portion of code, which comprises one or more executable instructions for implementing the specified logical function(s). In some alternative implementations, the functions noted in the block may occur out of the order noted in the figures. For example, two blocks shown in succession may, in fact, be executed substantially concurrently, or the blocks may sometimes be executed in the reverse order, depending upon the functionality involved. In the description corresponding to the flowcharts and block diagrams in the figures, operations or steps corresponding to different blocks may also occur in different orders than that disclosed in the description, and sometimes no specific order exists between different operations or steps. For example, two consecutive operations or steps may actually be performed substantially in parallel, they may sometimes be performed in reverse order, which may be dependent on the functions involved. Each block of the block diagrams and/or flowchart illustration, and combinations of blocks in the block diagrams and/or flowchart illustration, can be implemented by special purpose hardware-based systems which perform the specified functions or acts, or combinations of special purpose hardware and computer instructions.

Claims

1. An out-of-order scheduling method for GPGPU instruction pre-analysis, comprising:

2. The out-of-order scheduling method of GPGPU instructions pre-analysis of claim 1, wherein the special instructions comprise a synchronization instruction, a memory access instruction, and a branch instruction during the pre-analysis stage.

3. The out-of-order scheduling method of GPGPU instruction pre-analysis of claim 1, wherein the scoreboard module is configured to record instruction information that is transmitted but not written back, supporting a dependency check of a pre-analysis stage.

4. The out-of-order scheduling method of GPGPU instruction pre-analysis of claim 1, wherein the pre-analysis stage dependency check performs a register dependency check for each new Warp instruction in the received instruction buffer by:

5. The out-of-order scheduling method of GPGPU instructions pre-analysis of claim 4, wherein the data collision instructions are prevented from entering the send buffer for the data collision instructions as determined by:

6. The method for out-of-order scheduling of GPGPU instruction pre-analysis according to claim 5, wherein the out-of-order transmission is specifically implemented by selecting an instruction with high Ready bit and Valid bit for out-of-order transmission, and then performing out-of-order execution and out-of-order write-back.

7. The out-of-order scheduling method of GPGPU instruction pre-analysis of claim 6, wherein a memory consistency model is observed for load instructions when executed;

8. An out-of-order scheduling system for GPGPU instructions pre-analysis to perform the method of any of claims 1-7, comprising:

9. An out-of-order scheduling apparatus for GPGPU instruction pre-analysis, comprising a processor and a memory storing program instructions, wherein the processor is configured to perform the GPGPU instruction pre-analysis out-of-order scheduling method of any of claims 1 to 7 when executing the program instructions.

10. A storage medium storing program instructions which, when executed, perform the GPGPU instruction pre-ordering scheduling method of any one of claims 1 to 7.