CN112579164B - SIMT conditional branch processing device and method - Google Patents
SIMT conditional branch processing device and method Download PDFInfo
- Publication number
- CN112579164B CN112579164B CN202011404147.9A CN202011404147A CN112579164B CN 112579164 B CN112579164 B CN 112579164B CN 202011404147 A CN202011404147 A CN 202011404147A CN 112579164 B CN112579164 B CN 112579164B
- Authority
- CN
- China
- Prior art keywords
- instruction
- warp
- component
- stack
- current
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30058—Conditional branch instructions
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3004—Arrangements for executing specific machine instructions to perform operations on memory
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/3005—Arrangements for executing specific machine instructions to perform operations for flow control
- G06F9/30069—Instruction skipping instructions, e.g. SKIP
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
- G06F9/3851—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution from multiple instruction streams, e.g. multistreaming
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3885—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units
- G06F9/3887—Concurrent instruction execution, e.g. pipeline or look ahead using a plurality of independent parallel functional units controlled by a single instruction for multiple data lanes [SIMD]
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06T—IMAGE DATA PROCESSING OR GENERATION, IN GENERAL
- G06T1/00—General purpose image data processing
- G06T1/20—Processor architectures; Processor configuration, e.g. pipelining
 
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Multimedia (AREA)
- Executing Machine-Instructions (AREA)
Abstract
The present invention relates to an SIMT conditional branch processing apparatus and method. The device comprises branch processing device hardware, instructions and a programming model; the branch processing device hardware is connected with the instruction and the programming model respectively, and the instruction and the programming model are connected, wherein: the branch processing device hardware is a hardware carrier for executing the external user code and has the relationship from the external user code; the instructions provide an external user code with a use interface of the apparatus, having a relationship to the external user code; the programming model provides external user code with constraints and limitations on the use of instruction sequences, with relationships to the external user code. The invention provides a low-cost and easy-to-realize scheme for conditional branching of the SIMT processor, and is simultaneously suitable for a fine-grained SIMT multi-thread processor and a coarse-grained SIMT multi-thread processor.
    Description
Technical Field
      The invention belongs to the technical field of design of processors and graphics processors, and particularly relates to an SIMT conditional branch processing device and method.
    Background
      In a coarse-grained multithreaded processor, when a current thread encounters a long latency event, processor resources are used by switching to other threads to mask the stall caused by the long latency event. However, thread switching of coarse-grained multithreading occurs only when a long delay event is encountered, the switching frequency is low, and the thread scene is generally saved by an operating system, so that certain overhead is required for saving and loading the thread scene during thread switching, and the processor computing resources are idle during the period.
      In the fine-grained multithread processor, the pause caused by a long delay event is covered by thread switching, different from coarse-grained multithread, fine-grained multithread carries out thread switching in a fixed period, the switching frequency is high, in order to reduce the expenses of thread field storage and loading during switching, the fine-grained multithread supports the thread multithread in a hardware mode, and therefore zero expenses can be achieved during thread switching.
      Warp is a set consisting of data to be processed, a program for processing the data, and result data generated after processing. Multiple warps can be executed in a time-sharing manner on the same hardware without mutual interference, and different warps need to have different field recording related information, namely, the warps need to be supported by corresponding hardware.
      Single Instruction Multiple Threading (SIMT) uses a single instruction to control the execution of multiple threads, i.e., multiple threads execute the same instruction at the same time. The SIMT technology is applied to processor design, can save instruction-fetching logic resources, uses more transistors for calculation and provides the operational capability of a processor; in graphic calculation, a large number of vertexes and pixels need to perform the same operation, the data parallelism is extremely high, and the SIMT has good adaptability; in non-graphic calculation, execution paths of different threads are different, and SIMT has the problem of low efficiency.
      In an SIMT processor, when all threads in a thread group have the same execution path, the SIMT processor can obtain the full efficiency and performance; if the execution paths of the threads in the thread group are different due to different thread data when the threads are in the conditional branch, the threads need to be executed according to the sequence of each branch path.
      Most of the existing SIMT processors use a branch synchronization stack to manage each branch and aggregation thread, and have complex realization and higher hardware cost.
    Disclosure of Invention
      In order to solve the above problems in the background art, the present invention provides an SIMT conditional branch processing apparatus and method, which provides a low-cost and easy-to-implement solution for conditional branch of an SIMT processor, and is suitable for both fine-grained SIMT multithreaded processors and coarse-grained SIMT multithreaded processors.
      The technical solution of the invention is as follows: the invention relates to an SIMT conditional branch processing device, which is characterized in that: the SIMT conditional branch processing apparatus comprises branch processing apparatus hardware, instructions and a programming model; the branch processing device hardware is connected with the instruction and the programming model respectively, and the instruction and the programming model are connected, wherein:
      the branch processing device hardware is a hardware carrier for executing the external user code and has the relationship from the external user code; the instructions provide an external user code with a use interface of the apparatus, having a relationship to the external user code; the programming model provides external user code with constraints and limitations of instruction sequences in use, with a relationship to the external user code;
      the branch processing apparatus hardware operates by instructions, having dependencies from the instructions; the hardware of the branch processing device is realized on the premise of a programming model, and has a dependency relationship on the programming model; the instructions are executed by branch processing apparatus hardware, having dependencies from the branch processing apparatus hardware; the use of the instructions must adhere to the programming model, with dependencies from the programming model; the implementation of the programming model depends on the sequence of instruction components, with dependencies from the instructions; the programming model is embodied in code executing on the branch processing device hardware, the programming model having dependencies from the branch processing device hardware.
      Preferably, the branch processing device hardware comprises a conditional branch unit, a Mask stack control unit, an absolute jump unit, a multi-site (Warp) ThreadMask stack, a multi-site (Warp) PC stack, a current ThreadMask, a Warp scheduling unit, an instruction fetch Warp scheduling unit, a PC updating unit and a current PC; wherein
      The conditional branch unit is used for executing the conditional branch instruction, generating assertion Mask information of all threads according to an execution result, and whether the results of all threads have bifurcation and branch address information, wherein a rule generated by the assertion Mask is as follows: if the condition of the branch instruction corresponding to a certain thread in the warp is met, the assertion Mask corresponding to the thread is valid, otherwise, the assertion Mask corresponding to the thread is invalid; the Mask stack control component is used for executing a Mask stack control instruction and generating corresponding control actions, wherein the control actions comprise stack entering, stack exiting and negation taking, and the rule of negation taking operation is as follows: negating the assertion masks of all threads in the warp according to bits, wherein the assertion masks are changed into invalid and the invalidity masks are changed into valid; the absolute jump component is used for executing an absolute jump instruction, generating control actions and data of a multi-site (warp) PC stack, wherein the actions comprise stacking and unstacking and generating current PC information; the multi-site (warp) ThreadMask stack is provided with a plurality of hardware sites, each site is used for storing Mask information of all threads of a warp and can output the Mask information of the warp according to the warp information output by the warp scheduling component; the multi-site (warp) PC stack is provided with a plurality of hardware sites, each site user stores PC information of a warp and can output the PC information of the warp according to the warp information output by the fetch instruction warp scheduling component; the current ThreadMask is used for keeping Mask information of all threads of the current warp; the Warp scheduling component is used for selecting one Warp with an operating condition from a plurality of warps according to the information of the external scoreboard unit; the instruction fetch warp scheduling component is used for selecting one to-be-fetched warp from a plurality of warps according to the state information cached in the external IF component; the PC updating component is used for updating the PC of the current warp according to the condition test results of all thread condition branches given by the condition branch component, the data information related to the absolute jump instruction given by the absolute jump component and the current PC given by the current PC, and the specific updating rule is that if the condition test results of all thread condition branches given by the condition branch component are not divergent and the test condition is established, the PC is updated to the jump address sent by the condition branch component; if the conditional test results of all the thread conditional branches given by the conditional branch component are divergent or the test conditions are not satisfied, the PC is updated to PC +1 sent by the current PC; if the absolute jump sent by the absolute jump component is effective, selecting a jump address sent by the absolute jump component or a stack top PC of a multi-site (warp) corresponding to the warp according to the type of the absolute jump; otherwise, the PC is updated to PC + 1' given by the current PC; the current PC is used to hold PC information for the current warp.
      Preferably, the conditional branch unit is connected with an ID unit from the outside for receiving therefrom a control signal and data of the conditional branch instruction; the Mask stack control component is connected with the external ID component and is used for receiving a control signal of a Mask stack control instruction from the external ID component; the connection of the absolute jump unit and the ID unit from the outside is used for receiving the control signal and data of the absolute jump instruction; the current PC is connected with an IF component from the outside and is used for outputting PC information of selected warp to the current PC; the instruction fetching warp scheduling component is connected with an external IF component and is used for receiving the empty and full information of the multi-warp instruction cache; the warp scheduling component is connected with a scoreboard unit from the outside and is used for receiving related information of data correlation, control correlation, function component correlation, write-back path correlation and the like of the multi-warp instructions and state information of instruction execution on the function component; the current ThreadMask is connected with an external Ex component and used for outputting Mask information of a current warp corresponding instruction to the current ThreadMask; the warp scheduling component is connected with an ID component from outside for outputting ID information of the winning current warp thereto.
      Preferably, the conditional branch unit is connected to the PC update unit, and is configured to output the conditional test results and branch address information of all current threads of warp to the PC update unit; the Mask stack control component is connected with a multi-field (warp) Threadmask stack and used for outputting control action information to the Mask stack control component; the absolute jump unit is connected with the PC updating unit and is used for outputting data information related to the absolute jump instruction to the PC updating unit; the absolute jump component is connected with a multi-site (warp) PC stack and is used for outputting control action and data information of the stack to the absolute jump component; a multi-field (warp) ThreadMask stack is connected with a warp scheduling component and is used for receiving id information of winning current warp; the current ThreadMask is connected with a multi-site (warp) ThreadMask stack and is used for receiving Mask information of all threads of the current warp; a multi-site (warp) PC stack connected to the PC update component for receiving therefrom an updated next PC for a current warp; a multi-site (warp) PC stack is connected with the fetch warp scheduling component and is used for receiving id information of winning fetch warp from the fetch warp scheduling component; a multi-site (warp) PC stack is connected with the current PC and used for outputting PC information for fetching warp; the current ThreadMask is connected with the conditional branch component and is used for outputting Mask information of all current warp threads to the conditional branch component; the Warp scheduling component is connected with a multi-site (Warp) PC stack and is used for outputting id information of a winning current Warp to the multi-site (Warp) PC stack; the current PC is connected with the PC updating component and used for outputting the PC information of the current warp to the current PC updating component; the current ThreadMask is connected with a Mask stack control unit and is used for outputting Mask information of all threads of the current warp to the current ThreadMask.
      Preferably, the instructions include conditional branch instructions, absolute jump instructions, and predicate operation instructions.
      Preferably, the conditional branch instruction is subdivided into two types, namely a conditional jump instruction and an unconditional jump instruction, and is respectively used for jumping to an address specified by the immediate when the condition is satisfied and jumping to an address specified by the immediate when the condition is not satisfied; the instruction format of the conditional branch instruction comprises mnemonics, test conditions and an immediate jump address;
      the absolute jump instruction is subdivided into a jump instruction, a push and jump instruction, a pop and jump instruction; the jump instruction is used for jumping to an address specified by the immediate; the push and jump instruction is used for storing the next instruction address to the stack and jumping to the address specified by the immediate; the stack popping and jumping instruction is used for popping the address in the stack and jumping to the address; the instruction formats of the jump instruction and the push-to-jump instruction comprise mnemonic symbols and immediate jump addresses; the instruction format of the stack popping and jumping instruction only contains mnemonics;
      the assertion operation instruction is subdivided into an assertion push instruction, an assertion pop instruction and an assertion negation instruction; the assertion push instruction is used for pushing the current assertion Mask into the assertion stack; the assertion stack popping instruction is used for popping an assertion Mask at the top of the assertion stack and taking the assertion Mask as a current assertion Mask; and the assertion negation instruction is used for performing bitwise AND operation on the current assertion Mask after negation and the assertion Mask at the top of the assertion stack to serve as the current assertion Mask.
      Preferably, the programming model includes assertion rules, flow control rules, and PC update rules.
      Preferably, the assertion rule includes: each instruction in the instruction has an attribute of whether the attribute is controlled by an asserted Mask or not, wherein: the absolute jump instruction and the assertion operation instruction are not controlled by the assertion Mask, and the instructions are executed no matter whether the assertion Mask is effective or not; the conditional branch instruction is controlled by the assertion Mask, and the instruction is executed when the assertion Mask is effective and is not executed when the assertion Mask is invalid;
      the flow control rules include: except for realizing while, for and other loop control, the jump instruction is not allowed to appear inside the conditional branch; the push and jump instruction, the pop and jump instruction can be inside the conditional branch; the stack pressing and jumping instruction and the stack popping and jumping instruction are used in a matched mode to realize function calling, and a plurality of function calling can be nested; two path addresses of the conditional branch, wherein one path is fixedly PC +1, and the other path is larger than the current PC but not allowed to be smaller than the current PC, namely, the immediate number carried in the conditional branch instruction is larger than the address of the current instruction; the absolute jump instruction supports immediate addressing; absolute jump instructions can also support register indirect addressing, but the programmer must ensure that the values in the register are the same in all threads; the user program must meet the specifications in the flow control rules, otherwise the execution flow of the program may not meet the expectations;
      the PC update rule includes: in the designed instruction set, the instructions capable of changing the program execution flow only comprise absolute jump instructions and conditional branch instructions; the update strategy of the PC is as follows: when an absolute jump instruction is encountered, the PC jumps certainly, and is irrelevant to whether the instruction is positioned in the conditional branch or not; when a conditional branch instruction is encountered, if all active thread conditional branches are not branched and the test condition is established, the PC jumps; when a conditional branch instruction is encountered, if all the active thread conditional branches have a divergence or no divergence but the test condition is not satisfied, the PC is updated to PC +1; when other instructions in the instruction set are encountered, PC is updated to PC +1.
      A method for realizing the SIMT conditional branch processing apparatus described above, characterized in that: the method comprises the following steps:
      1) The hardware of the branch processing device judges the type of the externally input instruction; respectively executing a conditional branch flow, a Mask stack control flow and an absolute jump flow according to the corresponding types;
      2) Conditional branch flow:
      2.1 The conditional branch unit performs conditional testing according to the input operation code and operand, and the input current Mask information;
      2.2 The conditional branch component informs the PC to update the conditional test results of all threads of the component according to the conditional test condition of the step 2.1) and the updated PC value;
      2.3 The conditional branch component generates new Mask information according to the condition test condition of the step 2.1) and informs a multi-site ThreadMask stack; selecting one multi-site (Warp) ThreadMask stack from a plurality of Warp sites according to current Warp information input by a Warp scheduling component, outputting the Mask information of all threads of the Warp to the current ThreadMask, and respectively outputting the received Mask information of all threads of the current Warp to an external Ex component, a conditional branch component and a Mask stack control component by the current ThreadMask;
      3) The Mask stack control flow is as follows: the Mask stack control component informs a multi-site (warp) thread stack of actions to be executed according to the operation code input by the ID component and the input Mask information of all threads of the current warp: pushing, popping, negating and Mask information of all threads during pushing;
      4) The absolute jump flow is as follows: and the absolute jump component informs the PC updating component of the operand carried by the absolute jump instruction according to the operation code and the operand input by the ID component and informs a multi-site (warp) PC stack of the operation to be carried out: pushing and popping; and the multi-site (warp) PC stack acquires the updated PC value from the PC updating component and executes corresponding operation according to the operation input by the absolute jump component.
      The invention provides an SIMT conditional branch processing device and method, which can realize the processing of the conditional branch of an SIMT processor in a simplified mode, and simultaneously support the branch processing of fine grain multithreading and coarse grain multithreading in a hardware multi-site mode. Therefore, the invention has the following advantages:
      1) Branch processing of the SIMT processor can be implemented;
      2) Supporting branch processing of a fine-grained SIMT multithreaded processor;
      3) Simultaneously, the branch processing of the coarse-grained SIMT multi-thread processor is supported;
      4) Hardware multi-site is supported, and zero-cost quick switching of fine-grained multithreading and coarse-grained multithreading is supported;
      5) The invention has simple structure, easy realization and strong expandability.
    Drawings
      FIG. 1 is a composition and relationship diagram of the present invention;
      FIG. 2 is a diagram of the hardware components and connections of the apparatus of the present invention.
      Fig. 3 is a flow chart of the method of the present invention.
    Detailed Description
      The invention provides a SIMT conditional branch processing device, which comprises branch processing device hardware, instructions and a programming model; the branch processing device hardware is connected with the instruction and the programming model respectively, and the instruction and the programming model are connected, wherein:
      the branch processing device hardware is a hardware carrier for executing the external user code and has the relationship from the external user code; the instructions provide an external user code with a use interface of the apparatus, having a relationship to the external user code; the programming model provides external user code with constraints and limitations of instruction sequences in use, with a relationship to the external user code;
      the branch processing apparatus hardware operates by instructions, having dependencies from the instructions; the hardware of the branch processing device is realized on the premise of a programming model and has a dependency relationship on the programming model; the instructions are executed by branch processing apparatus hardware, having dependencies from the branch processing apparatus hardware; the use of the instructions must adhere to the programming model, with dependencies from the programming model; the implementation of the programming model depends on the sequence of instruction components, with dependencies from the instructions; the programming model is embodied in code that executes on the branch processing apparatus hardware, the programming model having dependencies from the branch processing apparatus hardware.
      The branch processing device hardware comprises a conditional branch component, a Mask stack control component, an absolute jump component, a multi-site (Warp) ThreadMask stack, a multi-site (Warp) PC stack, a current ThreadMask, a Warp scheduling component, an instruction fetch Warp scheduling component, a PC updating component and a current PC; wherein
      The conditional branch unit is used for executing the conditional branch instruction, generating assertion Mask information of all threads according to an execution result, and whether the results of all threads have bifurcation and branch address information, wherein a rule generated by the assertion Mask is as follows: if the condition of a branch instruction corresponding to a certain thread in warp is met, the assertion Mask corresponding to the thread is valid, otherwise, the assertion Mask corresponding to the thread is invalid; the Mask stack control component is used for executing a Mask stack control instruction and generating corresponding control actions, wherein the control actions comprise stack entering, stack exiting and negation taking, and the rule of negation taking operation is as follows: negating the assertion masks of all threads in the warp according to bits, wherein the assertion masks are changed into invalid and the invalidity masks are changed into valid; the absolute jump component is used for executing an absolute jump instruction, generating control actions and data of a multi-site (warp) PC stack, wherein the actions comprise stacking and popping, and generating current PC information; the multi-site (warp) ThreadMask stack is provided with a plurality of hardware sites, each site is used for storing Mask information of all threads of a warp and can output the Mask information of the warp according to the warp information output by the warp scheduling component; the multi-site (warp) PC stack is provided with a plurality of hardware sites, each site user stores PC information of a warp and can output the PC information of the warp according to the warp information output by the fetch instruction warp scheduling component; the current ThreadMask is used for keeping Mask information of all threads of the current warp; the Warp scheduling component is used for selecting one Warp with an operating condition from a plurality of warps according to the information of the external scoreboard unit; the instruction fetch warp scheduling component is used for selecting one to-be-fetched warp from a plurality of warps according to the state information cached in the external IF component; the PC updating component is used for updating the PC of the current warp according to the condition test results of all the thread condition branches given by the condition branch component, the data information related to the absolute jump instruction given by the absolute jump component and the current PC given by the current PC, and the specific updating rule is that if the condition test results of all the thread condition branches given by the condition branch component are not diverged and the test conditions are satisfied, the PC is updated to the jump address sent by the condition branch component; if the conditional test results of all the thread conditional branches given by the conditional branch component are divergent or the test conditions are not satisfied, the PC is updated to PC +1 sent by the current PC; if the absolute jump sent by the absolute jump component is effective, selecting a jump address sent by the absolute jump component or a stack top PC of a multi-site (warp) corresponding to the warp according to the type of the absolute jump; otherwise, the PC is updated to PC + 1' given by the current PC; the current PC is used to hold PC information for the current warp.
      The conditional branch unit is connected with the ID unit from the outside and is used for receiving the control signal and data of the conditional branch instruction; the Mask stack control component is connected with the external ID component and is used for receiving a control signal of a Mask stack control instruction from the external ID component; the connection of the absolute jump unit and the ID unit from the outside is used for receiving the control signal and data of the absolute jump instruction; the current PC is connected with an IF component from the outside and is used for outputting PC information of selected warp to the current PC; the instruction fetching warp scheduling component is connected with an external IF component and is used for receiving the empty and full information of the multi-warp instruction cache; the warp scheduling component is connected with a scoreboard unit from the outside and is used for receiving related information of data correlation, control correlation, function component correlation, write-back path correlation and the like of the multi-warp instructions and state information of instruction execution on the function component; the current ThreadMask is connected with an external Ex component and used for outputting Mask information of a current warp corresponding instruction to the current ThreadMask; the warp scheduling component is connected with an ID component from the outside for outputting ID information of the winning current warp thereto.
      The conditional branch component is connected with the PC updating component and is used for outputting the conditional test results and the branch address information of all current warp threads to the conditional branch component; the Mask stack control component is connected with a multi-field (warp) Threadmask stack and used for outputting control action information to the Mask stack control component; the absolute jump unit is connected with the PC updating unit and used for outputting data information related to the absolute jump instruction to the PC updating unit; the absolute jump component is connected with a multi-site (warp) PC stack and is used for outputting control action and data information of the stack to the absolute jump component; a multi-field (warp) ThreadMask stack is connected with a warp scheduling component and is used for receiving id information of winning current warp; the current ThreadMask is connected with a multi-site (warp) ThreadMask stack and is used for receiving Mask information of all threads of the current warp; a multi-site (warp) PC stack connected to the PC update component for receiving therefrom an updated next PC for a current warp; a multi-site (warp) PC stack is connected with the fetch warp scheduling component and used for receiving id information of winning fetch warp from the multi-site (warp) PC stack; a multi-site (warp) PC stack is connected with the current PC and used for outputting PC information for fetching warp; the current ThreadMask is connected with the conditional branch component and used for outputting Mask information of all current warp threads to the conditional branch component; the Warp scheduling component is connected with a multi-site (Warp) PC stack and used for outputting id information of a winning current Warp to the multi-site (Warp) PC stack; the current PC is connected with the PC updating component and used for outputting the PC information of the current warp to the current PC updating component; the current ThreadMask is connected with a Mask stack control unit and is used for outputting Mask information of all threads of the current warp to the current ThreadMask.
      The instructions include conditional branch instructions, absolute jump instructions, and predicate operation instructions.
      The conditional branch instruction is subdivided into two types of a conditional jump instruction and an unconditional jump instruction, and is respectively used for jumping to an address specified by an immediate value when the condition is satisfied and jumping to an address specified by the immediate value when the condition is not satisfied; the instruction format of the conditional branch instruction comprises mnemonics, test conditions and an immediate jump address;
      the absolute jump instruction is subdivided into a jump instruction, a push and jump instruction, a pop and jump instruction; the jump instruction is used for jumping to an address specified by the immediate; the push and jump instruction is used for storing the next instruction address to the stack and jumping to the address specified by the immediate; the stack popping and jumping instruction is used for popping and jumping the address in the stack to the address; the instruction formats of the jump instruction and the push-to-jump instruction comprise mnemonic symbols and immediate jump addresses; the instruction format of the stack popping and jumping instruction only contains mnemonics;
      the assertion operation instruction is subdivided into an assertion push instruction, an assertion pop instruction and an assertion negation instruction; the assertion push instruction is used for pushing the current assertion Mask into the assertion stack; the assertion stack popping instruction is used for popping an assertion Mask at the top of the assertion stack and taking the assertion Mask as a current assertion Mask; and the assertion negation instruction is used for performing bitwise AND operation on the current assertion Mask after negation and the assertion Mask at the top of the assertion stack to serve as the current assertion Mask.
      The programming model includes assertion rules, flow control rules, and PC update rules.
      The assertion rules include: each instruction in the instruction has an attribute of whether the attribute is controlled by an asserted Mask or not, wherein: the absolute jump instruction and the assertion operation instruction are not controlled by the assertion Mask, and the instructions are executed no matter whether the assertion Mask is effective or not; the conditional branch instruction is controlled by the assertion Mask, and the instruction is executed when the assertion Mask is effective and is not executed when the assertion Mask is invalid;
      the flow control rules include: except for realizing while, for and other loop control, the jump instruction is not allowed to appear inside the conditional branch; the push and jump instruction, the pop and jump instruction can be inside the conditional branch; the stack pressing and jumping instruction and the stack popping and jumping instruction are used in a matched mode to realize function calling, and a plurality of function calling can be nested; two path addresses of the conditional branch, wherein one path is fixedly PC +1, and the other path is larger than the current PC but not allowed to be smaller than the current PC, namely, the immediate number carried in the conditional branch instruction is larger than the address of the current instruction; the absolute jump instruction supports immediate addressing; absolute jump instructions may also support register indirection, but the programmer must ensure that the values in the register are the same in all threads; the user program must meet the specifications in the flow control rules, otherwise the execution flow of the program may not meet the expectations;
      the PC update rule includes: in a designed instruction set, the instructions capable of changing the program execution flow only comprise absolute jump instructions and conditional branch instructions; the update strategy of the PC is as follows: when an absolute jump instruction is encountered, the PC jumps certainly, and is irrelevant to whether the instruction is positioned in the conditional branch or not; when a conditional branch instruction is encountered, if all active thread conditional branches are not diverged and the test condition is established, the PC jumps; when a conditional branch instruction is encountered, if all the active thread conditional branches have a divergence or no divergence but the test condition is not satisfied, the PC is updated to PC +1; when other instructions in the instruction set are encountered, PC is updated to PC +1.
      The invention also provides a SIMT conditional branching processing method, which comprises the following steps:
      1) The branch processing device hardware judges the type of the instruction input from the outside; respectively executing a conditional branch flow, a Mask stack control flow and an absolute jump flow according to the corresponding types;
      2) Conditional branch flow:
      2.1 The conditional branch unit performs conditional testing according to the input operation code and operand, and the input current Mask information;
      2.2 The conditional branch component informs the PC to update the conditional test results of all threads of the component according to the conditional test condition of the step 2.1) and the updated PC value;
      2.3 The conditional branch component generates new Mask information according to the condition test condition of the step 2.1) and informs a multi-site ThreadMask stack; selecting one of a plurality of Warp fields by a multi-field (Warp) ThreadMask stack according to current Warp information input by a Warp scheduling component, outputting the Mask information of all threads of the Warp to the current ThreadMask, and respectively outputting the received Mask information of all threads of the current Warp to an external Ex component, a conditional branch component and a Mask stack control component by the current ThreadMask;
      3) The Mask stack control flow comprises the following steps: the Mask stack control component informs a multi-site (warp) thread stack of actions to be executed according to the operation code input by the ID component and the input Mask information of all threads of the current warp: pushing, popping, negating and Mask information of all threads during pushing;
      4) The absolute jump flow is as follows: the absolute jump unit informs the PC updating unit of the operand carried by the absolute jump instruction according to the operation code and operand input by the ID unit, and informs a multi-site (warp) PC stack of the operation to be carried out: pushing and popping; and the multi-site (warp) PC stack acquires the updated PC value from the PC updating component and executes corresponding operation according to the operation input by the absolute jump component.
      The technical solution of the present invention is further described in detail with reference to the accompanying drawings and specific embodiments.
      Referring to fig. 1, a SIMT conditional branch processing apparatus of an embodiment of the present invention includes branch processing apparatus hardware 101, instructions 102 and a programming model  103.
      The branch processing device hardware 101 is a hardware carrier of the SIMT branch processing device, and is represented in the form of hardware; the instruction 102 is not a hardware device, but is a device use interface provided for a user, and is represented in the form of an instruction; the programming model  103 is not a hardware device, but rather provides the user with interface usage rules in the form of constraints and limitations on the use of instruction sequences.
      The branch processing apparatus hardware 101 is a hardware carrier of external user code execution, having a relationship  132 from the external user code; the instructions 102 provide a usage interface of the apparatus for an external user code, having a relationship  130 to the external user code; the programming model  103 provides external user code with constraints and limits on the instruction sequence in use, with a relationship  131 to the external user code;
      the branch processing apparatus hardware 101 operates with instructions 102, i.e. has dependencies  134 from the instructions 102; the implementation of the branch processing apparatus hardware 101 is premised on the programming model  103, i.e. has a dependency  137 on the programming model  103; the instructions 102 are executed by the branch processing apparatus hardware 101, i.e. have dependencies  136 from the branch processing apparatus hardware 101; the use of the instructions 102 must comply with the programming model  103, i.e. have dependencies  135 from the programming model  103; the implementation of the programming model  103 depends on the sequence of instructions 102, i.e. has dependencies  133 from the instructions 102; the programming model  103 is embodied in code executing on the branch processing apparatus hardware 101, i.e. the programming model  103 has dependencies  138 from the branch processing apparatus hardware 101.
      Referring to fig. 2, the branch processing apparatus hardware 101 of the present invention is composed of a conditional branch unit 201, a Mask stack control unit  202, an absolute jump unit 203, a multi-site (Warp) ThreadMask stack 204, a multi-site (Warp) PC stack 205, a current ThreadMask206, a Warp scheduling unit 207, an instruction fetch Warp scheduling unit  208, a PC updating unit 209, and a current PC 210;
      the conditional branch unit 201 is configured to execute a conditional branch instruction, and generate Mask assertion information of all threads according to an execution result, and whether the results of all threads have divergence and branch address information, where a rule generated by the Mask assertion is: if the condition of the branch instruction corresponding to a certain thread in the warp is met, the assertion Mask corresponding to the thread is valid, otherwise, the assertion Mask corresponding to the thread is invalid; the Mask stack control component 202 is used for executing a Mask stack control instruction and generating corresponding control actions, wherein the control actions include stack entry, stack exit and negation, and the rule of negation operation is as follows: negating the assertion masks of all threads in the warp according to bits, wherein the assertion masks are changed into invalid and the invalidity masks are changed into valid; the absolute jump component 203 is used for executing an absolute jump instruction, generating control actions and data of a multi-site (warp) PC stack, wherein the actions comprise stacking and unstacking, and generating current PC information; the multi-site (warp) ThreadMask stack 204 is provided with a plurality of hardware sites, each site is used for storing Mask information of all threads of a warp and can output the Mask information of the warp according to the warp information output by the warp scheduling component 207; the multi-site (warp) PC stack 205 has multiple hardware sites, and each site user stores PC information of a warp and can output PC information of the warp according to the warp information output by the fetch warp scheduler 208; the current ThreadMask206 is used for maintaining Mask information of all threads of the current warp; the Warp scheduling component 207 is used for selecting one Warp with an operating condition from a plurality of warps according to the information of the external scoreboard unit; the instruction warp scheduling component 208 is configured to select a warp to be instructed from multiple warps according to the state information cached in the external IF component; the PC updating unit 209 is configured to update the PC of the current warp according to the conditional test results of all the conditional branches of the thread given by the conditional branch unit 201, according to the data information related to the absolute jump instruction given by the absolute jump unit 203, and the current PC given by the current PC210, where a specific update rule is "if the conditional test results of all the conditional branches of the thread given by the conditional branch unit 201 are not divergent and the test condition is satisfied, the PC updates the jump address sent by the conditional branch unit 201; if the conditional test results of all the thread conditional branches given by the conditional branch unit 201 are divergent or the test conditions are not satisfied, the PC is updated to PC +1 sent by the current PC 210; if the absolute jump sent by the absolute jump unit 203 is valid, selecting to jump to the jump address sent by the absolute jump unit 203 or a stack top PC of a multi-site (warp) PC stack 205 corresponding to the warp according to the type of the absolute jump; in other cases, the PC is updated to PC +1 given by the current PC 210; "; the current PC210 is used for maintaining the PC information of the current warp;
      conditional branch component 201 has a connection  230 from an external ID component for receiving control signals and data of conditional branch instructions therefrom; the Mask stack control section  202 has a connection  231 from an external ID section for receiving a control signal of a Mask stack control instruction therefrom; the absolute jump unit 203 has a connection  232 from the external ID unit for receiving control signals and data of absolute jump instructions therefrom; the current PC210 has a connection  233 to an external IF component for outputting selected warp PC information thereto; fetch warp scheduler component  208 has a connection  234 from an external IF component for receiving empty full information from the multi-warp instruction cache; the warp scheduling component 207 has connections  235 from the external scoreboard unit for receiving data-related, control-related, functional component-related, writeback path-related, etc. information about the multi-warp instructions from, and status information about the execution of instructions on the functional component; the current ThreadMask206 has a connection  236 to an external Ex component for outputting Mask information for the current warp corresponding instruction thereto; warp scheduling component 207 has a connection  237 to an ID component for outputting ID information of the winning current warp thereto;
      conditional branch component 201 has a connection 248 to PC update component 209 for outputting thereto conditional test results and branch address information for all threads of the current warp; the Mask stack control block 202 has a connection 249 to a multi-site (warp) ThreadMask stack 204 for outputting control action information thereto; absolute jump unit 203 has a connection 250 to PC update unit 209 for outputting thereto absolute jump instruction related data information; absolute skip component 203 has a connection 251 to multi-site (warp) PC stack 205 for outputting stack control action and data information thereto; the multi-site (warp) ThreadMask stack 204 has a connection 252 from the warp scheduler 207 for receiving id information of the winning current warp therefrom; the current ThreadMask206 has a connection 253 from the multi-live (warp) ThreadMask stack 204 for receiving Mask information for all threads of the current warp therefrom; multi-locale (warp) PC stack 205 has a connection 254 from PC update component 209 for receiving the current warp's updated next PC therefrom; a multi-context (warp) PC stack 205 has a connection 255 from the fetch warp scheduler component 208 for receiving winning fetch warp id information therefrom; multi-context (warp) PC stack 205 has a connection 256 to current PC210 for outputting warp-fetching PC information thereto; the current ThreadMask206 has a connection 257 to the conditional branch unit 201 for outputting Mask information for all threads of the current warp thereto; warp scheduling component 207 has a connection 258 to multi-site (Warp) PC stack 205 for outputting id information of the winning current Warp thereto; the current PC210 has a connection 259 to the PC update section 209 for outputting current warp's PC information thereto; the current ThreadMask206 has a connection 260 to the Mask stack control unit 202 for outputting Mask information for all threads of the current warp thereto.
      Wherein: PC-ProgrammCounter, program counter
      Referring to fig. 3, the steps of the SIMT conditional branch processing method according to the embodiment of the present invention are as follows:
      the branch processing device hardware 101 has the following working procedures: determining a type of an externally input instruction; respectively executing a conditional branch flow, a Mask stack control flow and an absolute jump flow according to the corresponding types;
      the conditional branch flow is as follows: the conditional branch unit 201 performs conditional testing based on the opcode and operand input via connection  230 and the current Mask information input via connection  257; the conditional branch unit 201 informs the PC update unit 209 of the conditional test results of all threads and the updated PC value via connection  248 according to the above-described conditional test conditions; the conditional branch component 201 generates new Mask information according to the condition test condition, and informs the multi-field ThreadMask stack 204 unit through a connection  247; the multi-context (warp) ThreadMask stack 204 selects one of the plurality of warp contexts according to the current warp information input by the connection  252, and outputs Mask information of all threads of the warp via the connection  253; the current ThreadMask206 holds Mask information for all current warp threads received over connection  253 and outputs to external Ex components, conditional branch component 201, and Mask stack control component  202 over connection  236, connection  257, connection  260;
      the Mask stack control flow is as follows: the Mask stack control component  202 informs actions (push, pop, and reverse) to be executed by a multi-site (warp) thread stack 204 through a connection  249 according to an operation code input through the connection  231 and Mask information of all current warp threads input through the connection  260, and the Mask information of all threads during push;
      the absolute jump flow is as follows: the absolute jump unit 203 informs the PC update unit 209 of the operand carried by the absolute jump instruction via the connection  250 and informs the multi-site (warp) PC stack 205 of the operation to be performed (including push, pop) via the connection  251, according to the operation code and operand inputted via the connection  232; the multi-context (warp) PC stack 205 retrieves the updated PC values from connection  254 and performs the corresponding operations according to the operations input over connection  251.
      The instructions 102 include: a conditional branch instruction 102C, an absolute jump instruction 102J, a predicate operation instruction 102M;
      the conditional branch instruction can be subdivided into two types of a conditional jump instruction BEQZ and an unconditional jump instruction BNEZ, and is respectively used for jumping to an address specified by the immediate value when the condition is satisfied and jumping to the address specified by the immediate value when the condition is not satisfied; the instruction format of the conditional branch instruction 102C is Opcode RS, # imm, where Opcode is a mnemonic, RS is a source register, and # imm is an immediate;
      the absolute jump instruction 102J is subdivided into a jump instruction J, a push-and-jump instruction JS and a pop-and-jump instruction JR; the jump instruction J is used for jumping to an address specified by an immediate; the stack pushing and jumping instruction JS is used for saving the next instruction address to the stack and jumping to the address specified by the immediate value; the popping and jumping instruction JR is used for popping and jumping an address in a stack to the address; the instruction format of the jump instruction J, the push and jump instruction JS is Opcode # imm, the instruction format of the pop and jump instruction JR only contains Opcode, wherein Opcode is a mnemonic symbol, and # imm is an immediate number;
      the assertion operation instruction 102M is subdivided into a assertion PUSH instruction PUSH, a assertion POP instruction POP and an assertion inversion instruction INV; the assertion PUSH instruction PUSH is used for pushing the current assertion Mask into the assertion stack; the assertion POP instruction POP is used for popping the assertion Mask at the top of the assertion stack and taking the assertion Mask as the current assertion Mask; the assertion and negation instruction INV is used for performing bitwise AND operation on the current assertion Mask after negation and the assertion Mask at the top of the assertion stack to serve as the current assertion Mask.
      The programming model  103 comprises an assertion rule 103M, a flow control rule 103F and a PC update rule 103P;
      the assertion rule 103M includes: each instruction in the instructions 102 has an attribute of whether the attribute is controlled by a predicate Mask or not, wherein: the absolute jump instruction 102J and the predicate operation instruction 102M are not controlled by a predicate Mask, and the instructions are executed no matter whether the predicate Mask is effective or not; the conditional branch instruction 102C is controlled by a predicate Mask, and when the predicate Mask is valid, the instruction is executed, and when the predicate Mask is invalid, the instruction is not executed;
      the flow control rule 103F includes: the jump instruction J is not allowed to occur inside a conditional branch except for implementing while, for, etc. loop control; the stack pushing and jumping instruction JS and the stack popping and jumping instruction JR can be positioned inside the conditional branch; function calling is realized by using a stack pressing and jumping instruction JS and a stack popping and jumping instruction JR in a matched mode, and a plurality of function calling can be nested; two path addresses of the conditional branch, one path is fixed to be PC +1, and the other path must be larger than the current PC but not allowed to be smaller than the current PC, that is, the immediate carried in the conditional branch instruction 102C must be larger than the address of the current instruction; the absolute jump instruction 102J supports immediate addressing; the absolute jump instruction 102J may also support register indirection, but the programmer must ensure that the values in the register are the same in all threads; the user program must comply with the rules specified in the flow control rules 103F, otherwise the execution flow of the program may not be in accordance with the expectation;
      the PC update rule 103P includes: in the designed instruction set, the instructions capable of changing the program execution flow only include an absolute jump instruction 102J and a conditional branch instruction 102C; the update strategy of the PC is as follows: when an absolute jump instruction 102J is encountered, the PC must jump, regardless of whether the instruction is located inside a conditional branch; when a conditional branch instruction 102C is encountered, if all active thread conditional branches are not divergent and the test condition is established, the PC jumps; when the conditional branch instruction 102C is encountered, if all the active thread conditional branches have a divergence or no divergence but the test condition is not satisfied, the PC is updated to PC +1; encountering the other instructions in the instruction set except for instruction 102, PC is updated to PC +1.
      For the purposes of the present invention: the thread is disambiguated, namely the condition assertions of all threads are different when the conditional branches are carried out, and the thread is not disambiguated, namely the condition assertions of all threads are the same; PC jumping refers to updating the PC to a value other than PC +1.
      Finally, it should be noted that the above embodiments are only used for illustrating the technical solutions of the present invention, and not for limiting the same; although the present invention has been described in detail with reference to the foregoing embodiments, it will be understood by those skilled in the art; the technical solutions described in the foregoing embodiments may still be modified, or some technical features may be equivalently replaced; and such modifications or substitutions do not depart from the spirit and scope of the corresponding technical solutions of the embodiments of the present invention.
    Claims (8)
1. An SIMT conditional branch processing apparatus, comprising: the SIMT conditional branch processing apparatus comprises branch processing apparatus hardware, instructions and a programming model; the branch processing apparatus hardware is connected with instructions and programming models respectively, the instructions and programming models are connected, wherein:
      the branch processing device hardware is a hardware carrier for executing external user codes and has a relationship from the external user codes; the instructions provide an external user code with a use interface of the apparatus, having a relationship to the external user code; the programming model provides external user code with constraints and limitations of instruction sequences in use, with a relationship to the external user code;
      the branch processing apparatus hardware operates by instructions, having dependencies from the instructions; the hardware of the branch processing device is realized on the premise of a programming model and has a dependency relationship on the programming model; the instructions are executed by branch processing apparatus hardware, having dependencies from the branch processing apparatus hardware; the use of the instructions must adhere to the programming model, with dependencies from the programming model; the implementation of the programming model depends on the sequence of instruction components, with dependencies from the instructions; the programming model is embodied in code executing on the branch processing device hardware, the programming model having dependencies from the branch processing device hardware;
      the branch processing device hardware comprises a conditional branch component, a Mask stack control component, an absolute jump component, a multi-field ThreadMask stack, a multi-field PC stack, a current ThreadMask, a Warp scheduling component, an instruction fetch Warp scheduling component, a PC updating component and a current PC; wherein
      The conditional branch unit is used for executing the conditional branch instruction, generating assertion Mask information of all threads according to an execution result, and judging whether the results of all threads have bifurcation and branch address information, wherein a rule generated by the assertion Mask is as follows: if the condition of the branch instruction corresponding to a certain thread in the warp is met, the assertion Mask corresponding to the thread is valid, otherwise, the assertion Mask corresponding to the thread is invalid; the Mask stack control component is used for executing a Mask stack control instruction and generating corresponding control actions, the control actions comprise stack entering, stack exiting and negation taking, and the rule of negation taking operation is as follows: negating the assertion masks of all threads in the warp according to bits, wherein the assertion masks are changed into invalid and the invalidity masks are changed into valid; the absolute jump component is used for executing an absolute jump instruction, generating control actions and data of the multi-site PC stack, wherein the actions comprise stacking and popping, and generating current PC information; the multi-site ThreadMask stack is provided with a plurality of hardware sites, each site is used for storing the Mask information of all threads of a warp and can output the Mask information of the warp according to the warp information output by a warp scheduling component; the multi-site PC stack is provided with a plurality of hardware sites, each site user stores one piece of warp PC information and can output the warp PC information according to the warp information output by the fetch instruction warp scheduling component; the current ThreadMask is used for keeping Mask information of all threads of the current warp; the Warp scheduling component is used for selecting one Warp with an operating condition from a plurality of warps according to the information of the external scoreboard unit; the instruction fetch warp scheduling component is used for selecting one to-be-fetched warp from a plurality of warps according to the state information cached in the external IF component; the PC updating component is used for updating the PC of the current warp according to the condition test results of all thread condition branches given by the condition branch component, the data information related to the absolute jump instruction given by the absolute jump component and the current PC given by the current PC, and the specific updating rule is that if the condition test results of all thread condition branches given by the condition branch component are not divergent and the test condition is established, the PC is updated to the jump address sent by the condition branch component; if the conditional test results of all the thread conditional branches given by the conditional branch component are divergent or the test conditions are not satisfied, the PC is updated to PC +1 sent by the current PC; if the absolute jump sent by the absolute jump component is effective, selecting a jump address sent by the absolute jump component or a stack top PC corresponding to warp of the multi-field PC stack according to the type of the absolute jump; otherwise, the PC is updated to PC + 1' given by the current PC; the current PC is used to maintain PC information for the current warp.
    2. The SIMT conditional branch processing apparatus according to claim 1, wherein: the conditional branch unit is connected with an ID unit from the outside and is used for receiving control signals and data of the conditional branch instruction; the Mask stack control component is connected with an ID component from the outside and is used for receiving a control signal of a Mask stack control instruction from the ID component; the absolute jump unit is connected with an ID unit from the outside and is used for receiving control signals and data of absolute jump instructions; the current PC is connected with an IF component from the outside and is used for outputting PC information of selected warp to the current PC; the instruction fetch warp scheduling component is connected with an external IF component and used for receiving empty and full information of the multi-warp instruction cache from the instruction fetch warp scheduling component; the warp scheduling component is connected with a scoreboard unit from the outside and is used for receiving related information of data correlation, control correlation, function component correlation, write-back path correlation and the like of the multi-warp instructions and state information of instruction execution on the function component; the current ThreadMask is connected with an external Ex component and used for outputting Mask information of a current warp corresponding instruction to the current ThreadMask; the warp scheduling component is connected with an ID component from the outside and is used for outputting the ID information of the winning current warp to the warp scheduling component.
    3. The SIMT conditional branch processing apparatus according to claim 2, wherein: the conditional branch component is connected with the PC updating component and is used for outputting the conditional test results and the branch address information of all current warp threads to the conditional branch component; the Mask stack control component is connected with the multi-field Threadmask stack and is used for outputting control action information to the Mask stack control component; the absolute jump unit is connected with the PC updating unit and is used for outputting data information related to the absolute jump instruction to the PC updating unit; the absolute skip component is connected with the multi-site PC stack and is used for outputting control actions and data information of the stack to the multi-site PC stack; the multi-field ThreadMask stack is connected with the warp scheduling component and is used for receiving id information of the winning current warp; the current ThreadMask is connected with a multi-field ThreadMask stack and used for receiving Mask information of all current warp threads from the current ThreadMask stack; the multi-site PC stack is connected with the PC updating component and is used for receiving the updated next PC of the current warp; the multi-field PC stack is connected with the instruction fetch dispatch component and is used for receiving id information of the winning instruction fetch from the multi-field PC stack; the multi-site PC stack is connected with the current PC and used for outputting the PC information of the instruction warp to the current PC stack; the current ThreadMask is connected with the conditional branch component and is used for outputting Mask information of all current warp threads to the conditional branch component; the Warp scheduling component is connected with a multi-field PC stack and is used for outputting id information of a current winning Warp to the multi-field PC stack; the current PC is connected with the PC updating component and used for outputting the PC information of the current warp to the current PC updating component; and the current ThreadMask is connected with a Mask stack control unit and used for outputting Mask information of all current warp threads to the current ThreadMask.
    4. The SIMT conditional branch processing apparatus according to claim 3, wherein: the instructions include conditional branch instructions, absolute jump instructions, and predicate operation instructions.
    5. The SIMT conditional branch treatment apparatus according to claim 4, wherein: the conditional branch instruction is subdivided into two types of a conditional jump instruction and an unconditional jump instruction, and is respectively used for jumping to an address specified by an immediate when the condition is satisfied and jumping to an address specified by the immediate when the condition is not satisfied; the instruction format of the conditional branch instruction comprises a mnemonic, a test condition and an immediate jump address;
      the absolute jump instruction is subdivided into a jump instruction, a push and jump instruction, a pop and jump instruction; the jump instruction is used for jumping to an address specified by an immediate; the push and jump instruction is used for storing the next instruction address to the stack and jumping to the address specified by the immediate; the stack popping and jumping instruction is used for popping and jumping an address in a stack to the address; the instruction formats of the jump instruction, the stack pressing jump instruction and the jump instruction comprise mnemonics and immediate jump addresses; the instruction format of the pop stack and jump instruction only comprises mnemonics;
      the assertion operation instruction is subdivided into an assertion push instruction, an assertion pop instruction and an assertion negation instruction; the assertion push instruction is used for pushing a current assertion Mask into an assertion stack; the assertion stack popping instruction is used for popping an assertion Mask at the top of the assertion stack and taking the assertion Mask as a current assertion Mask; and the assertion negation instruction is used for performing bitwise AND operation on the current assertion Mask after negation and the assertion Mask at the top of the assertion stack to serve as the current assertion Mask.
    6. The SIMT conditional branch processing apparatus according to claim 5, wherein: the programming model includes assertion rules, flow control rules, and PC update rules.
    7. The SIMT conditional branch processing apparatus according to claim 6, wherein: the assertion rule includes: each instruction in the instructions has an attribute of whether the attribute is controlled by an asserted Mask or not, wherein: the absolute jump instruction and the assertion operation instruction are not controlled by assertion Mask, and the instructions are executed no matter whether the assertion Mask is effective or not; the conditional branch instruction is controlled by a predicate Mask, and the instruction is executed when the Mask is predicated to be effective and is not executed when the Mask is not effective;
      the flow control rule comprises: the jump instruction is not allowed to occur inside a conditional branch except for the implementation of while, for, etc. loop control; the push and jump instruction, the pop and jump instruction can be inside the conditional branch; the stack pressing and jumping instruction and the stack popping and jumping instruction are used in a matched mode to realize function calling, and a plurality of function calling can be nested; two path addresses of the conditional branch, wherein one path is fixedly PC +1, and the other path is larger than the current PC but not allowed to be smaller than the current PC, namely, the immediate number carried in the conditional branch instruction is larger than the address of the current instruction; the absolute jump instruction supports immediate addressing; absolute jump instructions can also support register indirect addressing, but the programmer must ensure that the values in the register are the same in all threads; the user program must comply with the regulations in the flow control rules, otherwise the execution flow of the program may be inconsistent with the expectation;
      the PC update rule includes: in the designed instruction set, the instructions capable of changing the program execution flow only comprise absolute jump instructions and conditional branch instructions; the update strategy of the PC is as follows: when an absolute jump instruction is encountered, the PC jumps certainly, and is irrelevant to whether the instruction is positioned in the conditional branch or not; when a conditional branch instruction is encountered, if all active thread conditional branches are not diverged and the test condition is established, the PC jumps; when a conditional branch instruction is encountered, if all the active thread conditional branches have a divergence or no divergence but the test condition is not satisfied, the PC is updated to PC +1; when other instructions in the instruction set are encountered, PC is updated to PC +1.
    8. A method of implementing the SIMT conditional branch processing apparatus according to claim 1, comprising: the method comprises the following steps:
      1) The hardware of the branch processing device judges the type of the externally input instruction; respectively executing a conditional branch flow, a Mask stack control flow and an absolute jump flow according to the corresponding types;
      2) Conditional branch flow:
      2.1 The conditional branch unit performs conditional testing according to the input operation code and operand, and the input current Mask information;
      2.2 The conditional branch component informs the PC to update the conditional test results of all threads of the component according to the conditional test condition of the step 2.1) and the updated PC value;
      2.3 The conditional branch component generates new Mask information according to the condition test condition of the step 2.1) and informs a multi-site ThreadMask stack; selecting one multi-site ThreadMask stack from a plurality of Warp sites according to current Warp information input by a Warp scheduling component, outputting the Mask information of all threads of the Warp to a current ThreadMask, and respectively outputting the received Mask information of all threads of the current Warp to an external Ex component, a conditional branch component and a Mask stack control component by the current ThreadMask;
      3) The Mask stack control flow comprises the following steps: the Mask stack control component informs the multi-site Thradmask stack of the actions to be executed according to the operation code input by the ID component and the input Mask information of all threads of the current warp: pushing, popping, negating and Mask information of all threads during pushing;
      4) The absolute jump flow is as follows: and the absolute jump component informs the PC updating component of the operand carried by the absolute jump instruction according to the operation code and the operand input by the ID component and informs the multi-site PC stack of the operation to be carried out: pushing and popping; the multi-site PC stack acquires the updated PC value from the PC updating component and executes corresponding operation according to the operation input by the absolute skip component.
    Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202011404147.9A CN112579164B (en) | 2020-12-05 | 2020-12-05 | SIMT conditional branch processing device and method | 
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| CN202011404147.9A CN112579164B (en) | 2020-12-05 | 2020-12-05 | SIMT conditional branch processing device and method | 
Publications (2)
| Publication Number | Publication Date | 
|---|---|
| CN112579164A CN112579164A (en) | 2021-03-30 | 
| CN112579164B true CN112579164B (en) | 2022-10-25 | 
Family
ID=75127105
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| CN202011404147.9A Active CN112579164B (en) | 2020-12-05 | 2020-12-05 | SIMT conditional branch processing device and method | 
Country Status (1)
| Country | Link | 
|---|---|
| CN (1) | CN112579164B (en) | 
Families Citing this family (4)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN113778529B (en) * | 2021-09-14 | 2024-09-20 | 西安电子科技大学 | Multi-path parallel execution branch merging system and branch merging method | 
| CN114253821B (en) * | 2022-03-01 | 2022-05-27 | 西安芯瞳半导体技术有限公司 | Method and device for analyzing GPU performance and computer storage medium | 
| CN114911528B (en) * | 2022-05-31 | 2024-09-13 | 上海阵量智能科技有限公司 | Branch instruction processing method, processor, chip, board card, equipment and medium | 
| CN118349281B (en) * | 2024-06-18 | 2024-10-29 | 北京辉羲智能科技有限公司 | Thread divergence processing method, device and equipment | 
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN102109979A (en) * | 2009-12-28 | 2011-06-29 | 索尼公司 | Processor, co-processor, information processing system, and method thereof | 
| CN103809964A (en) * | 2012-11-05 | 2014-05-21 | 辉达公司 | System and method for executing sequential code using a group of hreads and single-instruction, multiple-thread processor incorporating the same | 
| CN105045564A (en) * | 2015-06-26 | 2015-11-11 | 季锦诚 | Front end dynamic sharing method in graphics processor | 
| CN106648545A (en) * | 2016-01-18 | 2017-05-10 | 天津大学 | Register file structure used for branch processing in GPU | 
| CN106708780A (en) * | 2016-12-12 | 2017-05-24 | 中国航空工业集团公司西安航空计算技术研究所 | Low complexity branch processing circuit of uniform dyeing array towards SIMT framework | 
| CN108133452A (en) * | 2017-12-06 | 2018-06-08 | 中国航空工业集团公司西安航空计算技术研究所 | A kind of instruction issue processing circuit of unified stainer array | 
Family Cites Families (1)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US8615646B2 (en) * | 2009-09-24 | 2013-12-24 | Nvidia Corporation | Unanimous branch instructions in a parallel thread processor | 
- 
        2020
        - 2020-12-05 CN CN202011404147.9A patent/CN112579164B/en active Active
 
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| CN102109979A (en) * | 2009-12-28 | 2011-06-29 | 索尼公司 | Processor, co-processor, information processing system, and method thereof | 
| CN103809964A (en) * | 2012-11-05 | 2014-05-21 | 辉达公司 | System and method for executing sequential code using a group of hreads and single-instruction, multiple-thread processor incorporating the same | 
| CN105045564A (en) * | 2015-06-26 | 2015-11-11 | 季锦诚 | Front end dynamic sharing method in graphics processor | 
| CN106648545A (en) * | 2016-01-18 | 2017-05-10 | 天津大学 | Register file structure used for branch processing in GPU | 
| CN106708780A (en) * | 2016-12-12 | 2017-05-24 | 中国航空工业集团公司西安航空计算技术研究所 | Low complexity branch processing circuit of uniform dyeing array towards SIMT framework | 
| CN108133452A (en) * | 2017-12-06 | 2018-06-08 | 中国航空工业集团公司西安航空计算技术研究所 | A kind of instruction issue processing circuit of unified stainer array | 
Non-Patent Citations (4)
| Title | 
|---|
| Warped-Shield: Tolerating Hard Faults in GPGPUs;W.Dweik et al.;《2014 44th Annual IEEE/IFIP International Conference on Dependable Systems and Networks》;20140922;全文 * | 
| 可重构及SIMT处理器系统架构存储映射方法研究;韩峰;《中国博士学位论文全文数据库-信息科技辑》;20180615;第2018年卷(第6期);全文 * | 
| 基于CUDA的GPU条件分支分歧聚合优化策略;刘素芹 等;《中国石油大学学报(自然科学版)》;20140609;第38卷(第3期);全文 * | 
| 统一染色器阵列中取指译码单元的设计与实现;魏艳艳等;《航空计算技术》;20200525(第03期);全文 * | 
Also Published As
| Publication number | Publication date | 
|---|---|
| CN112579164A (en) | 2021-03-30 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| CN112579164B (en) | SIMT conditional branch processing device and method | |
| US7178011B2 (en) | Predication instruction within a data processing system | |
| CN101373427B (en) | Program Execution Control Device | |
| US9286091B2 (en) | Semiconductor device improving performance of virtual machines | |
| US20160291982A1 (en) | Parallelized execution of instruction sequences based on pre-monitoring | |
| CN101957744B (en) | Hardware multithreading control method for microprocessor and device thereof | |
| US20130080749A1 (en) | Processor and control method of processor | |
| US20070174592A1 (en) | Early conditional selection of an operand | |
| JP5316407B2 (en) | Arithmetic processing device and control method of arithmetic processing device | |
| US7065636B2 (en) | Hardware loops and pipeline system using advanced generation of loop parameters | |
| GB2429084A (en) | Operating system coprocessor support module | |
| US7831979B2 (en) | Processor with instruction-based interrupt handling | |
| JP2002366351A (en) | Super-scalar processor | |
| US7543135B2 (en) | Processor and method for selectively processing instruction to be read using instruction code already in pipeline or already stored in prefetch buffer | |
| JP2005234968A (en) | Arithmetic processing unit | |
| US5713012A (en) | Microprocessor | |
| US7472264B2 (en) | Predicting a jump target based on a program counter and state information for a process | |
| US7269720B2 (en) | Dynamically controlling execution of operations within a multi-operation instruction | |
| US6453412B1 (en) | Method and apparatus for reissuing paired MMX instructions singly during exception handling | |
| WO2016156955A1 (en) | Parallelized execution of instruction sequences based on premonitoring | |
| JP5573038B2 (en) | Multi-thread processor and program | |
| Stefan et al. | A processor network without interconnection path | |
| CN111124494B (en) | Method and circuit for accelerating unconditional jump in CPU | |
| Nguyen | Multithreading in Application Specific Instruction Set Processor | |
| US20160291979A1 (en) | Parallelized execution of instruction sequences | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |