HK1112982A

HK1112982A - System and method of correcting a branch misprediction

Info

Publication number: HK1112982A
Application number: HK08108337.7A
Authority: HK
Inventors: 迈克尔‧斯科特‧麦基尔文; 杰弗里‧托德‧布里奇斯; 詹姆斯‧诺里斯‧迪芬德尔费尔; 托马斯‧安德鲁‧萨托里乌斯
Original assignee: 高通股份有限公司
Priority date: 2005-02-18
Filing date: 2006-02-17
Publication date: 2008-09-19

Abstract

When a branch misprediction in a pipelined processor is discovered, if the mispredicted branch instruction is not the last uncommitted instruction in the pipelines, older uncommitted instructions are checked for dependency on a long latency operation. If one is discovered, all uncommitted instructions are flushed from the pipelines without waiting for the dependency to be resolved. The branch prediction is corrected, and the branch instruction and all flushed instructions older than the branch instruction are re-fetched and executed.

Description

System and method for correcting branch mispredictions

Technical Field

The present invention relates generally to the field of processors, and in particular, to a method of flushing uncommitted instructions from a processor execution pipeline in response to a branch misprediction.

Background

Microprocessors perform computational tasks in a wide variety of applications. There is almost always a need for improved processor performance to allow faster operation and/or increased functionality through software changes. In many embedded applications, such as portable electronic devices, conserving power is also an important consideration in the design and implementation of processors.

Most modern processors use a pipeline architecture in which successive instructions, each having multiple execution steps, overlap in execution. To achieve maximum performance, instructions should constantly flow through the pipeline. However, instructions often become stalled in the pipeline for various reasons, such as data dependencies between instructions, delays associated with memory accesses, the inability to allocate sufficient pipeline resources to the instructions, and so forth. Minimizing pipeline stalls and effectively addressing the stalls are important factors in achieving improved processor performance.

Real-world programs include conditional branch instructions whose actual branching behavior is typically not known until the instruction is evaluated deep in the pipeline. Typically modern processors employ various forms of branch prediction whereby the branching behavior of conditional branch instructions is predicted early in the pipeline, and the processor speculatively allocates pipeline resources based on the branch prediction and/or fetches and speculatively executes instructions. When determining actual branch behavior, if a branch is mispredicted, the speculatively fetched instructions must be flushed from the pipeline and new instructions fetched from the correct branch target address. Mispredicted branches adversely affect the performance and power consumption of the processor.

Typically, in handling mispredicted branch instructions, all instructions earlier than the branch instruction (i.e., the instructions that entered the pipeline prior to the branch instruction) are allowed to complete execution before the speculatively fetched instructions are flushed. In the event that one or more of the earlier instructions stall in the pipeline due to longer latency operations, waiting until the dependency is resolved before flushing the pipeline exacerbates the loss of performance of the mispredicted branch.

Disclosure of Invention

The present invention relates to a method of handling branch mispredictions in a pipelined processor. A branch misprediction is detected, and at least one instruction older than the branch instruction is flushed from the pipeline in response to detecting the misprediction.

The invention also relates to a processor. The processor includes an instruction execution pipeline, and a branch predictor that predicts evaluation of conditional branch instructions in the pipeline. The processor also includes an instruction order manager that tracks the order of instructions in the pipeline and dependencies between the instructions. The processor additionally includes a pipeline controller that flushes at least one instruction older than the branch instruction from the pipeline in response to detecting that the branch instruction was mispredicted.

In addition, the present invention relates to a method of correcting branch mispredictions in a pipelined processor. A branch instruction misprediction is detected. It is determined whether the branch instruction is the last uncommitted instruction in the pipeline. If the branch instruction is the last uncommitted instruction in the pipeline, the branch instruction is committed and all uncommitted instructions are flushed from the pipeline. If the branch instruction is not the last uncommitted instruction in the pipeline, a determination is made as to whether an instruction older than the branch instruction is stalled in the pipeline due to a longer latency operation. If an instruction older than the branch instruction is stalled in the pipeline due to a longer latency operation, the branch instruction and all other uncommitted instructions are flushed from the pipeline.

Drawings

FIG. 1 is a functional block diagram of a processor.

FIG. 2 is a functional block diagram of an instruction cache and portions of two pipelines.

FIG. 3 is a flow diagram of a method of handling branch mispredictions.

Detailed Description

Fig. 1 depicts a functional block diagram of a processor 10. The processor 10 executes instructions in an instruction execution pipeline 12 in accordance with control logic 14. The pipeline 12 may be a superscalar design, with multiple parallel pipelines such as 12a and 12 b. The pipeline control logic 14 may include a branch predictor 13 and an instruction order manager 15. The pipelines 12a, 12b include various registers or latches, organized in pipe stages, and one or more Arithmetic Logic Units (ALU) 18. A General Purpose Register (GPR) file 20 provides registers comprising the top of the memory hierarchy. The pipelines 12a, 12b fetch instructions from an instruction cache 22, with memory addressing and permissions managed by an instruction-side translation lookaside buffer (ITLB) 24. Data is accessed from a data cache 26, with memory addressing and permissions managed by a main Translation Lookaside Buffer (TLB) 28. In various embodiments, the ITLB may comprise a copy of part of the TLB. Alternatively, the ITLB and TLB may be integrated. Similarly, in various embodiments of the processor 10, the I-cache 22 and D-cache 26 may be integrated or merged. Misses in the I-cache 22 and/or the D-cache 26 cause an access to main (off-chip) memory 32, under the control of a memory interface 30. The processor 10 may include an input/output (I/O) interface 34 that controls access to various peripheral devices 36. Those skilled in the art will recognize that many modifications to the processor 10 are possible. For example, the processor 10 may include a level two (L2) cache for one or both of the I and D caches. Moreover, one or more of the functional blocks depicted in the processor 10 may be omitted from a particular embodiment.

Pipelining is a well-known processor implementation technique whereby multiple instructions overlap in execution at the same time. Each instruction in the exemplary architecture is executed in multiple execution steps (e.g., fetch, decode, execute, memory access, and writeback). The processor pipeline 12 is made up of a plurality of "pipe stages," each pipe stage including a logic and storage element 16 that completes an execution step or a portion of an execution step of an instruction. The pipe stages are connected together to form a pipeline 12. Instructions enter the pipeline 12 and are processed sequentially through the stages. New instructions enter the pipeline 12 before previous instructions complete, so multiple instructions may be processed in the pipeline 12 at any given time. This ability to exploit parallelism among instructions in a continuous instruction stream contributes significantly to improved processor performance. In a processor 10 that completes one pipe stage in one cycle, under ideal conditions and after a brief initial process of filling the pipeline 12, the execution of one instruction may complete in each cycle.

Such ideal conditions have never been achieved in practice due to various factors including data dependencies among instructions (data contention), control dependencies such as branches (control contention), processor resource allocation conflicts (structural contention), interrupts, cache misses, page faults, and the like. Typical data races are encountered when an instruction performs an arithmetic or logical operation on two operands, where one or more of the operands is the result of a previous instruction that did not complete execution and thus did not produce the required operand. The earlier instruction may be another arithmetic or logical operation, or it may be a memory access (e.g., a memory access that misses in the cache 22, 26), forcing the memory interface 30 to perform an off-chip memory access operation. The data race forces the pipeline 12 to stall.

A typical control hazard encountered in the pipelined processor 10 is a mispredicted branch instruction. Conditional branch instructions may be "taken", where the instructions direct control flow to different programming points, or "not taken", where instruction execution proceeds sequentially. Evaluation of branch conditions occurs deeper in the pipeline 12 during execution pipe stages. Until after the branch instruction is evaluated, the processor 10 does not know which instruction (i.e., the next sequential instruction or the instruction at the branch target address) to fetch and execute next. The delay until the branch condition is evaluated causes a stall in the pipeline 12. Thus, many processors predict how a branch condition will be evaluated, e.g., based on previous executions of conditional branch instructions. The processor 10 fetches the instruction into the pipeline 12 beginning at the predicted address to execute the instruction speculatively. When the prediction is correct, pipeline stalls are avoided.

Some branch instructions will evaluate branch conditions that are the opposite of those that were predicted. This is referred to herein as a "branch misprediction" or "mispredicted branch". When a branch misprediction is detected, all instructions newer than the branch instruction (i.e., all instructions fetched based on the branch prediction) must be flushed from the pipeline 12. In a single pipeline, the determination of which instructions are newer than the mispredicted branch is straightforward — all pipe stages "behind" the branch must be flushed.

FIG. 2 depicts a superscalar pipeline architecture having two parallel execution pipelines 12a and 12 b. In the scenario depicted in FIG. 2, instruction A in pipeline 12a is stalled due to a dependency on instruction X (e.g., operand generation, memory access, or some other long latency operation). The data race for instruction A also causes instruction B to be stalled. Thus, instructions C, D and E have been fetched from the instruction cache 22 and loaded into the pipeline 12 b. In a superscalar processor 10, a mechanism must be used to track the order in which instructions are executed, as well as to track dependencies between instructions.

Most superscalar processors 10 include an order manager 15 as part of the pipeline control logic 14. The order manager 15 tracks the order in which instructions are executed through the pipeline-i.e., which instructions are older or newer than a given instruction. The order manager 15 additionally tracks instruction dependencies and facilitates exception handling.

Whenever a pipe stage cannot complete its execution of an instruction step, i.e. an exception or interrupt occurs. For example, if the TLB 28 lookup table indicates that a memory page is read-only, a store instruction that writes data to the memory may cause an exception. Other types of anomalies are well known in the art. Once an exception is encountered, the processor 10 must execute all previous or earlier instructions in the pipeline 12 (or pipelines 12a and 12b in a superscalar); flushing the exception-causing instruction and all earlier instructions from the pipelines 12a and 12 b; and then fetch and execute interrupt handling code. The order manager 15 assists this process by keeping track of which instructions are "acknowledged" and which instructions are "committed".

When it is determined that no pipeline hazards will prevent execution of an instruction (i.e., the instruction will not stall), the instruction is acknowledged. For example, an instruction that performs an arithmetic or logical operation may be validated when it is known that the two operands were generated by a previous instruction, fetched from memory, or otherwise available.

When an instruction and all earlier instructions are acknowledged, the instruction is committed. It is known that the committed instruction is able to complete execution because no pipeline hazards impede it (acknowledge the instruction itself), or impede any instructions before it (acknowledge all earlier instructions). Referring to FIG. 2, instruction A is not confirmed due to its dependency on the result of instruction X. Instruction B is unlikely to be acknowledged at this earlier stage in the pipeline 12 a. Instruction C in the pipeline 12b may be confirmed, meaning that no contention prevents instruction C from completing execution. However, instruction C cannot be committed until all instructions older than instruction C (i.e., instructions A and B) are acknowledged.

The conventional rule during exception handling is to flush the pipelines 12a and 12b when the instruction causing the exception is the "last uncommitted instruction". For example, if instruction D generates an exception, the dependency of instruction A on instruction X must be resolved, allowing A to be confirmed. Once A acknowledges, if there is no unconfirmed instruction before A (assuming instruction X completes), it will also be committed. If instruction B is also acknowledged and committed as it travels through the pipeline 12A, then instruction C will be committed because both instructions A, B and C are acknowledged. Thus, D is the last uncommitted instruction and will be flushed from the pipelines 12a and 12b along with all newer instructions (e.g., E). Then, as the committed instructions A, B and C travel through the pipeline and complete execution, exception handling instructions are fetched and fed into the pipelines 12a and 12 b. A complete break in program execution is ensured by forcing the instruction causing the exception to be the last uncommitted instruction in the pipelines 12a and 12 b. That is, once the interrupt handling instruction resolves the error and restores the state of processor 10, program execution may resume from instruction D and will produce the correct result.

Similar procedures may also be applicable to handling mispredicted branches in the superscalar processor 10. For example, assume instruction C in FIG. 2 is a conditional branch instruction whose branch condition has been evaluated and found to have been mispredicted. Instructions D and E are fetched based on the wrong branch prediction and must be flushed from the pipeline 12b and replaced with instructions fetched from the correct branch target address. Under the exception handling rule, mispredicted branch C will wait until it is the last uncommitted instruction, i.e., until A's dependency on X is resolved and A and B are confirmed and committed, to flush D and E. However, it takes some time to resolve the dependency of A on X, delaying time until the next appropriate instruction following mispredicted branch C is fetched and executed. Additionally, if A and B are flushed and re-fetched along with D and E, the dependency may have been resolved by the time A travels through pipeline 12a again, allowing A to be immediately confirmed.

According to one embodiment of the present invention, and as described with reference to FIG. 3, when a mispredicted branch is detected (block 40), if the mispredicted branch is not the oldest uncommitted instruction (block 42), then a check is made as to whether the earlier uncommitted instruction is stalled (block 44). If a stalled instruction is detected (e.g., due to pipeline contention, memory access, or other long latency operation), the pipeline controller 14 immediately flushes all uncommitted instructions from the pipeline 12a, 12b (block 46). This includes a mispredicted branch, all uncommitted instructions older than the mispredicted branch, and all instructions newer than the branch (i.e., instructions speculatively fetched based on the branch misprediction). The branch prediction is cancelled (block 48), and the flushed uncommitted instructions are then re-fetched and executed in sequence (block 50). The long latency operations that caused the stall may have been resolved by the time the previously stalled instruction is re-fetched and re-executed. However, even if this is not the case, the processor fetches instructions from the correct branch target and does not need to wait until the stall is resolved before doing so, thus improving processor performance.

If the mispredicted branch instruction is the earliest uncommitted instruction (block 42), the processor commits the mispredicted branch instruction (so that it is not flushed), and flushes all uncommitted instructions from the pipeline 12a, 12b (block 52). This flushes all instructions newer than the mispredicted branch instruction-i.e., instructions on the mispredicted branch path. The branch prediction is then corrected (block 48) so that the branch prediction mechanism accurately reflects the branch evaluation, and fetching and execution of the instruction continues at the appropriate branch target address (block 50).

As indicated in FIG. 3, if the mispredicted branch (block 40) is not the oldest uncommitted instruction (block 42), and no earlier uncommitted instructions are stalled due to long latency operations, the processor may only wait until all earlier instructions are committed (block 42), then commit the mispredicted branch and flush all newer instructions (block 50). As described above, this process may utilize existing control logic for handling exceptions (with the difference being that it is the commit branch and not the flush branch).

Alternatively, the processor may simply flush all uncommitted instructions (block 46), including the mispredicted branch, and proceed as if the uncommitted instructions were stalled (blocks 48, 50). The latter option (not shown in fig. 3, but where the yes path would be the only control flow exiting block 44) may optimize performance, but at the cost of complexity of the control. In the event that the instruction is stalled (block 44), the submission of the new instruction is aborted and synchronization of the task that submitted the new instruction with the task that flushed the uncommitted instruction is simplified. One of ordinary skill in the art will readily recognize that either option is possible and will yield the correct results.

Conventional processor design practices are executing all instructions earlier than the instruction causing the exception, the mispredicted branch, or other instruction that caused the pipeline flush. According to an exemplary embodiment of the present invention, one or more instructions older than the mispredicted branch instruction are flushed from the pipeline and re-fetched and executed. This can improve processor performance and power consumption by quickly terminating fetching instructions from incorrect (mispredicted) addresses and constructively utilizing the latency of pipeline contention to correct the misprediction. In the event that the time to resolve the pipeline hazard is equal to or greater than the time required to flush and re-fetch the stalled instruction, recovering from the misprediction will not cause any performance penalty.

Although the present invention has been described herein with respect to particular features, aspects and embodiments thereof, it will be apparent that numerous variations, modifications, and other embodiments are possible within the broad scope of the present invention, and accordingly, all variations, modifications and embodiments are to be regarded as being within the scope of the invention. The present embodiments are, therefore, to be considered in all respects as illustrative and not restrictive, and all changes coming within the meaning and equivalency range of the appended claims are intended to be embraced therein.

Claims

1. A method of handling branch mispredictions in a pipelined processor, comprising:

detecting that a branch instruction is mispredicted; and

in response to detecting the misprediction, at least one instruction earlier than the branch instruction is flushed from the pipeline.

2. The method of claim 1 wherein the at least one instruction earlier than the branch instruction is not committed.

3. The method of claim 2, wherein the at least one uncommitted instruction is stalled in a pipeline.

4. The method of claim 1, further comprising:

correcting the branch prediction; and

the branch instruction is flushed from the pipeline.

5. The method of claim 4, further comprising fetching the branch instruction and all flushed instructions earlier than the branch instruction in program order.

6. The method of claim 1, wherein flushing at least one instruction earlier than the branch instruction from the pipeline comprises flushing all uncommitted instructions from the pipeline.

7. A processor, comprising:

at least one instruction execution pipeline;

a branch predictor that predicts evaluation of conditional branch instructions in the pipeline;

an instruction order manager that tracks the order of instructions in the pipeline; and

a pipeline controller to flush at least one instruction older than a branch instruction from the pipeline in response to detecting that the branch instruction was mispredicted.

8. The processor of claim 7, wherein the branch predictor cancels branch prediction in response to detecting that the branch instruction was mispredicted.

9. The processor of claim 7 wherein flushing at least one instruction earlier than a branch instruction from the pipeline comprises flushing all uncommitted instructions from the pipeline.

10. The processor of claim 7, further comprising: in response to detecting that the branch instruction was mispredicted, the branch instruction is flushed from the pipeline.

11. The processor of claim 7, further comprising: the branch instruction and all flushed instructions earlier than the branch instruction are fetched in program order.

12. A method of correcting branch mispredictions in a pipelined processor, comprising:

detecting that a branch instruction is mispredicted;

detecting a dependency of a first instruction earlier than the branch instruction on a long latency operation; and

all uncommitted instructions are flushed from the pipeline.

13. The method of claim 12, further comprising: correcting the branch misprediction.

14. The method of claim 13, further comprising: the branch instruction and all flushed instructions earlier than the branch instruction are fetched in program order.

15. A method of correcting branch mispredictions in a pipelined processor, comprising:

detecting that a branch instruction is mispredicted;

determining whether the branch instruction is the last uncommitted instruction in the pipeline;

if the branch instruction is the last uncommitted instruction in the pipeline, then the branch instruction is committed and all uncommitted instructions are flushed from the pipeline;

if the branch instruction is not the last uncommitted instruction in the pipeline, determining whether an instruction earlier than the branch instruction is stalled in the pipeline due to a long latency operation;

if an instruction earlier than the branch instruction is stalled in the pipeline due to a long latency operation, the branch instruction and all other uncommitted instructions are flushed from the pipeline.

16. The method of claim 15, further comprising: correcting the branch misprediction.

17. The method of claim 15, further comprising: the branch instruction and all flushed instructions earlier than the branch instruction are fetched in program order.