[go: up one dir, main page]

WO1998006040A1 - Support architectural pour le chevauchement par logiciel de boucles imbriquees - Google Patents

Support architectural pour le chevauchement par logiciel de boucles imbriquees Download PDF

Info

Publication number
WO1998006040A1
WO1998006040A1 PCT/RU1996/000216 RU9600216W WO9806040A1 WO 1998006040 A1 WO1998006040 A1 WO 1998006040A1 RU 9600216 W RU9600216 W RU 9600216W WO 9806040 A1 WO9806040 A1 WO 9806040A1
Authority
WO
WIPO (PCT)
Prior art keywords
loop
control transfer
inner loop
register
code
Prior art date
Application number
PCT/RU1996/000216
Other languages
English (en)
Inventor
Boris Artashesovich Babayan
Fedor Anatolievich Gruzdov
Juli Khanaanovich Sakhin
Vladimir Sergeevich Volin
Vladimir Jurievich Volkonsky
Original Assignee
Sun Microsystems, Inc.
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sun Microsystems, Inc. filed Critical Sun Microsystems, Inc.
Priority to PCT/RU1996/000216 priority Critical patent/WO1998006040A1/fr
Priority to US08/733,479 priority patent/US5958048A/en
Publication of WO1998006040A1 publication Critical patent/WO1998006040A1/fr

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F8/00Arrangements for software engineering
    • G06F8/40Transformation of program code
    • G06F8/41Compilation
    • G06F8/44Encoding
    • G06F8/445Exploiting fine grain parallelism, i.e. parallelism at instruction level
    • G06F8/4452Software pipelining
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F9/00Arrangements for program control, e.g. control units
    • G06F9/06Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
    • G06F9/30Arrangements for executing machine instructions, e.g. instruction decode
    • G06F9/32Address formation of the next instruction, e.g. by incrementing the instruction counter
    • G06F9/322Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address
    • G06F9/325Address formation of the next instruction, e.g. by incrementing the instruction counter for non-sequential address for loops, e.g. loop detection or loop counter

Definitions

  • the present invention relates to processor architectures, and more particularly to processor architectures amenable to software pipelining techniques.
  • VLIW Very Long Instruction Word
  • processor architectures represent a design approach for exploiting instruction level parallelism in which the bulk of instruction scheduling and parallel dispatch is relegated to a compiler.
  • VLIW is a natural successor to RISC, furthering a trend toward moving complexity from the hardware to a compiler so as to enable simpler, faster processors. See Gwennap, VLIW: The Wave of the Future, Microprocessor Report, February 14, 1994, pp. 18-21 (discussing VLIW architectures).
  • a VLIW processor design eliminates the need for complex instruction scheduling logic on the chip by shifting scheduling responsibilities to the compiler.
  • Parallelism is determined explicitly by a compiler at a program code level. The compiler generates code which is suitable for parallel execution at multiple execution units and which allows for data and resource dependencies. When data dependencies are dynamic, they are explicitly mapped in the compiled code.
  • Such compiled code is organized as VLIW instructions encoding a number of independent operations that can be executed by execution units of the VLIW processor, in parallel.
  • a typical VLIW instruction horizontally encodes multiple operations as a single, very long instruction; hence the name. These operations can flow directly to associated functional units with a minimum of decoding.
  • NOPs are provided to functional units and a pure VLIW processor has no hardware scheduling interlocks, relying instead on the compiler for scheduling and dependency handling.
  • the Cydra 5 was an early VLIW implementation. See generally, Beck et al., The Cydra 5 Minisupercomputer: Architecture and Implementation, Journal of Siipercomputing, 7, 143-180 (1993) (providing a retrospective on the Cydra 5 architecture).
  • the Cydra 5 was capable of initiating one instruction every clock cycle at each of six functional units.
  • the Cydra 5 supported two instruction formats: MultiOp and UniOp, as shown in FIGURE 1.
  • a MultiOp instruction, e.g., MultiOp instruction 110 included seven "containers," one for each of six functional units (e.g., containers 111, 112, 113, 114, 115, and 116) and a seventh (e.g., container 117) for instruction sequencing control.
  • Each of the containers provided storage for an operation encoded in a format similar to that of a conventional load- store RISC instruction.
  • containers 111, 112, 113, 114, 115, and 116 included respective predicate specifiers 121, 122, 123, 124, 125, and 126.
  • the effectiveness of the MultiOp format was highly dependent on the program, and on how effectively the compiler could extract operations for parallel execution. In particular, performance and capacity of the instruction cache was adversely affected if there was little inherent parallelism to encode, i.e., if the containers of a MultiOp instruction encoded mainly null operations (or NOOPs).
  • the UniOp format as exemplified by UniOp instruction 180, was provided for such cases and included six containers 181, 182, 183, 184, 185, and 186 for encoding operations (six per UniOp instruction).
  • Cydra 5 architecture A major objective of the Cydra 5 architecture was to allow the overlapping of loop iterations without requiring multiple copies of the loop body or complex compensation code. See generally, Dehert et al., Overlapped Loop Support in the Cydra 5, Proc. 2nd Internat. Conf. on Architectural Support for Programming Languages and Operating Systems (Boston, Massachusetts, April 3-6, 1989), pp. 26-38.
  • the approach taken by the Cydra 5 was to execute a compiled, overlapped loop schedule of TL cycles, organized as Stage Count (SC) stages where:
  • the first stage of iteration 1 executed.
  • the first stage of iteration 2 and the second stage of iteration 1 executed, and so on until SC different iterations were executing in different stages.
  • the first SC-1 iterations of a loop i.e., when not all stages were yet executing, was collectively known as the prologue.
  • the opposite process occurred until the last stage of the last iteration was executed.
  • the Cydra 5 provided a single mechanism to deal with prologue and epilogue control and with conditional code in loop bodies.
  • the mechanism was based on a file of single-bit Iteration Control Registers (ICRs).
  • ICRs Iteration Control Registers
  • the Cydra 5 mechanism relied on a loop counter (LC) register, which kept track of the number of prologue and kernel iterations yet to be executed, an epilogue stage counter (ESC), and a BRanch to TOP of loop (brtop) operation which specified the first instruction of the loop body as its branch target address.
  • LC loop counter
  • ESC epilogue stage counter
  • brtop BRanch to TOP of loop
  • ICRO was set before loop entry and all other ICRs were cleared.
  • ICP Iteration Control Pointer
  • the brtop operation set a new logical ICRO until LC reached zero. Thereafter, the logical ICRO was cleared during each iteration until the ESC reached zero, indicating the end of the loop body.
  • the compiler made first stage operations conditional on ICRO, second stage operations conditional on ICR1 , etc. In this way, only first stage operations executed during the first iteration through the loop, only the first two stages executed during the second iteration, etc.
  • ICRO was set to zero and first stage operations no longer executed. On each successive iteration, one less stage executed until the ESC reached zero and the loop was complete.
  • a number of elements of the ICR file equal to the stage count (SC) were used to provide prologue/epilogue control.
  • Conditional code (including conditional code in the loop body) was handled similarly, using additional elements of the ICR file. See supra, Dehert et al., Overlapped Loop Support in the Cydra 5, for a more detailed description of Cydra 5 loop control, conditional execution, and brtop operation semantics.
  • an apparatus in one embodiment, includes first and second register complexes and multiway control transfer logic.
  • the first register complex is responsive to physical iterations of inner loop body code and the state of the first register complex advances toward a beginning of last iteration state in correspondence with the physical iterations of the inner loop body code.
  • the second register complex is responsive to physical iterations of inner loop body code and the state of the second register complex advances toward an end of last iteration state in correspondence with the physical iterations of the inner loop body code.
  • the multiway control transfer logic is coupled to the first and second register complexes and is selective for a control transfer address, wherein the multiway control transfer logic selects a first control transfer address in response to an asserted beginning of last iteration state and selects a second control transfer address in response to an asserted end of last iteration state.
  • a method of controlling execution of software pipelined inner loop body code includes the steps of initializing a first register complex with an indication corresponding to a number of logical iterations in the inner loop body code; initializing a second register complex with an indication corresponding to a number of overlapped logical iterations minus one (NOVL-1) in the inner loop body code; advancing the state of the first register complex toward a beginning of last iteration state in correspondence with physical iterations of the inner loop body code; and advancing the state of the second register complex toward an end of last iteration state in correspondence with physical iterations of the inner loop body code.
  • NOVL-1 overlapped logical iterations minus one
  • the method further includes the steps of when the first register complex reaches the beginning of last iteration state, transferring control, during a next physical iteration thereafter, to a start patch; and when the second register complex reaches the end of last iteration state, transferring control, during a next physical iteration thereafter, to a finish patch.
  • FIGURE 1 is a pictorial illustration of the VLIW instruction format of the Cydra 5 processor architecture.
  • FIGURE 2 is a system block diagram for a VLIW processor constructed in accordance with the teachings of the present invention.
  • FIGURE 3 is a pictorial illustration of loop scheduling of a simple inner loop in accordance with the teachings of the present invention.
  • FIGURE 4 is a pictorial illustration of the structure of nested loop code compiled for execution in accordance with the teachings of the present invention.
  • FIGURES 5 A and 5B are pictorial illustrations of nested loop scheduling in accordance with the teachings of the present invention.
  • FIGURE 6 is a block diagram of loop control logic constructed in accordance with the teachings of the present invention.
  • FIGURES 7A, 7B, and 7c are register structure diagrams for loop parameter and loop state storage in accordance with the teachings of the present invention.
  • FIGURE 8 is a pictorial illustration of control transfer logic constructed in accordance with the teachings of the present invention.
  • FIGURES 9A, 9B, and 9c are register structure diagrams for control transfer preparation and execution registers in accordance with the teachings of the present invention.
  • a processor in accordance with the present invention provides, for certain classes of loops, loop control for software pipelined outer loops as well as inner loops. Loop control in accordance with the present invention provides overlapped execution of respective epilogue and prologue periods of adjacent inner loops.
  • FIGURE 2 depicts the architecture of a Very Long Instruction Word (VLIW) processor 200 in accordance with an exemplary VLIW embodiment of the present invention.
  • VLIW processor 200 includes an instruction buffer (IB) 210, a control unit (CU) 220, a multiport register file (RF) 230, 4 arithmetic logic channels (ALCO 241, ALCl 242, ALC2 243, and ALC3 244), each of which includes multiple execution units (EUs) 248, array access channels (AACO, AACl, AAC2, and AAC3) 250, a memory management unit (MMU) 260, a memory access unit (MAU) 270, an array prefetch buffer (APB) 235, and a data cache (DCACHE) 280.
  • IB instruction buffer
  • CU control unit
  • RF multiport register file
  • ALC3 244 4 arithmetic logic channels
  • EUs execution units
  • AACl array access channels
  • AAC2 array prefetch buffer
  • VLIW processor 200 has a long instruction word architecture and exploits Instruction Level Parallelism (ILP) among operations of a long instruction word.
  • ILP Instruction Level Parallelism
  • a compiler is used to schedule operations to be executed by VLIW processor 200 during each cycle.
  • the design of VLIW processor 200 allows concurrent execution of multiple independent operations (e.g., load, store, add, multiply, divide, shift, logical, and branch operations) that make up a long instruction.
  • Long instructions are stored in a memory 211 and an instruction cache (IC) 282 of VLIW processor 200 in packed form.
  • Instruction buffer 210 fetches long instructions from memory 211, or from an included instruction cache (IC) 282 if cached.
  • instruction buffer 210 includes instruction cache (IC) 282, instruction alignment logic, a program counter register (PC) 216, and control transfer preparation registers (CTPRI 213, CTPR2 214, and CTPR3 215).
  • Instruction cache (IC) 282 is filled in response to both linear program path pre-fetches and control transfer preparation instructions.
  • Control Unit (CU) 220 issues wide instruction operations for execution and performs several tasks including:
  • Control unit (CU) 120 also executes Control Transfer OPerations (CTOPs) and includes storage, collectively shown as special registers 224, which includes:
  • loop parameter and status registers e.g., LPR, LSHRI, LSHR2, and LSHR3 used for loop control
  • base registers BP
  • Both the loop parameter and status registers and the base registers are software accessible for read and write.
  • a compiler overlaps portions of the loop code corresponding to several subsequent iterations of the loop. Operations from the several iterations are represented, or overlapped, in a single stage. Designs for compilers providing overlapped iteration code are well known to persons of ordinary skill the art. See e.g., Dehnert et al., Compiling for the Cydra 5, Journal ofSupercomputing, 7, 181-227 (1993). Such compilers implement variations on a technique known as software pipelining.
  • logical iterations of the initial loop code contrast with the physical iterations of a software pipelined loop.
  • Multiple logical iterations are overlapped in a given physical iteration.
  • NOVL physical iterations must be executed to complete a logical iteration. In other words, each logical iteration is executed in NOVL stages. If the initial loop code has NLI logical iterations, then the overlapped, pipelined loop should have NPI physical iterations where:
  • NPI * NLI + (NOVL-l).
  • FIGURE 3 depicts a loop schedule 300 including iterations of loop body code compiled for execution on VLIW processor 200.
  • Loop schedule 300 is illustrative of a simple loop, i.e., single level, unnested loop, and also illustrative of an inner loop, i.e., a innermost loop nested within one or more levels of outer loops.
  • Logical iterations e.g., first logical iteration 370 and second logical iteration 380
  • physical iterations e.g., first physical iteration 350 and second physical iteration 360
  • Five logical iterations are overlapped in each physical iteration and each logical iteration is executed in five stages.
  • stages of logical iterations 3, 4, 5,6, and 7 are executed.
  • a single physical iteration can require the evaluation of more than one long instruction word, i.e., "n" long instruction words evaluated in "n" cycles such as 316.1, 316.2, and 316.3.
  • n long instruction word
  • not every very long instruction required for a physical iteration will contribute an operation to the set of operations evaluated for a stage of a logical iteration, i.e., some cycles will not contribute an operation to some stages.
  • physical iterations of prologue 330 and epilogue 340 portions of the loop body do not include a full set of stages.
  • prologue portion 330 i.e., during the first NOVL-1 physical iterations of loop body 300
  • certain stages include garbage operations 310 which are associated with non-existent logical iterations.
  • garbage operations 320 are associated with other non-existent logical iterations.
  • garbage operations arise because each physical iteration of loop body 300 includes the same set of operations, encoded by the one or more VLIW instruction cycles which make up a physical iteration.
  • each physical iteration of loop body 300 includes the same set of operations, encoded by the one or more VLIW instruction cycles which make up a physical iteration.
  • the full set of operations encoded for a physical iteration of loop body code only one valid stage exists in the first physical iteration 350, only two valid stages exist in the second physical iteration 360, etc., until all five stages are valid in the initial physical iteration of kernel portion 390 (i.e., physical iteration NOVL).
  • Garbage operations 310 are the invalid operations.
  • Garbage operations 320 are similar, but result from increasing numbers of stages containing invalid operations during the epilogue portion 340 of loop body 300.
  • the prologue/epilogue control technique implemented by control logic 220 of VLIW processor 200 selectively enables and disables the execution of categories of operations.
  • the prologue/epilogue control technique is not a general solution for all simple or inner loop body code, the technique can be applied to a large class of loop programs.
  • the technique and its implementation lay a foundation for addtional architectural support for nested loops as described herein.
  • Prologue/epilogue control in accordance with the present invention requires that loop body code conform to two reasonable constraints on the structure of the pipelined logical iterations.
  • the constraints are as follows:
  • memory read operations e.g., loads
  • operations with side-effects e.g., memory write operations or stores, loop breaks, etc.
  • Suitable compiler techniques to provide loop body code in accordance with these constraints are well known to those of ordinary skill in the art and loop body code 300 is compiled using any such suitable techniques.
  • FIGURE 3 the restriction of memory read operations to memory read stages 312 and of operations having side-effects to side-effects stages 314 is illustrative of loop body code structured in accordance with the above constraints.
  • memory read operations associated with logical iteration 370 are constrained to the first stage 371 of the logical iteration.
  • side-effects operations associated with logical iteration 370 are constrained to the last stage 372 of the logical iteration.
  • FIGURE 4 illustrates the structure of nested loop code 400 compiled for execution on a processor such as VLIW processor 200.
  • Nested loop code 400 corresponds to source code, illustratively of the form shown below.
  • Nested loop code 400 (which is a compiled software pipelined representation of the illustrative nested loop source, above) includes initialization code 410, a start patch 420, inner loop body code 430, and a finish patch 440.
  • Initialization code 410 includes code for initializing loop control registers and control transfer preparation registers (i.e., CTPR2 214, and CTPR3 215).
  • Initialization code 410 also includes code from an upper portion of the outer loop and startup code for loading array address registers with initial values for the first iteration of the inner loop.
  • Inner loop body code 430 corresponds to the inner loop source code shown above, compiled in accordance with the previously described loop constraints, i.e., that memory read operations must be located in the first stage of a logical iteration and operations with side-effects must be located in the last stage of a logical iteration.
  • the upper and lower portions of the above source code are represented as compiled code in start patch 420 and finish patch 440, respectively. Nonetheless, persons of ordinary skill in the art will recognize that suitable compilers will typically perform data dependency analysis. Such compilers may identify and exploit optimizations which may result in an execution ordering of operations which differs from that in the source code. As a result, it is possible that some operations corresponding to statements in the upper portion of the above source code may be represented in a lower portion of the compiled outer loop body code. Similarly, some operations corresponding to statements in the lower portion of the source code may be represented in an upper portion of the compiled outer loop body code. The upper and lower portions of the compiled outer loop code are discussed below.
  • Start patch 420 includes code corresponding to an upper portion, if any, of the compiled outer loop and also includes code for initializing a next inner loop, i.e., loading loop control registers and array address registers with initial values for a next series of iterations through the inner loop.
  • Finish patch 440 includes code for saving the results of a completed inner loop and code corresponding to a lower portion, if any, of the compiled outer loop. If the inner loop simply calculates and stores elements of an array, there are no results for finish patch 440 to save.
  • Finish patch 440 also includes code for checking an outer loop counter and for loading a subset of the array address registers. If the outer loop counter is not exhausted, finish patch 440 code initializes a subset of the array address registers dedicated to storing results to memory from within the next inner loop (i.e., from within the inner loop included within the next iteration of the outer loop).
  • array address registers some supply address values for inner loop read and write operations, while others supply address values for outer loop operations. Those supplying address values for inner loop operations must be reloaded between the last and first logical iterations of adjacent inner loops.
  • Array address registers supplying read addresses for inner loop operations are ⁇ eloaded by start patch 420, while those supplying write addresses for inner loop operationsare reloaded in finish patch 440.
  • Outer loop status must be separately maintained for upper and lower portions of the outer loop.
  • an outer loop counter variable must be duplicated: one for successive instances of start patch 420 and one for successive instances of finish patch 440.
  • Control transfers in nested loop code 400 compiled for execution on VLIW processor 200 are also illustrated in FIGURE 4.
  • control is transfered to inner loop body code 430 (control transfer 471), which is identified using the contents of CTPR3 450.
  • control is transferred to finish patch 440 (control transfer 477), to start patch 420 (control transfer 475), or back to inner loop body code 430 (control transfer 474), depending on loop state.
  • Start patch 420 code is identified using the contents of CTPR2 460.
  • control is transfered to inner loop body code 430 (control transfer 472).
  • control is transferred to subsequent code (control transfer 478), to start patch 420 (control transfer 476), or back to inner loop body code 430 (control transfer 473), depending on loop state.
  • Nested loop schedule 500 corresponds to the above source code having a six (6) iteration inner loop within a three (3) iteration outer loop. Three passes through the outer loop are shown in FIGURE 5A as overlapped portions 520, 530, and 540 of loop schedule 500. Note that, as will be further illustrated below, the execution of operations associated with successive passes through the outer loop, including inner loop operations associated with successive passes through the outer loop, are overlapped.
  • Nested loop schedule 500 begins with initialization code 410, which in an embodiment for execution on VLIW processor 200 includes one or more long instruction words encoding one or more initialization operations. Nested loop schedule 500 continues with physical iterations of inner loop body code. In the loop schedule of FIGURE 5A, six (6) physical iterations of inner loop body code follow initialization code 410. Successive, 5-stage logical iterations of the inner loop begin during each of the first six physical iterations. As with the simple inner loop schedule of FIGURE 3, the first NOVL-1 physical iterations make up a prologue period 510 of the outer loop and include garbage operations.
  • start patch instance 420a initiates the second pass through the outer loop, including operations from an upper portion, if any, of the compiled outer loop and including operations for initializing a next inner loop, i.e., loading loop control registers and array address registers with initial values for a next series of iterations through the inner loop.
  • Implicit in the loop schedule of FIGURE 5A is a control transfer 475 from loop body code 430 to start patch instance 420a and a return control transfer 472 to loop body code 430.
  • Physical iterations of inner loop body 430 continue at physical iteration 552.
  • physical iteration 552 includes the remaining stages of inner loop logical iterations 3, 4, 5, and 6 from the first outer loop iteration (i.e., stages of logical iterations 1.3, 1.4, 1.5, and 1.6 executed as part of physical iteration 552) and includes the first stage of the first logical iteration of the inner loop from the second outer loop iteration (i.e., the first stage of logical iteration 2.1).
  • finish patch instance 440a saves recurrent variable instances for the six completed iterations the inner loop, executes code corresponding to a lower portion, if any, of the compiled outer loop, and checks an outer loop counter. Since the outer loop counter is yet not exhausted (completion of two more passes through outer loop is required), finish patch instance 440a loads the subset of the array address registers with values for writing results of next inner loop (i.e., of the inner loop included within the next iteration of the outer loop).
  • a typical data structure for processing in a nested loop is a 2-dimensional matrix, A[n,m].
  • A[n,m] a 2-dimensional matrix
  • finish patch 440 reloads the array address register with the address of the A[i,0] element.
  • Implicit in the loop schedule of FIGURE 5 A is a control transfer 477 (fall through) from loop body code 430 to finish patch instance 440a and a return control transfer 473 to loop body code 430.
  • Physical iterations of inner loop body 430 continue at physical iteration 560 with the remaining stages of inner loop logical iterations 1, 2, 3, 4, and 5 from the second pass through the outer loop (i.e., continue with stages of logical iterations 2.1, 2.2, 2.3, 2.4, and 2.5 executed as part of physical iteration 560).
  • loop schedule 500 includes control transfers to and from start patch 420 after physical iteration 555 (control transfer to/from start patch instance 420b) and after physical iteration 557 (control transfer to/from start patch instance 420c).
  • loop schedule 500 includes control transfers to and from finish patch 440 after physical iteration 556 (control transfer to/from finish patch instance 440b) and after physical iteration 561 (control transfer to/from finish patch instance 440c).
  • start patch operations are unnecessary for start patch instance 420c.
  • the outer loop "upper portion” operations, array address register loads, and certain loop control register updates can be skipped.
  • there is no harm if, for example, reloading of array address registers is skipped because there is no next inner loop to use them.
  • Alternative embodiments may skip or execute the outer loop "upper portion” operations, array address register loads, and most loop control register updates as desired for code efficiency. In the exemplary loop control register embodiment described herein, only the reloading of an inner loop counter register in start patch instance 420c is constrained.
  • the compiler ensures that the zero value is provided.
  • the compiler may include a condition in start patch 420 such that start patch instance 420c skips reloading of current loop counter field (clc) 645, or alternatively, the compiler may include a code in start patch 420 reload the zero value in start patch instance 420c, etc.
  • epilogue period 550 is the epilogue period 550 of the outer loop.
  • epilogue period 550 includes garbage operations.
  • prologue period 510 and epilogue period 550 are amortized across a total of 18 inner loop logical iterations (i.e., 3 outer loop passes x 6 inner loop iterations) together with upper and lower portions of each pass through the outer loop, all of which are software pipelined.
  • FIGURE 5B depicts a loop schedule 500a which is analogous to that shown in FIGURE 5A.
  • loop schedule 500a corresponds to loop code compiled from loop source code such as that described above, but for which IN ER_LOOP_LIMIT is equal to 2.
  • the overlapping of inner and outer loop code is more dramatically shown in FIGURE 5B than previously in FIGURE 5A. For example,
  • VLIW processor 200 introduces several architectural features for controlling the execution of nested loop code in accordance with the loop schedules depicted in FIGURES 5 A and 5B.
  • VLIW processor 200 includes loop control registers and logic for controlling prologue and epilogue portions of loop code and also includes control transfer preparation registers and logic for implementing multiway control transfers, each of which are described below. Operations in the stream of long instructions initialize and modify loop control registers and trigger control transfers.
  • FIGURE 6 depicts loop control logic 600 of VLIW processor 200 which provide prologue and epilogue control.
  • Loop control logic 600 is coupled to receive values for loop control variables from instruction decoder 623. These values are used to initialize fields of various loop parameters and loop control registers (collectively shown as loop parameter and status registers 640). In particular, these values initialize an epilogue counter field (ecnt) 641, a shift register (sh) 647, a side-effects enabled flag (seen) 648, a current loop counter field (clc) 645, a loop mode flag (lm) 644, and side-effects manual control (seme) and loads manual control (l d mc) flags (642 and 646).
  • ecnt epilogue counter field
  • sh shift register
  • clc current loop counter field
  • lm loop mode flag
  • l d mc load manual control
  • Side-effects enabling logic 610 and load enabling logic 620 respectively issue the side-effects enabled predicate (ls_se_enbl) and the loads enabled predicate (ls_ld_enbl) to respective subsets of execution units illustratively grouped as 630.
  • STU 0 633 through STU m 634 are illustrative of a first group of execution units 248 which implement operations with side-effects and which are distributed among ALCl 242 and ALC3 244 as described above with reference to FIGURE 2.
  • STU 0 633 through STU m 634 are also illustrative of array access channels (AACl and AAC3) 250.
  • STU 0 633 through STU m 634 are each responsive to the ls_se_enbl predicate, enabling side-effects operations when ls_se_enbl is asserted and disabling side-effects operations when ls_se_enbl is de-asserted.
  • LDU Q 635 through LDU n 636 are similarly illustrative of a second group of execution units 248 which implement load operations and which are distributed among ALCl 242 and ALC3 244 as described above with reference to FIGURE 2.
  • LDU 0 635 through LDU n 636 are also illustrative of array access channels (AACO, AACl, AAC2, and AAC3) 250.
  • LDU 0 635 through LDU n 636 are each responsive to the ls_ld_enbl predicate, enabling load operations when ls_ld_enbl is asserted and disabling side-effects operations when ls_ld_enbl is de-asserted.
  • Array access channels (AACO, AACl, AAC2, and AAC3) 250 are also described in the co- pending patent application entitled "Array Prefetch Algorithm," serial no. XX/xxx,xxx ⁇ atty. docket no.: M-3793 PCT>, naming Babaian et al. as inventors and filed on even date herewith, the detailed description of which is incorporated herein by reference.
  • ALU 0 631 through ALU k 632 are illustrative of a third group of execution units which implement arithmetic and logic operations (i.e., non-load and non-side-effects operations) and which are distributed among ALCO 241, ALCl 242, ALC2 243, and ALC3 244 as described above with reference to FIGURE 2.
  • the operation of ALU 0 631 through ALU k 632 is unaffected by the state of either the ls_se_enbl predicate or the ls_ld_enbl predicate.
  • load enabling logic 620 implements:
  • side-effects enabling logic 610 and load enabling logic 620 may be implemented in positive or negative logic, using AND, OR, NAND, or NOR gates, etc. Suitable transformations of the respective logic equations will be appreciated by those of ordinary skill in the art. Additionally, the initialization and transition sequencing of register fields may be alternately defined with suitable modifications to the logic equations. Similarly, many suitable designs for comparing register values to trigger values will be appreciated by those of ordinary skill in the art. Side-effects enabling logic 610 and load enabling logic 620 are of any such suitable designs.
  • FIGURES 7A, 7B, and 7c depict exemplary embodiments of loop parameter and status registers 640 for maintaining loop state information to control nested loop execution. Field values from loop parameter and status registers 640 are used by loop control logic 600 and by multi-way control transfer logic 800 (described below).
  • FIGURE 7A depicts an illustrative organization of Loop Parameters Register (LPR) 710 which resides in control unit 220. Loop Parameters Register (LPR) 710 provides storage for both statically known and dynamically calculated loop attributes. Loop Parameters Register (LPR) 710 includes the side effects manual control (seme) and loop loads manual control (ldmc) fields (642 and 646) described above for selectively enabling and disabling the prologue and epilogue control features described above.
  • Loop Parameters Register (LPR) 710 also includes storage for a decrement loop counter flag (die) 712, a number of overlapped logical iterations minus one field (novl) 711 and an initial loop counter value field (lc) 713.
  • FIGURE 7B depicts an illustrative organization of Loop State Register 1 (LSRI) 720 which also resides in control unit 220.
  • Loop State Register 1 (LSRI) 720 includes storage for the loop mode flag (lm) 644, the epilogue counter field (ecnt) 641, the side-effects enabled flag (seen) 648, and the current loop counter field (clc) 645 described above.
  • the loop mode flag (lm) 644 is set for executing inner loop body code and cleared on inner loop body exit.
  • epilogue counter field (ecnt) 641 counts down to zero (0) from a value equal to that stored in novl 711.
  • Side-effects enabled flag (seen) 648 acts as a sticky bit representation of bit zero (sh [o] ) of shift register, sh 647 which is described below.
  • Current loop counter field (clc) 645 is decremented during each non-epilogue physical iteration of loop schedule 500 and reloaded with a value from initial loop counter value field (lc) 713 at the beginning of set of iterations through the inner loop body code.
  • Deferred control transfer flags dcto 714, dcti 715, and dc 2 716 save control transfer conditions until transfer has been implemented. For example, the transfer along a lower priority branch is deferred if a branch having a higher priority is also triggered.
  • FIGURE 7c depicts an illustrative implementation of shift register, sh 647, which includes two concatenated registers, a 64-bit Loop State Register 2 (LSR2) 732 and Loop State Register 3 (LSR3) 731.
  • LSR2 732 and LSR3 731 respectively provide storage for the lower and upper parts of the shift register, sh 647.
  • Shift register, sh 647 marks the last physical iterations of multiple overlapped inner loop bodies (i.e., shift register, sh 647, marks, for each overlapped pass through the outer loop, the final physical iteration of the last logical iteration of inner loop body code).
  • shift register, sh 647 are set during loop initialization code 410 and start patches (e.g., start patches instances 420a, 420b, and 420c) and shift right in correspondence with successive physical iterations of inner loop body code.
  • start patches e.g., start patches instances 420a, 420b, and 420c
  • shift register, sh 647 indicates the last physical iteration of the last logical iteration in a corresponding pass though the outer loop.
  • the aggregate length of the sh 647 depends on the size (MAXAPB) of an array prefetch buffer, and is equal to
  • bits 0 to 63 of sh 647 are represented in LSR2 732, while bits 64 to (32 + AXAPB/2) are represented in LSR3 731.
  • LSR3 731 and LSR2 732 may be of alternate lengths.
  • LSR3 731 may be omitted.
  • the array prefetch buffer is described in greater detail in the co-pending patent application entitled "Array Prefetch Algorithm," serial no. XX/xxx,xxx ⁇ atty. docket no.: M-3793 PCT>, naming Babaian et al. as inventors and filed on even date herewith, the detailed description of which is incorporated herein by reference.
  • FIGURE 8 depicts an exemplary embodiment of multi-way control transfer logic 800.
  • Multi-way control transfer logic 800 includes multiplexer 810 which supplies a next address selected from an incremented next instruction address, a start patch address, and a loop body address.
  • the next instruction address is supplied to multiplexer 810 by adder 830 based on a program counter value supplied from Program Counter register (PC) register 832 and an instruction length supplied from instruction decoder 623.
  • PC Program Counter register
  • the start patch and loop body addresses are respectively supplied from Control Transfer Preparation Registers (CTPR2a 840 and CTPR3a 850).
  • CTR2a 840 and CTPR3a 850 Control Transfer Preparation Registers
  • Branch target coder 820 provides multiplexer 810 with an address selection signal based on an outer loop exit predicate (ls_exit) represented in predicate file 231, based on a last iteration begin predicate (ls_lst_itr_bgn), and based on a last iteration end predicate (ls_lst_itr_end). Branch target coder 820 also receives Control Transfer Operations (CTOPs) from instruction decoder 623.
  • COPs Control Transfer Operations
  • Last iteration begin and end loop predicates are respectively supplied by last iteration begin logic 870 and last iteration end logic 860 based on values stored in fields of various of loop parameter and loop status registers, which are collectively shown as loop parameter and status registers 640.
  • last iteration begin logic 870 compares current loop counter field (clc) 645 to the value one (1), supplying a true predicate if current loop counter field (clc) 645 indicates that the current physical iteration begins the last logical iteration.
  • the ls_lst_itr_bgn predicate is used by branch target coder 820 to identify points in a nested loop schedule for transfering control to start patch 420 (i.e., in the context of loop schedule 500, for transfering control to start patch instances 420a, 420b, and 420c).
  • branch target coder 820 Upon receiving an appropriate Control Transfer OP (CTOP) from instruction decoder 623 and a true ls_lst_itr_bgn predicate from last iteration begin logic 870, branch target coder 820 supplies a address selection signal selective for the start patch address stored in Control Transfer Preparation Register (CTPR2a) 840.
  • CTOP Control Transfer OP
  • values of current loop counter field (clc) 645 from PR . lc to one (1) indicate valid logical iterations.
  • the zero (0) value indicates the epilogue period.
  • Alternate encodings for identifying the beginning of the last logical iteration of an inner loop body will be appreciated by those of ordinary skill in the art.
  • a shift register configuration, a count up (rather than count down) configuration, alternate counter base points, etc. are all suitable alternatives.
  • Suitable corresponding design modifications to the predicate implemented by last iteration begin logic 870 will also be appreciated by those of ordinary skill in the art.
  • Last iteration begin logic 870 is of any such suitable design.
  • last iteration end logic 860 supplies the last iteration end predicate (ls_lst_itr_end) if both the side-effects enabled flag (seen) 648 and bit one
  • sh 647 are set.
  • Side-effects enabled flag (seen) 648 is set at the end of prologue period 510 and marks the non-prologue portion of loop schedule 500.
  • the last inner loop body code physical iterations associated with each of multiple overlapped outer loop passes are encoded by set bits of shift register, sh 647.
  • the ls_lst_i r_end predicate is used by branch target coder 820 to identify points in an overlapped loop schedule for transfering control to finish patch 440 (i.e., in the context of loop schedule 500, for transfering control to finish patch instances 440a, 440b, and 440c).
  • CTOP Control Transfer OP
  • branch target coder 820 Upon receiving an appropriate Control Transfer OP (CTOP) from instruction decoder 623 and a true ls_lst_itr_end predicate from last iteration end logic 860, branch target coder 820 supplies a address selection signal selective for the finish patch address stored in Control Transfer Preparation Register (CTPR3a) 850.
  • CTPR3a Control Transfer Preparation Register
  • Last iteration end logic 860 is of any such suitable design.
  • branch target coder 820 supplies an address selection signal based on the particular Control Transfer OPeration (CTOP) received from instruction decoder 623 and based on the states of loop predicates such as ls_exit, ls_lst_itr_bgn, and ls_lst_itr_end.
  • CTOP Control Transfer OPeration
  • CTOP Control Transfer OP
  • branch target coder 820 selects one of four program paths ⁇ see colum 3) as indicated by the address from the corresponding Control Transfer Preparation Register (CTPR/ ' a).
  • Loop control transfer semantics encode a "fall through,” i.e., continuation with the next long instruction (according to the program counter value from Program Counter register (PC) register 832) as path 0, i.e., the address associated with "CTPROa.”
  • Loop control condition expressions are prioritized from 0 to 3, with path 0 (fall through) having the highest priority.
  • branch target coder 820 provides multiplexer 810 with an address selection signal selective for the next instruction address input from adder 830 if condition expression 0 evalutes to true.
  • branch target coder 820 provides multiplexer 810 with a address selection signal selective for the the start patch address from Control Transfer Preparation Register (CTPR2a) 840 if condition expressions 0 and 1 evaluate to false and condition expression 2 evaluates to true.
  • branch target coder 820 provides multiplexer 810 with a address selection signal selective for the the loop body address from Control Transfer Preparation Register (CTPR3a) 850 if condition expressions 0, 1, and 2 evaluate to false and condition expression 3 evaluates to true.
  • CPR2a Control Transfer Preparation Register
  • CPR3a Control Transfer Preparation Register
  • Condition expression 1 and Control Transfer Preparation Register CTPRla can be configured to control transfers to a middle patch for supporting nested loops with a prefetch buffer.
  • a middle patch appears in a nested loop structure when an array prefetch algorithm (APA) is used.
  • array prefetch logic inserts additional stages into inner loop logical iterations.
  • a logical iteration is divided into two portions.
  • a dynamic portion of the divided logical iteration includes a first stage with loads (and only loads) together with additional "waiting for memory” stages.
  • a static portion of the divided logical iteration (enabled by the ls_stat_enbl predicate) includes all statically compiled stages (with all operations but loads).
  • a start patch such as start patch 420:
  • condition expression 0 for Control Transfer OPs (CTOPs) 1, 2, and 3 (i.e., ls_lst_itr_end
  • the ls_break term provides for inner loop exit on a condition from predicate file and the (Is jprlg && LPR .
  • ext term provides for handling of inner loop body code with an extension fragment for handling nested loops with vector invariants (where ls_prlg indicates the prologue period of each inner loop and ext is a flag in loop parameters register (LPR) 710).
  • a nested loop may include a variable which is constant in the inner loop but which is modified in the outer loop, so that each next inner loop must see a new value. During overlapped epilogue and prologue periods of adjacent inner loops, these adjacent loops must see different values for the same variable.
  • Vectorizing the storage register for the inner loop invariant allows a particular vector element to correspond to each successive adjacent inner loop.
  • a start patch such as start patch 420
  • the vector element corresponding to the first stage of inner loop body code 430 is updated with the inner loop invariant.
  • Inner loop body code such as inner loop body code 430, can include operations to copy the inner loop invariant to the next element stage by stage.
  • VLIW processor 200 allows for an extension fragment which collects inner loop invariant servicing operations.
  • the extension fragment is appended to inner loop body code 430 and is executed only when necessary, i.e., as the prologue of a next in turn inner loop, on a fall through control transfer on the condition (ls_jprlg && LPR . ext).
  • a second rotatable area in predicate file 231 is included in VLIW processor 200 to support vectorized storage for the inner loop invariant independent of other vectorized loop variables.
  • the ls_ldovl_limit predicate in condition expression 0 for Control Transfer OPs (CTOPs) 1, 2, and 3 ⁇ see TABLE 1) is also not closely related to the implementation of loop schedule 500. Instead, the ls_ldovl_limit predicate provides support for branches in response to a maximum load overlap condition for array prefetch operations.
  • Array prefetch operations are described in greater detail in the co-pending patent application entitled "Array Prefetch Algorithm," serial no. XX/xxx,xxx ⁇ atty. docket no.: M-3793 PCT>, naming Babaian et al. as inventors and filed on even date herewith, the detailed description of which is incorporated herein by reference.
  • CTOP computed control transfer addresses
  • the first step includes conditions calculation and encoding of a control transfer, if any, to take.
  • the possible control transfers are described in Control Transfer Preparation Registers (CTPRS) which contain control transfer attributes.
  • CPRS Control Transfer Preparation Registers
  • the particular CPTR transfer to take is encoded in a control transfer execution register (CTER) 930 ⁇ see FIGURE 9) of branch target coder 820.
  • This first step is executed along with the wide instruction that contains the control transfer operation and results in a CTPR number encoded in control transfer execution register (CTER) 930.
  • the second step the actual control transfer as indicated by CTER 930 encoding — is executed with the next long instruction.
  • control transfer logic 800 may provide single step branches, in which case control transfer execution register (CTER) 930 may be eliminated and branch target coder 820 selections may be coupled directly to multiplexer 810. Both delayed and non-delayed branch implementations are suitable.
  • CTER control transfer execution register
  • CTPRia provides storage for the virtual address target of an associated control transfer
  • CTPRie provides storage for control transfer attributes, including encodings for conditional control transfers based on Boolean Predicate Register (BPR) values and for unconditional control transfers.
  • CONDFN fields i.e., CONDFN .
  • CTPRc 920 respectively encode the sense of the predicate (i.e., which binary value is the true condition), the type of predicate addressing (as a address or as a modulo displacement in the predicate file), and the address/displaceent of the predicate.
  • the exemplary embodiment of CTPR and CTER registers described herein is generally applicable to both scalar and loop control transfers, the description focuses on loop control transfers, i.e., on the role of CTPR and CTER registers as architectural support for software pipelined nested loops.
  • the remaining fields of CTPRc 920 are primarily associated with scalar control transfers.
  • Outer loop exit predicate supports outer loop termination by a condition encoded in predicate file 231, wherein:
  • pred2 xor CTPR2c. CONDFN.neg a CTPR2c .
  • TYPE of 1 indicates loop control transfer mode and wherein pred2 represents a logic predicate accessed from predicate file (BPR) 231 in the following way:
  • the stored logic predicate corresponding to pred2 is computed and written to predicate file 231 by operations in the long instruction stream. Suitable predicate file designs and access method are well known in the art.
  • predicate file 231 is accessed using information stored in the fields of Control Transfer Preparation Register (cTPR2c) 852.
  • cTPR2c Control Transfer Preparation Register
  • Control Transfer Preparation Register Number field (CTPR#) 931 of Control Transfer Execution Register (CTER) 930 encodes the direction of program execution as determined in the first step (conditions calculation) in accordance with Table 1.
  • the second step, control transfer itself, is performed along with the next long instruction. If CTER . CTPR# is non-zero, branch target coder 820 supplies multiplexer 810 with an address selection signal selective for an address from the corresponding Control Transfer Preparation Register, i.e., CTPRla (not shown), CTPR2a 840, or CTPR3a 850.
  • a zero value of CTER. CTPR# is selective for the next instruction address from adder 830.
  • nested loop code 400 for execution on VLIW processor 200 includes Control Transfer Preparation (CTP) operations, operations encoding modifications and updates to loop parameter and status registers 640, and operations encoding multiway Control Transfer OPs (CTOPs).
  • initialization code 410 includes Control Transfer Preparation (CTP) operations which initialize Control Transfer Preparation Registers CTPR2a 840 and CTPR3a 850 with addresses of start patch 420 and inner loop body code 430, respectively.
  • CTP operations trigger cache fills of control transfer target code to instruction cache (IC) 282.
  • CTP operations are performed once, i.e., in initialization code 410, for multiple Control Transfer OPs (CTOPs) in initialization code 410, start patch 420, inner loop body code 430, and finish patch 440 portions of nested loop code 400.
  • CTP operations encode the target and nature of an associated control transfer
  • Control Transfer OPs encode the control transfer itself.
  • CTPs can be encoded limit impact on the size of inner loop body code 430.
  • a Control Transfer OP is encoded in 4 bits.
  • Initialization code 410 also initializes loop parameter and status registers 640 with a register write operation to an aliased loop register (LR).
  • Loop register (LR) is an alias for a variety of underlying fields of loop parameters register (LPR) 710, loop state register 1 (LSRI) 720, and shift register, sh 647.
  • a register write of rw_data to LR provides a single operation (OP) method for initializing loop parameter and status registers 640.
  • OPs methods for initializing loop parameter and status registers 640 are also suitable, though less efficient of long instruction word space and execution time.
  • LSR2.sh 1 ⁇ (rw data. novl + rw data.nxi) ; where the (rw_data . novl + rw_data . nxi) term in the initialization expressions is selective for a particular bit of shift register, sh 647.
  • the nxi term provides support for auxiliary logical iterations for reduction of common subexpressions, maintaining of recurrent dependencies, etc. using extension code interposed between inner loop body code 430 and finish patch 440.
  • initialization code 410 includes a Control Transfer OP (CTOP) encoding control transfer 471.
  • CTOP Control Transfer OP
  • Deferred control transfer flags dcto 714, dcti 715, and dct2 716 are necessarily unset in accordance with the above register write operation to LR. The higher priority conditions evaluate to false and control transfers associated with the deferred control transfer flags are not taken.
  • Inner loop body code 430 includes a decrement loop register (DLR) operation, which like the register write to loop register (LR) operation described above operates on a variety of underlying fields of loop parameters register (LPR) 710, loop state register I (LSRI) 720, and shift register, sh 647.
  • the decrement loop register (DLR) operation updates a base register (BR) for addressing into predicate file 231.
  • the decrement loop register (DLR) operation performs the following actions: if (LPR. die &-- ! (ls_eplg
  • ls_ldovl_limit) ) LSRI. Clc LSRI.
  • bpcur : (BR . bpcur - 1 ) mod (BR . bpsz + 1 ) ⁇
  • current loop counter field (clc) 645 is decremented during non-epilogue physical iterations of inner loop body code 430
  • epilogue counter field (ecnt) 641 is set to the value stored in novl 711 at the beginning of the last logical iteration and thereafter decremented
  • shift register, sh 647 is right shifted during each physical iteration of inner loop body code 430.
  • the ls_ldovl_limit and ls_stat_enbl predicates are not closely related to the implementation of nested loop control, but rather, relate to an implementation of array prefetch operations.
  • Array prefetch operations are described in greater detail in the co- pending patent application entitled "Array Prefetch Algorithm," serial no. XX/xxx,xxx ⁇ atty. docket no.: M-3793 PCT>, naming Babaian et al. as inventors and filed on even date herewith, the detailed description of which is inco ⁇ orated herein by reference.
  • the decrement loop register (DLR) operation should be encoded in the last long instruction word of inner loop body code 430.
  • inner loop body code 430 includes a Control Transfer OP (CTOP) encoding control transfers 474, 475, and 477, i.e., encoding a three-way control transfer.
  • CTOP Control Transfer OP
  • Finish patch 440 includes a Control Transfer OP (CTOP) encoding control transfers 473, 476, and 478, i.e., encoding a three-way control transfer.
  • CTOP Control Transfer OP
  • Control Transfer OPs are implemented as two-step, delayed branches. Control transfer conditions calculation and encoding of a CTPR number, is included in the first step, which in the exemplary VLIW processor 200 embodiment described herein is encoded in a first long instruction. The actual control transfer switch (CTS) is then performed in the next long instruction, i.e., in a delay slot.
  • CTS control transfer switch
  • each of the above Control Transfer OPs are encoded in a second-to-last long instruction word of a respective portion of nested loop code 400, i.e., in the second-to-last long instruction words of initialization code 410, of start patch 420, of inner loop body code 430, or of finish patch 440.
  • CTOPs Control Transfer OPs
  • alternative embodiments which implement in Control Transfer OPs (CTOPs) in a single step are also suitable and, in such embodiments, CTOPs are encoded in the last long instruction of respective portions of nested loop code 400.
  • modifications to certain fields of loop state register 1 (LSRI) 720 are implicit in the control transfer switch (CTS) step of Control Transfer OP (CTOP) evaluation, i.e., certain loop state modifications are automatically triggered by an actual control transfer.
  • CTS control transfer switch
  • CTOP Control Transfer OP
  • loop mode flag (lm) 644 and side-effects enabled flag (seen) 648 are modified as described below.
  • CTPR# control transfer preparation register number
  • VLIW Very Long Instruction Word
  • a compiler is used to schedule operations to be executed by VLIW processor 200 during each cycle.
  • the design of VLIW processor 200 allows concurrent execution of multiple independent operations (e.g., load, store, add, multiply, divide, shift, logical, and branch operations) that make up a long instruction.
  • Long instructions are stored in a memory 211 and an instruction cache (IC) 282 of VLIW processor 200 in packed form as sets of 16- and 32-bit syllables.
  • Operation execution time at execution units 248 is one cycle for integer and logic operations, two cycles for floating point addition, three or four cycles for floating point multiplication, seven cycles for word format division, and ten to eleven cycles for two-word format, normalized operands. All operations except division can be executed in every cycle; division can be run every other cycle.
  • VLIW processor 200 includes an instruction buffer (IB) 210, a control unit (CU) 220, a multiport register file (RF) 230, 4 arithmetic logic channels (ALCO 241, ALCl 242, ALC2 243, and ALC3 244), each of which includes multiple execution units 248, array access channels (AACO, AAC l, AAC2, and AAC3) 250, a memory management unit (MMU) 260, a memory access unit (MAU) 270, an array prefetch buffer (APB) 235, and a data cache (DCACHE) 280.
  • IB instruction buffer
  • CU control unit
  • RF multiport register file
  • ALC3 244 4 arithmetic logic channels
  • ALCl 242, ALC2 243 4 arithmetic logic channels
  • ALC3 244 each of which includes multiple execution units 248, array access channels (AACO, AAC l, AAC2, and AAC3) 250
  • MMU memory management unit
  • MAU memory access unit
  • Instruction buffer 210 fetches long instructions from memory 211, or from an included instruction cache (IC) 282 if cached.
  • instruction buffer 210 includes instruction cache (IC) 282, instruction alignment logic, a program counter register (PC) 216, and control transfer preparation registers (CTPRl 213, CTPR2 214, and CTPR3 215).
  • Instruction cache (IC) 282 is filled in response to both linear program path pre-fetches and control transfer preparation operations.
  • Control unit (CU) 220 issues operations from a long instruction to execution units (EUs)
  • control unit (CU) 220 controls control unit (CU) 220:
  • predicate file PF 231 as condition codes for Control Transfer OPerations (CTOPs);
  • ACO 241, ALCl 242, ALC2 243, and ALC3 244 issues literal values to arithmetic logic channels (ALCO 241, ALCl 242, ALC2 243, and ALC3 244) and array access channels (AACO, AACl, AAC2, and AAC3) 250;
  • ACO 241, ALCl 242, ALC2 243, and ALC3 244) issues up to four operations to array access channels (AACO, AACl, AAC2, and AAC3)
  • Control unit (CU) 120 also executes Control Transfer OPerations (CTOPs) and includes an instruction register (IR) 221, an unpacked instruction register, scattering logic, and special registers 224.
  • CTOPs Control Transfer OPerations
  • IR instruction register
  • IR unpacked instruction register
  • scattering logic scattering logic
  • special registers 224 include:
  • loop parameter and status registers 640 (e.g., LPR, LSHRi, LSHR2, and LSHR3) used for loop control and
  • Both the loop parameter and status registers and the base registers are software accessible for read and write.
  • the design and operation of instruction register 221, the unpacked instruction register, and scattering logic are described in greater detail in a co-pending patent application entitled "Wide Instruction Unpack,” serial no. XX/xxx,xxx ⁇ atty. docket no.: M-3795 PCT>, naming Sakhin et al. as inventors and filed on even date herewith, the detailed description of which is inco ⁇ orated herein by reference.
  • Predicate file (PF) 231 includes storage for predicate values generated by integer and floating point compare operations. Predicate values are used to control the conditional (or predicated) execution of operations.
  • predicate file (PF) 231 includes 32 two-bit registers.
  • Calculate condition unit (CCU) 233 generates a mask for conditional execution of operations at execution units 248 of arithmetic logic channels (ALCO 241, ALCl 242, ALC2 243, and ALC3 244) and for operations in array access channels (AACO, AAC 1 , AAC2, and AAC3) 250.
  • register file 230 includes 18-port memory that enables each of 4 execution units to read 2 arguments (or 3 arguments in the case of store operations), to write 4 results (one from each ALU) and to write 4 values read from memory in each cycle.
  • Register file 230 includes 256 66-bit registers, that are accessed with 4 bases (CWP, CWPAR, BRl, BR2) defined in special registers 224 of control unit 220. Each base allows the addressing of up to 64 registers from register file 230.
  • VLIW processor 200 provides register addressing which is relative to a base register.
  • the loop base registers, BRl and BR2 allow decrementing and cycling to provide a rotating set of physical registers from register file 230 to represent vector elements in software pipelined inner loops. In this way, a compiler can allocate a consecutive set of registers which is only as long as the lifetime of a vector element.
  • Execution units of VLIW processor 200 are combined in 4 pipelined ALU channels (ALCO 241, ALCl 242, ALC2 243, and ALC3 244). Each ALU channel has 2 data multiplexers MUX0 and MUX1 , unpack circuits, and 2 input registers (Data RegO and Data Regl ).
  • the design and operation of execution units and ALU channels of VLIW processor 200 are described in greater detail in a co-pending patent application entitled "Multifunctional Execution Unit, Executing Combined Operations and Supporting Continuing Instruction Flow," serial no. XX/xxx,xxx ⁇ atty. docket no.: M-3731 PCT>, naming Gorshtein et al. as inventors and filed on even date herewith, the detailed description of which is inco ⁇ orated herein by reference.
  • ALCO 241 includes execution units for executing integer arithmetic, division, and floating point addition operations.
  • ALCl 242 includes execution units for executing memory access operations, integer operations and floating point addition operations.
  • ALC2 243 includes execution units for executing integer, logic and shift operations, as well as floating point addition and multiplication operations.
  • ALC3 244 includes execution units for executing integer and logic operations, floating point multiplication operations, and memory access operations.
  • ALCs assignment of operation sets to ALCs is driven by a desire to provide even ALU channel loading for integer as well as floating-point computations.
  • alternate execution unit configurations would also be suitable, including larger or smaller numbers of ALCs, alternate mappings of operations to ALCs, and segregated integer and floating point execution unit configurations. Indeed alternative embodiments need not group execution units in ALCs. Suitable designs for such alternate configurations will be appreciated by those of ordinary skill in the art. Execution unit and ALC configurations are of any such suitable designs.
  • array access channels AACO, AACl, AAC2, and AAC3 250.
  • array access units (AAUs) of the array access channels 250 issue addresses for the loading (and storing) of array elements from (and to) main memory to (and from) register file 230.
  • AAC0-AAC3 each of 4 independent array access channels 250 (i.e., AAC0-AAC3) corresponds to a DTLB 237 port.
  • Each array access channel includes 8 pairs of address registers, which include a current address register (CAR) and an increment register (INCR). For memory accesses, one pair of address registers is used in every cycle.
  • CAR current address register
  • ICR increment register
  • the current address from the CAR register is delivered to the memory and is modified by an increment from the INCR register.
  • AACO and AAC2 are used for load memory accesses
  • AACl and AAC3 are used for both load and store memory accesses.
  • array prefetch buffer (APB) 235 is used to prefetch array elements for loops from memory.
  • array prefetch buffer (APB) 235 includes a four-channel FIFO buffer. Each channel includes forty-eight (48) 66-bit registers. Data are transferred from array prefetch buffer (APB) 235 to register file (RF) 230 when the data are ready.
  • Suitable array prefetch buffer designs such as for array prefetch buffer (APB) 235
  • suitable array access unit designs such as for array access channels (AACO, AACl, AAC2, and AAC3) 250 are described in greater detail in a co-pending patent application entitled "Array Prefetch Algorithm," serial no.
  • Memory management Unit (MMU) 260 includes a four-port Data Translation Lookaside
  • Buffer (DTLB) 237 with 64 entries and hardware for searching in a page table in the case of a DTLB 237 miss.
  • Memory management unit (MMU) 260 also contains disambiguation memory 294 for checking rearrangement correctness of load and store operations, performed by an optimizing compiler.
  • Memory access unit (MAU) 270 provides an interface for communicating between VLIW processor 200 and memory 211 at an exchange rate of up to four information words transferred per cycle.
  • Memory access unit (MAU) 270 includes an entry buffer for memory requests and a crossbar of five memory access channels (i.e., four data access channels and one instruction access channel for instruction fetches from instruction buffer 210) to four physical memory channels.
  • the two least significant bits of a physical address correspond to physical memory channel number each memory access channel includes a 64-bit data path.
  • Data cache (DCACHE) 280 caches data for scalar memory accesses and, in the exemplary embodiment of FIGURE 2, is organized as a write-through, 32 Kbyte, four-way set associative with 64-byte blocks, although alternative organizations are also suitable.
  • VLIW processor 200 is merely illustrative of a Very Long Instruction Word (VLIW) embodiment of the present invention.
  • VLIW Very Long Instruction Word
  • Alternate embodiments of the present invention provide loop control for a pipelined processor having single operation instructions, for a pipelined processor having an N-wide instruction pipeline, and for a pipelined execution unit of a multi-execution processor. Suitable modifications for each will be appreciated by persons of ordinary skill in the art.
  • Loop control logic 600 is similarly illustrative.
  • Alternative embodiments may inco ⁇ orate other structures and/or methods for distinguishing the prologue and epilogue portions of a loop body.
  • Alternative embodiments may also inco ⁇ orate other structures and/or methods and for inhibiting the operation of side-effects operations during the prologue and of load operations during the epilogue.
  • alternative processor embodiments may define analogous sets of operation classes in accordance with the operation semantics implemented by a particular processor architecture without departing from the spirit and scope of the invention.
  • Multi-way control transfer logic 800 is similarly illustrative.
  • Alternative embodiments may inco ⁇ orate other structures and/or methods for distinguishing the beginning and end of a last logical iteration of an inner loop body.
  • Alternative embodiments may also inco ⁇ orate other structures and/or methods for selecting a control transfer address from a plurality of branch addresses based on loop status dependent predicates.
  • Alternative embodiments may implement multi-way control transfers in fewer or greater steps and multi-way control transfer logic and registers may support larger or smaller numbers of control transfer targets.
  • alternative processor embodiments may define analogous sets of loop state predicates selective for control transfer targets and analogous portions or patches in accordance with the nested loop code 400 structures of the particular processor architecture without departing from the spirit and scope of the invention.
  • loop parameter and status registers 640 is merely illustrative and wide variety of suitable alternate groupings, field widths, state mappings, state transition pathways, etc. will be appreciated by persons of ordinary skill in the art.
  • control transfer registers including CTPRia/ CTPRie pairings and the CTPR / CTER distinction
  • suitable alternate groupings field widths, CTOP mappings, etc. will be appreciated by persons of ordinary skill in the art.

Landscapes

  • Engineering & Computer Science (AREA)
  • Software Systems (AREA)
  • Theoretical Computer Science (AREA)
  • General Engineering & Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • General Physics & Mathematics (AREA)
  • Devices For Executing Special Programs (AREA)

Abstract

Pour certaines classes de boucles à chevauchement par logiciel, il est possible de faire se chevaucher les parties prologue et épilogue de boucles internes adjacentes dans une boucle imbriquée. Il est ainsi possible de réaliser le chevauchement par logiciel du code de boucle externe et du code de boucle interne. Le support architectural des boucles imbriquées et à chevauchement par logiciel, est fourni par un ensemble de registres (640) de paramètres et de statuts de boucles, ainsi que par la mise en oeuvre de transferts de commande à voies multiples et dépendants de l'état de la boucle. En ce qui concerne le code de corps de boucle compatible avec deux contraintes simples, cette invention ne fait appel à aucun élément de code complémentaire afin de désactiver les opérations inutiles lors de périodes de boucle de prologue et d'épilogue de boucles internes adjacentes. La commande de boucle imbriquée assure le chevauchement entre la période d'épilogue d'une boucle interne précédente et la période de prologue de la boucle interne suivante. Le code (400) de boucle imbriquée peut ainsi être programmé plus efficacement par un compilateur en vue d'une exécution sur un processeur tel qu'un processeur à très long mot d'instruction (TLMI) (200). Ce dernier va fournir le support architectural destiné aux boucles imbriquées et à chevauchement par logiciel, ce qui permet d'améliorer les performances des boucles. Les transferts de commande à voies multiples et dépendants de l'état de boucle sont fournis par un circuit logique (800) de transfert de commande à voies multiples. Ce circuit comprend les registres (640) de paramètres et de statuts de boucles, ainsi qu'un sélecteur de ciblage de branche (820, 810) permettant de sélectionner les adresses de transfert de commande qui correspondent au code de corps de boucle interne (430). Ce circuit logique fait également appel à une correction de départ (420) ainsi qu'à une correction d'arrivée (440) issue des registres d'adresse de transfert de commande en fonction de l'état de la boucle.
PCT/RU1996/000216 1996-08-07 1996-08-07 Support architectural pour le chevauchement par logiciel de boucles imbriquees WO1998006040A1 (fr)

Priority Applications (2)

Application Number Priority Date Filing Date Title
PCT/RU1996/000216 WO1998006040A1 (fr) 1996-08-07 1996-08-07 Support architectural pour le chevauchement par logiciel de boucles imbriquees
US08/733,479 US5958048A (en) 1996-08-07 1996-10-18 Architectural support for software pipelining of nested loops

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
PCT/RU1996/000216 WO1998006040A1 (fr) 1996-08-07 1996-08-07 Support architectural pour le chevauchement par logiciel de boucles imbriquees

Related Child Applications (1)

Application Number Title Priority Date Filing Date
US08/733,479 Continuation US5958048A (en) 1996-08-07 1996-10-18 Architectural support for software pipelining of nested loops

Publications (1)

Publication Number Publication Date
WO1998006040A1 true WO1998006040A1 (fr) 1998-02-12

Family

ID=20130023

Family Applications (1)

Application Number Title Priority Date Filing Date
PCT/RU1996/000216 WO1998006040A1 (fr) 1996-08-07 1996-08-07 Support architectural pour le chevauchement par logiciel de boucles imbriquees

Country Status (1)

Country Link
WO (1) WO1998006040A1 (fr)

Cited By (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1117031A1 (fr) * 2000-01-14 2001-07-18 Texas Instruments France Un microprocesseur

Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SU741269A1 (ru) * 1978-01-04 1980-06-15 Специальное Конструкторское Бюро Вычислительных Машин Микропрограммный процессор
SU742942A1 (ru) * 1977-09-21 1980-06-25 Предприятие П/Я А-3162 Устройство дл обработки информации
US4236227A (en) * 1979-01-02 1980-11-25 Honeywell Information Systems Inc. Data storage system
EP0293851A2 (fr) * 1987-06-05 1988-12-07 Mitsubishi Denki Kabushiki Kaisha Processeur de traitement numérique de signaux
US5210827A (en) * 1989-12-05 1993-05-11 Nec Corporation Nest level judging device for judging the starting and the end addresses
US5471189A (en) * 1994-12-14 1995-11-28 International Business Machines Corp. Comparator circuitry and method of operation
US5530665A (en) * 1993-07-22 1996-06-25 Kawasaki Steel Corporation Method of using associative memories and an associative memory

Patent Citations (7)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
SU742942A1 (ru) * 1977-09-21 1980-06-25 Предприятие П/Я А-3162 Устройство дл обработки информации
SU741269A1 (ru) * 1978-01-04 1980-06-15 Специальное Конструкторское Бюро Вычислительных Машин Микропрограммный процессор
US4236227A (en) * 1979-01-02 1980-11-25 Honeywell Information Systems Inc. Data storage system
EP0293851A2 (fr) * 1987-06-05 1988-12-07 Mitsubishi Denki Kabushiki Kaisha Processeur de traitement numérique de signaux
US5210827A (en) * 1989-12-05 1993-05-11 Nec Corporation Nest level judging device for judging the starting and the end addresses
US5530665A (en) * 1993-07-22 1996-06-25 Kawasaki Steel Corporation Method of using associative memories and an associative memory
US5471189A (en) * 1994-12-14 1995-11-28 International Business Machines Corp. Comparator circuitry and method of operation

Cited By (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP1117031A1 (fr) * 2000-01-14 2001-07-18 Texas Instruments France Un microprocesseur
US6795930B1 (en) 2000-01-14 2004-09-21 Texas Instruments Incorporated Microprocessor with selected partitions disabled during block repeat

Similar Documents

Publication Publication Date Title
US5958048A (en) Architectural support for software pipelining of nested loops
US5794029A (en) Architectural support for execution control of prologue and eplogue periods of loops in a VLIW processor
US11340908B2 (en) Reducing data hazards in pipelined processors to provide high processor utilization
US7594102B2 (en) Method and apparatus for vector execution on a scalar machine
US8161266B2 (en) Replicating opcode to other lanes and modifying argument register to others in vector portion for parallel operation
Ditzel et al. Branch folding in the CRISP microprocessor: Reducing branch delay to zero
US5941983A (en) Out-of-order execution using encoded dependencies between instructions in queues to determine stall values that control issurance of instructions from the queues
EP1010065B1 (fr) Controle d'acces de donnees dans un coprocesseur
US6282633B1 (en) High data density RISC processor
EP0996057B1 (fr) Processeur de données avec une unité d'instructions comprenant une mémoire tampon et une mémoire morte
US5983336A (en) Method and apparatus for packing and unpacking wide instruction word using pointers and masks to shift word syllables to designated execution units groups
US5889985A (en) Array prefetch apparatus and method
KR100705507B1 (ko) 확장가능한 프로세서 아키텍처에 진보된 명령어들을부가하는 방법 및 장치
US9329866B2 (en) Methods and apparatus for adapting pipeline stage latency based on instruction type
KR100316078B1 (ko) 파이프라인방식프로세서
US20010042187A1 (en) Variable issue-width vliw processor
US6292845B1 (en) Processing unit having independent execution units for parallel execution of instructions of different category with instructions having specific bits indicating instruction size and category respectively
KR100267089B1 (ko) 스칼라/벡터연산이조합된단일명령복수데이터처리
JP2001501755A (ja) データ処理条件コード・フラグ
US20040210886A1 (en) Optimized switch statement code employing predicates
EP0924603A2 (fr) Planification dynamique d'instructions de programme sur commande de compilateur
WO1998006040A1 (fr) Support architectural pour le chevauchement par logiciel de boucles imbriquees
Song Demystifying epic and ia-64
WO2005036384A2 (fr) Codage d'instructions pour processeurs de type vliw
JPH06290057A (ja) ループ最適化方法

Legal Events

Date Code Title Description
WWE Wipo information: entry into national phase

Ref document number: 08733479

Country of ref document: US

AK Designated states

Kind code of ref document: A1

Designated state(s): RU US