US20020002664A1 - System and method for power optimization in parallel units - Google Patents
System and method for power optimization in parallel units Download PDFInfo
- Publication number
- US20020002664A1 US20020002664A1 US09/832,064 US83206401A US2002002664A1 US 20020002664 A1 US20020002664 A1 US 20020002664A1 US 83206401 A US83206401 A US 83206401A US 2002002664 A1 US2002002664 A1 US 2002002664A1
- Authority
- US
- United States
- Prior art keywords
- execution unit
- power
- voltage
- execution
- instruction
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Granted
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3867—Concurrent instruction execution, e.g. pipeline or look ahead using instruction pipelines
- G06F9/3871—Asynchronous instruction pipeline, e.g. using handshake signals between stages
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/30003—Arrangements for executing specific machine instructions
- G06F9/30007—Arrangements for executing specific machine instructions to perform operations on data operands
- G06F9/3001—Arithmetic instructions
- G06F9/30014—Arithmetic instructions with variable precision
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3836—Instruction issuing, e.g. dynamic instruction scheduling or out of order instruction execution
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3854—Instruction completion, e.g. retiring, committing or graduating
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F9/00—Arrangements for program control, e.g. control units
- G06F9/06—Arrangements for program control, e.g. control units using stored programs, i.e. using an internal store of processing equipment to receive or retain programs
- G06F9/30—Arrangements for executing machine instructions, e.g. instruction decode
- G06F9/38—Concurrent instruction execution, e.g. pipeline or look ahead
- G06F9/3854—Instruction completion, e.g. retiring, committing or graduating
- G06F9/3858—Result writeback, i.e. updating the architectural state or memory
Definitions
- This invention pertains to circuit chip powering. More specifically, it relates to providing to individual functional units a selectable power supply, thereby increasing processing speed while reducing power consumption.
- An advantage of asynchronous logic in a microprocessor is its low power consumption when compared with similar synchronous logic.
- Asynchronous logic operating in an unclocked domain, consumes much less power due to the fact that there are no logic transitions based on a clock. Consequently, the speed and power consumption of any piece of asynchronous logic will be dependent on the data and the voltage supplied to the logic gates.
- a common method of improving the performance of asynchronous logic has been to increase the power supply voltage. This increased voltage makes logic gates function much more rapidly.
- the power consumed in the transition of a circuit which is due to a brief short circuit current, is proportional to the supply voltage.
- the asynchronous logic designers increase the speed of their chips by increasing the voltage of the power supply, they also increase the power needed to drive the logic.
- the asynchronous chip is powered by the same voltage source throughout. This source even supplies units that are executing functions that cannot be applied until after some critical function has completed, as would be the case of an in-order chip executing two instructions in parallel.
- This increased voltage supply throughout the chip executes the critical instruction more rapidly, but also causes other units executing non-critical instructions to use more power, thus dissipating more heat and draining batteries faster.
- the adder were to use the same power supply as the multiplier in this example, the speed advantage gained by the adder from the extra voltage would be useless since the adder is held from completion until after the divide instruction is finished. If the adder were to be given a lower supply voltage during the execution of the add instruction than that given to the multiplier, the adder would save a considerable amount of power without effecting the overall performance of the chip.
- a system and method for selectively powering execution units from a plurality of power sources, the power to each execution unit being selected based upon expected time to completion of processing within the execution unit.
- FIG. 1 is a timing chart illustrating asynchronous data transfer.
- FIG. 2 is a timing chart illustrating synchronous and asynchronous operation.
- FIG. 3 is a timing diagram illustrating asynchronous data transfer.
- FIG. 4 is a flow chart illustrating parallel execution of adds and multiplies in the execution stage of a pipeline.
- FIG. 5 is a high level system diagram of a preferred embodiment of the invention.
- FIG. 6 is a high level circuit diagram illustrating a selectable voltage supply.
- FIG. 7 is a high level circuit diagram illustrating a step up buffer used for communication between two units powered at different voltage levels.
- FIG. 8 is a high level circuit diagram of a “careful” system that uses timing information from decode to determine the power supplied to parallel units.
- FIG. 9 is a high level circuit diagram of a “not careful” system which switches between two power supplies based on a stall signal.
- a typical pipelined unit including a plurality of combinational logic segments 100 , 104 , 108 , 112 followed, respectively, by storage segments 102 , 106 , 110 , and 114 .
- Such a pipelined unit may operate asynchronously or synchronously. If it is asynchronous then the storage segments 102 , 106 , 110 and 114 will be activated when the combinational logic before it has completed. If it is synchronous the storage segments will be activated by the clock (not shown) every clock cycle.
- FIG. 2 the difference between synchronous and asynchronous operation is further depicted.
- Operations 141 - 144 and 151 - 154 have time durations A, B, C, and D, respectively.
- FIG. 2 in connection with FIG. 1, if a pipeline is operating asynchronously, a storage segment 102 will be activated when the preceding combinational logic stage 100 has completed. It is not clocked. If the pipeline is operating synchronously, a storage segment 102 will be activated very clock cycle 146 - 149 by the clock.
- a time lag such as is illustrated between operations 142 and 143 , and between operations 143 and 144 , occurs in synchronous operations, for such operations must complete before a clock 146 - 149 change—a change which is required to trigger a subsequent operation.
- each component 280 is implemented with a series of requests 285 and acknowledges 287 surrounding any data transfer operations 289 , such as DATA_R 286 .
- a request 285 may be referred to as a “go” or “req”, while an acknowledge 287 is referred to as a “done” or “ack”.
- a floating point unit (FPU) 170 may be provided which supports add, subtract, multiply, divide, negate, absolute value, and compare operations.
- This FPU 170 operates in a pipelined fashion, with separate execution pipelines 162 , 164 for simple operations (add, subtract, negate, absolute, and compare) and for complex operations (multiply and divide), respectively.
- FPU pipeline 172 includes four execute stages 172
- FPU pipeline 174 includes three execute stages 174 , with feed back loops on stages 1 and 2, which in effect lengthens pipeline 164 beyond the length of simple pipeline 162 .
- Prior pipelining for pre-fetch and decode is provided by a load/store unit 160 outside of FPU 170 .
- the simple pipeline 162 includes four major stages 172 necessary for completion of addition and subtraction, as follows:
- Stage 1 Subtract exponents, shift fraction of minimum exponent right by the resultant difference. Return maximum exponent, shifted fraction and non-shifted fraction.
- Stage 2 Add fractions. Return result and maximum exponent.
- Stage 3 Count leading zeroes, shift left result by this count. Return count, result, and maximum exponent.
- Stage 4 Subtract count from maximum. Return result in IEEE standard numbers.
- the long, or complex, pipeline 164 includes three major stages 174, including an add/subtract fraction stage 1, a shift fraction stage 2, and an exponent add/subtract stage 3.
- the add/subtract stage 1 and shift stage 2 are accessed multiple times, possibly causing long stalls if other operations are waiting in pipeline 174 .
- a multiply operation is accomplished in pipeline 164 in a three stage no feedback manner, similar to the adder, but is considered a long pipeline due to the time required to complete the multiply accumulate (MAC) operation.
- the operations performed in these three stages 174 are:
- Stage 1 Calculate sign of result, calculate partial fraction for low 23 bits, calculate exponent (exa+exb).
- Stage 2 Calculate partial fraction for high 32 bits.
- Pipeline latency and conflicts characterize those circuits or systems which will most benefit from the use of this invention. While a system which does not exhibit these characteristics, such as an in-order execution system, also benefits from power conservation, when applied to an out-of-order system, conflicts result in much greater power conservation, since instructions will only be deemed non-critical (and, consequently, powered at a lower Vdd) if conflicts are found with their successors.
- Operation d is held out due to dependencies.
- operations b and c can be powered down (use less power, and consequently take longer to execute), resulting in the timing diagram of Table 2: TABLE 2 Execution Timing With Variable Vdd Mult Pipe: doiaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaaa.
- Add Pipe .
- variable Vdd approach illustrated in Table 2 results in significantly less power consumption and completes the four instructions a, b, c, and d in exactly the same amount of time as illustrated in Table 1. If the system is an out-of-order system, and no conflict detection is included, instructions b and c will not be switched to lower voltage, since they will both be seen as critical.
- This feature of separation of pipelines into simple and complex pipelines facilitates optimization of the power and performance attributes of a chip through implementation of the asynchronous execution pipeline power selection method of the preferred embodiment of the invention described hereafter.
- pipeline selectable supply voltage (Vdd) for parallel instruction execution makes it possible to balance the performance boost given to critical asynchronous functions by increased voltage supply while also conserving power in other units of the chip performing faster operations in parallel.
- higher voltage is provided to time critical instructions, and lower voltage to non-critical instructions.
- lower voltage By supplying lower voltages to the non-time-critical pipelines, overall chip power consumption is lessened.
- the high power unit, or pipeline will perform very fast, but will consume switching power levels proportional to Vdd squared. This unit also has transitions based on steep inputs, causing greater short circuit current, and therefore more power consumption.
- the low power non-critical unit, or pipeline will consume much less power than it would at a high power since Vdd squared will be a less significant factor. Also, there is less short circuit current due to slower input transitions.
- a simple implementation of a system involves an arithmetic logic unit (ALU) 158 comprising a multiplier 164 and an adder 162 , with the power for adder 162 selected to be lower whenever multiplier 164 is working.
- ALU arithmetic logic unit
- Such an embodiment has the advantage that it involves little overhead while conserving power for each add in the pipeline.
- This embodiment is based on an in order execution system, where the adds cannot be put away (stored) before a preceding multiply.
- ALU 158 comprises, in this embodiment, two asynchronous pipelines 172 , 174 working in parallel.
- the design of the pipelines involves a normal asynchronous handshaking protocol but also includes the generation of a power signal 165 from the adder that instructs the adder power supply 161 to reduce the voltage 165 supplying the adder if a multiply operation is currently executing in multiplier 164 .
- the multiply operation is the most critical instruction, and the power supplied on line 169 to multiplier 164 from supply 163 will be the maximum available.
- this system may be modified to optimize power in the case of hazards which would cause pipeline stalls.
- This embodiment may be modified to work with more than two parallel units 172 , 174 by generating power control signals for each unit in the system.
- each unit 162 , 164 is supplied by a simple switch (not shown), either implemented in CMOS on chip or externally. This switch is connected to the power control signal by the unit it is supplying. Since the multiplier 164 , in this embodiment, will always be deemed most critical its power control signal (not shown) will always set the power control 163 to full voltage. Adder 162 will generate its power control signal 165 by monitoring the activity of multiplier 164 , which can be done in the protocol outlined above by connecting the power control signal 165 to the request signal (not shown) going into the multiplier. When the request signal for the multiplier is high, the power 161 to the adder will be set to low voltage on line 167 . When the request signal returns to zero, the power 167 to adder 162 is switched back to high voltage.
- the adder 162 output signals are stepped up to normal voltage by resizing the transistors on the last few gates in the logic or by inserting step up buffers.
- These gates and buffers step up the low input voltage 167 to the high output voltage by being constructed with very small p-transistors. This smaller size transistor creates a lower transition voltage for the p-transistor(s) in the logic gate, thus allowing the higher voltage to drive the output when a lower voltage is applied at the input.
- Several stages of these logic books may be needed to supply appropriate amplification of the signal without violating noise margins.
- FIG. 6 an example of a selectable voltage supply is illustrated which uses a plurality of voltage supply rails VDD 1 121 , VDD 2 122 , VDD 3 123 , . . . Wide transistors 127 - 129 are used to power a common voltage rail VDD common 125 under control of control lines 131 - 133 .
- Alternative switching mechanisms may be employed, and additional components (not shown), such as diodes, may be used to smooth voltage transitions.
- Pipeline hazard prediction logic may be used to control inputs 131 - 133 so that switching is made intelligently.
- power usage due to increased Vdd in an asynchronous chip is reduced by giving individual functional units (any unit that is performing computations resulting from an instruction) a selectable power supply.
- This is a power supply including a plurality of power supplies independently selectable by the instruction pipeline unit depending on, for example, the status of the functional units and/or the estimated completion times for various instructions.
- a plurality of different voltages 121 - 123 is provided to the chip, with all, or some selection of less than all, of the voltage lines distributed to each of the functional units.
- the power supplied to each individual unit is then selected from among the voltages by means of a set of very wide p-channel MOS transistors 127 - 129 in order to reduce resistance and supply enough current to power the functional unit. Due to the increased current required by the circuit at a higher voltage the p-channel MOS transistors would typically range in size, with wider transistors gating the larger voltages and smaller transistors the smaller voltages.
- the voltage supplies applied to the chip are, in this preferred embodiment, controlled through selector logic in the instruction pipeline which activates control lines 131 - 133 .
- this selector logic automatically switches the voltage of the most critical instruction, that is, the earliest instruction to enter execution that has not yet completed, to the maximum available.
- the power supply logic high voltage in the pipeline unit equals the maximum supply voltage available to the unit. Also, the interconnecting logic between units uses that maximum voltage available to the destination unit.
- cross unit communication is performed at a uniform power level.
- Signals being transmitted from a low voltage unit to a high voltage unit are driven close to the high voltage unit Vdd.
- several amplifying gates are synthesized on the outputs of each power selectable unit. These gates may consist of logic gates, or be non-inverting buffers. Regardless of their function, the Vdd supplied to these gates is the maximum available in the chip and the p-transistors in the gates sized such that they amplify low voltage signals.
- buffer 181 with input at line 200 and output at line 202 , is connected between maximum Vdd 186 and ground 188 across transistors 192 , 194 , 196 and 198 , and is used as a step up buffer for communication between two units 182 and 184 powered at different voltage levels Vdd 1 and Vdd 2 .
- the width of P transistor 194 equals the width of N transistor 192 so that the P transistor reaches threshold more quickly.
- power switching may be performed during idle times in the unit or during operation.
- the power applied to an execution unit can be decreased by following a simple heuristic: decrease power for all non-critical units in the queue with completion time less than the next executing operation with lower criticality. This heuristic also reduces power for all operations in units prior to a target operation, where the target operation is data dependent on operations in other units (within execution time parameters). In an in-order system there can be only one critical operation at a time.
- Table 3 represents two pipes in which instructions execute and retire in rank order, meaning the operation with rank 1 will execute and complete first, and the operation with rank 6 will execute and complete last.
- execution order is read from bottom to top.
- the multiply with rank 1 causes four subsequent adds to proceed at a reduced power level, provided the sum of their execution times is less than that of the multiply. That is, the multiply must take at least 4 times the execution time of a slowed add. Data dependency can also cause reduced power. Since the Add R 1 in the pipe can't proceed until the Mult R 1 , R 2 is finished, the two previous adds are slowed since the multiply will finish much later than the previous operations would. Since the data dependency and retire order will already be available, very little additional circuitry is required to set the power levels. By retire order is meant the order in which the results are used to update registers; it is the same as rank, and is sometimes referred to as execution order.
- each gate is viewed as a simple inverter, which is a reasonable generalization inasmuch as in CMOS technology all gates are essentially specialized inverters.
- a careful system which is not dangerous in terms of impeding system performance.
- this system can be more careful in determining voltage levels for the non-critical operations. This is so that when an instruction has a chance of making a transition from non-critical to critical, it is not powered at a lower voltage, thus impeding its time of completion when it becomes critical.
- the most critical instruction will generally be given the highest voltage if it is likely to complete after the execution of the less time consuming instructions in the pipeline.
- the non-critical instructions those following the critical instruction in the pipeline, must also have their voltage determined.
- determining the voltage of non-critical instructions requires additional information about instruction times. In one embodiment of the invention, this additional information about instruction times is included in the instruction decode, and includes a representation of the average time required for the execution of the instruction.
- FIG. 8 a high level diagram of a “careful” system is illustrated that uses timing information from decoder 212 to determine the power to be supplied to parallel units 222 , 224 .
- Instruction decode generally takes the binary instruction text and breaks that into commands for the machine, and in this embodiment of the invention also decodes an estimate of the instruction time. These instruction times are then used to compare the execution times of two or more parallel executions to determine which instruction is the most critical.
- Instructions 231 are decoded in instruction decode block 212 , the output of which is fed on line 235 to decrementer 214 along with oscillator 210 , to timer block 216 on line 237 , and to parallel execution pipelines 222 and 224 on lines 251 and 253 , respectively.
- the output of decrementer is fed on line 239 to order of magnitude subtractor 218 , along with a signal on line 241 representing the time required for the next non-critical instruction.
- the output of subtractor is fed on line 243 to power control 220 , which provides power on lines 245 and 247 to execution pipelines 222 and 224 , respectively, of execution unit 226 .
- this system adapts to instructions 231 in the asynchronous pipelines 222 , 224 which may execute at different times by using the time estimates from decode block 212 on lines 235 and 237 to determine voltages on lines 245 and 247 .
- the determination of power for a synchronous pipeline machine is provided by a decrementer 214 and comparator 218 for each execution unit.
- the decrementer 214 is loaded with the estimated execution time from instruction decode 212 .
- the voltage of a non-critical unit 222 is adjusted by comparing its decrementer 214 minus a margin to ensure the slowed-down unit does not finish after critical unit 224 . If the difference detected by subtractor 218 is an order of two, then the voltage of the execution unit 222 is scaled to provide half performance, and so on based on the magnitude of the difference.
- a simple substractor 218 is used to find the order of magnitude difference (in powers of 2) between the longest operation and the shortest operation executing at the same time. The resulting difference is used to determine the power supply going to individual units 222 , 224 .
- the multiply instruction in pipeline 224 is given the highest (4 times performance) power, with an estimated run time of 16 cycles; the add instruction in pipeline 222 is powered with the lowest voltage; and the load instruction is powered with performance equivalent 2 (2 times lowest power speed), making it complete in roughly 15 cycles.
- the decrementers 214 are decremented by the performance level each cycle to give a true representation of the estimated cycles left in execution.
- the power consumption of the comparators, simple subtractors, and all other additional logic added to the pipeline is kept lower than the power saved in the rest of the chip by designing the areas of the comparators and decrementers to a minimum.
- variable cycle math instructions are provided.
- the same adder may take 1 or 2 cycles, depending upon the voltage supply selected to drive the adder.
- the voltages supplied on lines 245 and 247 to supply non critical instructions are determined without the use of decrementor 214 and comparator 218 . While not as “careful” at making sure critical (or soon to be critical) instructions are powered at high voltages, such a system (deemed “not-careful”) still reduces overhead significantly.
- FIG. 9 such an alternative embodiment is illustrated as a simple power control circuit switching between two power supplies 260 , 262 based on a stall signal 268 generated elsewhere.
- This circuit represents a system that is not “careful” but still saves considerable power by slowing non-critical instructions.
- high voltage rail 260 and low voltage rail 262 are switched to power rail 272 of pipeline stage 270 under control of stall signal 268 as applied directly to transistor 264 and through diode 271 via line 267 to transistor 265 .
- Pipeline stage 270 includes logic 274 for executing the instruction applied to stage 270 , the output of which is fed on line 275 to step up circuit 276 , such as buffer 181 illustrated in FIG. 7, and thence to output 277 .
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Software Systems (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Power Sources (AREA)
Abstract
A plurality of parallel execution units are selectively powered from a plurality of power sources, the power to each execution unit being selected based upon expected time to completion of processing within the execution unit. Maximum power is gated to execution units executing complex instructions, or time-critical instructions. Less than maximum power is gated to execution units executing simple instructions, or instructions which are not time-critical, or in response to pipeline hazards or stalls. When less than maximum power is gated to an execution unit, a step up circuit may be employed to raise the output of that execution unit to maximum power.
Description
- 1. Technical Field of the Invention
- This invention pertains to circuit chip powering. More specifically, it relates to providing to individual functional units a selectable power supply, thereby increasing processing speed while reducing power consumption.
- 2. Background Art
- Computers are becoming faster with each passing development cycle. Much of this increase in speed is due to ever smaller component and interconnection technologies used in manufacturing the microprocessors. With transistor densities increasing by an estimated 50% per year, technology is expected to reach a point where it becomes prohibitively expensive to improve the size of a transistor.
- The possibility of these technological limits will soon make it more and more difficult for computer designers to gain the yearly speedup necessary to perpetuate Moore's law. Thus, in order to continue to achieve performance gains, new methods must be used to push computer design to new limits. One of these methods is asynchronous design.
- An advantage of asynchronous logic in a microprocessor is its low power consumption when compared with similar synchronous logic. Asynchronous logic, operating in an unclocked domain, consumes much less power due to the fact that there are no logic transitions based on a clock. Consequently, the speed and power consumption of any piece of asynchronous logic will be dependent on the data and the voltage supplied to the logic gates.
- A common method of improving the performance of asynchronous logic has been to increase the power supply voltage. This increased voltage makes logic gates function much more rapidly. Unfortunately, the power consumed in the transition of a circuit, which is due to a brief short circuit current, is proportional to the supply voltage. Thus, when asynchronous logic designers increase the speed of their chips by increasing the voltage of the power supply, they also increase the power needed to drive the logic. In most cases the asynchronous chip is powered by the same voltage source throughout. This source even supplies units that are executing functions that cannot be applied until after some critical function has completed, as would be the case of an in-order chip executing two instructions in parallel. This increased voltage supply throughout the chip executes the critical instruction more rapidly, but also causes other units executing non-critical instructions to use more power, thus dissipating more heat and draining batteries faster.
- As an example of this problem, consider the case of an asynchronous fixed point unit with the ability to execute multiply and add instructions simultaneously. An instruction pipeline would typically have a multiply instruction in the execute stage and an add instruction in the decode stage simultaneously. It would execute the multiply instruction, while adding any referenced registers to its list of dependencies. Shortly afterward it might find that the add instruction used no registers with data dependency. It would then use the adder in the fixed point unit to execute the add instruction. An in-order chip would most likely wait for the multiply instruction to complete, update the appropriate registers, then update registers for the add instruction. If the add instruction took less time than the multiply instruction, then the adder would be held until its results could be applied to the machine registers, doing no useful work while it is being held. If the adder were to use the same power supply as the multiplier in this example, the speed advantage gained by the adder from the extra voltage would be useless since the adder is held from completion until after the divide instruction is finished. If the adder were to be given a lower supply voltage during the execution of the add instruction than that given to the multiplier, the adder would save a considerable amount of power without effecting the overall performance of the chip.
- It is an object of the invention to provide an improved system and method of chip powering.
- It is a further object of the invention to provide an improved system and method of chip powering which increases processing speed while reducing power consumption.
- It is a further object of the invention to provide an improved system and method of chip powering where individual execution units are provided selectable power to increase processing speed in critical units while reducing power in less critical units without delaying completion of processing in the critical units.
- In accordance with the invention, a system and method is provided for selectively powering execution units from a plurality of power sources, the power to each execution unit being selected based upon expected time to completion of processing within the execution unit.
- Other features and advantages of this invention will become apparent from the following detailed description of the presently preferred embodiment of the invention, taken in conjunction with the accompanying drawings.
- FIG. 1 is a timing chart illustrating asynchronous data transfer.
- FIG. 2 is a timing chart illustrating synchronous and asynchronous operation.
- FIG. 3 is a timing diagram illustrating asynchronous data transfer.
- FIG. 4 is a flow chart illustrating parallel execution of adds and multiplies in the execution stage of a pipeline.
- FIG. 5 is a high level system diagram of a preferred embodiment of the invention.
- FIG. 6 is a high level circuit diagram illustrating a selectable voltage supply.
- FIG. 7 is a high level circuit diagram illustrating a step up buffer used for communication between two units powered at different voltage levels.
- FIG. 8 is a high level circuit diagram of a “careful” system that uses timing information from decode to determine the power supplied to parallel units.
- FIG. 9 is a high level circuit diagram of a “not careful” system which switches between two power supplies based on a stall signal.
- Referring to FIG. 1, a typical pipelined unit is depicted including a plurality of
combinational logic segments storage segments storage segments - Referring to FIG. 2, the difference between synchronous and asynchronous operation is further depicted. Operations141-144 and 151-154 have time durations A, B, C, and D, respectively. Referring to FIG. 2 in connection with FIG. 1, if a pipeline is operating asynchronously, a
storage segment 102 will be activated when the precedingcombinational logic stage 100 has completed. It is not clocked. If the pipeline is operating synchronously, astorage segment 102 will be activated very clock cycle 146-149 by the clock. A time lag, such as is illustrated betweenoperations operations - Referring to FIG. 3, asynchrony of a circuit chip, which is implemented without the use of clock, requires that a method be used to guarantee than an operation has completed. As shown, in order to assure an asynchronous design, each
component 280 is implemented with a series of requests 285 and acknowledges 287 surrounding anydata transfer operations 289, such as DATA_R 286. As used herein, a request 285 may be referred to as a “go” or “req”, while anacknowledge 287 is referred to as a “done” or “ack”. - In addition to the request/acknowledge protocol used to assure completion of data transfers, standard logic components, such as a full adder, latch, or shifter, which contains no hardware to indicate its completion status and assumes completion at the end of the clock cycle given to it, must have some additional logic included to detect completion. In some cases, as in most combinational logic, this can be done by simultaneously computing the inverse function for each bit and waiting for the two to differ. In other cases, as in latches, one dummy signal is propagated into the device, and then checked on the output for correctness before the outputs can be guaranteed stable. In either case, these protocols must be used in even the smallest of components to assure that the unit as a whole operates in a consistent fashion.
- Referring to FIG. 4, the parallel execution of add instructions and multiply instructions in the execute stage of a pipeline is depicted. By way of example, a floating point unit (FPU)170 may be provided which supports add, subtract, multiply, divide, negate, absolute value, and compare operations. This
FPU 170 operates in a pipelined fashion, withseparate execution pipelines FPU pipeline 172 includes four executestages 172, andFPU pipeline 174 includes three executestages 174, with feed back loops onstages 1 and 2, which in effect lengthenspipeline 164 beyond the length ofsimple pipeline 162. Prior pipelining for pre-fetch and decode is provided by a load/store unit 160 outside ofFPU 170. - The
simple pipeline 162 includes fourmajor stages 172 necessary for completion of addition and subtraction, as follows: - Stage 1: Subtract exponents, shift fraction of minimum exponent right by the resultant difference. Return maximum exponent, shifted fraction and non-shifted fraction.
- Stage 2: Add fractions. Return result and maximum exponent.
- Stage 3: Count leading zeroes, shift left result by this count. Return count, result, and maximum exponent.
- Stage 4: Subtract count from maximum. Return result in IEEE standard numbers.
- Since other operations in
simple pipeline 162 do not require all fourstages 172, they are performed in thefirst pipeline stage 1, and sent in a pass-through mode through the other stages 2-4 ofpipeline 172 in order that they will not complete out of order with other operations insimple pipe 162. - The long, or complex,
pipeline 164 includes threemajor stages 174, including an add/subtractfraction stage 1, a shift fraction stage 2, and an exponent add/subtractstage 3. When executing a divide operation, the add/subtractstage 1 and shift stage 2 are accessed multiple times, possibly causing long stalls if other operations are waiting inpipeline 174. - A multiply operation is accomplished in
pipeline 164 in a three stage no feedback manner, similar to the adder, but is considered a long pipeline due to the time required to complete the multiply accumulate (MAC) operation. The operations performed in these threestages 174 are: - Stage 1: Calculate sign of result, calculate partial fraction for low 23 bits, calculate exponent (exa+exb).
- Stage 2: Calculate partial fraction for high 32 bits.
- Stage 3: Round fraction.
- Pipeline latency and conflicts characterize those circuits or systems which will most benefit from the use of this invention. While a system which does not exhibit these characteristics, such as an in-order execution system, also benefits from power conservation, when applied to an out-of-order system, conflicts result in much greater power conservation, since instructions will only be deemed non-critical (and, consequently, powered at a lower Vdd) if conflicts are found with their successors. For example, consider the following instruction sequence:
a) Mult r3,r4 ;r3=r3*r4 b) Add r1,r2 ;r1=r1+r2 c) Add r6,r7 ;r6=r6+r7 d) Add r3,r5 ;r3=r3+r5, which conflicts with a). - These instructions execute as illustrated in Table 1:
TABLE 1 Execution Timing Without Variable Vdd Mult Pipe: .....aaaaaaaaaaaaaaaaaaaaaaaaaaaaa......... Add Pipe: .......bbbbbccccc.................ddddd.... - Operation d is held out due to dependencies. In accordance with the invention, by using this collision information operations b and c can be powered down (use less power, and consequently take longer to execute), resulting in the timing diagram of Table 2:
TABLE 2 Execution Timing With Variable Vdd Mult Pipe: .....aaaaaaaaaaaaaaaaaaaaaaaaaaaaa......... Add Pipe: .......bbbbbbbbbbcccccccccccc.....ddddd.... - The variable Vdd approach illustrated in Table 2 results in significantly less power consumption and completes the four instructions a, b, c, and d in exactly the same amount of time as illustrated in Table 1. If the system is an out-of-order system, and no conflict detection is included, instructions b and c will not be switched to lower voltage, since they will both be seen as critical.
- This feature of separation of pipelines into simple and complex pipelines facilitates optimization of the power and performance attributes of a chip through implementation of the asynchronous execution pipeline power selection method of the preferred embodiment of the invention described hereafter.
- In accordance with the preferred embodiment of the invention, pipeline selectable supply voltage (Vdd) for parallel instruction execution makes it possible to balance the performance boost given to critical asynchronous functions by increased voltage supply while also conserving power in other units of the chip performing faster operations in parallel.
- In accordance with the invention, higher voltage is provided to time critical instructions, and lower voltage to non-critical instructions. By supplying lower voltages to the non-time-critical pipelines, overall chip power consumption is lessened. The high power unit, or pipeline, will perform very fast, but will consume switching power levels proportional to Vdd squared. This unit also has transitions based on steep inputs, causing greater short circuit current, and therefore more power consumption. The low power non-critical unit, or pipeline, will consume much less power than it would at a high power since Vdd squared will be a less significant factor. Also, there is less short circuit current due to slower input transitions.
- Referring to FIG. 5, in accordance with a preferred embodiment of the invention, a simple implementation of a system involves an arithmetic logic unit (ALU)158 comprising a
multiplier 164 and anadder 162, with the power foradder 162 selected to be lower whenevermultiplier 164 is working. Such an embodiment has the advantage that it involves little overhead while conserving power for each add in the pipeline. This embodiment is based on an in order execution system, where the adds cannot be put away (stored) before a preceding multiply.ALU 158 comprises, in this embodiment, twoasynchronous pipelines power signal 165 from the adder that instructs theadder power supply 161 to reduce thevoltage 165 supplying the adder if a multiply operation is currently executing inmultiplier 164. In this case, the multiply operation is the most critical instruction, and the power supplied on line 169 tomultiplier 164 fromsupply 163 will be the maximum available. With the addition of some additional power supply control logic, this system may be modified to optimize power in the case of hazards which would cause pipeline stalls. This embodiment may be modified to work with more than twoparallel units - The power supplied to each
unit multiplier 164, in this embodiment, will always be deemed most critical its power control signal (not shown) will always set thepower control 163 to full voltage.Adder 162 will generate itspower control signal 165 by monitoring the activity ofmultiplier 164, which can be done in the protocol outlined above by connecting thepower control signal 165 to the request signal (not shown) going into the multiplier. When the request signal for the multiplier is high, thepower 161 to the adder will be set to low voltage online 167. When the request signal returns to zero, thepower 167 to adder 162 is switched back to high voltage. - As will be described more fully hereafter in connection with FIG. 7, in order to communicate the results of a low power addition to the rest of the chip the
adder 162 output signals are stepped up to normal voltage by resizing the transistors on the last few gates in the logic or by inserting step up buffers. These gates and buffers step up thelow input voltage 167 to the high output voltage by being constructed with very small p-transistors. This smaller size transistor creates a lower transition voltage for the p-transistor(s) in the logic gate, thus allowing the higher voltage to drive the output when a lower voltage is applied at the input. Several stages of these logic books may be needed to supply appropriate amplification of the signal without violating noise margins. - Referring to FIG. 6, an example of a selectable voltage supply is illustrated which uses a plurality of voltage supply rails
VDD1 121,VDD2 122,VDD3 123, . . . Wide transistors 127-129 are used to power a common voltage rail VDD common 125 under control of control lines 131-133. Alternative switching mechanisms may be employed, and additional components (not shown), such as diodes, may be used to smooth voltage transitions. Pipeline hazard prediction logic may be used to control inputs 131-133 so that switching is made intelligently. - Thus, in accordance with the present invention, power usage due to increased Vdd in an asynchronous chip is reduced by giving individual functional units (any unit that is performing computations resulting from an instruction) a selectable power supply. This is a power supply including a plurality of power supplies independently selectable by the instruction pipeline unit depending on, for example, the status of the functional units and/or the estimated completion times for various instructions.
- Referring further to FIG. 6, a plurality of different voltages121-123 is provided to the chip, with all, or some selection of less than all, of the voltage lines distributed to each of the functional units. The power supplied to each individual unit is then selected from among the voltages by means of a set of very wide p-channel MOS transistors 127-129 in order to reduce resistance and supply enough current to power the functional unit. Due to the increased current required by the circuit at a higher voltage the p-channel MOS transistors would typically range in size, with wider transistors gating the larger voltages and smaller transistors the smaller voltages.
- The voltage supplies applied to the chip are, in this preferred embodiment, controlled through selector logic in the instruction pipeline which activates control lines131-133. Where the chip is capable of detecting data dependencies of instructions for parallel execution, this selector logic automatically switches the voltage of the most critical instruction, that is, the earliest instruction to enter execution that has not yet completed, to the maximum available.
- Logic high inputs of a voltage lower than the Vdd for a functional unit could result in the logic in that unit not functioning correctly. Therefore, in order to deselect the power supply provided to a particular execution unit, the power supply logic high voltage in the pipeline unit equals the maximum supply voltage available to the unit. Also, the interconnecting logic between units uses that maximum voltage available to the destination unit.
- In order to provide different voltages to different units on a chip, care is taken with respect to cross unit communication and power switching of queued units.
- In accordance with the preferred embodiment of the invention, cross unit communication is performed at a uniform power level. Signals being transmitted from a low voltage unit to a high voltage unit are driven close to the high voltage unit Vdd. In order to accomplish this, several amplifying gates are synthesized on the outputs of each power selectable unit. These gates may consist of logic gates, or be non-inverting buffers. Regardless of their function, the Vdd supplied to these gates is the maximum available in the chip and the p-transistors in the gates sized such that they amplify low voltage signals.
- Referring to FIG. 7, buffer181, with input at line 200 and output at
line 202, is connected between maximum Vdd 186 andground 188 acrosstransistors units P transistor 194 equals the width ofN transistor 192 so that the P transistor reaches threshold more quickly. - In accordance with various embodiments of the invention, power switching may be performed during idle times in the unit or during operation.
- The safest way to switch execution unit power supplies is during idle times. While this prevents possible glitches due to power selection, it does not provide the flexibility usually required.
- Currently, most commercial microprocessors and DSPs rely on some kind of pipeline which queues operations so as to perform sequential operations in a pipelined manner. Superscalar processor designs typically use holding queues in addition to the pipeline for operations that cannot enter the pipeline due to structural hazards. These provide both in order execution and out of order execution, and pipeline Vdd control of the present invention is flexible enough to be useful in both cases.
- In order to perform pipeline power controls without significant performance degradation, in accordance with the preferred embodiment of the invention, power is switched on a unit only when it has been made non-critical by data or control hazards in the queue. The system implemented on a chip is viewed as a set of queues with voltage regulated depending on the number of operations waiting to be processed in the queue. A rough estimation of time required for a given operation, called a “time unit”, is used to determine when a unit is been made non-critical. The set of queues and associated time units is used to implement the hazard dependent power control methods of the preferred embodiments of the invention.
- In an in-order system, the power applied to an execution unit can be decreased by following a simple heuristic: decrease power for all non-critical units in the queue with completion time less than the next executing operation with lower criticality. This heuristic also reduces power for all operations in units prior to a target operation, where the target operation is data dependent on operations in other units (within execution time parameters). In an in-order system there can be only one critical operation at a time.
- An example of in-order execution is illustrated in the following Table 3, which represents two pipes in which instructions execute and retire in rank order, meaning the operation with
rank 1 will execute and complete first, and the operation with rank 6 will execute and complete last. In Table 3, execution order is read from bottom to top.TABLE 3 IN-ORDER SYSTEM EXECUTION Simple Pipe Complex Pipe Reduced Data Op Rank Op Rank Power Dependency Add R1, R2 6 Yes Yes Add R5, R6 5 Yes No Mult R3, R4 3 Add R7, R8 4 Yes No Mult R1, R2 1 Add R4, R9 2 Yes Yes - In the example of Table 3, the multiply with
rank 1 causes four subsequent adds to proceed at a reduced power level, provided the sum of their execution times is less than that of the multiply. That is, the multiply must take at least 4 times the execution time of a slowed add. Data dependency can also cause reduced power. Since the Add R1 in the pipe can't proceed until the Mult R1, R2 is finished, the two previous adds are slowed since the multiply will finish much later than the previous operations would. Since the data dependency and retire order will already be available, very little additional circuitry is required to set the power levels. By retire order is meant the order in which the results are used to update registers; it is the same as rank, and is sometimes referred to as execution order. - In an out of order system, the same queue system as is illustrated in Table 3 can be used, but data dependency will provide the only power reduction cases.
- In order to optimize Vdd switching of a unit during its operation, the switch occurs from one power level to the other without ever deviating lower than the lowest supply power. This is accomplished by selecting the desired voltage rail before the prior voltage rails are switched off. Power glitches occur in a unit due to two causes: transition from a high voltage to a lower voltage and transition form a low voltage to a higher voltage. At the gate level, each gate is viewed as a simple inverter, which is a reasonable generalization inasmuch as in CMOS technology all gates are essentially specialized inverters. Extreme conditions occur with an input voltage of 0 (ground) and an input voltage of the original Vdd, and analysis at these extremes, with the gate input voltage being the original power selection and Vdd viewed as the newest power selection, will yield worst case results. These input and rail voltage selections are realistic since in an asynchronous circuit, the voltage supply increases before signals can propagate through the circuit at the new voltage level.
- As a result of analysis of all four possible cases on the transistor level (Vhi→Vlo, Vlo→Vhi, Vinold, Vingnd), the following parameters for unit design are established.
- Dynamic power consumption can be approximated by the following equation:
- Pd=CL*(Vdd)^ 2*fp
- where fp is the switching frequency of the circuit and CL is the capacitive load on the device. This equation assumes a step input and that
- in(t)=CLdVout/dt,
- where in is the n transistor input current. This equation supports experimental data showing a marked increase of the rate of power consumption when a test device is running at a high voltage versus low.
- 1. By selectively decreasing the power for non-time-critical units, overall chip power consumption is saved. However, applying this approximation onto all of the circuits for the design would yield incorrect results, since while it takes into account the power consumed by individual units at given supply voltages, it does not take into account the added power consumed by the power control circuitry of the chip.
- 2. By placing a modified set of logic gates right before the outputs of a unit, they can supply the high voltage required for system communication given low internal voltage in the unit. If the rail voltage is increased before the input voltage can step up to the new Vdd, and if the transistors are sized correctly, output voltages will be well below the noise margin of the following transistor. This approach is used only when very wide noise margins are available.
- Referring to FIG. 8, in accordance with a further embodiment of the invention, a careful system is provided which is not dangerous in terms of impeding system performance. By including timing information, this system can be more careful in determining voltage levels for the non-critical operations. This is so that when an instruction has a chance of making a transition from non-critical to critical, it is not powered at a lower voltage, thus impeding its time of completion when it becomes critical.
- The most critical instruction will generally be given the highest voltage if it is likely to complete after the execution of the less time consuming instructions in the pipeline. The non-critical instructions, those following the critical instruction in the pipeline, must also have their voltage determined. In a safe system, determining the voltage of non-critical instructions requires additional information about instruction times. In one embodiment of the invention, this additional information about instruction times is included in the instruction decode, and includes a representation of the average time required for the execution of the instruction.
- Referring to FIG. 8, a high level diagram of a “careful” system is illustrated that uses timing information from
decoder 212 to determine the power to be supplied toparallel units -
Instructions 231 are decoded ininstruction decode block 212, the output of which is fed online 235 to decrementer 214 along withoscillator 210, to timer block 216 online 237, and toparallel execution pipelines lines line 239 to order ofmagnitude subtractor 218, along with a signal online 241 representing the time required for the next non-critical instruction. The output of subtractor is fed online 243 topower control 220, which provides power onlines execution pipelines execution unit 226. - In operation, this system adapts to
instructions 231 in theasynchronous pipelines decode block 212 onlines lines decrementer 214 andcomparator 218 for each execution unit. At the beginning of anexecution stage 226, thedecrementer 214 is loaded with the estimated execution time frominstruction decode 212. The voltage of anon-critical unit 222 is adjusted by comparing itsdecrementer 214 minus a margin to ensure the slowed-down unit does not finish aftercritical unit 224. If the difference detected bysubtractor 218 is an order of two, then the voltage of theexecution unit 222 is scaled to provide half performance, and so on based on the magnitude of the difference. - For example, consider the
instruction stream 231 of Table 4, all entering the executestage 226 at the same time:TABLE 4 Instruction Stream Execution Time Estimates Instruction Estimated Time 1. multiply r1, r2, r3 64 cycles (100000) critical instruction 2. add r4, r5, r6 1 cycle (010000): 64 times faster than the critical instruction 3. load r7, r8 32 cycles (0100000) 2 times faster than the critical instruction - A
simple substractor 218 is used to find the order of magnitude difference (in powers of 2) between the longest operation and the shortest operation executing at the same time. The resulting difference is used to determine the power supply going toindividual units pipeline 224 is given the highest (4 times performance) power, with an estimated run time of 16 cycles; the add instruction inpipeline 222 is powered with the lowest voltage; and the load instruction is powered with performance equivalent 2 (2 times lowest power speed), making it complete in roughly 15 cycles. Thedecrementers 214 are decremented by the performance level each cycle to give a true representation of the estimated cycles left in execution. - A similar procedure is followed for an asynchronous pipeline. In this embodiment of the invention, an
oscillator 210 with a certain time period is used in the time estimate comparisons. This approach may be adapted to situations where instructions are not executed at the same time, sincedecrementer 214 shows in every cycle an approximation of the critical remaining time of the critical instruction. - In accordance with the invention, better performance is achieved for an asynchronous chip, or for a chip with asynchronous units, for a given amount of power. It does have the disadvantage that asynchronous operation execution times are often data dependent. In this case, the estimates of remaining time could be off, and voltages adjusted to slow performance of operations deemed to be much faster than the critical instruction. This may be the case in the above example with the multiply instruction executing in 2 cycles. After the multiply completes, however, the voltages are readjusted to assist the next most critical instruction. Overall, the chip performance is boosted, given its power consumption.
- In accordance with a further embodiment of the invention, the power consumption of the comparators, simple subtractors, and all other additional logic added to the pipeline is kept lower than the power saved in the rest of the chip by designing the areas of the comparators and decrementers to a minimum.
- As a result of implementing the selectable variable voltage supply technique of the preferred embodiment of the invention, variable cycle math instructions are provided. For example, the same adder may take 1 or 2 cycles, depending upon the voltage supply selected to drive the adder.
- In alternative embodiments, the voltages supplied on
lines decrementor 214 andcomparator 218. While not as “careful” at making sure critical (or soon to be critical) instructions are powered at high voltages, such a system (deemed “not-careful”) still reduces overhead significantly. - Referring to FIG. 9, such an alternative embodiment is illustrated as a simple power control circuit switching between two
power supplies stall signal 268 generated elsewhere. This circuit represents a system that is not “careful” but still saves considerable power by slowing non-critical instructions. In FIG. 9,high voltage rail 260 andlow voltage rail 262 are switched topower rail 272 ofpipeline stage 270 under control ofstall signal 268 as applied directly totransistor 264 and throughdiode 271 vialine 267 to transistor 265.Pipeline stage 270 includeslogic 274 for executing the instruction applied to stage 270, the output of which is fed online 275 to step upcircuit 276, such as buffer 181 illustrated in FIG. 7, and thence tooutput 277. - It is an advantage of the invention that there is provided an improved system and method of chip powering.
- It is a further advantage of the invention that there is provided an improved system and method of chip powering which increases processing speed while reducing power consumption.
- It is a further advantage of the invention that there is provided an improved system and method of chip powering where individual execution units are provided selectable power to increase processing speed in critical units while reducing power in less critical units without delaying completion of processing in the critical units.
- It is a further advantage of the invention that there is provided a system and method for adjusting the voltage of a system for non-critical operations, slowing only instructions that have extra time to complete due to stalls or other hazards, thereby providing power savings without degrading system performance.
- It will be appreciated that, although specific embodiments of the invention have been described herein for purposes of illustration, various modifications may be made without departing from the spirit and scope of the invention. In particular, it is within the scope of the invention to provide a program storage or memory device such as a solid or fluid transmission medium, magnetic or optical wire, tape or disc, or the like, for storing signals readable by a machine for controlling the operation of a computer according to the method of the invention and/or to structure its components in accordance with the system of the invention.
- Accordingly, the scope of protection of this invention is limited only by the following claims and their equivalents.
Claims (23)
1. A method for selectively powering a plurality of execution units, comprising the steps of:
selectively operating said plurality of execution units in parallel; and
selectively powering a first execution unit from a first one of a plurality of power supplies.
2. The method of claim 1 , said selectively powering step being based upon expected time to completion of processing within said first execution unit.
3. The method of claim 2 , further comprising the step of decoding an instruction to be processed to determine said expected time to completion.
4. The method of claim 1 , further comprising the step of selectively powering a second execution unit from a second one of said plurality of power supplies.
5. The method of claim 4 , wherein said second one of said plurality of power supplies provides maximum available voltage, said first one of said plurality of power supplies provides less than said maximum available voltage, and further comprising the step of:
selecting said first execution unit for powering by said first one of said plurality of power supplies when the expected time to completion of processing within said first execution unit is substantially less than the expected time to completion of processing within said second execution unit.
6. The method of claim 4 , wherein said second one of said plurality of power supplies provides maximum available voltage, said first one of said plurality of power supplies provides less than said maximum available voltage, and further comprising the step of:
selecting said first one of said plurality of power supplies for powering said first execution unit responsive to said second execution unit executing an operation.
7. The method of claim 6 , wherein said first execution unit executes simple operations and said second execution unit executes complex operations.
8. The method of claim 7 , wherein said first execution unit executes add instructions and wherein said second execution unit executes multiply instructions.
9. The method of claim 1 , comprising the further step of:
stepping up the voltage from a lesser voltage applied to the input of said first execution unit to a higher voltage at the output of said first execution unit.
10. The method of claim 1 , comprising the further step of:
selectively powering said first execution unit responsive to a stall signal.
11. The method of claim 4 , comprising the further step of operating said first execution unit and said second execution unit asynchronously.
12. The method of claim 4 , comprising the further step of operating said first execution unit and said second execution unit synchronously.
13. The method of claim 1 , comprising the further step of switching power to said first execution unit during idle time.
14. The method of claim 1 , comprising the further step of switching power to said first execution unit during operation.
15. The method of claim 1 , said selectively powering step including the step of predicting pipeline hazards to intelligently control power switching.
16. System for selectively powering a plurality of execution units, comprising:
a plurality of power sources;
means for selectively operating said plurality of execution units in parallel; and
means for selectively powering a first execution unit from a first one of said plurality of power sources.
17. A selective power system, comprising:
a plurality of power sources;
a plurality of execution units;
a first switch for gating power to a first execution unit selectively from a first power source and a second power source; and
a circuit for providing power to a second execution unit from said second power source.
18. The selective power system of claim 17 , wherein said second power source provides maximum voltage and said first power source provides less than maximum voltage, said selective power system further comprising:
an instruction decoder for determining the expected time to completion of a plurality of instructions for parallel execution;
said instruction decoder being operable for gating a first instruction to said first execution unit and a second instruction to said second execution unit; and
said first switch being operable to gate power to said first execution unit from said first power source when said first execution unit is executing a first instruction which takes substantially less time to completion than said second instruction.
19. The selective power system of claim 17 , wherein said second one of said plurality of power supplies provides maximum available voltage, said first one of said plurality of power supplies provides less than said maximum available voltage, and wherein said first switch is operable to gate power to said first execution unit when said second execution unit is executing an instruction.
20. The selective power system of claim 16 , further comprising:
a voltage step-up circuit for stepping up the voltage supplied to said first execution unit from said first power source to the voltage level of said second power source at the output of said first execution unit.
21. The selective power system of claim 17 , wherein said first execution unit executes simple operations and said second execution unit executes complex operations.
22. A program storage device readable by a machine, tangibly embodying a program of instructions executable by a machine to perform method steps for selectively powering a plurality of execution units, said method steps comprising:
selectively operating said plurality of execution units in parallel; and
selectively powering a first execution unit from a first one of a plurality of power supplies.
23. An article of manufacture comprising:
a computer useable medium having computer readable program code means embodied therein for selectively powering a plurality of execution units, the computer readable program means in said article of manufacture comprising:
computer readable program code means for causing a computer to effect selectively operating said plurality of execution units in parallel; and
computer readable program code means for causing a computer to effect selectively powering a first execution unit from a first one of a plurality of power supplies.
Priority Applications (1)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/832,064 US6457131B2 (en) | 1999-01-11 | 2001-04-10 | System and method for power optimization in parallel units |
Applications Claiming Priority (2)
Application Number | Priority Date | Filing Date | Title |
---|---|---|---|
US09/228,884 US6289465B1 (en) | 1999-01-11 | 1999-01-11 | System and method for power optimization in parallel units |
US09/832,064 US6457131B2 (en) | 1999-01-11 | 2001-04-10 | System and method for power optimization in parallel units |
Related Parent Applications (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/228,884 Division US6289465B1 (en) | 1999-01-11 | 1999-01-11 | System and method for power optimization in parallel units |
Publications (2)
Publication Number | Publication Date |
---|---|
US20020002664A1 true US20020002664A1 (en) | 2002-01-03 |
US6457131B2 US6457131B2 (en) | 2002-09-24 |
Family
ID=22858934
Family Applications (2)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/228,884 Expired - Lifetime US6289465B1 (en) | 1999-01-11 | 1999-01-11 | System and method for power optimization in parallel units |
US09/832,064 Expired - Lifetime US6457131B2 (en) | 1999-01-11 | 2001-04-10 | System and method for power optimization in parallel units |
Family Applications Before (1)
Application Number | Title | Priority Date | Filing Date |
---|---|---|---|
US09/228,884 Expired - Lifetime US6289465B1 (en) | 1999-01-11 | 1999-01-11 | System and method for power optimization in parallel units |
Country Status (1)
Country | Link |
---|---|
US (2) | US6289465B1 (en) |
Cited By (4)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030233593A1 (en) * | 2002-06-18 | 2003-12-18 | Atul Kalambur | Reduced verification complexity and power saving features in a pipelined integated circuit |
US20060206737A1 (en) * | 2005-03-14 | 2006-09-14 | Samsung Electronics Co., Ltd. | Processor with variable wake-up and sleep latency and method for managing power therein |
JP2019521454A (en) * | 2016-07-21 | 2019-07-25 | アドバンスト・マイクロ・ディバイシズ・インコーポレイテッドAdvanced Micro Devices Incorporated | Control the operating speed of asynchronous pipeline stages |
US10950299B1 (en) | 2014-03-11 | 2021-03-16 | SeeQC, Inc. | System and method for cryogenic hybrid technology computing and memory |
Families Citing this family (29)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US7509486B1 (en) * | 1999-07-08 | 2009-03-24 | Broadcom Corporation | Encryption processor for performing accelerated computations to establish secure network sessions connections |
US6738915B1 (en) * | 2000-05-12 | 2004-05-18 | Sun Microsystems, Inc. | System for supplying multiple voltages to devices on circuit board through a sequencing in a predictable sequence |
JP2002009864A (en) * | 2000-06-20 | 2002-01-11 | Sony Corp | Control method and communication equipment |
US20070245165A1 (en) | 2000-09-27 | 2007-10-18 | Amphus, Inc. | System and method for activity or event based dynamic energy conserving server reconfiguration |
US7032119B2 (en) | 2000-09-27 | 2006-04-18 | Amphus, Inc. | Dynamic power and workload management for multi-server system |
US20030196126A1 (en) | 2002-04-11 | 2003-10-16 | Fung Henry T. | System, method, and architecture for dynamic server power management and dynamic workload management for multi-server environment |
DE10162309A1 (en) * | 2001-12-19 | 2003-07-03 | Philips Intellectual Property | Method and arrangement for increasing the security of circuits against unauthorized access |
US6992405B2 (en) * | 2002-03-11 | 2006-01-31 | Intel Corporation | Dynamic voltage scaling scheme for an on-die voltage differentiator design |
US7100060B2 (en) * | 2002-06-26 | 2006-08-29 | Intel Corporation | Techniques for utilization of asymmetric secondary processing resources |
US7197629B2 (en) * | 2002-11-22 | 2007-03-27 | Sun Microsystems, Inc. | Computing overhead for out-of-order processors by the difference in relative retirement times of instructions |
US6996665B2 (en) * | 2002-12-30 | 2006-02-07 | International Business Machines Corporation | Hazard queue for transaction pipeline |
US7012459B2 (en) * | 2003-04-02 | 2006-03-14 | Sun Microsystems, Inc. | Method and apparatus for regulating heat in an asynchronous system |
IL163757A0 (en) * | 2003-04-15 | 2005-12-18 | Nds Ltd | Secure time element |
US20040225868A1 (en) * | 2003-05-07 | 2004-11-11 | International Business Machines Corporation | An integrated circuit having parallel execution units with differing execution latencies |
ATE554443T1 (en) * | 2003-06-25 | 2012-05-15 | Koninkl Philips Electronics Nv | INSTRUCTION-DRIVEN DATA PROCESSING DEVICE AND METHOD |
US20050102114A1 (en) * | 2003-11-12 | 2005-05-12 | Intel Corporation | System and method for determining processor utilization |
JP4246141B2 (en) * | 2004-03-22 | 2009-04-02 | シャープ株式会社 | Data processing device |
US20060200651A1 (en) * | 2005-03-03 | 2006-09-07 | Collopy Thomas K | Method and apparatus for power reduction utilizing heterogeneously-multi-pipelined processor |
US8108863B2 (en) * | 2005-12-30 | 2012-01-31 | Intel Corporation | Load balancing for multi-threaded applications via asymmetric power throttling |
US7725745B2 (en) * | 2006-12-19 | 2010-05-25 | Intel Corporation | Power aware software pipelining for hardware accelerators |
FR2916288B1 (en) * | 2007-05-18 | 2009-08-21 | Commissariat Energie Atomique | DEVICE FOR SUPPLYING AN ELECTRONIC CIRCUIT AND ELECTRONIC CIRCUIT |
JP2009048264A (en) * | 2007-08-14 | 2009-03-05 | Oki Electric Ind Co Ltd | Semiconductor integrated circuit device |
US20090055636A1 (en) * | 2007-08-22 | 2009-02-26 | Heisig Stephen J | Method for generating and applying a model to predict hardware performance hazards in a machine instruction sequence |
KR100934215B1 (en) * | 2007-10-29 | 2009-12-29 | 한국전자통신연구원 | Microprocessor based on event handling instruction set and event processing method using the same |
KR20090085944A (en) * | 2008-02-05 | 2009-08-10 | 삼성전자주식회사 | Processors and Semiconductor Devices Reduce Power Consumption |
US9117511B2 (en) * | 2013-03-08 | 2015-08-25 | Advanced Micro Devices, Inc. | Control circuits for asynchronous circuits |
US9372520B2 (en) | 2013-08-09 | 2016-06-21 | Globalfoundries Inc. | Reverse performance binning |
US11093401B2 (en) | 2014-03-11 | 2021-08-17 | Ampere Computing Llc | Hazard prediction for a group of memory access instructions using a buffer associated with branch prediction |
US20170090957A1 (en) * | 2015-09-25 | 2017-03-30 | Greg Sadowski | Performance and energy efficient compute unit |
Family Cites Families (12)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
EP0368144B1 (en) | 1988-11-10 | 1996-02-07 | Motorola, Inc. | Digital computing system with low power mode |
DE69033149T2 (en) | 1989-06-30 | 1999-11-18 | Fujitsu Personal Systems, Inc. | Clock system |
US5220671A (en) | 1990-08-13 | 1993-06-15 | Matsushita Electric Industrial Co., Ltd. | Low-power consuming information processing apparatus |
JP3237926B2 (en) | 1991-12-04 | 2001-12-10 | シャープ株式会社 | Power control device for digital electronic device, processing device provided with the power control device, and power management system for digital electronic device provided with the processing device |
US5339445A (en) | 1992-11-16 | 1994-08-16 | Harris Corporation | Method of autonomously reducing power consumption in a computer sytem by compiling a history of power consumption |
EP0722584B1 (en) * | 1993-10-04 | 2001-06-13 | Elonex I.P. Holdings Limited | Method and apparatus for an optimized computer power supply system |
US5481733A (en) * | 1994-06-15 | 1996-01-02 | Panasonic Technologies, Inc. | Method for managing the power distributed to a disk drive in a laptop computer |
JP3109387B2 (en) * | 1994-09-26 | 2000-11-13 | キヤノン株式会社 | Magnetic head drive circuit |
US5613130A (en) * | 1994-11-10 | 1997-03-18 | Vadem Corporation | Card voltage switching and protection |
US5745375A (en) | 1995-09-29 | 1998-04-28 | Intel Corporation | Apparatus and method for controlling power usage |
US5958041A (en) * | 1997-06-26 | 1999-09-28 | Sun Microsystems, Inc. | Latency prediction in a pipelined microarchitecture |
US6304978B1 (en) * | 1998-11-24 | 2001-10-16 | Intel Corporation | Method and apparatus for control of the rate of change of current consumption of an electronic component |
-
1999
- 1999-01-11 US US09/228,884 patent/US6289465B1/en not_active Expired - Lifetime
-
2001
- 2001-04-10 US US09/832,064 patent/US6457131B2/en not_active Expired - Lifetime
Cited By (11)
Publication number | Priority date | Publication date | Assignee | Title |
---|---|---|---|---|
US20030233593A1 (en) * | 2002-06-18 | 2003-12-18 | Atul Kalambur | Reduced verification complexity and power saving features in a pipelined integated circuit |
WO2003107161A3 (en) * | 2002-06-18 | 2004-08-05 | Sun Microsystems Inc | Reduced verification complexity and power saving features in a pipelined integrated circuit |
US6954865B2 (en) | 2002-06-18 | 2005-10-11 | Sun Microsystems, Inc. | Reduced verification complexity and power saving features in a pipelined integrated circuit |
US20060206737A1 (en) * | 2005-03-14 | 2006-09-14 | Samsung Electronics Co., Ltd. | Processor with variable wake-up and sleep latency and method for managing power therein |
GB2424499A (en) * | 2005-03-14 | 2006-09-27 | Samsung Electronics Co Ltd | Method for managing the power consumption of a processor with variable wake-up and sleep latencies |
GB2424499B (en) * | 2005-03-14 | 2007-05-16 | Samsung Electronics Co Ltd | Processor with variable wake-up and sleep latency and method for managing power therein |
US7571335B2 (en) | 2005-03-14 | 2009-08-04 | Samsung Electronics Co., Ltd. | Processor with variable wake-up and sleep latency and method for managing power therein |
US10950299B1 (en) | 2014-03-11 | 2021-03-16 | SeeQC, Inc. | System and method for cryogenic hybrid technology computing and memory |
US11406583B1 (en) | 2014-03-11 | 2022-08-09 | SeeQC, Inc. | System and method for cryogenic hybrid technology computing and memory |
US11717475B1 (en) | 2014-03-11 | 2023-08-08 | SeeQC, Inc. | System and method for cryogenic hybrid technology computing and memory |
JP2019521454A (en) * | 2016-07-21 | 2019-07-25 | アドバンスト・マイクロ・ディバイシズ・インコーポレイテッドAdvanced Micro Devices Incorporated | Control the operating speed of asynchronous pipeline stages |
Also Published As
Publication number | Publication date |
---|---|
US6457131B2 (en) | 2002-09-24 |
US6289465B1 (en) | 2001-09-11 |
Similar Documents
Publication | Publication Date | Title |
---|---|---|
US6289465B1 (en) | System and method for power optimization in parallel units | |
US7076681B2 (en) | Processor with demand-driven clock throttling power reduction | |
US5420808A (en) | Circuitry and method for reducing power consumption within an electronic circuit | |
US7334143B2 (en) | Computer power conservation apparatus and method that enables less speculative execution during light processor load based on a branch confidence threshold value | |
US7571342B2 (en) | Processor system, instruction sequence optimization device, and instruction sequence optimization program | |
Li et al. | Deterministic clock gating for microprocessor power reduction | |
US7657766B2 (en) | Apparatus for an energy efficient clustered micro-architecture | |
US6728866B1 (en) | Partitioned issue queue and allocation strategy | |
US6946869B2 (en) | Method and structure for short range leakage control in pipelined circuits | |
Li et al. | DCG: Deterministic clock-gating for low-power microprocessor design | |
JP2004334872A (en) | Integrated circuit device comprising unit power adjustment mechanism | |
Efthymiou et al. | Adaptive pipeline depth control for processor power-management | |
Ratković et al. | An overview of architecture-level power-and energy-efficient design techniques | |
JP4129345B2 (en) | Control of multiple equivalent functional units for power reduction | |
US6907534B2 (en) | Minimizing power consumption in pipelined circuit by shutting down pipelined circuit in response to predetermined period of time having expired | |
CN100590592C (en) | Processor and its instruction issuing method | |
Nielsen et al. | A low-power asynchronous data-path for a FIR filter bank | |
US7487374B2 (en) | Dynamic power and clock-gating method and circuitry with sleep mode based on estimated time for receipt of next wake-up signal | |
Buyuktosunoglu et al. | An oldest-first selection logic implementation for non-compacting issue queues [microprocessor power reduction] | |
EP1658560B1 (en) | Processor with demand-driven clock throttling for power reduction | |
JP2000353188A (en) | Low power consumption calculation circuit for common case for high performance/low power vlsi design | |
Ravi et al. | Recycling data slack in out-of-order cores | |
US20040225868A1 (en) | An integrated circuit having parallel execution units with differing execution latencies | |
Huang et al. | On signal-gating schemes for low-power adders | |
Lotfi-Kamran et al. | Stall power reduction in pipelined architecture processors |
Legal Events
Date | Code | Title | Description |
---|---|---|---|
STCF | Information on status: patent grant |
Free format text: PATENTED CASE |
|
FPAY | Fee payment |
Year of fee payment: 4 |
|
FPAY | Fee payment |
Year of fee payment: 8 |
|
FEPP | Fee payment procedure |
Free format text: PAYOR NUMBER ASSIGNED (ORIGINAL EVENT CODE: ASPN); ENTITY STATUS OF PATENT OWNER: LARGE ENTITY |
|
AS | Assignment |
Owner name: INTEL CORPORATION, CALIFORNIA Free format text: ASSIGNMENT OF ASSIGNORS INTEREST;ASSIGNOR:INTERNATIONAL BUSINESS MACHINES CORPORATION;REEL/FRAME:030228/0415 Effective date: 20130408 |
|
FPAY | Fee payment |
Year of fee payment: 12 |