WO2022223881A1 - A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems - Google Patents
A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems Download PDFInfo
- Publication number
- WO2022223881A1 WO2022223881A1 PCT/FI2022/050258 FI2022050258W WO2022223881A1 WO 2022223881 A1 WO2022223881 A1 WO 2022223881A1 FI 2022050258 W FI2022050258 W FI 2022050258W WO 2022223881 A1 WO2022223881 A1 WO 2022223881A1
- Authority
- WO
- WIPO (PCT)
- Prior art keywords
- voltage
- matrix accelerator
- matrix
- processing system
- abft
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Ceased
Links
Classifications
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0766—Error or fault reporting or storing
- G06F11/0772—Means for error signaling, e.g. using interrupts, exception flags, dedicated error registers
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/3296—Power saving characterised by the action undertaken by lowering the supply or operating voltage
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/28—Supervision thereof, e.g. detecting power-supply failure by out of limits supervision
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F1/00—Details not covered by groups G06F3/00 - G06F13/00 and G06F21/00
- G06F1/26—Power supply means, e.g. regulation thereof
- G06F1/32—Means for saving power
- G06F1/3203—Power management, i.e. event-based initiation of a power-saving mode
- G06F1/3234—Power saving characterised by the action undertaken
- G06F1/324—Power saving characterised by the action undertaken by lowering clock frequency
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/004—Error avoidance
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/30—Monitoring
- G06F11/3003—Monitoring arrangements specially adapted to the computing system or computing system component being monitored
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F17/00—Digital computing or data processing equipment or methods, specially adapted for specific functions
- G06F17/10—Complex mathematical operations
- G06F17/16—Matrix or vector computation, e.g. matrix-matrix or matrix-vector multiplication, matrix factorization
 
- 
        - G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06N—COMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
- G06N3/00—Computing arrangements based on biological models
- G06N3/02—Neural networks
- G06N3/06—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons
- G06N3/063—Physical realisation, i.e. hardware implementation of neural networks, neurons or parts of neurons using electronic means
 
- 
        - H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03K—PULSE TECHNIQUE
- H03K19/00—Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits
- H03K19/0008—Arrangements for reducing power consumption
 
- 
        - H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03K—PULSE TECHNIQUE
- H03K19/00—Logic circuits, i.e. having at least two inputs acting on one output; Inverting circuits
- H03K19/003—Modifications for increasing the reliability for protection
 
- 
        - H—ELECTRICITY
- H03—ELECTRONIC CIRCUITRY
- H03K—PULSE TECHNIQUE
- H03K3/00—Circuits for generating electric pulses; Monostable, bistable or multistable circuits
- H03K3/02—Generators characterised by the type of circuit or by the means used for producing pulses
- H03K3/027—Generators characterised by the type of circuit or by the means used for producing pulses by the use of logic circuits, with internal or external positive feedback
- H03K3/037—Bistable circuits
- H03K3/0375—Bistable circuits provided with means for increasing reliability; for protection; for ensuring a predetermined initial state when the supply voltage has been applied; for storing the actual state when the supply voltage fails
 
Definitions
- the present invention is related to digital circuits and systems, systolic arrays and error detection is such circuits and systems.
- ABFT Algorithm Based Fault Tolerance
- a simple technique to optimize the operation voltage is to embed the design with a delay chain, e.g., a sequence of inverters that mimics the timing behavior of the longest delay path in the circuit. While this enables taking into account global varia tions, supply voltage drops, and temperature fluctuations, local variations and cross- coupled noise cannot be effectively modeled. In practice, the longest delay path is stimulated less frequently than the shorter ones, rendering delay chain based volt age tuning either too optimistic or too pessimistic. This translates to loss of reliability or reduced efficiency gains, respectively.
- TED Timing Error Detection
- CN 108733628 (“Dai”) discloses a reinforcement method of a parallel matrix multi plication algorithm.
- the method is used for lowering ABFT reinforcement overhead of the matrix algorithm.
- the method comprises the following steps of (1 ) encoding input and output of matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists; (2) preprocessing the error lists, and eliminating some misjudgment errors and avoiding unnecessary correc tion, wherein the method of eliminating errors is a relative error law, error detection is performed before error correction, then left errors are corrected; if one or more errors are corrected, the error list is updated, and most errors are corrected after iteration for many times; and (3) adopting a re-calculation policy for left errors that cannot be corrected by the algorithm.
- the reinforcement method improves both sys tem reliability and execution efficiency.
- the application area of this disclosure is GPUs (Graphics Processing Unit).
- CN 101369241 (”Huo”) is about a "cluster fault-tolerance system, apparatus and method”, but it does not discuss arrays which would be systolic. Huo mentions the ABFT only in its background as one known option, but the actual description seems to disclose a totally different concept on fault tolerant processes.
- US 2012/0221884 discusses an error management method, taking both HW and some SW into account.
- this disclosure provides error man agement across hardware and software layers to enable hardware and software to deliver reliable operation in the face of errors and hardware variation due to aging, manufacturing tolerances, etc.
- an error management module is provided that gathers information from the hardware and software layers, and de tects and diagnoses errors.
- a hardware or software recovery technique may be se lected to provide efficient operation, and, in some embodiments, the hardware de vice may be reconfigured to prevent future errors and to permit the hardware device to operate despite a permanent error.
- a cost- and energy-efficient, yet properly working accelerator device is thus desired to be produced.
- the present invention introduces an integrated circuit for reliable low-voltage oper ation of a matrix accelerator processing system (10), which matrix accelerator pro cessing system (10) comprises a matrix accelerator (11 ), wherein the matrix accel erator processing system (10) is configured to operate systolic arrays.
- the inte grated circuit is characterized in that the matrix accelerator (11 ) is configured to output data to Algorithm-Based Fault Tolerance (ABFT) based error detection module (12), which is applied to com pute checksums and detect errors online within the systolic array, the ABFT-based error detection module (12) is configured to forward possible detected errors to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11) in case of at least one detected error, the matrix accelerator processing system (10) is configured to reperform the ABFT based error detection for the matrix accelerator (11) output data with an ad justed operational voltage, and the matrix accelerator processing system (10) is configured to find a lowest operational voltage where the number of the detected errors is zero.
- the matrix accelerator processing system (10) is configured to keep the clock (14) frequency unchanged throughout the ad justment process by the dynamic voltage and frequency controlling module
- the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA.
- the matrix accelerator processing system (10) is configured to either lower or increase the operational voltage by the dynamic voltage and frequency controlling module (13) by a predetermined voltage step.
- a voltage controller (15) is configured to perform the voltage adjustments for the matrix accelerator (11 ).
- the input data and output data of the matrix accelerator (11 ) are augmented for Algorithm-Based Fault Tolerance error detection.
- the matrix accelerator processing system (10) does not require roundtrips for the data from the matrix accelerator (11 ) to a memory and back to the matrix accelerator (11 ).
- a matrix accelerator processing system (10) which matrix accelerator processing system (10) comprises a matrix accelerator (11 ), wherein the method comprises the step of operating systolic arrays in the matrix accelerator processing system (10).
- the method is characterized in that it further comprises the steps of: outputting data from the matrix accelerator (11 ) to Algorithm-Based Fault Tol erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array, forwarding possible detected errors from the ABFT-based error detection module (12) to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11 ) in case of at least one detected error, reperforming the ABFT based error detection for the matrix accelerator (11 ) output data with an adjusted operational voltage by the matrix accelerator pro cessing system (10), and finding, by the matrix accelerator processing system (10), a lowest opera tional voltage where the number of the detected errors is zero.
- ABFT Algorithm-Based Fault Tol erance
- the matrix accelerator processing system (10) is configured to keep the clock (14) frequency unchanged throughout the adjustment process by the dynamic voltage and frequency controlling module (13).
- the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA.
- the matrix accelerator processing system (10) is configured to either lower or increase the operational voltage by the dynamic voltage and frequency controlling module (13) by a predetermined voltage step.
- a voltage controller (15) is configured to perform the voltage adjustments for the matrix accelerator (11 ).
- the input data and output data of the matrix accel erator (11 ) are augmented for Algorithm-Based Fault Tolerance error detection.
- the matrix accelerator processing system (10) does not require roundtrips for the data from the matrix accelerator (11 ) to a memory and back to the matrix accelerator (11 ).
- a third aspect of the present invention introduces a computer pro gram product for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11 ), wherein the matrix accelerator processing system (10) is config ured to operate systolic arrays.
- the computer program product is characterized in that the computer program product comprises program code which is executable by a processor, wherein the computer program product is configured to execute the steps of: outputting data from the matrix accelerator (11 ) to Algorithm-Based Fault Tol erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array, forwarding possible detected errors from the ABFT-based error detection module (12) to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11 ) in case of at least one detected error, reperforming the ABFT based error detection for the matrix accelerator (11 ) output data with an adjusted operational voltage by the matrix accelerator pro cessing system (10), and finding, by the matrix accelerator processing system (10), a lowest opera tional voltage where the number of the detected errors is zero.
- ABFT Algorithm-Based Fault To
- FIG. 1 illustrates a simplified concept of Timing-Error-Detection circuit in the prior art. Shadow register is clocked by a delayed version of original clock.
- FIG. 2 illustrates a checksum row and column which help to detect the errors.
- FIG. 3 illustrates a checksum row and column which help to detect the errors and correct up to one error per row/column.
- FIG. 4 illustrates a co-simulation paradigm of reduced voltage processing platform.
- FIG. 5 illustrates a processing element of the systolic array.
- FIG. 6 illustrates an energy efficient point which can be achieved using error detec tions.
- FIG. 7 illustrates in the top: Error rates of PE output bits. In the bottom: Total error and silent error rates for 32 point row-column and 32-by-32 matrix multiplication.
- the horizontal axis represents clock period variations (OT, nanoseconds).
- FIG. 8 illustrates that ABFT detects errors in different operation points and temper atures.
- FIG. 9 discloses a main block diagram according to an embodiment of the present invention. Detailed description
- the present invention introduces a solution which achieves substantial energy ben efits from reduced voltage operation in a matrix accelerator.
- the matrix accelerator may be a matrix multiplier designed as a systolic array.
- a main embodiment of the present invention is discussed at first.
- this embodiment it is proposed to adjust the voltage and frequency according to errors detected at the data output of the computing logic.
- an ABFT scheme [3] is integrated into the design. The objective is to enable pushing the supply voltage down, while maintaining reliability with minimal overheads.
- Our contribution in the embodiments of the present invention is demonstrating an ABFT solution for error detection in matrix multiplication targeted systolic array and verifying its performance.
- the objective is to support sustained operation at reduced voltage compensating, e.g., for temperature dependency of the threshold voltage.
- ABFT was originally conceived by Fluang and Abraham in [3]. They initially described a low-overhead technique to detect and correct computational er rors striking multiplication operations. Subsequently, ABFT was extended to other linear algebra operations, including transposition, QR decomposition, FFT, and 2-D convolutions [3].
- a row checksum matrix A r is defined as a N x (N + 1 ) matrix, as below:
- a f matrices are defined as respectively. The checksums can be used to detect errors instead by multiplying matrices A NXN and B NXN
- the array consists of a grid of identical Processing Elements (PE) that each perform multiply-accumulate (MAC) operation on data received from the adjacent top and left PEs and pass the result or the input data to the next neighbouring PEs on the right and below.
- PE Processing Elements
- MAC multiply-accumulate
- the architecture exhibits a high degree of parallelism and reduces memory bandwidth requirements. It should be noticed that the size of the array doesn't need to match the one of the matrices to be multiplied, as the calculations can be broken down to sub-matrices.
- ABFT is merged with a N x N systolic array structure by adding a column of N PEs for checksums and another column consisting of N digital integrators and comparators for error detection.
- N PEs for checksums
- N digital integrators and comparators for error detection.
- ABFT has overhead rate of 0(1 /N) compared to the O(N). Moreover, ABFT can be adapted to operate on-the-fly.
- the co-simulation model was used to investigate the potential of ABFT in operating a systolic array at reduced voltages.
- the scheme in our simulations was to keep the clock frequency and reduce the operating voltage until errors start appearing. Then, the clock frequency was reduced by predetermined steps until the errors disappear.
- Matrix size 32-by-32 was selected due to its application relevance.
- the dots in the upper part of Fig. 6 show the power dissipation of the ABFT logic augmented systolic array when the operating voltage is gradually reduced by 0.01V steps, and the fre quency is adjusted after error detection.
- the recalculations can be observed as momentary power dissipation peaks taking place after frequency drop.
- Fig. 6 shows the correct throughput rates ("goodput") for the 32- by-32 matrix multiplication systolic array with ABFT logic.
- the silent error rate or the share of errors that can find their way into the output without being detected is a relevant parameter for applications. It can be impacted by the design of error detection logic that in our case is fairly sensitive.
- Fig. 7 shows the error rate for different variations and the error rate in the final output of a row of PEs.
- V dd 0.7V
- clock period f clk 280 MHz
- a matrix accelerator processing system 10 is illustrated, and it comprises a matrix accelerator 11 .
- the matrix accelerator processing system 10 is configured to oper ate systolic arrays (see the Input and Output).
- the matrix accelerator 11 is config ured to output data to Algorithm-Based Fault Tolerance (ABFT) based error detec tion module 12, which is applied to compute checksums and detect errors online within the systolic array. Thereafter, the ABFT-based error detection module 12 is configured to forward possible detected errors (i.e.
- ABFT Algorithm-Based Fault Tolerance
- a dynamic voltage and frequency controlling module 13 which is configured to reduce an op erational voltage of the matrix accelerator 11 in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator 11 in case of at least one detected error.
- the matrix accelerator processing system 10 is configured to reperform the ABFT based error detection for the matrix accel erator 11 output data with an adjusted operational voltage. In this way, the matrix accelerator processing system 10 is configured to find a lowest operational voltage where the number of the detected errors is zero.
- the dynamic voltage and frequency controlling module 13 is capable to control the clock 14 frequency, and the voltage V dd (i.e. the operational voltage of the matrix accelerator 11 ) via the voltage controller 15.
- V dd the operational voltage of the matrix accelerator 11
- the matrix accelerator processing sys tem 10 is configured to keep the clock 14 frequency unchanged throughout the ad justment process by the dynamic voltage and frequency controlling module 13.
- the proposed ABFT approach enables removing the extra voltage guard bands determined by the circuit manufacturers, without performance compro mises. Furthermore, efficient near-threshold operation points can be reached when the clock rate is adjusted as well. Another interesting case of use is for approximate computing applications where the solution can be used to assure the error probabil ities are confined within certain bounds.
- the invented concept of applying the ABFT in systolic arrays for low voltage has been implemented in a real device (an FPGA) for demonstration pur poses, instead of being simulated.
- the focus is in the energy sav ings achievable through voltage reductions.
- Matrix multiplication has been selected as an example case due to being the core of key operations for wireless communi cations and deep neural networks.
- the matrix multiplier design used on the FPGA is a run-of-the-mill scheme generated using a vendor provided FILS tool.
- the disclosed ABFT based voltage control scheme is independent of architectural optimizations, and can therefore be used with any matrix multiplier design. Implementations of other algorithms can be foreseen to benefit from match ing ABFT schemes. Designing chips to operate at optimum near-threshold points is notoriously difficult, due to exacerbated process parameter and temperature dependencies that need to be modeled. The demonstrated ABFT based error feedback mechanism, however, is an alternative adaptive approach that fits applications where occasional errors can be tolerated.
- the invented concept comprises an integrated circuit, a method and a computer program product for reliable low-voltage operation of a matrix accelerator pro- cessing system.
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Mathematical Physics (AREA)
- Computing Systems (AREA)
- Data Mining & Analysis (AREA)
- Quality & Reliability (AREA)
- Biomedical Technology (AREA)
- Computational Mathematics (AREA)
- Mathematical Analysis (AREA)
- Mathematical Optimization (AREA)
- Pure & Applied Mathematics (AREA)
- Computer Hardware Design (AREA)
- Biophysics (AREA)
- Software Systems (AREA)
- Health & Medical Sciences (AREA)
- Life Sciences & Earth Sciences (AREA)
- Algebra (AREA)
- Databases & Information Systems (AREA)
- Neurology (AREA)
- Artificial Intelligence (AREA)
- Computational Linguistics (AREA)
- Evolutionary Computation (AREA)
- General Health & Medical Sciences (AREA)
- Molecular Biology (AREA)
- Medicines Containing Plant Substances (AREA)
- Power Sources (AREA)
Abstract
The present invention relates to an approach, which is proposed to achieve energy savings from reduced voltage operation. The solution detects timing-errors by integrating Algorithm Based Fault Tolerance (ABFT) into a digital architecture. The approach has been studied with a systolic array matrix multiplier operating at reduced voltage, detecting errors on-the-fly avoiding energy demanding memory round-trips. The analysis of the solution has been done using analog-digital co-simulation to extract the transient behavior under different voltages and clock frequencies. HSPICE simulations using 90nm CMOS transistor models, and experiments by reducing operation voltage of a Xilinx Zynq FPGA device were carried out. Scaling down the voltage in FPGA experiments from nominal by 22% was registered without incurring one single error translating into 1.8x increase in energy-efficiency. HSPICE simulations showed possibility of 10x increase in energy-efficiency by approaching near-threshold region.
  Description
A METHOD FOR INCREASE OF ENERGY EFFICIENCY THROUGH LEVERAGING FAULT TOLERANT ALGORITHMS INTO UNDERVOLTED DIGITAL SYSTEMS 
    Technical field 
    The present invention is related to digital circuits and systems, systolic arrays and error detection is such circuits and systems. 
    Background 
    With the ultra-densification of wireless infrastructure through 5G technologies, with aim already at 6G solutions, much of the computing is expected to take place in the "edge", at short latency from the sensor and actuator nodes. This development is fueled by the increasing reliance on machine learning applications. With disappear ing data communications bandwidth and latency constraints "intelligence" can be implemented and coordinated in the edge computing resources. Neural inference and massive mu-MIMO (multi-user Multiple-Input-Multiple-Output) radios are among the technological enablers that both are computationally demanding and a challenge for energy-efficient digital design. 
    Power consumption of digital circuits has a quadratic relation to the supply voltage (Eq. 1 ) . Therefore, a straightforward approach to gain in efficiency is to reduce the voltage. 
    
    The first proposal to operate digital logic near and below the threshold voltage of the transistors was in 1972. In theory, these operating regions offer potential to improve the energy efficiency by 10x to 20x, while in experiments improvements up to 8x have been demonstrated by adopting Near-Threshold voltage regimes. Although the nominal operating voltages have been reducing, recent research [1] has shown that the vendor specifications for off-the-shelf components such as Field-Program mable-Gate-Arrays (FPGA) and Graphical-Processing-Units (GPU) are pessimistic to accommodate for process variations, such as inconsistencies in transistor geom etry, oxide thickness and doping. It has been demonstrated that the supply voltage can be scaled down by around 12%, 20%, and even 30% in case of CPUs, GPUs, 
and FPGAs, respectively. In most cases the voltage margin to where errors start appearing is large; up to 60% energy savings has been reported with FPGAs without observing one single fault and without performance loss. 
    Unfortunately, the impact of variations cannot be modeled deterministically in re duced voltage settings. This uncertainty discourages the manufacturers from adopt ing aggressive voltage reduction schemes to utilize near-threshold and sub-thresh old regions. In reduced voltage settings the effects of process variations are exac erbated, even up to 100x differences between Fast-Fast (FF) and Slow-Slow (SS) process corners. A challenge is that lowering the supply voltage without setting the clock frequency to the optimum, results in either loss of performance or loss of reli ability. Too slow clocking results in the loss of energy efficiency and performance, while too fast clocking results in timing errors, hence loss of reliability. Reliable low- voltage implementations often require significant investments into development time and may reduce the fabrication yield [1 ,2] 
    In the present invention, we propose an algorithm-dependent technique for the de tection of errors from aggressive voltage reductions. Similar approaches, e.g., Er- ror-Correction-Code (ECC) algorithms are practical in detection and correction of memory errors. Flowever they are not applicable for error detection in data and con trol paths. Traditional fault-tolerance approaches such as Triple-Module-Redun dancy (TMR) or Double-Module-Redundancy (DMR) and their different flavors, while provide error-resiliency, increase the gate-count and gate-activity substan tially. This translates into fault-resiliency at the cost of energy-efficiency, while our approach is to achieve energy-efficiency through leveraging fault-tolerance. The proposed Algorithm Based Fault Tolerance (ABFT) approach is demonstrated through transistor and system level co-simulation and FPGA implementation of a systolic matrix multiplier. The solution enables utilizing reduced voltage operating regions at minor development overheads. The targets are low-power high-perfor mance applications that allow for error correction by recalculation after voltage/fre quency adjustment. 
    Prior Approaches 
    A simple technique to optimize the operation voltage is to embed the design with a delay chain, e.g., a sequence of inverters that mimics the timing behavior of the longest delay path in the circuit. While this enables taking into account global varia tions, supply voltage drops, and temperature fluctuations, local variations and cross- 
coupled noise cannot be effectively modeled. In practice, the longest delay path is stimulated less frequently than the shorter ones, rendering delay chain based volt age tuning either too optimistic or too pessimistic. This translates to loss of reliability or reduced efficiency gains, respectively. 
    Another state-of-the-art solution that is commercially used is to arm the critical paths with Timing Error Detection (TED) circuits [2], shown in Fig. 1 . The TED method was originally proposed to mitigate susceptibility to ambient and internal variations to increase manufacturing yield, while in reduced voltage cases TED circuits aid in setting an improved operation point to maximise energy efficiency. Based on detec tions by TED circuits the voltage and clock frequency can be adjusted in an adaptive manner. Nonetheless, TED schemes are not readily applicable with off-the-shelf platforms and require significant development investment when adopted to custom ASIC designs. Being clock synchronized digital circuits, TEDs also add non-trivial power consumption overheads. 
    Due to increased impacts of process and ambient variations at lower voltages, de terministic operation may be possible only with performance or energy losses, that is, by lowering the clock frequency, or using error correction. Solutions such as shortening the critical paths or TED circuits are either costly in silicon real-estate or overly conservative. 
    Borrowing concepts from the supercomputing community, we propose a simple, yet an effective, algorithm level solution. It enables low-voltage integrated circuit design, manifesting sub-linear overhead from error detection in terms of power and circuit complexity. 
    Earlier mentioned references are listed in the following. 
    [1] G. Papadimitriou, A. Chatzidimitriou, D. Gizopoulos, V. J. Reddi, J. Leng, B. Sa lami, O. S. Unsal, and A. C. Kestelman, “Exceeding conservative limits: A consoli dated analysis on modern hardware margins,” IEEE Transactions on Device and Materials Reliability, 2020. 
    [2] Mudge, Trevor Nigel, Todd Michael Austin, David Theodore Blaauw, and Kriszt- ian Flautner. "Systematic and random error detection and recovery within pro cessing stages of an integrated circuit." U.S. Patent 7,162,661 , issued January 9, 2007. 
[3] K.-H. Huang and J. A. Abraham, “Algorithm-based fault tolerance for matrix op erations,” IEEE transactions on computers, vol. 100, no. 6, pp. 518-528, 1984. 
    [4] H.-T. Kung, “Why systolic architectures?” Computer, vol. 15, no. 1 , pp. 37-46, 1982. 
    Furthermore, concerning patent publications, the following prior art are briefly dis cussed. 
    CN 108733628 (“Dai”) discloses a reinforcement method of a parallel matrix multi plication algorithm. The method is used for lowering ABFT reinforcement overhead of the matrix algorithm. The method comprises the following steps of (1 ) encoding input and output of matrix multiplication, checking a calculation result according to an encoding value and storing all possible error lists; (2) preprocessing the error lists, and eliminating some misjudgment errors and avoiding unnecessary correc tion, wherein the method of eliminating errors is a relative error law, error detection is performed before error correction, then left errors are corrected; if one or more errors are corrected, the error list is updated, and most errors are corrected after iteration for many times; and (3) adopting a re-calculation policy for left errors that cannot be corrected by the algorithm. The reinforcement method improves both sys tem reliability and execution efficiency. The application area of this disclosure is GPUs (Graphics Processing Unit). 
    US 3,604,619 (“Abbiati”) from the late 1960s discloses a biquinary calculating ma chine i.e. a machine for calculating with decimal numbers in which each decimal order to two operands is stored in accordance with a biquinary code and it also concerns a store for use in such machine. This disclosure does not apply the above presented ABFT from the 1980s. 
    CN 101369241 (”Huo”) is about a "cluster fault-tolerance system, apparatus and method”, but it does not discuss arrays which would be systolic. Huo mentions the ABFT only in its background as one known option, but the actual description seems to disclose a totally different concept on fault tolerant processes. 
    US 2012/0221884 ("Carter”) discusses an error management method, taking both HW and some SW into account. In other words, this disclosure provides error man agement across hardware and software layers to enable hardware and software to deliver reliable operation in the face of errors and hardware variation due to aging, 
manufacturing tolerances, etc. In one embodiment, an error management module is provided that gathers information from the hardware and software layers, and de tects and diagnoses errors. A hardware or software recovery technique may be se lected to provide efficient operation, and, in some embodiments, the hardware de vice may be reconfigured to prevent future errors and to permit the hardware device to operate despite a permanent error. 
    One problem in the prior art is that concerning TED circuits, redesigning or re-fabri cation of the circuit is required. If the TED circuits are used as additional circuit ele ments, they would act as an extra source of power consumption. Thus, design times will get longer, and manufacturing costs will also increase. 
    Generally, high voltages in accelerator devices lead directly to high energies and thus, high costs. 
    A cost- and energy-efficient, yet properly working accelerator device is thus desired to be produced. 
    Summary 
    The present invention introduces an integrated circuit for reliable low-voltage oper ation of a matrix accelerator processing system (10), which matrix accelerator pro cessing system (10) comprises a matrix accelerator (11 ), wherein the matrix accel erator processing system (10) is configured to operate systolic arrays. The inte grated circuit is characterized in that the matrix accelerator (11 ) is configured to output data to Algorithm-Based Fault Tolerance (ABFT) based error detection module (12), which is applied to com pute checksums and detect errors online within the systolic array, the ABFT-based error detection module (12) is configured to forward possible detected errors to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11) in case of at least one detected error, the matrix accelerator processing system (10) is configured to reperform the ABFT based error detection for the matrix accelerator (11) output data with an ad justed operational voltage, and the matrix accelerator processing system (10) is configured to find a lowest operational voltage where the number of the detected errors is zero. 
In an embodiment of the integrated circuit, the matrix accelerator processing system (10) is configured to keep the clock (14) frequency unchanged throughout the ad justment process by the dynamic voltage and frequency controlling module (13). 
    In an embodiment of the integrated circuit, the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA. 
    In an embodiment of the integrated circuit, the matrix accelerator processing system (10) is configured to either lower or increase the operational voltage by the dynamic voltage and frequency controlling module (13) by a predetermined voltage step. 
    In an embodiment of the integrated circuit, a voltage controller (15) is configured to perform the voltage adjustments for the matrix accelerator (11 ). 
    In an embodiment of the integrated circuit, the input data and output data of the matrix accelerator (11 ) are augmented for Algorithm-Based Fault Tolerance error detection. 
    In an embodiment of the integrated circuit, the matrix accelerator processing system (10) does not require roundtrips for the data from the matrix accelerator (11 ) to a memory and back to the matrix accelerator (11 ). 
    According to a second aspect of the present invention, it introduces a method for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11 ), wherein the method comprises the step of operating systolic arrays in the matrix accelerator processing system (10). The method is characterized in that it further comprises the steps of: outputting data from the matrix accelerator (11 ) to Algorithm-Based Fault Tol erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array, forwarding possible detected errors from the ABFT-based error detection module (12) to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11 ) in case of at least one detected error, 
 reperforming the ABFT based error detection for the matrix accelerator (11 ) output data with an adjusted operational voltage by the matrix accelerator pro cessing system (10), and finding, by the matrix accelerator processing system (10), a lowest opera tional voltage where the number of the detected errors is zero. 
    In an embodiment of the method, the matrix accelerator processing system (10) is configured to keep the clock (14) frequency unchanged throughout the adjustment process by the dynamic voltage and frequency controlling module (13). 
    In an embodiment of the method, the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA. 
    In an embodiment of the method, the matrix accelerator processing system (10) is configured to either lower or increase the operational voltage by the dynamic voltage and frequency controlling module (13) by a predetermined voltage step. 
    In an embodiment of the method, a voltage controller (15) is configured to perform the voltage adjustments for the matrix accelerator (11 ). 
    In an embodiment of the method, the input data and output data of the matrix accel erator (11 ) are augmented for Algorithm-Based Fault Tolerance error detection. 
    In an embodiment of the method, the matrix accelerator processing system (10) does not require roundtrips for the data from the matrix accelerator (11 ) to a memory and back to the matrix accelerator (11 ). 
    According to a third aspect of the present invention, it introduces a computer pro gram product for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11 ), wherein the matrix accelerator processing system (10) is config ured to operate systolic arrays. The computer program product is characterized in that the computer program product comprises program code which is executable by a processor, wherein the computer program product is configured to execute the steps of: outputting data from the matrix accelerator (11 ) to Algorithm-Based Fault Tol erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array, 
 forwarding possible detected errors from the ABFT-based error detection module (12) to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11 ) in case of at least one detected error, reperforming the ABFT based error detection for the matrix accelerator (11 ) output data with an adjusted operational voltage by the matrix accelerator pro cessing system (10), and finding, by the matrix accelerator processing system (10), a lowest opera tional voltage where the number of the detected errors is zero. 
    The embodiments concerning the method are applicable to the computer program product as well. 
    Brief description of the drawings 
    FIG. 1 illustrates a simplified concept of Timing-Error-Detection circuit in the prior art. Shadow register is clocked by a delayed version of original clock. 
    FIG. 2 illustrates a checksum row and column which help to detect the errors. 
    FIG. 3 illustrates a checksum row and column which help to detect the errors and correct up to one error per row/column. 
    FIG. 4 illustrates a co-simulation paradigm of reduced voltage processing platform. FIG. 5 illustrates a processing element of the systolic array. 
    FIG. 6 illustrates an energy efficient point which can be achieved using error detec tions. 
    FIG. 7 illustrates in the top: Error rates of PE output bits. In the bottom: Total error and silent error rates for 32 point row-column and 32-by-32 matrix multiplication. The horizontal axis represents clock period variations (OT, nanoseconds). 
    FIG. 8 illustrates that ABFT detects errors in different operation points and temper atures. 
    FIG. 9 discloses a main block diagram according to an embodiment of the present invention. 
Detailed description 
    The present invention introduces a solution which achieves substantial energy ben efits from reduced voltage operation in a matrix accelerator. The matrix accelerator may be a matrix multiplier designed as a systolic array. 
    A main embodiment of the present invention is discussed at first. In this embodi ment, it is proposed to adjust the voltage and frequency according to errors detected at the data output of the computing logic. For detections an ABFT scheme [3] is integrated into the design. The objective is to enable pushing the supply voltage down, while maintaining reliability with minimal overheads. 
    The restriction of the approach is that ABFT methods are algorithm specific. In our study they are for matrix operations that are a worthwhile target, representing the most energy hungry computations in many applications. For instance, more than 98% of computations in a neural network inference episode, and majority of the computations of a 5G radio involve matrix operations. 
    One of the most energy efficient and high performance architectures for matrix op erations is systolic array introduced by Kung in 1980s [4] Systolic designs have recently regained attention, being utilized in Google Tensor-Processor-Units (TPU) for acceleration of neural network computations. 
    Our contribution in the embodiments of the present invention is demonstrating an ABFT solution for error detection in matrix multiplication targeted systolic array and verifying its performance. The objective is to support sustained operation at reduced voltage compensating, e.g., for temperature dependency of the threshold voltage. 
    Algorithm Based Fault Tolerance is discussed first. 
    The idea of ABFT was originally conceived by Fluang and Abraham in [3]. They initially described a low-overhead technique to detect and correct computational er rors striking multiplication operations. Subsequently, ABFT was extended to other linear algebra operations, including transposition, QR decomposition, FFT, and 2-D convolutions [3]. 
    The fundamental idea of ABFT for matrix operations is to augment input matrices with checksum property. This enables detecting computational errors by inspecting 
the final result, while correction can be done in limited cases provided that hardware support has been included in the design. In our proposed approach according to an embodiment of the invention, correction is done by recalculating after clock fre quency or voltage change. 
    Assuming A is an N x N matrix, then a row checksum matrix Ar is defined as a N x (N + 1 ) matrix, as below: 
    Ar = [A AeT] Eq. 2 where e is column vector QN = [1 ,1 ,...,1], hence the nth element of the column vector. Similarly a full checksum, Af matrices are defined as 
 respectively. The checksums can be used to detect errors instead by multiplying matrices ANXN and BNXN 
    C = A x B Eq. 4 
    Thus, we multiply the column checksum matrix Ac and row checksum matrix Bc to obtain full checksum matrix Cf . This outcome is depicted in Fig. 2. Provided that there are no errors in the checksum calculations, the location of a single row or column error in the result matrix C can be detected. 
    In the current study the interests are in controlling the operating voltage and fre quency, and understanding the energy costs of the checksum logic. We assume that the occasionally added latencies from recalculations, and the silent error rate fit the application constraints. 
    Next, systolic array for matrix multiplication with ABFT is discussed. 
    We limit our treatment to 2-D structured systolic array [4] shown in Fig. 3 for matrix multiplication, equipped with ABFT logic, in this embodiment of the invention. The array consists of a grid of identical Processing Elements (PE) that each perform multiply-accumulate (MAC) operation on data received from the adjacent top and left PEs and pass the result or the input data to the next neighbouring PEs on the 
right and below. The architecture exhibits a high degree of parallelism and reduces memory bandwidth requirements. It should be noticed that the size of the array doesn't need to match the one of the matrices to be multiplied, as the calculations can be broken down to sub-matrices. 
    In the current design according to an embodiment of the invention, ABFT is merged with a N x N systolic array structure by adding a column of N PEs for checksums and another column consisting of N digital integrators and comparators for error detection. We recognize that by not including column checksum logic, the silent er ror rate increases, however, it stays low as we show later. In a similar manner, we could check just a fraction of the bits in the checksums. 
    Except for the added checksum and error detection columns, the system design is the same as proposed in [4] in this embodiment of the invention. An advantage of the scheme is that errors are detected on-the-fly as the result matrix is clocked-out from the array. No memory round-trips are needed. The added overhead from col umns for checksum and error checking is only 0(l/iV), targeting transient-errors in reduced voltage settings rather than tackling permanent faults. 
    Next, co-simulation and experimentation are discussed. 
    Simulation models for the design according to the above embodiment were created for MATLAB and HSPICE to investigate the functionality and implementation char acteristics. The analog-digital co-simulation setting is depicted in Fig. 4. As no standard cell libraries exists that are characterized for reduced voltage operations, the main focus was on FISPICE analyses of transient behavior under different oper ation voltages, while the MATLAB model was used for error detection. 
    As a partial confirmation of the scheme in the real-world, reduced voltage experi ments were carried out on an FPGA. 
    In the HSPICE model of the processing elements 8-bit Wallace-tree multiplier is fol lowed by 24-bit ripple-carry adder. The structure is shown in Fig. 5. While the HSPICE simulations are time consuming, detailed results were desired rather than probabilistic models to estimate the energy impacts. The digital behavior was mod eled using MATLAB, closing the loop from analog simulations. 
HSPICE was used to model variations in W/L , oxide thickness, temperature etc. Signal transitions were obtained from system level simulations with random data inputs to compile sets of test vectors. 
    Overhead Analysis is discussed next. The number of arithmetic operations of N by N matrix multiplication is (2NL3-NL2). Extending the input matrices with checksums according to Fig. 2, the operation count increases to (2NL3 + 3NL2). Detection of errors through only the row checksum vector requires additional NL2 summation operations plus N comparisons. The outcome is (2NL3 + 4NL2 + 3N) operations. 
    With large enough matrices the added overhead is small, 8.0% for N=32 and 2.5% for N=100. Compared to similar methods such as Result Checking (RC), ABFT has overhead rate of 0(1 /N) compared to the O(N). Moreover, ABFT can be adapted to operate on-the-fly. 
    The results according to the above disclosed embodiment are discussed next. Fig ures 6, 7 and 8 are referred in that regard. 
    The co-simulation model was used to investigate the potential of ABFT in operating a systolic array at reduced voltages. The scheme in our simulations was to keep the clock frequency and reduce the operating voltage until errors start appearing. Then, the clock frequency was reduced by predetermined steps until the errors disappear. 
    Matrix size 32-by-32 was selected due to its application relevance. The dots in the upper part of Fig. 6 show the power dissipation of the ABFT logic augmented systolic array when the operating voltage is gradually reduced by 0.01V steps, and the fre quency is adjusted after error detection. On an average the highest power consump tion per all types of PEs is around 74mI// at Vdd = 0.9V and fcik = 400 MHz. The recalculations can be observed as momentary power dissipation peaks taking place after frequency drop. 
    The lower part of Fig. 6 shows the correct throughput rates ("goodput") for the 32- by-32 matrix multiplication systolic array with ABFT logic. These results are for long runs at each voltage after frequency adjustment. The impacts of process, voltage and temperature variations were simulated by adding timing variations, following the approach presented in and with a fixed relative variance of clock period. 
Voltage reduction from 1V to 0.7V saves half of the energy for a matrix multiplication without compromising the throughput. When the near threshold region is ap proached at 0.5V, the energy use is reduced by further 70%, while the goodput drops to half. Aiming at higher energy efficiency sacrifices the goodput: at the near threshold region in vicinity of 0.4V the goodput is only 9% of the one at 0.7V, instead, the energy per 32-by-32 matrix multiplication is only 8%. 
    The silent error rate or the share of errors that can find their way into the output without being detected is a relevant parameter for applications. It can be impacted by the design of error detection logic that in our case is fairly sensitive. 
    The worst case takes place when only 1 bit in the output of PEs is affected, and the erroneous results "neutralize" each other. In this arguably rare scenario, the proba bility of a silent error in the final matrix multiplication result is 50%. This is the upper- bound with having only one erroneous row. However, when multiplications are car ried out one after another by the PEs, such errors are most likely to become de tected, triggering an operating point change. At application level, such as in tele communications a silent error might result in a re-transmission due to packet errors. 
    Fig. 7 shows the error rate for different variations and the error rate in the final output of a row of PEs. To estimate the silent error rate we first modeled the bit error rate of a PE using a fixed voltage Vdd = 0.7V ) and clock period fclk = 280 MHz) under different random variations in HSPICE. The variations were introduced as clock jitter and were considered to represent sources for timing-errors collectively. The result ing error probability density was then used within PEs of the systolic array to flip their output bits. 
    In the worst case for ABFT error detection, only a single row is struck by errors, whereas in a more realistic case the PEs in all rows undergo similar variations. This means lowering chances of a silent error passing through to the output. 
    Experimentation using an FPGA design was carried out to demonstrate aggressive elimination of guard band voltages without incurring erroneous results. Using a 32- by-32 matrix multiplier synthesized on Xilinx Zynq-7000 SoC ZC702 (XC7Z020) evaluation board, different operational temperature/voltage points were explored as depicted in Fig. 8. 
The programmable logic, BRAM and auxiliary circuits voltage-rails were adjusted by sending Power Management Bus (PMBUS) commands to the voltage regulator (UCD9248). According to the timing analysis tool of the vendor, the maximum clock ing for the programmable logic design was 120 MHz, while we overclocked to 250 MHz to force the FPGA to produce errors to be detected in experimentation. To investigate the sensitivity of ABFT a few thousands trials per operation point were carried out. Based on experiments in low-voltage setting and at higher temperatures the error rate was reduced which is explained by Inverse Temperature Dependence. The ABFT approach can utilize this opportunity to increase the clock frequency and to improve energy efficiency. 
    Now discussing the main invented concept in a form of a main block diagram ac cording to an embodiment of the present invention, we refer to Fig. 9. 
    A matrix accelerator processing system 10 is illustrated, and it comprises a matrix accelerator 11 . The matrix accelerator processing system 10 is configured to oper ate systolic arrays (see the Input and Output). The matrix accelerator 11 is config ured to output data to Algorithm-Based Fault Tolerance (ABFT) based error detec tion module 12, which is applied to compute checksums and detect errors online within the systolic array. Thereafter, the ABFT-based error detection module 12 is configured to forward possible detected errors (i.e. error feedback) to a dynamic voltage and frequency controlling module 13, which is configured to reduce an op erational voltage of the matrix accelerator 11 in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator 11 in case of at least one detected error. Thereafter, the matrix accelerator processing system 10 is configured to reperform the ABFT based error detection for the matrix accel erator 11 output data with an adjusted operational voltage. In this way, the matrix accelerator processing system 10 is configured to find a lowest operational voltage where the number of the detected errors is zero. 
    As shown in Fig. 9, the dynamic voltage and frequency controlling module 13 is capable to control the clock 14 frequency, and the voltage Vdd (i.e. the operational voltage of the matrix accelerator 11 ) via the voltage controller 15. In the present invention, we can select whether we adjust the frequency or the voltage, or even both the frequency and the voltage based on the error feedback. 
In an embodiment of the present invention, the matrix accelerator processing sys tem 10 is configured to keep the clock 14 frequency unchanged throughout the ad justment process by the dynamic voltage and frequency controlling module 13. 
    In brief for the above presented embodiments of the present invention and their advantages, the proposed ABFT approach enables removing the extra voltage guard bands determined by the circuit manufacturers, without performance compro mises. Furthermore, efficient near-threshold operation points can be reached when the clock rate is adjusted as well. Another interesting case of use is for approximate computing applications where the solution can be used to assure the error probabil ities are confined within certain bounds. 
    Summarizing the above, the energy savings potential indicated by transistor level FISPICE simulations was experimentally confirmed with an FPGA. Integration of the proposed ABFT scheme into a systolic array according to the embodiment of the invention allows on-the-fly error detection, and can be employed to find low-power operation points. With its low silent error rate, the scheme fits a wide field of appli cations from wireless communications to artificial intelligence. These are notable advantages for the invented concept. 
    Furthermore, the invented concept of applying the ABFT in systolic arrays for low voltage, has been implemented in a real device (an FPGA) for demonstration pur poses, instead of being simulated. As in the above, the focus is in the energy sav ings achievable through voltage reductions. Matrix multiplication has been selected as an example case due to being the core of key operations for wireless communi cations and deep neural networks. The matrix multiplier design used on the FPGA is a run-of-the-mill scheme generated using a vendor provided FILS tool. 
    Further summarizing the invented concept verified both by simulations and by an FPGA implementation, and when comparing it to the earlier cited prior art, we note the following. 
    To our best knowledge, this seems to be the first contribution in which a low over head algorithmic error detection technique is employed to realize a low-voltage pro cessing solution. The disclosed ABFT based voltage control scheme is independent of architectural optimizations, and can therefore be used with any matrix multiplier design. Implementations of other algorithms can be foreseen to benefit from match ing ABFT schemes. 
Designing chips to operate at optimum near-threshold points is notoriously difficult, due to exacerbated process parameter and temperature dependencies that need to be modeled. The demonstrated ABFT based error feedback mechanism, however, is an alternative adaptive approach that fits applications where occasional errors can be tolerated. 
    While its utility was now demonstrated with an FPGA, it is possible to envision ap plications with ASIC designs. Furthermore, the objective of the present invention was to present and demonstrate safe voltage reduction of matrix multiplication accelerator using ABFT. The results demonstrate that the energy efficiency of FPGA based matrix multiplication can be improved substantially by cutting the voltage margins. Moreover, the utility of an ABFT based low overhead feedback mechanism was shown in controlling the volt- ages based on errors originating from the internal logic, Block RAM, and auxiliary circuit. 
    The invented concept comprises an integrated circuit, a method and a computer program product for reliable low-voltage operation of a matrix accelerator pro- cessing system. 
    The present invention is not restricted merely in the embodiments discussed above but the present invention may vary within the scope of the claims. 
  Claims
1 . An integrated circuit for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11 ), wherein the matrix accelerator processing system (10) is configured to operate sys tolic arrays, characterized in that the matrix accelerator (11 ) is configured to output data to Algorithm-Based Fault Tolerance (ABFT) based error detection module (12), which is applied to com pute checksums and detect errors online within the systolic array, the ABFT-based error detection module (12) is configured to forward possible detected errors to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11) in case of at least one detected error, the matrix accelerator processing system (10) is configured to reperform the ABFT based error detection for the matrix accelerator (11) output data with an ad justed operational voltage, and the matrix accelerator processing system (10) is configured to find a lowest operational voltage where the number of the detected errors is zero. 
    2. The integrated circuit according to claim 1 , characterized in that the matrix accelerator processing system (10) is configured to keep the clock (14) frequency unchanged throughout the adjustment process by the dynamic voltage and fre quency controlling module (13). 
    3. The integrated circuit according to claim 1 , characterized in that the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA. 
    4. The integrated circuit according to claim 1 , characterized in that the matrix accelerator processing system (10) is configured to either lower or increase the op erational voltage by the dynamic voltage and frequency controlling module (13) by a predetermined voltage step. 
    5. The integrated circuit according to any of claims 1-4, characterized in that a voltage controller (15) is configured to perform the voltage adjustments for the matrix accelerator (11 ). 
    6. The integrated circuit according to any of claims 1-5, characterized in that the input matrices of the matrix accelerator (11 ) are augmented with checksum prop erty in the Algorithm-Based Fault Tolerance based error detection. 
    7. A method for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11 ), wherein the method comprises the step of: operating systolic arrays in the matrix accelerator processing system (10), characterized in that the method further comprises the steps of: outputting data from the matrix accelerator (11 ) to Algorithm-Based Fault Tol erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array, forwarding possible detected errors from the ABFT-based error detection module (12) to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11 ) in case of at least one detected error, reperforming the ABFT based error detection for the matrix accelerator (11 ) output data with an adjusted operational voltage by the matrix accelerator pro cessing system (10), and finding, by the matrix accelerator processing system (10), a lowest opera tional voltage where the number of the detected errors is zero. 
    8. The method according to claim 7, characterized in that the matrix accelerator processing system (10) is configured to keep the clock (14) frequency unchanged throughout the adjustment process by the dynamic voltage and frequency control ling module (13). 
    9. The method according to claim 7, characterized in that the matrix accelerator processing system (10) is implemented in form of an ASIC or an FPGA. 
    10. The method according to claim 7, characterized in that the matrix accelerator processing system (10) is configured to either lower or increase the operational volt age by the dynamic voltage and frequency controlling module (13) by a predeter mined voltage step. 
    11 . The method according to any of claims 7-10, characterized in that a voltage controller (15) is configured to perform the voltage adjustments for the matrix accel erator (11 ). 
    12. The method according to any of claims 7-11 , characterized in that the input matrices of the matrix accelerator (11 ) are augmented with checksum property in the Algorithm-Based Fault Tolerance based error detection. 
    13. A computer program product for reliable low-voltage operation of a matrix accelerator processing system (10), which matrix accelerator processing system (10) comprises a matrix accelerator (11), wherein the matrix accelerator processing system (10) is configured to operate sys tolic arrays, characterized in that the computer program product comprises program code which is executable by a processor, wherein the computer program product is configured to execute the steps of: outputting data from the matrix accelerator (11 ) to Algorithm-Based Fault Tol erance (ABFT) based error detection module (12), which is applied to compute checksums and detect errors online within the systolic array, forwarding possible detected errors from the ABFT-based error detection module (12) to a dynamic voltage and frequency controlling module (13), which is configured to reduce an operational voltage of the matrix accelerator (11 ) in case of no detected errors, and respectively, to increase an operational voltage of the matrix accelerator (11 ) in case of at least one detected error, reperforming the ABFT based error detection for the matrix accelerator (11 ) output data with an adjusted operational voltage by the matrix accelerator pro cessing system (10), and finding, by the matrix accelerator processing system (10), a lowest opera tional voltage where the number of the detected errors is zero. 
    Applications Claiming Priority (2)
| Application Number | Priority Date | Filing Date | Title | 
|---|---|---|---|
| FI20215475A FI130137B (en) | 2021-04-22 | 2021-04-22 | A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems | 
| FI20215475 | 2021-04-22 | 
Publications (1)
| Publication Number | Publication Date | 
|---|---|
| WO2022223881A1 true WO2022223881A1 (en) | 2022-10-27 | 
Family
ID=81850667
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date | 
|---|---|---|---|
| PCT/FI2022/050258 Ceased WO2022223881A1 (en) | 2021-04-22 | 2022-04-20 | A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems | 
Country Status (2)
| Country | Link | 
|---|---|
| FI (1) | FI130137B (en) | 
| WO (1) | WO2022223881A1 (en) | 
Cited By (2)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US20200074287A1 (en) * | 2018-08-31 | 2020-03-05 | Texas Instruments Incorporated | Fault detectable and tolerant neural network | 
| WO2024146011A1 (en) * | 2023-01-06 | 2024-07-11 | 上海科技大学 | Automatic overclocking controller based on circuit delay measurement | 
Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US3604619A (en) | 1967-10-30 | 1971-09-14 | Olivetti & Co Spa | Biquinary calculating machine | 
| US7162661B2 (en) | 2003-03-20 | 2007-01-09 | Arm Limited | Systematic and random error detection and recovery within processing stages of an integrated circuit | 
| CN101369241A (en) | 2007-09-21 | 2009-02-18 | 中国科学院计算技术研究所 | A cluster fault-tolerant system, device and method | 
| US20120221884A1 (en) | 2011-02-28 | 2012-08-30 | Carter Nicholas P | Error management across hardware and software layers | 
| CN108733628A (en) | 2018-05-23 | 2018-11-02 | 河海大学常州校区 | A kind of reinforcement means of parallel matrix multiplication algorithm | 
| US20190041942A1 (en) * | 2017-12-29 | 2019-02-07 | Intel Corporation | Voltage droop mitigation technology in array processor cores | 
| CN110932713A (en) * | 2019-11-11 | 2020-03-27 | 东南大学 | Timing elastic circuit for convolutional neural network hardware accelerator | 
- 
        2021
        - 2021-04-22 FI FI20215475A patent/FI130137B/en active
 
- 
        2022
        - 2022-04-20 WO PCT/FI2022/050258 patent/WO2022223881A1/en not_active Ceased
 
Patent Citations (7)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US3604619A (en) | 1967-10-30 | 1971-09-14 | Olivetti & Co Spa | Biquinary calculating machine | 
| US7162661B2 (en) | 2003-03-20 | 2007-01-09 | Arm Limited | Systematic and random error detection and recovery within processing stages of an integrated circuit | 
| CN101369241A (en) | 2007-09-21 | 2009-02-18 | 中国科学院计算技术研究所 | A cluster fault-tolerant system, device and method | 
| US20120221884A1 (en) | 2011-02-28 | 2012-08-30 | Carter Nicholas P | Error management across hardware and software layers | 
| US20190041942A1 (en) * | 2017-12-29 | 2019-02-07 | Intel Corporation | Voltage droop mitigation technology in array processor cores | 
| CN108733628A (en) | 2018-05-23 | 2018-11-02 | 河海大学常州校区 | A kind of reinforcement means of parallel matrix multiplication algorithm | 
| CN110932713A (en) * | 2019-11-11 | 2020-03-27 | 东南大学 | Timing elastic circuit for convolutional neural network hardware accelerator | 
Non-Patent Citations (6)
| Title | 
|---|
| CAVELAN AURÉLIEN AURELIEN CAVELAN@ENS-LYON FR ET AL: "Voltage Overscaling Algorithms for Energy-Efficient Workflow Computations With Timing Errors", PROCEEDINGS OF THE 2015 ACM INTERNATIONAL JOINT CONFERENCE ON PERVASIVE AND UBIQUITOUS COMPUTING, ACMPUB27, NEW YORK, NY, USA, 15 June 2015 (2015-06-15), pages 27 - 34, XP058511836, ISBN: 978-1-4503-3572-0, DOI: 10.1145/2751504.2751508 * | 
| G. PAPADIMITRIOUA. CHATZIDIMITRIOUD. GIZOPOULOSV. J. REDDIJ. LENGB. SALAMIO. S. UNSALA. C. KESTELMAN: "Exceeding conservative limits: A consolidated analysis on modern hardware margins", IEEE TRANSACTIONS ON DEVICE AND MATERIALS RELIABILITY, 2020 | 
| H.-T. KUNG: "Why systolic architectures", COMPUTER, vol. 15, no. 1, 1982, pages 37 - 46, XP000743032, DOI: 10.1109/MC.1982.1653825 | 
| K.-H. HUANGJ. A. ABRAHAM: "Algorithm-based fault tolerance for matrix operations", IEEE TRANSACTIONS ON COMPUTERS, vol. 100, no. 6, 1984, pages 518 - 528 | 
| MUDGE, TREVOR NIGELTODD MICHAEL AUSTINDAVID THEODORE BLAAUWKRISZT-IAN FLAUTNER, SYSTEMATIC AND RANDOM ERROR DETECTION AND RECOVERY WITHIN PROCESSING STAGES OF AN INTEGRATED CIRCUIT | 
| ZHANG JEFF ET AL: "ThUnderVolt: Enabling Aggressive Voltage Underscaling and Timing Error Resilience for Energy Efficient Deep Learning Accelerators", 2018 55TH ACM/ESDA/IEEE DESIGN AUTOMATION CONFERENCE (DAC), IEEE, 24 June 2018 (2018-06-24), pages 1 - 6, XP033406006, DOI: 10.1109/DAC.2018.8465918 * | 
Cited By (3)
| Publication number | Priority date | Publication date | Assignee | Title | 
|---|---|---|---|---|
| US20200074287A1 (en) * | 2018-08-31 | 2020-03-05 | Texas Instruments Incorporated | Fault detectable and tolerant neural network | 
| US11710030B2 (en) * | 2018-08-31 | 2023-07-25 | Texas Instmments Incorporated | Fault detectable and tolerant neural network | 
| WO2024146011A1 (en) * | 2023-01-06 | 2024-07-11 | 上海科技大学 | Automatic overclocking controller based on circuit delay measurement | 
Also Published As
| Publication number | Publication date | 
|---|---|
| FI130137B (en) | 2023-03-09 | 
| FI20215475A1 (en) | 2022-10-23 | 
Similar Documents
| Publication | Publication Date | Title | 
|---|---|---|
| Bhardwaj et al. | Power-and area-efficient approximate wallace tree multiplier for error-resilient systems | |
| Cai et al. | Rapid robust principal component analysis: CUR accelerated inexact low rank estimation | |
| US8495440B2 (en) | Fully programmable parallel PRBS generator | |
| Safarpour et al. | Algorithm level error detection in low voltage systolic array | |
| WO2022223881A1 (en) | A method for increase of energy efficiency through leveraging fault tolerant algorithms into undervolted digital systems | |
| Immareddy et al. | A survey paper on design and implementation of multipliers for digital system applications | |
| US11409527B2 (en) | Parallel processor in associative content addressable memory | |
| Fougstedt et al. | Energy-efficient high-throughput VLSI architectures for product-like codes | |
| Ge et al. | Reliable and secure memories based on algebraic manipulation detection codes and robust error correction | |
| Neugebauer et al. | On the limits of stochastic computing | |
| US11228432B2 (en) | Quantum-resistant cryptoprocessing | |
| Kim et al. | Simplified ordered statistic decoding for short-length linear block codes | |
| Japa et al. | Processor based Intrinsic PUF Design for Approximate Computing: Faith or Reality? | |
| Sajadimanesh et al. | Inexact Quantum Square Root Circuit for NISQ Devices | |
| Mohan et al. | An improved implementation of hierarchy array multiplier using CslA adder and full swing GDI logic | |
| Juliato et al. | SEU-resistant SHA-256 design for security in satellites | |
| Ikezoe et al. | Recovering faulty non-volatile flip flops for coarse-grained reconfigurable architectures | |
| Kim et al. | Error resilient and energy efficient mrf message-passing-based stereo matching | |
| Bukkapatnam et al. | VLSI implementation of low-power and area efficient parallel memory allocation with EC-TCAM | |
| Cenova | Exploring HLS Coding Techniques to Achieve Desired Turbo Decoder Architectures | |
| Aiswarya et al. | An investigation on recoverable and fault tolarence method-based on real-time FPGA | |
| GB2544814B (en) | Modulo hardware generator | |
| KR102727496B1 (en) | Decoding method and apparatus based on polar code algorithm suitable for extremely low-resolution channels | |
| Guo et al. | A Low Latency Decoding Algorithm for Grouping Variable Nodes on TLC NAND Flash Devices | |
| Wu et al. | Parallel balanced-bit-serial design technique for ultra-low-voltage circuits with energy saving and area efficiency enhancement | 
Legal Events
| Date | Code | Title | Description | 
|---|---|---|---|
| 121 | Ep: the epo has been informed by wipo that ep was designated in this application | Ref document number: 22726149 Country of ref document: EP Kind code of ref document: A1 | |
| NENP | Non-entry into the national phase | Ref country code: DE | |
| 122 | Ep: pct application non-entry in european phase | Ref document number: 22726149 Country of ref document: EP Kind code of ref document: A1 |