[go: up one dir, main page]

CN107122256B - The high-performance on piece caching fault-tolerant architecture of dynamic repairing - Google Patents

The high-performance on piece caching fault-tolerant architecture of dynamic repairing Download PDF

Info

Publication number
CN107122256B
CN107122256B CN201710298651.7A CN201710298651A CN107122256B CN 107122256 B CN107122256 B CN 107122256B CN 201710298651 A CN201710298651 A CN 201710298651A CN 107122256 B CN107122256 B CN 107122256B
Authority
CN
China
Prior art keywords
block
cache
sub
fault
field
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Expired - Fee Related
Application number
CN201710298651.7A
Other languages
Chinese (zh)
Other versions
CN107122256A (en
Inventor
黄智濒
刘欣
许翰元
王珏
满柯宇
周锋
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Beijing University of Posts and Telecommunications
Original Assignee
Beijing University of Posts and Telecommunications
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Beijing University of Posts and Telecommunications filed Critical Beijing University of Posts and Telecommunications
Priority to CN201710298651.7A priority Critical patent/CN107122256B/en
Publication of CN107122256A publication Critical patent/CN107122256A/en
Application granted granted Critical
Publication of CN107122256B publication Critical patent/CN107122256B/en
Expired - Fee Related legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Memory System Of A Hierarchy Structure (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

本发明专利提出了一种动态修补的高性能片上缓存容错结构,可以及时的,高效的,低开销的处理间隙性位失效和永久故障。通过一个故障敏感的替换机制,可以随时容忍新发生的故障,保证缓存正常工作,然后依据该缓存块被利用访问的情况,通过动态的修补故障子块,减轻故障对缓存的性能影响。

The patent of the present invention proposes a dynamically patched high-performance on-chip cache fault-tolerant structure, which can deal with intermittent bit failures and permanent failures in a timely, efficient, and low-cost manner. Through a fault-sensitive replacement mechanism, new faults can be tolerated at any time to ensure the normal operation of the cache, and then according to the status of the cache block being accessed, the faulty sub-block can be dynamically repaired to reduce the performance impact of the fault on the cache.

Description

动态修补的高性能片上缓存容错架构Dynamically patched high-performance on-chip cache fault-tolerant architecture

所属技术领域Technical field

本发明专利涉及一种修补间歇性和永久性错误的缓存容错架构,尤其是可动态修补的高性能片上的缓存容错架构。The patent of the present invention relates to a cache fault-tolerant architecture for repairing intermittent and permanent errors, especially a high-performance on-chip cache fault-tolerant architecture that can be dynamically repaired.

背景技术Background technique

不仅漏检和磨损老化导致的永久故障对出厂后的芯片的可靠性造成影响,而且,出厂前难以检测的,时间上随机发生的,位置固定的,持续一定周期的,但可恢复的间隙性位失效对出厂后的芯片的可靠性也造成严重影响。应对暂态故障有效地纠错检错码在处理间隙性位失效时,也面临延迟开销大的问题,而且会减弱对暂态故障的处理,引起多位位故障。而目前提出的一些应对永久故障的缓存架构,例如子块禁用方案,无冲突修补方案等,都是基于静态的绕开或替代方法,需要在系统启动或者电压频率调整时,进行故障的处理。这些方法处理不及时,不能有效应对时间上随机发生的间歇性位失效。Not only the permanent faults caused by missed detection and wear and aging will affect the reliability of the chips after leaving the factory, but also the gaps that are difficult to detect before leaving the factory, occur randomly in time, have a fixed position, last for a certain period, but are recoverable Bit failure also has a serious impact on the reliability of the chip after leaving the factory. Efficient error correction for transient faults Error detection codes also face the problem of high delay overhead when dealing with intermittent bit failures, and will weaken the handling of transient faults, causing multi-bit faults. Some cache architectures currently proposed to deal with permanent faults, such as sub-block disabling schemes, conflict-free repair schemes, etc., are all based on static bypass or replacement methods, which need to handle faults when the system is started or the voltage frequency is adjusted. These methods are not processed in time, and cannot effectively deal with intermittent bit failures that occur randomly in time.

发明内容Contents of the invention

本专利提出了动态修补的高性能片上缓存容错架构,可以及时的,高效的,低开销的处理间隙性位失效和永久故障。通过一个故障敏感的替换机制,可以随时容忍新发生的故障,保证缓存正常工作,然后依据该缓存块被利用访问的情况,通过动态的修补故障子块,减轻故障对缓存的性能影响。This patent proposes a dynamically patched high-performance on-chip cache fault-tolerant architecture, which can handle intermittent bit failures and permanent failures in a timely, efficient, and low-overhead manner. Through a fault-sensitive replacement mechanism, new faults can be tolerated at any time to ensure the normal operation of the cache, and then according to the status of the cache block being accessed, the faulty sub-block can be dynamically repaired to reduce the performance impact of the fault on the cache.

本缓存结构,如图1所示,具有如下特征:This cache structure, as shown in Figure 1, has the following characteristics:

本缓存行结构一个缓存块包含若干个子块,缓存块的结构主要包括子块故障域,缓存块是否禁用域,子块偏移调整计数器域,指向修补阵列的指针域,表示数据是否被修改的域,维护替换状态的域以及标签标识域。实际映射的子块偏移由子块偏移与块内偏移相加得到。In this cache line structure, a cache block contains several sub-blocks. The structure of the cache block mainly includes the sub-block failure field, whether the cache block is disabled, the sub-block offset adjustment counter field, and the pointer field pointing to the repair array, indicating whether the data has been modified. Domain, the domain that maintains the replacement state, and a label that identifies the domain. The actual mapped sub-block offset is obtained by adding the sub-block offset and the intra-block offset.

1.标志标示位:表示该缓存块的标号。1. Marking bit: indicates the label of the cache block.

2.子块故障位图域:指示了每个缓存块的子块故障状态,包含若干位,对应若干个子块,若子块是故障子块则对该位进行设置以与正常子块进行区分。改域在系统初始时进行填充,电压频率动态调整时更新,可根据奇偶校验的结果及时更新,如果多次连续发现位故障,可以认为发生间歇性位失效,从而及时修改SFM域。2. Sub-block fault bitmap field: Indicates the sub-block fault status of each cache block, including several bits, corresponding to several sub-blocks, if the sub-block is a faulty sub-block, set this bit to distinguish it from a normal sub-block. The modified field is filled at the initial stage of the system, and updated when the voltage frequency is dynamically adjusted. It can be updated in time according to the result of the parity check. If bit faults are found continuously for many times, it can be considered that intermittent bit failures have occurred, so that the SFM field can be modified in time.

3.缓存是否禁用域:指示该缓存块是否被禁用,如果一个缓存块中不存在连续的无故障子块,则该缓存块被禁用,该域的值进行设置,与正常可用进行区分。3. Whether the cache is disabled field: Indicates whether the cache block is disabled. If there is no continuous fault-free sub-block in a cache block, the cache block is disabled, and the value of this field is set to distinguish it from normal availability.

4.数据是否修改域:指示数据是否被修改,若修改则对改域进行修改,否则不变。4. Whether the data is modified field: Indicates whether the data is modified, if modified, modify the field, otherwise it remains unchanged.

5.子块偏移调整域:用于调整子块的实际映射偏移位置。其大小可以表示出子块个数即可5. Sub-block offset adjustment field: used to adjust the actual mapping offset position of the sub-block. Its size can represent the number of sub-blocks

6.指向修补阵列的指针域:用于指向修补阵列6. Pointer field to patch array: used to point to patch array

7.替换标志位:记录替换的情况7. Replacement flag: record the situation of replacement

本发明提出了一种故障敏感机制算法,分为真命中和假命中两个情况,分别如图2,图3所示:The present invention proposes a fault sensitive mechanism algorithm, which is divided into two cases of true hit and false hit, as shown in Fig. 2 and Fig. 3 respectively:

当发生真命中时:When a true hit occurs:

步骤1:选择替换堆栈底部的缓存块v作为移除块,缓存请求为R,某块Sblock(R)为k。Step 1: Select the cache block v at the bottom of the replacement stack as the removal block, the cache request is R, and a certain block Sblock(R) is k.

步骤2:判断((k+计数器C(v))MOD SN)的子块是无故障的,即B(v)[(k+C(v))MODSN]数值检查,若无故障则转到步骤3,若有故障则转到步骤4Step 2: Judgment ((k+counter C(v))MOD SN) sub-block is fault-free, i.e. B(v)[(k+C(v))MODSN] numerical check, if there is no fault then go to step 3. If there is a fault, go to step 4

步骤3:计数器C的值不变。跳至步骤5。Step 3: The value of the counter C remains unchanged. Skip to step 5.

步骤4:调整缓存块的计数器C,当SFM(v)[(k+NewC(v))MOD SN]值表示无故障时停止调整,计数器值设为NewC。Step 4: Adjust the counter C of the cache block, stop the adjustment when the value of SFM(v)[(k+NewC(v))MOD SN] indicates no fault, and set the counter value to NewC.

步骤5:请求的数据可以放置到缓存块v的无故障子块中。与LRU类似的提升策略进行堆栈内缓存块的替换状态的调整,缓存块v被提升到堆栈的顶部。Step 5: The requested data can be placed into a non-faulty subblock of cache block v. The promotion strategy similar to LRU adjusts the replacement state of the cache block in the stack, and the cache block v is promoted to the top of the stack.

当发生假命中时:When a false hit occurs:

步骤1:选择替换堆栈底部的缓存块fv作为移除块Step 1: Select the cache block fv at the bottom of the replacement stack as the removal block

步骤2:判断fv是否包含胀数据,若包含则进行回写。完成后转到步骤3Step 2: judge whether fv contains expansion data, and if so, write back. Go to step 3 when done

步骤3:调整fv的计数器值,使得B(fv)[RSblockfv(R)]值表示为无故障状态Step 3: Adjust the counter value of fv such that the B(fv)[RSblockfv(R)] value represents a no-fault state

步骤4:进行缺失情况处理。Step 4: Carry out missing case processing.

步骤5:依据提升策略,将fv提升到栈顶,栈内其它缓存行依次更新其次序。Step 5: According to the promotion strategy, promote fv to the top of the stack, and update the order of other cache lines in the stack in turn.

本缓存结构容错缓存架构方式,其整体架构如图7所示:The fault-tolerant cache architecture of this cache structure, its overall architecture is shown in Figure 7:

图中附图标记表示的内容为:1.标志标识,2.子块故障位图域,3.缓存是否禁用域,4.数据是否修改域,5.子块偏移调整计数域当前为000,6.指向修补阵列的指针域当前为000101,7.替换状态位,8.分区标识,00为空闲单元,01为开始单元,10为中间单元,11为结束单位。The contents indicated by the reference numerals in the figure are: 1. Flag identification, 2. Sub-block fault bitmap field, 3. Whether to disable the cache field, 4. Whether to modify the data field, 5. The sub-block offset adjustment count field is currently 000 , 6. The pointer field pointing to the repair array is currently 000101, 7. Replacement status bit, 8. Partition identification, 00 is a free unit, 01 is a start unit, 10 is an intermediate unit, and 11 is an end unit.

在缓存块的标记域中,指向修补阵列的指针域存储了修补单元的首地址。修补单元是由某一阵列提供,该阵列是一个独立寻址的单独模块,它包含若干个单元。每一个单元主要由两个部分构成,标志域和数据域其中标志域部分包含若干位位。经组合分别表示(1)该单元是空闲的,没有被分配出去用来修补故障子块。(2)表示该单元是修补节的起始单元。(3)表示该单元是修补节的结束单元。(4)表示该单元是修补构成单元。一个缓存块的所有故障子块是一修补的,因为被认为值得修补的缓存块都是重用比较高的缓存块。而且,修补一个缓存块的修补单元在该阵列中是连续的一次性分配的,形成一个修补节。修补节的起始单元的标志域被赋值表示情况(1),结束单元的标志域被赋值为情况(3),中间单元的标志域被赋值为情况(4)。In the tag field of the cache block, the pointer field to the patch array stores the first address of the patch unit. The repair unit is provided by an array, which is an independently addressable single module, which contains several units. Each unit is mainly composed of two parts, the flag field and the data field, and the flag field part contains several bits. Combination means that (1) the unit is free and is not allocated to repair the faulty sub-block. (2) Indicates that the unit is the start unit of the patch section. (3) Indicates that this unit is the end unit of the repair section. (4) indicates that the unit is a repair constituent unit. All faulty sub-blocks of a cache block are repaired, because the cache blocks that are considered worthy of repair are all cache blocks with relatively high reuse. Moreover, the repair unit for repairing a cache block is continuously allocated in the array at one time, forming a repair section. The flag field of the start unit of the patch section is assigned a value to indicate case (1), the flag field of an end unit is assigned a value of case (3), and the flag field of an intermediate unit is assigned a value of case (4).

本发明提供了一种动态修补机制过程,分为假命中,真命中和缺失三种情况,分别如图4,图5,图6所示:The present invention provides a dynamic repair mechanism process, which is divided into false hits, true hits and missing three situations, as shown in Figure 4, Figure 5, and Figure 6 respectively:

假命中时:On a false hit:

假设发生假缺失的缓存块为fv,并且它的故障子块的数量为k。Assume that the false-miss cache block is fv, and the number of its faulty sub-blocks is k.

步骤1:检查该特殊阵列中是否存在k个连续的空闲单元,如果没有,则程序结束。否则转至步骤2Step 1: Check if there are k consecutive free cells in this special array, if not, the program ends. Otherwise go to step 2

步骤2:调整计数器保证新请求的数据被放置在缓存块fv的无故障子块中。如果找到k个连续的空闲单元,则这k个连续的空闲单元组成一个修补节,然后将修补节的开始地址存储到缓存块fv的某个域中(假设为A),同时分别设置修补节各个单元的标志域。Step 2: Adjust the counter to ensure that the newly requested data is placed in the fault-free sub-block of the cache block fv. If k continuous free units are found, then these k continuous free units form a patch section, and then store the start address of the patch section into a field of the cache block fv (assumed to be A), and set the patch section respectively Flag fields for individual cells.

步骤3:从后备存储中重新载入数据,部分数据被放置到缓存块fv的无故障子块中,另外一部分数据通过A域的指针,依次放置到修补单元中,顺序按故障子块的偏移从小到大排列。Step 3: reload the data from the backup storage, part of the data is placed in the non-faulty sub-block of the cache block fv, and the other part of the data is placed in the repair unit in turn through the pointer of the A domain, and the order is according to the partiality of the faulty sub-block Arrange from small to large.

真命中时:On true hit:

假设命中缓存块为v。Suppose the hit cache block is v.

步骤1:依据某个域(假设为A)的值组装所请求的数据。查看该缓存块是否存在对应的修补单元,若是则说明所请求的数据被存储在缓存块v的无故障子块中。若不是,则说明该缓存块存在着对应的修补单元,并且修补单元的起始地址记录在A域中,依据A域的值,直接寻址访问对应的修补单元,依据标志域的标记,可以一次性的快速找到所有的修补单元。Step 1: Assemble the requested data according to the value of a field (assumed to be A). Check whether there is a corresponding repair unit in the cache block, if so, it means that the requested data is stored in the non-faulty sub-block of the cache block v. If not, it means that there is a corresponding repair unit in the cache block, and the starting address of the repair unit is recorded in the A field. According to the value of the A field, the corresponding repair unit can be directly addressed and accessed according to the mark in the flag field. Quickly find all patched units at once.

2.一部分数据从缓存块v的无故障子块中读出,一部分从修补单元中直接读出放置到内置的寄存器中,然后依据某个域(假设为B)的值,计数器的值进行装配,形成一个完整的数据块。2. Part of the data is read from the fault-free sub-block of the cache block v, and part of the data is directly read from the repair unit and placed in the built-in register, and then assembled according to the value of a certain field (assumed to be B) and the value of the counter , forming a complete data block.

缺失时:When missing:

步骤1.依据某个域(假设为B)的值,计数器的值,使得请求数据被放置到缓存块的无故障子块中。Step 1. According to the value of a certain field (assumed to be B) and the value of the counter, the requested data is placed in the fault-free sub-block of the cache block.

步骤2.当一个缓存块被敏感机制算法选中为牺牲缓存块时,该缓存块如果被修改过,需要先写回。检查某个域(假设为A)的值,若标记有修补单元,则写回之后,对应的修补单元需要被收回即这些单元的标志域被设置为空闲状态。Step 2. When a cache block is selected as a sacrifice cache block by the sensitive mechanism algorithm, if the cache block has been modified, it needs to be written back first. Check the value of a field (assumed to be A), if there is a repair unit marked, then after writing back, the corresponding repair unit needs to be reclaimed, that is, the flag fields of these units are set to an idle state.

本发明的有益效果是,延迟开销较低,对于平均的命中访问时间和缓存的性能的影响较小。在面对不同位故障率时,性能的波动小,在较高的故障概率下也可以高效工作,保证性能的稳定。特殊阵列只需要0.26KB~0.5KB的存储空间,就可以有效修补位故障率达0.01时32KB的L1缓存,并且与无故障缓存相比,性能下降在1%以内,使L1缓存可以在低达400毫伏的供电电压下高效工作。The beneficial effect of the present invention is that the delay overhead is low, and the impact on the average hit access time and cache performance is small. In the face of different bit failure rates, the fluctuation of performance is small, and it can also work efficiently under a higher failure probability to ensure stable performance. The special array only needs 0.26KB to 0.5KB of storage space, which can effectively repair the 32KB L1 cache when the bit failure rate reaches 0.01, and compared with the non-faulty cache, the performance drops within 1%, so that the L1 cache can be as low as Work efficiently with a supply voltage of 400 millivolts.

附图说明Description of drawings

图1是本缓存结构的缓存块标记域的结构图。FIG. 1 is a structural diagram of a cache block tag field of the present cache structure.

图2是故障敏感机制算法发生假命中时的算法流程图Figure 2 is the algorithm flow chart when a false hit occurs in the fault sensitive mechanism algorithm

图3是故障敏感机制算法发生真命中时的算法流程图Figure 3 is the algorithm flow chart of the fault-sensitive mechanism algorithm when a true hit occurs

图4是动态修补机制过程发生假命中时的算法流程图Figure 4 is the algorithm flow chart when a false hit occurs during the dynamic repair mechanism process

图5是动态修补机制过程发生真命中时的算法流程图Figure 5 is the algorithm flow chart when a true hit occurs in the process of dynamic patching mechanism

图6是动态修补机制过程发生缺失时的算法流程图Figure 6 is the algorithm flow chart when the dynamic repair mechanism process is missing

图7是本缓存结构容错缓存架构方式示意图Figure 7 is a schematic diagram of the fault-tolerant cache architecture of this cache structure

图8是本缓存结构初始缓存行的内容,缓存行的可靠性状态标识Line-Status和对应的计数器的值示意图Figure 8 is a schematic diagram of the contents of the initial cache line of the cache structure, the reliability state identifier Line-Status of the cache line, and the value of the corresponding counter

具体实施方式Detailed ways

下面举例具体说明敏感机制算法:The following example specifically illustrates the sensitive mechanism algorithm:

假设L1缓存的一个普通的关联度为4的缓存组,即一个缓存组含有四个缓存行,初始的缓存行的内容,缓存行的可靠性状态标识Line-Status和对应的计数器的值如图8所示。其中每个缓存行由标记部分和数据部分组成,标记部分包含标志标示位,子块故障位图域,缓存是否禁用域,数据是否修改域,子块偏移调整域,指向修补阵列的指针域,替换标志位,数据部分则以8字节的连续空间为一个子块,一个缓存组包含8个子块。初始的缓存行的内容为第一缓存行行第3,5子块有故障,第二缓存行的第5,6子块有故障,第三缓存行的第3子块有故障,第四缓存行的第3,7,8子块有故障。对应的计数器的值分别为第一行为001,第二行为010,第三行为000,第四行为001。设a域为第四缓存行的第三子块,b域为第二缓存行的第五子块,c域为第三缓存行的第六子块,d域为第一缓存行的第三子块,e域为第二缓存行的第四子块。各缓存行对应的SFM域值分别为00101000,00001100,00100000和00100001。当发生缓存请求R1时,请求数据g,请求发生缺失,假设请求地址映射到缓存行的子块的偏移是7,首先寻找候选移除块,该算法会选择替换堆栈底部的缓存行L4,如果Line4包含胀数据则回写Line4,由于B域(L4)[7]和B域(L4)[8]均等于1,是故障子块,因此,缓存行L4的计数器设置为2,依据公式,某块RSblock(R)等于(7+2)MOD 8,即等于1,请求数据g实际存储到偏移为1的子块,然后依据升策略,将Line4升到栈顶,栈内其它缓存行依次更新其次序。接下来,当发生缓存请求R2时,请求数据e,数据e已经随着请求数据b已经放置在Line3中,但是数据e被放置在故障子块中,因此,当请求数据e时,出现了假命中。该算法将假命中认为是缺失,并且候选移除块就选择发生假命中的缓存行。但是,直接再次载入数据块到该缓存行会使请求的有效数据放置到故障子块,因此需要先调整该缓存行的计数器。对于缓存行Line3,由于Sblock(e)等于4,SFM(L3)[4]等于“0”,因此,可以设置计数器为0,使得RSblock(e)等于4,将数据e放置在无故障子块中。对于假命中之后的替换状态的更新,依据升策略,所有的缓存行的替换状态都保持不变。Suppose an ordinary cache group with an associativity of 4 in the L1 cache, that is, a cache group contains four cache lines, the contents of the initial cache line, the reliability status indicator Line-Status of the cache line, and the corresponding counter values are shown in the figure 8. Each cache line is composed of a tag part and a data part. The tag part includes a flag flag bit, a subblock fault bitmap field, a cache disabled field, a data modified field, a subblock offset adjustment field, and a pointer field to a repair array. , to replace the flag bit, the data part uses 8-byte continuous space as a sub-block, and a cache group contains 8 sub-blocks. The content of the initial cache line is that the 3rd and 5th sub-blocks of the first cache line are faulty, the 5th and 6th sub-blocks of the second cache line are faulty, the 3rd sub-block of the third cache line is faulty, and the fourth cache line is faulty. Subblocks 3, 7, and 8 of the row are faulty. The corresponding counter values are 001 in the first row, 010 in the second row, 000 in the third row, and 001 in the fourth row. Let field a be the third subblock of the fourth cache line, field b be the fifth subblock of the second cache line, field c be the sixth subblock of the third cache line, and field d be the third subblock of the first cache line Sub-block, e field is the fourth sub-block of the second cache line. The SFM field values corresponding to each cache line are 00101000, 00001100, 00100000 and 00100001 respectively. When a cache request R1 occurs, data g is requested, and the request is missing. Assume that the offset of the sub-block mapped to the cache line from the request address is 7, first look for a candidate removal block, and the algorithm will choose to replace the cache line L4 at the bottom of the stack. If Line4 contains bloated data, write back to Line4. Since the B field (L4)[7] and B field (L4)[8] are both equal to 1, it is a faulty sub-block. Therefore, the counter of the cache line L4 is set to 2, according to the formula , a block of RSblock(R) is equal to (7+2)MOD 8, which is equal to 1, the requested data g is actually stored in the sub-block with an offset of 1, and then Line4 is promoted to the top of the stack according to the promotion strategy, and other caches in the stack Rows are sequentially updated in their order. Next, when cache request R2 occurs, data e is requested, and data e has been placed in Line3 along with requested data b, but data e is placed in the faulty sub-block, so when data e is requested, a false hit. The algorithm treats a false hit as a miss, and the candidate eviction block selects the cache line where the false hit occurred. However, directly reloading the data block into the cache line will cause the requested valid data to be placed in the faulty sub-block, so the counter of the cache line needs to be adjusted first. For cache line Line3, since Sblock(e) is equal to 4 and SFM(L3)[4] is equal to "0", the counter can be set to 0 so that RSblock(e) is equal to 4, and data e is placed in the non-faulty sub-block middle. For the update of the replacement state after the false hit, according to the up strategy, the replacement state of all cache lines remains unchanged.

当缓存请求R3到来时,再次请求数据b,但是缓存请求R2引起80的假命中,使得缓存行Line3被再次载入,并且调整计数器,使得数据b被放置到了故障子块中,因此请求数据b也导致了假命中。该算法对请求R3的处理,会导致计数器被再次设置为1,数据b可以被访问,而数据e又被放置于故障子块中。因此,访问数据e和访问数据b导致了缓存块Line3频繁出现假命中,数据块Line3被反复载入L1一级缓存,产生假命中的颠簸现象。When cache request R3 arrives, data b is requested again, but cache request R2 causes 80 false hits, so that cache line Line3 is loaded again, and the counter is adjusted so that data b is placed in the faulty sub-block, so data b is requested Also resulted in false hits. The algorithm's processing of request R3 will cause the counter to be set to 1 again, data b can be accessed, and data e is placed in the faulty sub-block. Therefore, accessing data e and accessing data b causes frequent false hits in the cache block Line3, and the data block Line3 is repeatedly loaded into the L1 cache, resulting in thrashing of false hits.

Claims (3)

1.一种动态修补的高性能片上缓存容错架构,其特征在于:一个缓存块包含若干个子块,缓存块的结构新增加了域,所述子块及所述域包括:1. a high-performance on-chip cache fault-tolerant framework of dynamic patching, is characterized in that: a cache block comprises several sub-blocks, and the structure of the cache block has newly increased domain, and described sub-block and described domain comprise: (1)标志标示位,用于表示所述缓存块的标号;(1) a flag marking bit, used to represent the label of the cache block; (2)子块故障位图域,用于指示了每个缓存块的子块故障状态,包含若干位,对应若干个子块,若子块是故障子块则对该位进行设置以与正常子块进行区分,该域在系统初始时进行填充,电压频率动态调整时更新,可根据奇偶校验的结果及时更新,如果多次连续发现位故障,可以认为发生间歇性位失效,从而及时修改所述子块故障位图域;(2) Sub-block failure bitmap domain, used to indicate the sub-block failure status of each cache block, including several bits, corresponding to several sub-blocks, if the sub-block is a faulty sub-block, then this bit is set to be consistent with the normal sub-block To distinguish, this field is filled at the initial stage of the system, updated when the voltage frequency is dynamically adjusted, and can be updated in time according to the result of the parity check. If bit failures are found continuously for many times, it can be considered that intermittent bit failures have occurred, so as to modify the described subblock fault bitmap field; (3)缓存是否禁用域,用于指示该缓存块是否被禁用,如果一个缓存块中不存在连续的无故障子块,则该缓存块被禁用,该域的值进行设置,与正常可用进行区分;(3) Whether the cache is disabled field is used to indicate whether the cache block is disabled. If there is no continuous non-faulty sub-block in a cache block, the cache block is disabled, and the value of this field is set to be normal and available. distinguish; (4)数据是否修改域,用于指示数据是否被修改,若修改则对该域进行修改,否则不变;(4) Whether the data is modified field is used to indicate whether the data is modified, if modified, the field is modified, otherwise it remains unchanged; (5)子块偏移调整域,用于调整子块的实际映射偏移位置,其大小可以表示出子块个数即可;(5) The sub-block offset adjustment field is used to adjust the actual mapping offset position of the sub-block, and its size can indicate the number of sub-blocks; (6)指向修补阵列的指针域,用于指向修补阵列;以及(6) A pointer field pointing to the patching array, used to point to the patching array; and (7)替换标志位,用于记录替换的情况;(7) Replacement flag, used to record the situation of replacement; 其中实际映射的子块偏移由子块偏移与块内偏移相加得到。The actual mapped sub-block offset is obtained by adding the sub-block offset and the intra-block offset. 2.根据权利要求1所述的一种动态修补的高性能片上缓存容错架构,其特征在于:所述架构通过一个故障敏感的替换机制,随时容忍新发生的故障,保证缓存正常工作,所述故障敏感的替换机制基于最近最少使用(Least RecentlyUsed:LRU)替换算法,考虑数据重用以及子块的故障状态,根据新增的缓存容错架构来对缓存块的计数器进行调整,使得请求的数据可以放入无故障的子块中,所述替换算法分为真命中和假命中两种情况:2. the high-performance on-chip cache fault-tolerant architecture of a kind of dynamic repair according to claim 1, is characterized in that: described architecture is through a fault-sensitive replacement mechanism, tolerates the fault that takes place newly at any time, guarantees cache normal operation, and described The fault-sensitive replacement mechanism is based on the least recently used (Least Recently Used: LRU) replacement algorithm, considering data reuse and sub-block failure status, and adjusting the counter of the cache block according to the new cache fault-tolerant architecture, so that the requested data can be placed In the fault-free sub-block, the replacement algorithm is divided into two cases of true hit and false hit: (1)当发生真命中时,所述替换算法步骤包括:(1) When a true hit occurs, the steps of the replacement algorithm include: 步骤1:选择替换堆栈底部的缓存块v作为移除块,缓存请求为R,某块Sblock(R)为k;Step 1: Select the cache block v at the bottom of the replacement stack as the removal block, the cache request is R, and a block Sblock(R) is k; 步骤2:判断((k+计数器C(v))MOD SN)的子块是无故障的,即B(v)[(k+C(v))MOD SN]数值检查,若无故障则转到步骤3,若有故障则转到步骤4;Step 2: Judging that the sub-block of ((k+counter C(v))MOD SN) is fault-free, that is, check the value of B(v)[(k+C(v))MOD SN], if there is no fault, go to Step 3, if there is a fault, go to step 4; 步骤3:计数器C的值不变,跳至步骤5;Step 3: The value of counter C remains unchanged, skip to step 5; 步骤4:调整缓存块的计数器C,当SFM(v)[(k+NewC(v))MOD SN]值表示无故障时停止调整,计数器值设为NewC;Step 4: Adjust the counter C of the cache block, stop the adjustment when the value of SFM(v)[(k+NewC(v))MOD SN] indicates no fault, and set the counter value to NewC; 步骤5:请求的数据放置到缓存块v的无故障子块中,与LRU类似的提升策略进行堆栈内缓存块的替换状态的调整,缓存块v被提升到堆栈的顶部;Step 5: The requested data is placed in the fault-free sub-block of the cache block v, and the promotion strategy similar to LRU is used to adjust the replacement state of the cache block in the stack, and the cache block v is promoted to the top of the stack; (2)当发生假命中时,所述替换算法步骤包括:(2) When a false hit occurs, the replacement algorithm steps include: 步骤1:选择替换堆栈底部的缓存块fv作为移除块;Step 1: Select the cache block fv at the bottom of the replacement stack as the removal block; 步骤2:判断fv是否包含脏数据,若包含则进行回写,完成后转到步骤3;Step 2: Determine whether fv contains dirty data, if so, write back, and go to step 3 after completion; 步骤3:调整fv的计数器值,使得B(fv)[RSblockfv(R)]值表示为无故障状态;Step 3: adjust the counter value of fv, so that the value of B(fv)[RSblockfv(R)] represents a no-fault state; 步骤4:进行缺失情况处理;Step 4: Carry out missing situation processing; 步骤5:依据提升策略,将fv提升到栈顶,栈内其它缓存行依次更新其次序。Step 5: According to the promotion policy, promote fv to the top of the stack, and update the order of other cache lines in the stack in turn. 3.根据权利要求2所述的一种动态修补的高性能片上缓存容错架构,其特征在于:所述动态的修补故障子块是对于出现故障子块的缓存块,在修补域中找到连续的空间通过本缓存结构的缓存容错架构某个域A和某个域B来对应替换故障子块,并在缓存块使用后对修补空间的替换块进行释放,所述动态的修补故障子块的过程分为假命中,真命中和缺失三种情况,分别是:3. the high-performance on-chip cache fault-tolerant framework of a kind of dynamic repair according to claim 2, is characterized in that: described dynamic repair fault sub-block is for the cache block of faulty sub-block, finds continuous The space uses a domain A and a domain B in the cache fault-tolerant architecture of this cache structure to replace the faulty sub-block, and releases the replacement block of the repair space after the cache block is used. The process of dynamically repairing the faulty sub-block Divided into false hits, true hits and missing three cases, respectively: (1)发生假命中时,所述动态的修补故障子块的步骤包括:(1) When a false hit occurs, the step of dynamically repairing the faulty sub-block comprises: 假设发生假缺失的缓存块为fv,并且它的故障子块的数量为k;Assume that the cache block where the false miss occurs is fv, and the number of its faulty sub-blocks is k; 步骤1:检查该特殊阵列中是否存在k个连续的空闲单元,如果没有,则程序结束,否则转至步骤2;Step 1: Check if there are k consecutive free cells in this special array, if not, the program ends, otherwise go to step 2; 步骤2:调整计数器保证新请求的数据被放置在缓存块fv的无故障子块中,如果找到k个连续的空闲单元,则该k个连续的空闲单元组成一个修补节,然后将修补节的开始地址存储到缓存块fv的域A中,同时分别设置修补节各个单元的标志域;Step 2: Adjust the counter to ensure that the newly requested data is placed in the fault-free sub-block of the cache block fv. If k consecutive free units are found, the k consecutive free units form a repair section, and then the repair section's The start address is stored in field A of the cache block fv, and the flag field of each unit of the repair section is set respectively; 步骤3:从后备存储中重新载入数据,部分数据被放置到缓存块fv的无故障子块中,另外一部分数据通过域A的指针,依次放置到修补单元中,顺序按故障子块的偏移从小到大排列;Step 3: Reload the data from the backup storage, part of the data is placed in the non-faulty sub-block of the cache block fv, and the other part of the data is placed in the repair unit in turn through the pointer of the domain A, and the order is according to the partiality of the faulty sub-block Move from small to large; (2)发生真命中时,所述动态的修补故障子块的步骤包括:(2) When a true hit occurs, the steps of dynamically repairing the faulty sub-block include: 假设命中缓存块为v;Suppose the hit cache block is v; 步骤1:依据域A的值组装所请求的数据,查看该缓存块是否存在对应的修补单元,若是则说明所请求的数据被存储在缓存块v的无故障子块中,若不是,则说明该缓存块存在着对应的修补单元,并且修补单元的起始地址记录在域A中,依据域A的值,直接寻址访问对应的修补单元,依据标志域的标记,一次性的快速找到所有的修补单元;Step 1: Assemble the requested data according to the value of field A, and check whether there is a corresponding repair unit in the cache block. If so, it means that the requested data is stored in the non-faulty sub-block of cache block v; if not, then it means There is a corresponding repair unit in the cache block, and the starting address of the repair unit is recorded in field A. According to the value of field A, directly address and access the corresponding repair unit, and quickly find all repair unit; 步骤2:将一部分数据从缓存块v的无故障子块中读出,一部分从修补单元中直接读出并放置到内置的寄存器中,然后依据域B的值和计数器的值进行装配,形成一个完整的数据块;Step 2: Read a part of data from the fault-free sub-block of the cache block v, read a part of the data directly from the repair unit and put it into the built-in register, and then assemble it according to the value of domain B and the value of the counter to form a complete data block; (3)发生缺失时,所述动态的修补故障子块的步骤包括:(3) When missing occurs, the step of dynamically repairing the faulty sub-block includes: 步骤1:依据域B的值和计数器的值,使得请求数据被放置到缓存块的无故障子块中;Step 1: According to the value of field B and the value of the counter, the requested data is placed in the non-faulty sub-block of the cache block; 步骤2:当一个缓存块被所述故障敏感的替换机制的所述替换算法选中为牺牲缓存块时,该缓存块如果被修改过,需要先写回,检查域A的值,若标记有修补单元,则写回之后,对应的修补单元被收回,即这些单元的标志域被设置为空闲状态。Step 2: When a cache block is selected as a victim cache block by the replacement algorithm of the fault-sensitive replacement mechanism, if the cache block has been modified, it needs to be written back first, check the value of field A, if it is marked as repaired Units, after writing back, the corresponding patched units are reclaimed, that is, the flag fields of these units are set to an idle state.
CN201710298651.7A 2017-06-13 2017-06-13 The high-performance on piece caching fault-tolerant architecture of dynamic repairing Expired - Fee Related CN107122256B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201710298651.7A CN107122256B (en) 2017-06-13 2017-06-13 The high-performance on piece caching fault-tolerant architecture of dynamic repairing

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201710298651.7A CN107122256B (en) 2017-06-13 2017-06-13 The high-performance on piece caching fault-tolerant architecture of dynamic repairing

Publications (2)

Publication Number Publication Date
CN107122256A CN107122256A (en) 2017-09-01
CN107122256B true CN107122256B (en) 2018-06-19

Family

ID=59726193

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201710298651.7A Expired - Fee Related CN107122256B (en) 2017-06-13 2017-06-13 The high-performance on piece caching fault-tolerant architecture of dynamic repairing

Country Status (1)

Country Link
CN (1) CN107122256B (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6810489B1 (en) * 2000-09-06 2004-10-26 Hewlett-Packard Development Company, L.P. Checkpoint computer system utilizing a FIFO buffer to re-synchronize and recover the system on the detection of an error
WO2012005938A3 (en) * 2010-07-05 2012-03-08 Intel Corporation Fault tolerance of multi-processor system with distributed cache
CN103870353A (en) * 2014-03-18 2014-06-18 北京控制工程研究所 Multicore-oriented reconfigurable fault tolerance system and multicore-oriented reconfigurable fault tolerance method

Family Cites Families (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US7168010B2 (en) * 2002-08-12 2007-01-23 Intel Corporation Various methods and apparatuses to track failing memory locations to enable implementations for invalidating repeatedly failing memory locations
CN105005513B (en) * 2015-08-19 2017-12-05 首都师范大学 The detection of cache long numeric data upset mistake and fault tolerance facility and method
CN105335247B (en) * 2015-09-24 2018-04-20 中国航天科技集团公司第九研究院第七七一研究所 The fault-tolerant architecture and its fault-tolerance approach of Cache in highly reliable System on Chip/SoC

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6810489B1 (en) * 2000-09-06 2004-10-26 Hewlett-Packard Development Company, L.P. Checkpoint computer system utilizing a FIFO buffer to re-synchronize and recover the system on the detection of an error
WO2012005938A3 (en) * 2010-07-05 2012-03-08 Intel Corporation Fault tolerance of multi-processor system with distributed cache
CN103870353A (en) * 2014-03-18 2014-06-18 北京控制工程研究所 Multicore-oriented reconfigurable fault tolerance system and multicore-oriented reconfigurable fault tolerance method

Non-Patent Citations (5)

* Cited by examiner, † Cited by third party
Title
《A fault-tolerant directory-based cache coherence protocol for CMP architecture》;Ricardo Fernandez-Pascual等;《International Conference on Dependable Systems & Networks》;20080630;全文 *
《Analyzing the Optimal Voltage/Frequency Pair in Fault-Tolerant Caches》;Vicente Lorente;《IEEE International Conference on High Performance Computing and Communications》;20141031;全文 *
《CacheFI:基于架构级故障注入的片上缓存容错评估工具》;黄智濒等;《国防科技大学学报》;20161031;第38卷(第5期);全文 *
《Fault-Tolerant Features in the Hal Memory Management Unit》;Nirmal R.Saxena等;《IEEE TAANSACTIONS ON COMPUTERS》;19950228;第44卷(第2期);全文 *
《混合架构下多请求模式的缓存替换模型研究》;曹旻等;《计算机科学》;20150630;第42卷(第6期);全文 *

Also Published As

Publication number Publication date
CN107122256A (en) 2017-09-01

Similar Documents

Publication Publication Date Title
US9389954B2 (en) Memory redundancy to replace addresses with multiple errors
KR101495049B1 (en) Method and apparatus for using cache memory in a system that supports a low power state
Yoon et al. FREE-p: Protecting non-volatile memory against both hard and soft errors
US10860495B2 (en) Storage circuitry responsive to a tag-matching command
CN103455441B (en) Disabling cache portions during low voltage operating
US8560881B2 (en) FLASH-based memory system with static or variable length page stripes including data protection information and auxiliary protection stripes
US7840848B2 (en) Self-healing cache operations
US9003247B2 (en) Remapping data with pointer
CN101782871B (en) Information processing device, processor and memory management method
US7430145B2 (en) System and method for avoiding attempts to access a defective portion of memory
US8977820B2 (en) Handling of hard errors in a cache of a data processing apparatus
US7856576B2 (en) Method and system for managing memory transactions for memory repair
US9710378B2 (en) Writing an address conversion table for nonvolatile memory wear leveling
CN101379566B (en) Apparatus, system, and method for repairing a location in a cache array
JP2008186460A (en) Method and system for dynamically recoverable memory
US9092357B2 (en) Remapping of inoperable memory blocks
US9424195B2 (en) Dynamic remapping of cache lines
US9477548B2 (en) Error repair location cache
US20110296082A1 (en) Method for Improving Service Life of Flash
CN117331508A (en) Headlamp data storage method based on NAND Flash
CN107122256B (en) The high-performance on piece caching fault-tolerant architecture of dynamic repairing
JP4703673B2 (en) Memory system
CN107329906B (en) A kind of multiple dimensioned failure bitmap buffer structure of high bandwidth
US20140281254A1 (en) Semiconductor Chip With Adaptive BIST Cache Testing During Runtime
HK1110431A (en) Various methods and apparatuses to track failing memory locations to enable implementations for invalidating repeatedly failing memory locations

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant
CF01 Termination of patent right due to non-payment of annual fee
CF01 Termination of patent right due to non-payment of annual fee

Granted publication date: 20180619