[go: up one dir, main page]

CN103218275B - Error in data restorative procedure, device and equipment - Google Patents

Error in data restorative procedure, device and equipment Download PDF

Info

Publication number
CN103218275B
CN103218275B CN201310105316.2A CN201310105316A CN103218275B CN 103218275 B CN103218275 B CN 103218275B CN 201310105316 A CN201310105316 A CN 201310105316A CN 103218275 B CN103218275 B CN 103218275B
Authority
CN
China
Prior art keywords
memory
data error
physical address
failure
data
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201310105316.2A
Other languages
Chinese (zh)
Other versions
CN103218275A (en
Inventor
傅汝丹
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
XFusion Digital Technologies Co Ltd
Original Assignee
Huawei Technologies Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Huawei Technologies Co Ltd filed Critical Huawei Technologies Co Ltd
Priority to CN201310105316.2A priority Critical patent/CN103218275B/en
Publication of CN103218275A publication Critical patent/CN103218275A/en
Application granted granted Critical
Publication of CN103218275B publication Critical patent/CN103218275B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Landscapes

  • Techniques For Improving Reliability Of Storages (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本发明公开了一种数据错误修复方法、装置和设备,属于终端设备领域。所述方法包括:判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。本发明通过根据所述预设存储空间中存储的物理地址有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。

The invention discloses a data error recovery method, device and equipment, belonging to the field of terminal equipment. The method includes: judging whether a preset counter in the internal memory overflows, the preset counter is used to count data errors in the internal memory; if the preset counter overflows, according to the data errors stored in the internal memory The physical address is used to determine the failure type of the memory, so that subsequent repairs can be performed accordingly. The present invention effectively distinguishes the failure type of the memory according to the physical address stored in the preset storage space, and repairs the failure type according to the failure type, avoiding the system hanging or failing to start caused by the accumulation of data errors, and ensuring the normal operation conduct.

Description

数据错误修复方法、装置和设备Data error recovery method, device and equipment

技术领域technical field

本发明涉及计算机技术领域,特别涉及一种数据错误修复方法、装置和设备。The invention relates to the field of computer technology, in particular to a method, device and equipment for repairing data errors.

背景技术Background technique

内存作为计算机系统中必备的部件,通常以内存条的形式存在于不同架构的系统中。在系统运行过程中,内存可能发生硬失效或软失效。硬失效是指由于硬件问题引入的无法恢复的数据错误,软失效是指由于数据跳变而引入的可以由上、下电或者重启进行恢复的数据错误。而为了维护系统的正常运行,需要对硬失效和软失效引入的数据错误进行修复。As an essential part of a computer system, memory usually exists in systems of different architectures in the form of memory sticks. During system operation, hard failure or soft failure may occur in the memory. Hard failures refer to data errors that cannot be recovered due to hardware problems, and soft failures refer to data errors that can be recovered by power-on, power-off or restart due to data jumps. In order to maintain the normal operation of the system, data errors introduced by hard failures and soft failures need to be repaired.

现有技术中的修复方法一般通过在内存条上增加ECC(ErrorCheckingandCorrection,错误检测和纠正)校验芯片进行,当内存的数据出现数据错误,ECC检测到该数据错误后,输出正确的数据给用户。The repair method in the prior art is generally carried out by adding an ECC (Error Checking and Correction, error detection and correction) verification chip on the memory stick. When the data in the memory has a data error, the ECC will output the correct data to the user after detecting the data error. .

在实现本发明的过程中,发明人发现现有技术至少存在以下问题:In the process of realizing the present invention, the inventor finds that there are at least the following problems in the prior art:

ECC仅是根据数据错误向用户输出正确的数据,而不对内存中的错误数据进行任何修复动作。通过ECC不能有效区分硬失效和软失效,进而不能对错误数据修复,使得错误数据的累积而容易造成系统挂死、系统无法启动等,影响正常业务的进行。ECC only outputs correct data to the user according to data errors, and does not perform any repair actions on the wrong data in the memory. ECC cannot effectively distinguish between hard failures and soft failures, and thus cannot repair erroneous data, making the accumulation of erroneous data likely to cause system hangs, system failures, etc., affecting normal business operations.

发明内容Contents of the invention

为了解决软、硬失效的区分和处理问题,本发明实施例提供了一种数据错误修复方法、装置和设备。所述技术方案如下:In order to solve the problem of distinguishing and processing soft and hard failures, the embodiments of the present invention provide a data error recovery method, device and equipment. Described technical scheme is as follows:

第一方面,提供了一种数据错误修复方法,所述方法包括:In a first aspect, a method for repairing data errors is provided, the method comprising:

判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;judging whether a preset counter in the memory overflows, and the preset counter is used to count data errors in the memory;

如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。If the preset counter overflows, the failure type of the memory is determined according to the physical address of the data error stored in the memory, so as to be repaired accordingly.

结合第一方面,本发明实施例的第一种可能实现方式中,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复,包括:With reference to the first aspect, in the first possible implementation of the embodiment of the present invention, if the preset counter overflows, determine the failure type of the memory according to the physical address stored in the memory where a data error occurs, so that subsequent Make appropriate repairs, including:

如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。If the same physical address exists in the physical addresses stored in the internal memory where the data error occurs, it is determined that the failure type of the data error corresponding to the same physical address is a hard failure.

结合第一方面,本发明实施例的第二种可能实现方式中,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复,包括:In combination with the first aspect, in the second possible implementation of the embodiment of the present invention, if the preset counter overflows, the failure type of the memory is determined according to the physical address of the data error stored in the memory, so that subsequent Make appropriate repairs, including:

如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检;If the same physical address does not exist in the physical address where the data error occurs stored in the memory, memory inspection is performed;

在结束巡检之后,判断所述内存中的数据错误是否已被修复;After the inspection is finished, it is judged whether the data error in the memory has been repaired;

如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效;If the data error has not been repaired, determining that the failure type of the data error is a hard failure;

如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。If the data error has been repaired, it is determined that the failure type of the data error is a soft failure.

结合第一方面,本发明实施例的第三种可能实现方式中,如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检,包括:In combination with the first aspect, in the third possible implementation of the embodiment of the present invention, if the same physical address does not exist in the physical address stored in the memory where a data error occurs, memory inspection is performed, including:

如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址;If the same physical address does not exist in the physical address where the data error occurs stored in the memory, converting the physical address in the preset counter into a patrol address;

根据所述巡检地址对应的所述内存中的数据进行巡检。Perform inspection according to the data in the internal memory corresponding to the inspection address.

结合第一方面,本发明实施例的第四种可能实现方式中,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复之后,所述方法还包括:In combination with the first aspect, in the fourth possible implementation of the embodiment of the present invention, if the preset counter overflows, the failure type of the memory is determined according to the physical address stored in the memory where a data error occurs, so that subsequent After performing corresponding repairs, the method also includes:

当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址;When it is determined that the failure type of the memory is a hard failure, acquiring the physical address of the data error whose failure type is a hard failure;

触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。An alarm is triggered so as to prompt the user to replace the memory corresponding to the physical address of the data error whose failure type is hard failure.

结合第一方面,本发明实施例的第五种可能实现方式中,判断内存中预设计数器是否溢出之前,所述方法还包括:In combination with the first aspect, in the fifth possible implementation of the embodiment of the present invention, before judging whether the preset counter in the memory overflows, the method further includes:

当内存中发生数据错误时,获取发生数据错误的物理地址;When a data error occurs in the memory, obtain the physical address where the data error occurs;

将所述发生数据错误的物理地址存储至内存中,并对所述发生数据错误的物理地址进行数据回写。The physical address where the data error occurs is stored in the memory, and the data is written back to the physical address where the data error occurs.

第二方面,提供了一种数据错误修复装置,所述装置包括:In a second aspect, a device for repairing data errors is provided, the device comprising:

判断模块,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;A judging module, configured to judge whether a preset counter in the memory overflows, and the preset counter is used to count data errors in the memory;

确定模块,用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。The determining module is configured to, if the preset counter overflows, determine the failure type of the memory according to the physical address stored in the memory where a data error occurs, so as to perform subsequent corresponding repairs.

结合第二方面,本发明实施例的第一种可能实现方式中,所述确定模块用于如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。With reference to the second aspect, in the first possible implementation manner of the embodiment of the present invention, the determining module is configured to determine whether the same physical address exists in the physical address stored in the memory where a data error occurs. The failure type of the data error corresponding to the address is a hard failure.

结合第二方面,本发明实施例的第二种可能实现方式中,所述确定模块包括:With reference to the second aspect, in a second possible implementation manner of the embodiment of the present invention, the determining module includes:

巡检单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检;The inspection unit is used to perform memory inspection if the same physical address does not exist in the physical address where the data error occurs stored in the memory;

判断单元,用于在结束巡检之后,判断所述内存中的数据错误是否已被修复;a judging unit, configured to judge whether the data error in the memory has been repaired after the inspection is finished;

确定单元,用于如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效;a determining unit, configured to determine that the failure type of the data error is a hard failure if the data error has not been repaired;

所述确定单元,用于如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。The determining unit is configured to determine that the failure type of the data error is a soft failure if the data error has been repaired.

结合第二方面,本发明实施例的第三种可能实现方式中,所述巡检单元包括:With reference to the second aspect, in a third possible implementation manner of the embodiment of the present invention, the patrol unit includes:

巡检地址转换子单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址;The patrol address conversion subunit is used to convert the physical address in the preset counter into a patrol address if the same physical address does not exist in the physical address stored in the memory where a data error occurs;

巡检子单元,用于根据所述巡检地址对应的所述内存中的数据进行巡检。The inspection subunit is configured to perform inspection according to the data in the internal memory corresponding to the inspection address.

结合第二方面,本发明实施例的第四种可能实现方式中,所述装置还包括:With reference to the second aspect, in a fourth possible implementation manner of the embodiment of the present invention, the device further includes:

硬失效物理地址获取模块,用于当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址;A hard failure physical address acquisition module, configured to obtain a physical address of a data error whose failure type is a hard failure when the failure type of the memory is determined to be a hard failure;

触发模块,用于触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。The triggering module is configured to trigger an alarm, so as to prompt the user to replace the memory corresponding to the physical address of the data error whose failure type is hard failure.

结合第二方面,本发明实施例的第五种可能实现方式中,所述装置还包括:With reference to the second aspect, in a fifth possible implementation manner of the embodiment of the present invention, the device further includes:

数据错误物理地址获取模块,用于当内存中发生数据错误时,获取发生数据错误的物理地址;The data error physical address acquisition module is used to obtain the physical address where the data error occurs when a data error occurs in the memory;

存储模块,用于将所述发生数据错误的物理地址存储至内存中;A storage module, configured to store the physical address where the data error occurs in the memory;

回写模块,用于对所述发生数据错误的物理地址进行数据回写。A write-back module, configured to write back data to the physical address where a data error occurs.

第三方面,提供了一种数据错误修复设备,所述设备包括:In a third aspect, a device for repairing data errors is provided, and the device includes:

内存,用于存储数据以及发生数据错误的物理地址;Memory, used to store data and the physical address where data errors occur;

处理器,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;a processor, configured to determine whether a preset counter in the memory overflows, and the preset counter is used to count data errors occurring in the memory;

所述处理器,还用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。The processor is further configured to, if the preset counter overflows, determine the failure type of the memory according to the physical address stored in the memory where a data error occurs, so as to perform subsequent corresponding repairs.

本发明实施例提供的技术方案带来的有益效果是:The beneficial effects brought by the technical solution provided by the embodiments of the present invention are:

本发明实施例提供的数据错误修复方法、装置和设备,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。The data error recovery method, device and equipment provided by the embodiments of the present invention determine whether the preset counter in the memory overflows, and the preset counter is used to count the occurrence of data errors in the memory; if the preset counter overflows, According to the physical address of the data error stored in the memory, the failure type of the memory is determined, so as to be repaired accordingly. By adopting the technical solution provided by the embodiment of the present invention, the failure type of the memory can be effectively distinguished and repaired according to the failure type, avoiding the system hanging or failing to start caused by the accumulation of data errors, and ensuring the normal operation of the business.

附图说明Description of drawings

为了更清楚地说明本发明实施例中的技术方案,下面将对实施例描述中所需要使用的附图作简单地介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据这些附图获得其他的附图。In order to more clearly illustrate the technical solutions in the embodiments of the present invention, the drawings that need to be used in the description of the embodiments will be briefly introduced below. Obviously, the drawings in the following description are only some embodiments of the present invention. For those skilled in the art, other drawings can also be obtained based on these drawings without creative effort.

图1是本发明实施例中提供的一种数据错误修复方法流程图;Fig. 1 is a flow chart of a method for repairing data errors provided in an embodiment of the present invention;

图2是本发明实施例中提供的一种数据错误修复方法流程图;Fig. 2 is a flow chart of a method for repairing data errors provided in an embodiment of the present invention;

图3是本发明实施例中提供的一种数据错误修复装置结构示意图;Fig. 3 is a schematic structural diagram of a data error repairing device provided in an embodiment of the present invention;

图4是本发明实施例中提供的一种数据错误修复设备结构示意图。Fig. 4 is a schematic structural diagram of a data error repairing device provided in an embodiment of the present invention.

具体实施方式Detailed ways

为使本发明的目的、技术方案和优点更加清楚,下面将结合附图对本发明实施方式作进一步地详细描述。In order to make the object, technical solution and advantages of the present invention clearer, the implementation manner of the present invention will be further described in detail below in conjunction with the accompanying drawings.

本发明实施例中,终端设备指向用户提供数据处理功能、语音和/或数据连通性的设备,包括无线终端或有线终端。无线终端可以是具有无线连接功能的手持式设备、或连接到无线调制解调器的其他处理设备,经无线接入网与一个或多个核心网进行通信的移动终端。例如,无线终端可以是移动电话(或称为“蜂窝”电话)和具有移动终端的计算机。又如,无线终端也可以是便携式、袖珍式、手持式、计算机内置的或者车载的移动装置。再如,无线终端可以为移动站(英文为:mobilestation)、接入点(英文为:accesspoint)、或用户装备(英文为:userequipment,简称UE)等。In the embodiment of the present invention, a terminal device refers to a device that provides a data processing function, voice and/or data connectivity to a user, including a wireless terminal or a wired terminal. The wireless terminal may be a handheld device with a wireless connection function, or other processing device connected to a wireless modem, and a mobile terminal that communicates with one or more core networks via a wireless access network. Wireless terminals may be, for example, mobile telephones (or "cellular" telephones) and computers with mobile terminals. As another example, the wireless terminal may also be a portable, pocket, hand-held, computer built-in or vehicle-mounted mobile device. For another example, the wireless terminal may be a mobile station (English: mobilestation), an access point (English: accesspoint), or user equipment (English: userequipment, UE for short), and the like.

图1是本发明实施例中提供的一种数据错误修复方法流程图,本发明实施例的执行主体是终端设备,参见图1,该方法包括:Fig. 1 is a flow chart of a method for repairing data errors provided in an embodiment of the present invention. The execution subject of the embodiment of the present invention is a terminal device. Referring to Fig. 1, the method includes:

101:判断内存中预设计数器是否溢出,所述预设计数器用于存储所述内存发生数据错误的物理地址;101: Determine whether a preset counter in the memory overflows, and the preset counter is used to store a physical address where a data error occurs in the memory;

其中,预设计数器是预先设置在内存中的空间,该预设计数器的大小由技术人员在设计过程中进行设定,本发明实施例对此不作具体限定。Wherein, the preset counter is a space preset in the memory, and the size of the preset counter is set by technicians during the design process, which is not specifically limited in the embodiment of the present invention.

优选地,该预设计数器每隔一定时间间隔对ECC寄存器进行读取,当读取到ECC寄存器中的标识位表示内存中的数据存在错误时,将该预设计数器的值加1。进一步地,每隔预设时长,该预设计数器的值减1,预设时长大于读取的时间间隔。当预设计数器的数值超过溢出门限时,该预设计数器溢出。其中,该预设计数器涉及到的读取的时间间隔、预设时长以及溢出门限等参数可以由技术人员进行设置,本发明实施例对此不作具体限定。Preferably, the preset counter reads the ECC register at regular time intervals, and when the flag bit read in the ECC register indicates that there is an error in the data in the memory, the value of the preset counter is increased by 1. Further, the value of the preset counter is decremented by 1 every preset time length, and the preset time length is longer than the time interval for reading. When the value of the preset counter exceeds the overflow threshold, the preset counter overflows. Parameters such as the time interval for reading, the preset duration, and the overflow threshold involved in the preset counter can be set by technicians, which are not specifically limited in this embodiment of the present invention.

终端设备判断内存中预设计数器是否溢出时,可以由预设计数器的值超出溢出门限时触发相应的指令,在接收到预设计数器在溢出时触发相应的指令时,确定预设计数器溢出,否则,确定该预设计数器未溢出。When the terminal device judges whether the preset counter in the memory overflows, it can trigger the corresponding instruction when the value of the preset counter exceeds the overflow threshold, and when it receives the corresponding instruction when the preset counter overflows, it determines that the preset counter overflows, otherwise , making sure that the preset counter has not overflowed.

优选地,该数据错误为单比特错误,当预设计数器溢出时,需要判断内存中存储的发生单比特数据错误的物理地址对应的数据的失效类型,并加以处理,以防止多比特数据错误的发生;当预设计数器未溢出时,内存中存储的发生单比特数据错误的物理地址数量较少,可以不对内存中存储的物理地址对应的数据进行处理。Preferably, the data error is a single-bit error. When the preset counter overflows, it is necessary to judge the failure type of the data corresponding to the physical address of the single-bit data error stored in the memory and process it to prevent multi-bit data errors. occurs; when the preset counter does not overflow, the number of physical addresses where single-bit data errors occur in the memory is relatively small, and the data corresponding to the physical addresses stored in the memory may not be processed.

102:如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。102: If the preset counter overflows, determine the failure type of the memory according to the physical address stored in the memory where a data error occurs, so as to perform corresponding repairs subsequently.

其中,内存的失效类型分为软失效和硬失效。软失效的数据错误可以进行回写,即将正确的数据写至该软失效对应的物理地址中;硬失效的数据错误不能进行回写,只能通过人工的方式进行对应内存的更换。Among them, memory failure types are divided into soft failure and hard failure. The data error of soft failure can be written back, that is, the correct data is written to the physical address corresponding to the soft failure; the data error of hard failure cannot be written back, and the corresponding memory can only be replaced manually.

如果所述预设计数器溢出,终端设备需要对内存中存储的物理地址对应的内存的失效类型进行判断,以确定内存的失效类型。终端设备可以对内存中存储的物理地址对应的数据进行反复读取,判断内存中存储的物理地址对应的数据是否被修复,如果该数据被修复,则该物理地址对应的内存的失效类型为软失效;如果该数据未被修复,则该物理地址对应的内存的失效类型为硬失效。如,当对一个物理地址对应数据进行多次读取后,通过检测获知该物理地址对应的数据仍然存在错误,则该物理地址对应的内存的失效类型为硬失效。If the preset counter overflows, the terminal device needs to judge the failure type of the memory corresponding to the physical address stored in the memory, so as to determine the failure type of the memory. The terminal device can repeatedly read the data corresponding to the physical address stored in the memory to determine whether the data corresponding to the physical address stored in the memory has been repaired. If the data is repaired, the failure type of the memory corresponding to the physical address is soft Failure; if the data has not been repaired, the failure type of the memory corresponding to the physical address is a hard failure. For example, when the data corresponding to a physical address is read multiple times, and it is known through detection that the data corresponding to the physical address still has errors, the failure type of the memory corresponding to the physical address is a hard failure.

优选地,判断内存中存储的物理地址对应的数据是否被修复可以由ECC寄存器进行检测获知。Preferably, judging whether the data corresponding to the physical address stored in the memory has been repaired can be known through detection by the ECC register.

当确定内存中存储的物理地址对应失效类型为软失效时,则将正确的数据回写至该软失效对应的物理地址中,当确定内存中存储的物理地址对应失效类型为硬失效时,则无法将正确的数据回写至该硬失效对应的物理地址中,相应的,提示用户该内存错误为硬失效,需要人工对该物理地址对应的内存进行更换,以防止多比特错误累积造成系统挂死等问题。When it is determined that the failure type corresponding to the physical address stored in the memory is a soft failure, then write back the correct data to the physical address corresponding to the soft failure; when it is determined that the failure type corresponding to the physical address stored in the memory is a hard failure, then The correct data cannot be written back to the physical address corresponding to the hard failure. Correspondingly, the user is prompted that the memory error is a hard failure, and the memory corresponding to the physical address needs to be replaced manually to prevent the accumulation of multiple bit errors from causing the system to hang. death and other issues.

本发明实施例提供的数据错误修复方法,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。In the method for repairing data errors provided by the embodiments of the present invention, by judging whether the preset counter in the memory overflows, the preset counter is used to count data errors in the memory; if the preset counter overflows, according to the memory The physical address where the data error occurs is stored in the memory, and the failure type of the memory is determined, so that subsequent corresponding repairs can be performed. By adopting the technical solution provided by the embodiment of the present invention, the failure type of the memory can be effectively distinguished and repaired according to the failure type, avoiding the system hanging or failing to start caused by the accumulation of data errors, and ensuring the normal operation of the business.

图2是本发明实施例中提供的一种数据错误修复方法流程图,本发明实施例的执行主体是终端设备,数据错误为单比特数据错误为例进行说明。参见图2,该方法包括:Fig. 2 is a flow chart of a method for repairing data errors provided in the embodiment of the present invention. The execution subject of the embodiment of the present invention is a terminal device, and the data error is a single-bit data error as an example for illustration. Referring to Figure 2, the method includes:

201:当内存中发生数据错误时,获取发生数据错误的物理地址;201: When a data error occurs in the memory, obtain the physical address where the data error occurs;

当内存中发生单比特数据错误时,终端设备根据ECC检测到的单比特数据错误,获取单比特数据错误对应的物理地址。When a single-bit data error occurs in the memory, the terminal device obtains a physical address corresponding to the single-bit data error according to the single-bit data error detected by the ECC.

202:将所述发生数据错误的物理地址存储至内存中,并对所述发生数据错误的物理地址进行数据回写;202: Store the physical address where the data error occurs in the memory, and write back data to the physical address where the data error occurs;

具体地,终端设备在获取到发生数据错误的物理地址后,将该发生数据错误的物理地址存储至内存的同时,启动需求清除(DemandScrubbing)功能,在发生单比特数据错误的物理地址中回写正确的数据,以实现对单比特数据错误的修复。Specifically, after obtaining the physical address where the data error occurred, the terminal device stores the physical address where the data error occurred in the memory, and at the same time starts the Demand Scrubbing function to write back the physical address where the single-bit data error occurred. correct data to enable recovery of single-bit data errors.

当该发生的单比特数据错误对应的失效类型为软失效时,理想状态下,该需求清除功能可以将正确的数据回写,当该发生的单比特数据错误对应的失效类型为硬失效时,该需求清除功能不能将正确的数据回写。因此,终端设备需要进一步判断发生单比特数据错误的内存的失效类型,如果是软失效,则单比特数据错误已经被修复,如果是硬失效,则单比特数据错误未被修复,还需后续进一步处理。When the failure type corresponding to the single-bit data error is soft failure, ideally, the demand clearing function can write back the correct data. When the failure type corresponding to the single-bit data error is hard failure, The demand clear function cannot write back correct data. Therefore, the terminal device needs to further determine the failure type of the memory with a single-bit data error. If it is a soft failure, the single-bit data error has been repaired; if it is a hard failure, the single-bit data error has not been repaired, and further steps are required deal with.

203:判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数,如果是,执行步骤204,如果否,继续执行该步骤203;203: Determine whether the preset counter in the memory overflows, the preset counter is used to count data errors in the memory, if yes, perform step 204, if not, continue to perform step 203;

当预设计数器的值大于溢出门限,该预设计数器溢出。When the value of the preset counter is greater than the overflow threshold, the preset counter overflows.

需要说明的是,当预设计数器未溢出时,则不进行后续步骤,终端设备继续判断预设计数器是否溢出。It should be noted that, when the preset counter does not overflow, no subsequent steps are performed, and the terminal device continues to determine whether the preset counter overflows.

步骤201-202为可选步骤,将所述发生数据错误的物理地址存储至内存可以作为步骤203的触发条件,当对发生数据错误的物理地址进行存储时,执行步骤203,而在本发明提供的另一实施例中,还可以是每隔预设时长执行步骤203,而不以对发生单比特错误并对发生数据错误的物理地址进行存储为触发条件。Steps 201-202 are optional steps. Storing the physical address where the data error occurs to the memory can be used as a trigger condition for step 203. When the physical address where the data error occurs is stored, step 203 is executed, and in the present invention In another embodiment of the present invention, step 203 may also be performed at preset time intervals, without taking the occurrence of a single bit error and storing the physical address where a data error occurs as a trigger condition.

204:判断所述内存中存储的发生数据错误的物理地址中是否存在相同的物理地址,如果是,执行步骤205,如果否,执行步骤206;204: Judging whether the same physical address exists in the physical address stored in the memory where the data error occurs, if yes, perform step 205, and if not, perform step 206;

如果预设计数器溢出,终端设备读取该内存中存储的各个发生数据错误的物理地址,判断各个发生数据错误的物理地址中是否存在相同的物理地址。If the preset counter overflows, the terminal device reads the physical addresses where data errors occur stored in the memory, and judges whether there is the same physical address among the physical addresses where data errors occur.

如果判断过程中确定存在两个或者两个以上相同的物理地址,则可以确定内存中存在相同的物理地址,如果不存在两个或者两个以上相同的物理地址,则可以确定内存中不存在相同的物理地址。If it is determined that there are two or more identical physical addresses in the judgment process, it can be determined that there are identical physical addresses in the memory; if there are no two or more identical physical addresses, it can be determined that there are no identical physical addresses in the memory. physical address.

205:如果所述内存中存在相同的物理地址,确定所述相同的物理地址对应的单比特数据错误的失效类型为硬失效,执行步骤211;205: If the same physical address exists in the memory, determine that the failure type of the single-bit data error corresponding to the same physical address is a hard failure, and perform step 211;

根据步骤202可知,对于软失效来说,在获取到发生单比特数据错误的物理地址后,将该发生单比特数据错误的物理地址存储至内存的同时,启动需求清除(DemandScrubbing)功能,通过需求清除功能在发生单比特数据错误的物理地址中写入正确的数据。因此,如果该发生单比特数据错误的失效类型为软失效时,该发生单比特数据错误的物理地址中将被写入正确的数据,当再次对该物理地址进行检测时,该物理地址中的数据正确,则不将该物理地址写入内存,也即是发生软失效的物理地址仅会在内存中存储一次;而如果该发生单比特数据错误的失效类型为硬失效时,由于数据回写不能将正确的数据写入该发生单比特数据错误的物理地址中,导致该物理地址的错误数据未能被修复,当再次对该物理地址进行检测时,该物理地址将再次被写入内存,因此,当该发生单比特数据错误的失效类型为硬失效时,该发生单比特数据错误的物理地址可能会多次存储在内存中。According to step 202, for soft failure, after obtaining the physical address where the single-bit data error occurred, the physical address where the single-bit data error occurred is stored in the memory, and at the same time, the demand scrubbing (DemandScrubbing) function is started. The clear function writes the correct data at the physical address where a single-bit data error occurred. Therefore, if the failure type of the single-bit data error is a soft failure, correct data will be written in the physical address of the single-bit data error, and when the physical address is detected again, the If the data is correct, the physical address will not be written into the memory, that is, the physical address of the soft failure will only be stored in the memory once; and if the failure type of the single-bit data error is a hard failure, due to data write-back The correct data cannot be written into the physical address where the single-bit data error occurs, resulting in that the wrong data of the physical address cannot be repaired. When the physical address is detected again, the physical address will be written into the memory again. Therefore, when the type of failure where the single-bit data error occurs is a hard failure, the physical address where the single-bit data error occurs may be stored in the memory multiple times.

由于发生单比特数据错误的物理地址可能会多次存储在内存中,如果内存中存在两个或两个以上相同的物理地址,确定该相同的物理地址对应的单比特数据错误的失效类型为硬失效。Since the physical address where a single-bit data error occurs may be stored in the memory multiple times, if there are two or more identical physical addresses in the memory, it is determined that the failure type of the single-bit data error corresponding to the same physical address is hard invalidated.

206:如果所述内存中不存在相同的物理地址,将所述内存中的发生数据错误的物理地址转换成巡检地址;206: If the same physical address does not exist in the memory, convert the physical address in the memory where a data error occurs into an inspection address;

当终端设备启动巡检清除PatrolScrubbing功能时,根据转换的巡检地址对该巡检地址对应的内存数据进行巡检。终端设备将内存中的发生数据错误的物理地址转换成巡检地址,便于根据该巡检地址进行巡检。When the terminal device starts the PatrolScrubbing function, it performs patrol inspection on the memory data corresponding to the patrol address according to the converted patrol address. The terminal device converts the physical address in the memory where a data error occurs into an inspection address, so as to facilitate inspection based on the inspection address.

具体地,将所述内存中的发生数据错误的物理地址转换成巡检地址包括:终端设备判断内存中的发生数据错误的物理地址是否为内存地址,如果确定该内存中的发生数据错误的物理地址是内存地址,则读取DRAM_RULE寄存器确定该内存所在的socket;查询TAD0-TAD11寄存器确定ChannelID;根据RIRWAYNESS寄存器和ririlvXoffset可以确定故障的DIMM、RankID和Rank内部地址,根据获取的socketID、ChannelID、DIMM、RankID和Rank地址获取巡检地址。终端设备根据物理地址获取巡检地址的过程为本领域技术人员所熟知,本发明实施例不再赘述。Specifically, converting the physical address where a data error occurs in the memory into a patrol address includes: the terminal device judges whether the physical address where a data error occurs in the memory is a memory address, and if it is determined that the physical address where a data error occurs in the memory If the address is a memory address, read the DRAM_RULE register to determine the socket where the memory is located; query the TAD0-TAD11 registers to determine the ChannelID; according to the RIRWAYNESS register and ririlvXoffset, you can determine the faulty DIMM, RankID and Rank internal address, according to the obtained socketID, ChannelID, DIMM , RankID, and Rank address to obtain the inspection address. The process for the terminal device to obtain the inspection address according to the physical address is well known to those skilled in the art, and will not be described in detail in this embodiment of the present invention.

如果内存中不存在相同的物理地址,则该内存中存储的物理地址对应的单比特数据错误的类型可能是软失效也可能是硬失效。If the same physical address does not exist in the memory, the type of single-bit data error corresponding to the physical address stored in the memory may be a soft failure or a hard failure.

207:根据所述巡检地址对应的所述内存中的数据进行巡检;207: Perform inspection according to the data in the memory corresponding to the inspection address;

具体地,终端设备停止系统自动的巡检清除PatrolScrubbing,将转换后的巡检地址写入SCRUBADDRESSLO寄存器和SCRUBADDRESSHI寄存器,使能巡检,根据SCRUBADDRESSLO寄存器和SCRUBADDRESSHI寄存器对转换后的巡检地址对应的内存中的数据的巡检。Specifically, the terminal device stops the automatic inspection of the system to clear PatrolScrubbing, writes the converted inspection address into the SCRUBADDRESSLO register and the SCRUBADDRESSI register, enables inspection, and uses the SCRUBADDRESSLO register and the SCRUBADDRESSI register to match the memory address corresponding to the converted inspection address. Inspection of data in .

在巡检过程中,如果巡检地址对应的内存数据存在数据错误,对该巡检地址对应的内存中的数据进行回写;如果巡检地址对应的内存中数据正确,则不对该数据进行任何处理。During the inspection process, if there is a data error in the memory data corresponding to the inspection address, write back the data in the memory corresponding to the inspection address; deal with.

步骤206-207是如果所述内存中不存在相同的物理地址,进行内存巡检的过程。Steps 206-207 are the process of performing memory inspection if the same physical address does not exist in the memory.

208:在结束巡检之后,判断所述内存中的单比特数据错误是否已被修复,如果是,执行步骤209,如果否,执行步骤210;208: After the inspection is finished, determine whether the single-bit data error in the memory has been repaired, if yes, perform step 209, if not, perform step 210;

在终端设备对内存中和巡检地址对应的数据巡检结束后,读取ECC寄存器中的标志着是否存在单比特数据错误的标识位,如果ECC寄存器中的标识位表明巡检地址对应内存中的数据存在错误,说明内存中的单比特数据错误未被修复;如果ECC寄存器中的标识位表明巡检地址对应内存中的数据没有错误,说明内存中的单比特数据错误已被修复。After the terminal device inspects the data corresponding to the inspection address in the memory, it reads the flag in the ECC register indicating whether there is a single-bit data error. If the flag in the ECC register indicates that the inspection address corresponds to the data in the memory If there is an error in the data in the memory, it means that the single-bit data error in the memory has not been repaired; if the flag bit in the ECC register indicates that the data in the memory corresponding to the inspection address has no error, it means that the single-bit data error in the memory has been repaired.

209:如果所述单比特数据错误已被修复,确定所述单比特数据错误的失效类型为软失效,结束;209: If the single-bit data error has been repaired, determine that the failure type of the single-bit data error is a soft failure, and end;

如果根据ECC寄存器中的标识位确定内存中的单比特数据错误已被修复,说明在终端设备启动需求清除(DemandScrubbing)功能过程中,该单比特数据错误被纠正,确定该单比特数据错误的失效类型为软失效。If it is determined that the single-bit data error in the memory has been repaired according to the identification bit in the ECC register, it means that the single-bit data error has been corrected during the start-up of the demand scrubbing (DemandScrubbing) function of the terminal device, and the invalidation of the single-bit data error is determined. Type is soft failure.

210:如果所述单比特数据错误未被修复,确定所述单比特数据错误的失效类型为硬失效;210: If the single-bit data error has not been repaired, determine that the failure type of the single-bit data error is a hard failure;

如果根据ECC寄存器中的标识位确定内存中的单比特数据错误未被修复,说明在终端设备启动需求清除(DemandScrubbing)功能过程中,该单比特数据错误未被纠正,确定该单比特数据错误的失效类型为硬失效。If it is determined that the single-bit data error in the memory has not been repaired according to the identification bit in the ECC register, it means that the single-bit data error has not been corrected during the terminal device startup demand scrubbing (DemandScrubbing) function process, and it is determined that the single-bit data error The failure type is hard failure.

211:当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的单比特数据错误的物理地址;211: When it is determined that the failure type of the memory is a hard failure, acquire a physical address of a single-bit data error whose failure type is a hard failure;

当确定所述内存的失效类型为硬失效时,终端设备获取所述失效类型为硬失效的单比特数据错误的物理地址的过程可以为以下任一项:When it is determined that the failure type of the memory is a hard failure, the process for the terminal device to obtain the physical address of the single-bit data error whose failure type is a hard failure may be any of the following:

(1)终端设备通过对内存中的相同物理地址检测,当存在相同的物理地址时确定内存的失效类型为硬失效时,终端设备直接读取该相同的物理地址;(1) The terminal device detects the same physical address in the memory. When the same physical address exists and determines that the failure type of the memory is a hard failure, the terminal device directly reads the same physical address;

(2)终端设备通对巡检地址对应的数据进行巡检,当巡检后根据ECC寄存器中的标志位确定所述内存的失效类型为硬失效时,终端设备从操作系统OS的mcelog文件中获取该硬失效对应的数据的物理地址。(2) The terminal device performs inspection on the data corresponding to the inspection address. After the inspection, according to the flag in the ECC register, when the failure type of the memory is determined to be a hard failure, the terminal device retrieves the data from the mcelog file of the operating system OS. Obtain the physical address of the data corresponding to the hard failure.

212:触发警报,以便提示用户更换所述失效类型为硬失效的单比特数据错误的物理地址对应的内存。212: An alarm is triggered, so as to prompt the user to replace the memory corresponding to the physical address of the single-bit data error whose failure type is a hard failure.

优选地,终端设备获取到失效类型为硬失效的单比特数据错误的物理地址后,在显示屏幕上显示所述失效类型为硬失效的单比特数据错误的物理地址,并触发警报,使得用户在获知该信息后,对失效类型为硬失效的单比特数据错误的物理地址对应的内存进行更换,避免硬失效的单比特数据错误累积,造成系统挂死,防止内存问题在单板集中复位或者升级时大量爆发。Preferably, after the terminal device obtains the physical address of the single-bit data error whose failure type is hard failure, it displays the physical address of the single-bit data error whose failure type is hard failure on the display screen, and triggers an alarm, so that the user After learning this information, replace the memory corresponding to the physical address of the single-bit data error whose failure type is hard failure, so as to avoid the accumulation of single-bit data error of hard failure, causing the system to hang up, and preventing the memory problem from being reset or upgraded on the single board exploded in large numbers.

步骤204-212是如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复的过程。Steps 204-212 are a process of determining the failure type of the memory according to the physical address of the memory where the data error occurs if the preset counter overflows, so as to perform subsequent corresponding repairs.

需要说明的是,本发明实施例的执行主体还可以是终端设备中的内存控制器。It should be noted that the execution subject of this embodiment of the present invention may also be a memory controller in the terminal device.

本发明实施例提供的数据错误修复方法,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。In the method for repairing data errors provided by the embodiments of the present invention, by judging whether the preset counter in the memory overflows, the preset counter is used to count data errors in the memory; if the preset counter overflows, according to the memory The physical address where the data error occurs is stored in the memory, and the failure type of the memory is determined, so that subsequent repairs can be carried out accordingly. By adopting the technical solution provided by the embodiment of the present invention, the failure type of the memory can be effectively distinguished, and the failure type can be repaired according to the failure type to avoid It prevents the system from hanging or failing to start caused by the accumulation of data errors, and ensures the normal operation of the business.

图3是本发明实施例中提供的一种数据错误修复装置结构示意图,参见图3,该装置包括:Fig. 3 is a schematic structural diagram of a data error repairing device provided in an embodiment of the present invention. Referring to Fig. 3, the device includes:

判断模块301,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;A judging module 301, configured to judge whether a preset counter in the memory overflows, and the preset counter is used to count data errors in the memory;

确定模块302,用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。The determining module 302 is configured to, if the preset counter overflows, determine the failure type of the memory according to the physical address stored in the memory where a data error occurs, so as to perform corresponding repairs subsequently.

所述确定模块302用于如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。The determination module 302 is configured to determine that the failure type of the data error corresponding to the same physical address is a hard failure if the same physical address exists in the physical addresses stored in the memory where the data error occurs.

所述确定模块302包括:The determining module 302 includes:

巡检单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检;The inspection unit is used to perform memory inspection if the same physical address does not exist in the physical address where the data error occurs stored in the memory;

判断单元,用于在结束巡检之后,判断所述内存中的数据错误是否已被修复;a judging unit, configured to judge whether the data error in the memory has been repaired after the inspection is finished;

确定单元,用于如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效;a determining unit, configured to determine that the failure type of the data error is a hard failure if the data error has not been repaired;

所述确定单元,用于如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。The determining unit is configured to determine that the failure type of the data error is a soft failure if the data error has been repaired.

所述巡检单元包括:The inspection unit includes:

巡检地址转换子单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址;The patrol address conversion subunit is used to convert the physical address in the preset counter into a patrol address if the same physical address does not exist in the physical address stored in the memory where a data error occurs;

巡检子单元,用于根据所述巡检地址对应的所述内存中的数据进行巡检。The inspection subunit is configured to perform inspection according to the data in the internal memory corresponding to the inspection address.

所述装置还包括:The device also includes:

硬失效物理地址获取模块303,用于当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址;A hard failure physical address acquisition module 303, configured to obtain a physical address of a data error whose failure type is a hard failure when the failure type of the memory is determined to be a hard failure;

触发模块304,用于触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。The trigger module 304 is configured to trigger an alarm, so as to prompt the user to replace the memory corresponding to the physical address of the data error whose failure type is hard failure.

所述装置还包括:The device also includes:

数据错误物理地址获取模块305,用于当内存中发生数据错误时,获取发生数据错误的物理地址;The data error physical address acquisition module 305 is used to obtain the physical address where the data error occurs when a data error occurs in the memory;

存储模块306,用于将所述发生数据错误的物理地址存储至内存中;A storage module 306, configured to store the physical address where the data error occurs in the memory;

回写模块307,用于对所述发生数据错误的物理地址进行数据回写。A write-back module 307, configured to write back data to the physical address where a data error occurs.

本发明实施例提供的数据错误修复装置,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。The data error recovery device provided by the embodiment of the present invention judges whether the preset counter in the memory overflows, and the preset counter is used to count data errors in the memory; if the preset counter overflows, according to the memory The physical address where the data error occurs is stored in the memory, and the failure type of the memory is determined, so that subsequent corresponding repairs can be performed. By adopting the technical solution provided by the embodiment of the present invention, the failure type of the memory can be effectively distinguished and repaired according to the failure type, avoiding the system hanging or failing to start caused by the accumulation of data errors, and ensuring the normal operation of the business.

需要说明的是:上述实施例提供的数据错误修复装置在数据错误修复时,仅以上述各功能模块的划分进行举例说明,实际应用中,可以根据需要而将上述功能分配由不同的功能模块完成,即将设备的内部结构划分成不同的功能模块,以完成以上描述的全部或者部分功能。另外,上述实施例提供的数据错误修复装置与数据错误修复方法实施例属于同一构思,其具体实现过程详见方法实施例,这里不再赘述。It should be noted that: the data error repairing device provided by the above embodiment only uses the division of the above-mentioned functional modules as an example for the data error repairing. In practical applications, the above-mentioned function allocation can be completed by different functional modules according to needs. , which divides the internal structure of the device into different functional modules to complete all or part of the functions described above. In addition, the device for repairing data errors provided by the above embodiments belongs to the same idea as the embodiments of the method for repairing data errors, and its specific implementation process is detailed in the method embodiments, and will not be repeated here.

图4是本发明实施例中提供的一种数据错误修复设备结构示意图。参见图4,该数据错误修复设备包括:处理器401和内存402,Fig. 4 is a schematic structural diagram of a data error repairing device provided in an embodiment of the present invention. Referring to FIG. 4, the data error repairing device includes: a processor 401 and a memory 402,

处理器401,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存402发生数据错误进行计数;The processor 401 is configured to determine whether a preset counter in the memory overflows, and the preset counter is used to count data errors occurring in the memory 402;

处理器401,用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存402的失效类型,以便后续进行相应地修复。The processor 401 is configured to determine the failure type of the memory 402 according to the physical address stored in the memory where a data error occurs if the preset counter overflows, so as to perform corresponding repairs subsequently.

内存402,用于存储数据以及发生数据错误的物理地址。The memory 402 is used for storing data and physical addresses where data errors occur.

处理器401,用于如果所述内存402中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。The processor 401 is configured to determine that the failure type of the data error corresponding to the same physical address is a hard failure if the same physical address exists in the physical addresses stored in the memory 402 where the data error occurs.

处理器401,用于如果所述内存402中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检;The processor 401 is configured to perform memory inspection if the same physical address does not exist in the physical address where the data error occurs stored in the memory 402;

处理器401,用于在结束巡检之后,判断所述内存402中的数据错误是否已被修复;The processor 401 is configured to determine whether the data error in the memory 402 has been repaired after the inspection is completed;

处理器401,用于如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效;Processor 401, configured to determine that the failure type of the data error is a hard failure if the data error has not been repaired;

处理器401,用于如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。The processor 401 is configured to determine that the failure type of the data error is a soft failure if the data error has been repaired.

处理器401,用于如果所述内存402中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址;The processor 401 is configured to convert the physical address in the preset counter into a patrol address if the same physical address does not exist in the physical address in which the data error occurs stored in the memory 402;

处理器401,用于根据所述巡检地址对应的所述内存402中的数据进行巡检。The processor 401 is configured to perform inspection according to the data in the memory 402 corresponding to the inspection address.

处理器401,用于当确定所述内存402的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址;The processor 401 is configured to, when determining that the failure type of the memory 402 is a hard failure, obtain a physical address of a data error whose failure type is a hard failure;

处理器401,用于触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存402。The processor 401 is configured to trigger an alarm, so as to prompt the user to replace the memory 402 corresponding to the physical address of the data error whose failure type is hard failure.

处理器401,用于当内存402中发生数据错误时,获取发生数据错误的物理地址;The processor 401 is configured to obtain a physical address where a data error occurs when a data error occurs in the memory 402;

处理器401,用于将所述发生数据错误的物理地址存储至内存402中,并对所述发生数据错误的物理地址进行数据回写。The processor 401 is configured to store the physical address where the data error occurs in the memory 402, and perform data write-back on the physical address where the data error occurs.

本发明实施例提供的数据错误修复设备,通过判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复。采用本发明实施例提供的技术方案,可以有效区分内存的失效类型,并根据失效类型进行修复,避免了数据错误累积造成的系统挂死或无法启动等情况,保证了业务的正常进行。The data error recovery device provided by the embodiment of the present invention judges whether the preset counter in the memory overflows, and the preset counter is used to count data errors in the memory; if the preset counter overflows, according to the memory The physical address where the data error occurs is stored in the memory, and the failure type of the memory is determined, so that subsequent corresponding repairs can be performed. By adopting the technical solution provided by the embodiment of the present invention, the failure type of the memory can be effectively distinguished and repaired according to the failure type, avoiding the system hanging or failing to start caused by the accumulation of data errors, and ensuring the normal operation of the business.

本领域普通技术人员可以理解实现上述实施例的全部或部分步骤可以通过硬件来完成,也可以通过程序来指令相关的硬件完成,所述的程序可以存储于一种计算机可读存储介质中,上述提到的存储介质可以是只读存储器,磁盘或光盘等。Those of ordinary skill in the art can understand that all or part of the steps for implementing the above embodiments can be completed by hardware, and can also be completed by instructing related hardware through a program. The program can be stored in a computer-readable storage medium. The above-mentioned The storage medium mentioned may be a read-only memory, a magnetic disk or an optical disk, and the like.

以上所述仅为本发明的较佳实施例,并不用以限制本发明,凡在本发明的精神和原则之内,所作的任何修改、等同替换、改进等,均应包含在本发明的保护范围之内。The above descriptions are only preferred embodiments of the present invention, and are not intended to limit the present invention. Any modifications, equivalent replacements, improvements, etc. made within the spirit and principles of the present invention shall be included in the protection of the present invention. within range.

Claims (9)

1.一种数据错误修复方法,其特征在于,所述方法包括:1. A data error recovery method, characterized in that said method comprises: 判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;judging whether a preset counter in the memory overflows, and the preset counter is used to count data errors in the memory; 如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复;If the preset counter overflows, determine the failure type of the memory according to the physical address stored in the memory where a data error occurs, so as to perform corresponding repairs subsequently; 如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复,包括:If the preset counter overflows, the failure type of the memory is determined according to the physical address of the data error stored in the memory, so as to be repaired accordingly, including: 如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址;If the same physical address does not exist in the physical address where the data error occurs stored in the memory, converting the physical address in the preset counter into a patrol address; 根据所述巡检地址对应的所述内存中的数据进行巡检;performing inspection according to the data in the internal memory corresponding to the inspection address; 在结束巡检之后,判断所述内存中的数据错误是否已被修复;After the inspection is finished, it is judged whether the data error in the memory has been repaired; 如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效;If the data error has not been repaired, determining that the failure type of the data error is a hard failure; 如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。If the data error has been repaired, it is determined that the failure type of the data error is a soft failure. 2.根据权利要求1所述的方法,其特征在于,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复,包括:2. The method according to claim 1, wherein if the preset counter overflows, the failure type of the memory is determined according to the physical address of the data error stored in the memory, so that subsequent corresponding Fixes, including: 如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。If the same physical address exists in the physical addresses stored in the internal memory where the data error occurs, it is determined that the failure type of the data error corresponding to the same physical address is a hard failure. 3.根据权利要求1所述的方法,其特征在于,如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复之后,所述方法还包括:3. The method according to claim 1, wherein if the preset counter overflows, the failure type of the memory is determined according to the physical address of the data error stored in the memory, so that subsequent corresponding After repairing, the method also includes: 当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址;When it is determined that the failure type of the memory is a hard failure, acquiring the physical address of the data error whose failure type is a hard failure; 触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。An alarm is triggered so as to prompt the user to replace the memory corresponding to the physical address of the data error whose failure type is hard failure. 4.根据权利要求1所述的方法,其特征在于,判断内存中预设计数器是否溢出之前,所述方法还包括:4. The method according to claim 1, wherein before judging whether the preset counter in the internal memory overflows, the method also includes: 当内存中发生数据错误时,获取发生数据错误的物理地址;When a data error occurs in the memory, obtain the physical address where the data error occurs; 将所述发生数据错误的物理地址存储至内存中,并对所述发生数据错误的物理地址进行数据回写。The physical address where the data error occurs is stored in the memory, and the data is written back to the physical address where the data error occurs. 5.一种数据错误修复装置,其特征在于,所述装置包括:5. A data error recovery device, characterized in that the device comprises: 判断模块,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;A judging module, configured to judge whether a preset counter in the memory overflows, and the preset counter is used to count data errors in the memory; 确定模块,用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复;A determining module, configured to determine the type of failure of the memory according to the physical address of the memory where the data error occurs if the preset counter overflows, so as to perform subsequent corresponding repairs; 所述确定模块包括:The determination module includes: 巡检单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,进行内存巡检;The inspection unit is used to perform memory inspection if the same physical address does not exist in the physical address where the data error occurs stored in the memory; 判断单元,用于在结束巡检之后,判断所述内存中的数据错误是否已被修复;a judging unit, configured to judge whether the data error in the memory has been repaired after the inspection is finished; 确定单元,用于如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效;a determining unit, configured to determine that the failure type of the data error is a hard failure if the data error has not been repaired; 所述确定单元,用于如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效;The determining unit is configured to determine that the failure type of the data error is a soft failure if the data error has been repaired; 所述巡检单元包括:The inspection unit includes: 巡检地址转换子单元,用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址;The patrol address conversion subunit is used to convert the physical address in the preset counter into a patrol address if the same physical address does not exist in the physical address stored in the memory where a data error occurs; 巡检子单元,用于根据所述巡检地址对应的所述内存中的数据进行巡检。The inspection subunit is configured to perform inspection according to the data in the internal memory corresponding to the inspection address. 6.根据权利要求5所述的装置,其特征在于,所述确定模块用于如果所述内存中存储的发生数据错误的物理地址中存在相同的物理地址,确定所述相同的物理地址对应的数据错误的失效类型为硬失效。6. The device according to claim 5, wherein the determining module is configured to determine the corresponding physical address of the same physical address if there is the same physical address in the physical address stored in the memory where a data error occurs. The failure type of a data error is a hard failure. 7.根据权利要求5所述的装置,其特征在于,所述装置还包括:7. The device according to claim 5, further comprising: 硬失效物理地址获取模块,用于当确定所述内存的失效类型为硬失效时,获取所述失效类型为硬失效的数据错误的物理地址;A hard failure physical address acquisition module, configured to obtain a physical address of a data error whose failure type is a hard failure when the failure type of the memory is determined to be a hard failure; 触发模块,用于触发警报,以便提示用户更换所述失效类型为硬失效的数据错误的物理地址对应的内存。The triggering module is configured to trigger an alarm, so as to prompt the user to replace the memory corresponding to the physical address of the data error whose failure type is hard failure. 8.根据权利要求5所述的装置,其特征在于,所述装置还包括:8. The device according to claim 5, further comprising: 数据错误物理地址获取模块,用于当内存中发生数据错误时,获取发生数据错误的物理地址;The data error physical address acquisition module is used to obtain the physical address where the data error occurs when a data error occurs in the memory; 存储模块,用于将所述发生数据错误的物理地址存储至内存中;A storage module, configured to store the physical address where the data error occurs in the memory; 回写模块,用于对所述发生数据错误的物理地址进行数据回写。A write-back module, configured to write back data to the physical address where a data error occurs. 9.一种数据错误修复设备,其特征在于,所述设备包括:9. A data error recovery device, characterized in that the device comprises: 内存,用于存储数据以及发生数据错误的物理地址;Memory, used to store data and the physical address where data errors occur; 处理器,用于判断内存中预设计数器是否溢出,所述预设计数器用于对所述内存发生数据错误进行计数;a processor, configured to determine whether a preset counter in the memory overflows, and the preset counter is used to count data errors occurring in the memory; 所述处理器,还用于如果所述预设计数器溢出,根据所述内存中存储的发生数据错误的物理地址,确定所述内存的失效类型,以便后续进行相应地修复;The processor is further configured to, if the preset counter overflows, determine the failure type of the memory according to the physical address stored in the memory where a data error occurs, so as to perform subsequent corresponding repairs; 所述处理器,还用于如果所述内存中存储的发生数据错误的物理地址中不存在相同的物理地址,将所述预设计数器中的物理地址转换成巡检地址;根据所述巡检地址对应的所述内存中的数据进行巡检;在结束巡检之后,判断所述内存中的数据错误是否已被修复;如果所述数据错误未被修复,确定所述数据错误的失效类型为硬失效;如果所述数据错误已被修复,确定所述数据错误的失效类型为软失效。The processor is further configured to convert the physical address in the preset counter into a patrol address if the same physical address does not exist in the physical address stored in the memory where a data error occurs; according to the patrol The data in the memory corresponding to the address is inspected; after the inspection is completed, it is judged whether the data error in the memory has been repaired; if the data error has not been repaired, it is determined that the failure type of the data error is Hard failure; if the data error has been repaired, determine that the failure type of the data error is a soft failure.
CN201310105316.2A 2013-03-28 2013-03-28 Error in data restorative procedure, device and equipment Active CN103218275B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201310105316.2A CN103218275B (en) 2013-03-28 2013-03-28 Error in data restorative procedure, device and equipment

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201310105316.2A CN103218275B (en) 2013-03-28 2013-03-28 Error in data restorative procedure, device and equipment

Publications (2)

Publication Number Publication Date
CN103218275A CN103218275A (en) 2013-07-24
CN103218275B true CN103218275B (en) 2015-11-25

Family

ID=48816095

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201310105316.2A Active CN103218275B (en) 2013-03-28 2013-03-28 Error in data restorative procedure, device and equipment

Country Status (1)

Country Link
CN (1) CN103218275B (en)

Families Citing this family (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN104750591A (en) * 2013-12-30 2015-07-01 上海威亿实业有限公司 Evidence-taking device and method for computer
CN104077375B (en) 2014-06-24 2017-09-12 华为技术有限公司 The processing method and node of a kind of wrong catalogue of CC NUMA systems interior joint
US9817714B2 (en) * 2015-08-28 2017-11-14 Intel Corporation Memory device on-die error checking and correcting code
CN106569734B (en) * 2015-10-12 2019-04-09 北京国双科技有限公司 The restorative procedure and device that memory overflows when data are shuffled
CN105426288A (en) * 2015-11-10 2016-03-23 浪潮电子信息产业股份有限公司 Optimization method of memory alarm
CN106445720A (en) * 2016-10-11 2017-02-22 郑州云海信息技术有限公司 Memory error recovery method and device
WO2019061517A1 (en) * 2017-09-30 2019-04-04 华为技术有限公司 Memory fault detection method and device, and server
CN112948160B (en) * 2021-02-26 2023-02-28 山东英信计算机技术有限公司 Method and device for locating and repairing memory ECC problems
CN118260119B (en) * 2024-05-29 2024-08-09 江原芯科技(上海)有限公司 Memory fault processing method and device, electronic equipment, medium and chip

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968353A (en) * 2012-10-26 2013-03-13 华为技术有限公司 Fail address processing method and fail address processing device

Family Cites Families (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US6715116B2 (en) * 2000-01-26 2004-03-30 Hewlett-Packard Company, L.P. Memory data verify operation
US7987407B2 (en) * 2009-08-20 2011-07-26 Arm Limited Handling of hard errors in a cache of a data processing apparatus

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN102968353A (en) * 2012-10-26 2013-03-13 华为技术有限公司 Fail address processing method and fail address processing device

Also Published As

Publication number Publication date
CN103218275A (en) 2013-07-24

Similar Documents

Publication Publication Date Title
CN103218275B (en) Error in data restorative procedure, device and equipment
CN104346265B (en) The acquisition methods and device of terminal device and its log information
CN106910528B (en) Optimization method and device for data inspection of solid state disk
CN102760090B (en) Debugging method and computer system
CN102904685B (en) A kind of processing method of hardware table item check errors and device
US20150067437A1 (en) Apparatus, method and system for reporting dynamic random access memory error information
CN110727597B (en) A method for troubleshooting invalid code completion use cases based on logs
CN102467440A (en) Internal memory error detection system and method
US20120246384A1 (en) Flash memory and flash memory accessing method
CN103605591A (en) Method and device for controlling memory initialization of terminal system
CN106547653B (en) Computer system fault state detection method, device and system
CN116521429B (en) Asset information reporting method and device, storage medium and electronic equipment
WO2016127600A1 (en) Exception handling method and apparatus
CN117667467A (en) A method for dealing with memory failures and related equipment
TW201543496A (en) Data managing method, memory control circuit unit and memory storage apparatus
CN104102563A (en) Method and device for finding MCA (machine check architecture) errors of server system
CN102053874B (en) Ways to protect backup data
EP3125251A1 (en) Hamming code-based data access method and integrated random access memory
CN106030544B (en) Method for detecting memory of computer equipment and computer equipment
CN110008105B (en) BMC time retention method and device, electronic device and storage medium
CN114327258B (en) Solid state disk processing method, system, device and computer storage medium
JP4940757B2 (en) Mobile radio communication terminal, information collecting method and program when abnormality occurs
US8151176B2 (en) CPU instruction RAM parity error procedure
CN111506460B (en) Memory fault processing method and device, mobile terminal and storage medium
CN115827376A (en) Server memory detection method and device, electronic equipment and storage medium

Legal Events

Date Code Title Description
C06 Publication
PB01 Publication
C10 Entry into substantive examination
SE01 Entry into force of request for substantive examination
C14 Grant of patent or utility model
GR01 Patent grant
TR01 Transfer of patent right
TR01 Transfer of patent right

Effective date of registration: 20220118

Address after: 450046 Floor 9, building 1, Zhengshang Boya Plaza, Longzihu wisdom Island, Zhengdong New Area, Zhengzhou City, Henan Province

Patentee after: xFusion Digital Technologies Co., Ltd.

Address before: 518129 Bantian HUAWEI headquarters office building, Longgang District, Guangdong, Shenzhen

Patentee before: HUAWEI TECHNOLOGIES Co.,Ltd.