[go: up one dir, main page]

CN118503004A - Memory fault processing method, electronic device, storage medium and program product - Google Patents

Memory fault processing method, electronic device, storage medium and program product Download PDF

Info

Publication number
CN118503004A
CN118503004A CN202410961857.3A CN202410961857A CN118503004A CN 118503004 A CN118503004 A CN 118503004A CN 202410961857 A CN202410961857 A CN 202410961857A CN 118503004 A CN118503004 A CN 118503004A
Authority
CN
China
Prior art keywords
memory
fault
error
basic input
error information
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Pending
Application number
CN202410961857.3A
Other languages
Chinese (zh)
Inventor
贾帅帅
李道童
艾山彬
陈衍东
李盛新
韩红瑞
孙秀强
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Suzhou Metabrain Intelligent Technology Co Ltd
Original Assignee
Suzhou Metabrain Intelligent Technology Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Suzhou Metabrain Intelligent Technology Co Ltd filed Critical Suzhou Metabrain Intelligent Technology Co Ltd
Priority to CN202410961857.3A priority Critical patent/CN118503004A/en
Publication of CN118503004A publication Critical patent/CN118503004A/en
Pending legal-status Critical Current

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0706Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
    • G06F11/073Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/0793Remedial or corrective actions

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Quality & Reliability (AREA)
  • Physics & Mathematics (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • Biomedical Technology (AREA)
  • Techniques For Improving Reliability Of Storages (AREA)

Abstract

The embodiment of the invention provides a memory fault processing method, electronic equipment, a storage medium and a program product, which relate to the technical field of computer systems and memories and comprise a baseboard management controller, a basic input and output system, an operating system and a memory, wherein the baseboard management controller is connected based on the same bus, and acquires current memory error information when a memory controller of the memory is patrolled and examined; the baseboard management controller determines the error type according to the current memory error information; the operating system generates a fault processing instruction according to the error type; and the basic input and output system executes the fault processing instruction to complete memory fault repair. According to the embodiment of the invention, the error data of each memory is detected under the condition that the system performance is not affected, so that the memory errors can be processed in time, and the downtime probability of the server is reduced.

Description

Memory fault processing method, electronic device, storage medium and program product
Technical Field
The present invention relates to the field of computer systems and memory technologies, and in particular, to a memory failure processing method, an electronic device, a storage medium, and a program product.
Background
In the running process of electronic equipment such as a server, memory errors can be generated; if the memory error is not timely processed, the electronic equipment such as a server is in downtime. Therefore, there is a technology for repairing a memory failure at present, and when the current memory failure repairing scheme is to repair after error accumulation, the failure is not found timely, and the downtime rate is still higher.
Disclosure of Invention
In view of the above, embodiments of the present invention have been made to provide a memory failure processing method, an electronic device, a storage medium, and a program product that overcome or at least partially solve the above problems.
In order to solve the above-mentioned problems, in a first aspect of the present invention, an embodiment of the present invention discloses a memory failure processing method, which is applied to an electronic device, where the electronic device includes a baseboard management controller, a basic input/output system, an operating system and a memory, which are connected based on the same bus, and the method includes:
The baseboard management controller acquires current memory error information when a memory controller of the memory is patrolled;
The baseboard management controller determines the error type according to the current memory error information;
The operating system generates a fault processing instruction according to the error type;
and the basic input and output system executes the fault processing instruction to complete memory fault repair.
Optionally, the step of determining, by the baseboard management controller, an error type according to the current memory error information includes:
When the current memory error information has a preset storage unit fault identifier, the baseboard management controller determines that the error type is a persistent storage unit error;
when the current memory error information does not have a storage unit fault identifier, the baseboard management controller acquires historical memory error information;
the baseboard management controller compares the current memory error information with the historical error information;
When the absolute value of the fault unit row difference value between the current memory error information and the historical error information is larger than a preset burst length, the baseboard management controller determines that the error type is a memory row error;
When the pin identifiers of the fault data input/output buses between the current memory error information and the historical error information are the same, and the particle faults between the current memory error information and the historical error information are different, the baseboard management controller determines that the error type is a memory bus error;
And when the sum of the block fault number of the current memory error information and the block fault number of the historical error information is larger than a preset threshold value, the baseboard management controller determines that the error type is a memory block fault.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
When the error type is the error of the persistent storage unit, the operating system analyzes the error information and determines a first fault memory page;
and the operating system generates a fault processing instruction for isolating the first fault memory page.
Optionally, the step of executing the fault handling instruction by the bios includes:
and the basic input/output system responds to a fault processing instruction for isolating the first fault memory page, writes the address of the first fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
Optionally, the step of executing the fault handling instruction by the bios further includes:
and the basic input and output system performs memory page offline processing on the first fault memory page.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is the memory line error, the operating system analyzes the current memory error information and determines a fault memory line;
The operating system determines a second failed memory page associated with the failed memory row;
And the operating system generates a fault processing instruction for isolating the second fault memory page.
Optionally, the step of executing the fault handling instruction by the bios includes:
and the basic input/output system responds to a fault processing instruction for isolating the second fault memory page, writes the address of the second fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
Optionally, the step of executing the fault handling instruction by the bios further includes:
And the basic input and output system performs memory page offline processing on the second fault memory page.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is the memory bus error, the operating system analyzes the current memory error information and determines a fault memory bus;
the operating system generates a fault handling instruction to replace the faulty memory bus.
Optionally, the step of executing the fault handling instruction by the bios includes:
The basic input/output system responds to a fault processing instruction for replacing the fault memory bus and determines an error correction pin associated with the fault memory bus based on a preset bus switching rule;
the basic input/output system switches a fault pin in the fault memory bus to the error correction pin.
Optionally, the step of executing the fault handling instruction by the bios further includes:
the basic input and output system reduces the error correction information length of the memory.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is a memory block fault, the operating system analyzes the current memory error information and determines a fault memory bank to which the fault memory block belongs;
the operating system generates a fault handling instruction that locks against the fault repository.
Optionally, the step of executing the fault handling instruction by the bios includes:
And the basic input/output system responds to a fault processing instruction for locking the fault memory bank, and triggers an adaptive dual-device data correction mechanism to lock the fault memory bank.
Optionally, the method further comprises:
And when the memory controller of the memory completes single inspection, the basic input/output system reads the fault line of the memory.
Optionally, the step of the bios reading the failed row of the memory when the memory controller of the memory completes a single patrol includes:
And when the memory controller of the memory completes single inspection, the basic input/output system receives interrupt processing sent by the memory, and in the interrupt processing, the fault line of the memory is read.
Optionally, the method further comprises:
And the basic input and output system performs fault processing on the fault line.
Optionally, the step of performing fault handling on the fault line by the basic input output system includes:
the basic input and output system judges whether the fault line has undergone memory page offline processing;
When the fault line has been subjected to memory page offline processing, the basic input/output system adopts a standby memory line to replace the fault line;
when the fault line does not perform memory page offline processing, the basic input output system determines storage unit data of the fault line; and inverting the storage unit data, and performing memory rewriting on the inverted storage unit data.
Optionally, the step of performing fault processing on the fault line by the basic input output system further includes:
the basic input and output system determines the storage unit address associated with the fault line;
and the basic input and output system performs memory page isolation on the memory pages corresponding to the memory unit addresses.
In a second aspect of the present invention, an embodiment of the present invention discloses an electronic device, including a baseboard management controller, a basic input output system, an operating system and a memory connected based on the same bus,
The baseboard management controller is used for acquiring current memory error information when the memory controller of the memory is in inspection;
the baseboard management controller is used for determining the error type according to the current memory error information;
the operating system is used for generating a fault processing instruction according to the error type;
The basic input/output system is used for executing the fault processing instruction to complete memory fault repair.
In a third aspect of the present invention, embodiments of the present invention disclose a computer readable storage medium having stored thereon a computer program which, when executed by a processor, implements the steps of a memory failure handling method as described above.
In a fourth aspect of the invention, embodiments of the invention disclose a computer program product comprising a computer program which, when executed by a processor, implements the steps of a memory failure handling method as described above.
The embodiment of the invention has the following advantages:
According to the embodiment of the invention, the base plate management controller acquires the current memory error information when the memory controller of the memory is patrolled; the baseboard management controller determines the error type according to the current memory error information; the operating system generates a fault processing instruction according to the error type; and the basic input/output system executes the fault processing instruction to complete memory fault repair. The substrate management controller is adopted to acquire the current memory error information when the memory controller is subjected to single inspection, so that the error data of each memory can be detected, the memory errors can be processed in time, and the timeliness of error detection is improved; and the operating system determines different error types to carry out corresponding repairing strategies to repair and generate fault processing instructions, so that the accuracy of memory error repairing is improved, memory errors can be effectively repaired, and downtime is avoided.
Drawings
FIG. 1 is a flowchart illustrating steps of an embodiment of a memory failure processing method according to the present invention;
FIG. 2 is a flowchart illustrating steps of another embodiment of a memory fault handling method according to the present invention;
FIG. 3 is a schematic diagram of an electronic device architecture to which another embodiment of the memory failure processing method of the present invention is applied;
FIG. 4 is a flowchart illustrating exemplary steps for persistent storage unit error handling in accordance with the present invention;
FIG. 5 is a flowchart illustrating exemplary steps for processing a memory row error according to the present invention;
FIG. 6 is a flowchart illustrating exemplary steps for memory bus error handling in accordance with the present invention;
FIG. 7 is a flowchart illustrating exemplary memory block error handling steps according to the present invention;
FIG. 8 is a flow chart of steps of an example fault line processing of the present invention;
FIG. 9 is a flowchart illustrating exemplary steps of a memory failure handling method according to the present invention;
Fig. 10 is a block diagram of an electronic device according to an embodiment of the present invention;
FIG. 11 is a block diagram of a storage medium according to an embodiment of the present invention;
Fig. 12 is a block diagram of a computer program product according to an embodiment of the present invention.
Detailed Description
In order that the above-recited objects, features and advantages of the present invention will become more readily apparent, a more particular description of the invention will be rendered by reference to the appended drawings and appended detailed description.
Referring to fig. 1, a flowchart illustrating steps of an embodiment of a memory failure processing method according to the present invention, where the electronic device includes a baseboard management controller, a basic input/output system, an operating system and a memory connected based on the same bus, and the memory failure processing method specifically may include the following steps:
Step 101, the baseboard management controller acquires current memory error information when the memory controller of the memory is patrolled;
When the memory is in error, the memory controller collects and records the information such as the memory position in error, the error content and the like when the memory is in inspection, and generates the current memory error information. The current memory error information may be stored in a register associated with the memory controller.
The baseboard management controller obtains the memory information from the registers associated with the memory controller.
Step 102, the baseboard management controller determines the error type according to the current memory error information;
The baseboard management controller determines the object with fault according to the error content recorded in the current memory error information, and further determines the error type.
Step 103, the operating system generates a fault processing instruction according to the error type;
The operating system receives the information of the error type sent by the baseboard management controller, determines a corresponding repairing strategy according to the error type and the object with the fault, and generates a fault processing instruction.
And 104, the basic input/output system executes the fault processing instruction to complete memory fault repair.
The basic input and output system executes the fault processing instruction, and the processing process corresponding to the repairing strategy is implemented to repair the fault, so that the further occurrence of larger memory errors of electronic equipment such as a server and the like is reduced, and the downtime probability is reduced.
According to the embodiment of the invention, the base plate management controller acquires the current memory error information when the memory controller of the memory is patrolled; the baseboard management controller determines the error type according to the current memory error information; the operating system generates a fault processing instruction according to the error type; and the basic input/output system executes the fault processing instruction to complete memory fault repair. The substrate management controller is adopted to acquire the current memory error information when the memory controller is subjected to single inspection, so that the error data of each memory can be detected, the memory errors can be processed in time, and the timeliness of error detection is improved; and the operating system determines different error types to carry out corresponding repairing strategies to repair and generate fault processing instructions, so that the accuracy of memory error repairing is improved, memory errors can be effectively repaired, and downtime is avoided.
Referring to fig. 2, a flowchart illustrating steps of another embodiment of a memory failure processing method according to the present invention may refer to fig. 3, where an application system of the memory failure processing method includes: the system comprises a memory 101, a processor 102, a BMC (baseboard management controller) 103, a BIOS (basic input output system) 104 and an operating system 105, wherein data are read from the memory, the data validity is calculated through a memory controller in the processor, a memory controller related register is updated after the data verification fails, error information is recorded, and the BMC is used for monitoring the memory controller related register in an out-of-band and real-time mode.
The memory fault processing method specifically comprises the following steps:
Step 201, the baseboard management controller acquires current memory error information when acquiring the memory controller of the memory for inspection;
The baseboard management controller monitors the related register of the memory controller in real time outside the band, and acquires the current memory error information when the memory controller is inspected once.
Step 202, when the current memory error information has a preset storage unit fault identifier, the baseboard management controller determines that the error type is a persistent storage unit error;
the baseboard management controller can identify whether a preset storage unit fault identifier exists in the current memory error information. When the current memory error information has a preset storage unit fault identifier, the type of the error corresponding to the error occurring in the current memory can be determined to be the persistent storage unit error.
When the storage unit with the preset storage unit fault identifier being persistent fails, electronic equipment such as a server records the generated identification information aiming at the fault.
Step 203, when the current memory error information does not have a storage unit fault identifier, the baseboard management controller acquires historical memory error information;
When the current memory error information does not have the preset memory unit fault identifier, it is indicated that no persistent memory unit error occurs, and the baseboard management controller can acquire the historical memory error information to further determine the error type. The historical memory error information is memory error information except the current memory error information before the current time. The embodiment of the present invention is not particularly limited to this regarding the history time of the history memory error information.
Step 204, the baseboard management controller compares the current memory error information with the history error information;
the baseboard management controller compares each content of the current memory error information with the historical error information.
Step 205, when the absolute value of the fault unit row difference between the current memory error information and the historical error information is greater than a preset burst length, the baseboard management controller determines that the error type is a memory row error;
The absolute value of the difference value of the fault unit row between the current memory error information and the historical error information is larger than the preset burst length, namely the change of the fault unit row of the current memory error information relative to the fault unit row between the historical error information is overlarge, and the baseboard management controller can determine that the error type is a memory row error.
The preset burst length may be determined according to the type of the memory, and the embodiment of the present invention is not limited to a specific preset burst length.
Step 206, when the pin identifiers of the fault data input/output bus between the current memory error information and the historical error information are the same, and the particle faults between the current memory error information and the historical error information are different, the baseboard management controller determines that the error type is a memory bus error;
The fault data input/output bus pins between the current memory error information and the historical error information are identical in identification, and the particle faults between the current memory error information and the historical error information are different, namely the faulty data input/output bus pins continuously appear, and the baseboard management controller can determine that the error type is the memory bus error when the fault corresponding to different particle faults appears each time.
Step 207, when the sum of the number of block faults of the current memory error information and the number of block faults of the historical error information is greater than a preset threshold, the baseboard management controller determines that the error type is a memory block fault;
When the sum of the number of block faults of the current memory error information and the number of block faults of the historical error information is larger than a preset threshold value, the baseboard management controller can determine that the error type is the memory block fault so as to repair the memory block.
Step 208, the operating system generates a fault processing instruction according to the error type;
The operating system determines that the object with the fault in the current memory error information carries out corresponding repair strategies for repairing according to different error types to generate a fault processing instruction. And sends the fault handling instructions to the basic input output system.
In an alternative embodiment of the present invention, the step of generating, by the operating system, a fault handling instruction according to the error type includes: when the error type is the error of the persistent storage unit, the operating system analyzes the error information and determines a first fault memory page; and the operating system generates a fault processing instruction for isolating the first fault memory page.
When the error type is a persistent storage unit error, the operating system may read the memory page that has failed, i.e., the first failed memory page, from the error information. And the operating system determines memory page isolation of the first fault memory page as a repairing strategy and generates a fault processing instruction for isolating the first fault memory page.
In an alternative embodiment of the present invention, the step of generating, by the operating system, a fault handling instruction according to the error type includes: when the error type is the memory line error, the operating system analyzes the current memory error information and determines a fault memory line; the operating system determines a second failed memory page associated with the failed memory row; and the operating system generates a fault processing instruction for isolating the second fault memory page.
When the error type is a memory line error, the operating system identifies the memory line with the current memory error information failed, namely the failed memory line. And determining the fault memory page associated with the fault memory line, namely a second fault memory page. And determining memory page isolation of the second fault memory page as a repairing strategy, and generating a fault processing instruction for isolating the second fault memory page.
In an alternative embodiment of the present invention, the step of generating, by the operating system, a fault handling instruction according to the error type includes: when the error type is the memory bus error, the operating system analyzes the current memory error information and determines a fault memory bus; the operating system generates a fault handling instruction to replace the faulty memory bus.
When the error type is a memory bus error, the operating system may identify the current failed memory bus from the current memory error information. And determining the repair strategy as replacing the fault memory bus, and generating a fault processing instruction for replacing the fault memory bus.
In an alternative embodiment of the present invention, the step of generating, by the operating system, a fault handling instruction according to the error type includes: when the error type is a memory block fault, the operating system analyzes the current memory error information and determines a fault memory bank to which the fault memory block belongs; the operating system generates a fault handling instruction that locks against the fault repository.
When the error type is a memory block fault, the operating system identifies a fault memory bank to which the fault memory block recorded in the current memory error information belongs; performing lock-step processing on the fault repository to determine the fault repository as a repair strategy; generating a fault processing instruction for locking a fault memory bank.
Step 209, the bios executes the fault handling instruction;
and the basic input/output system executes corresponding fault processing instructions to repair different error types.
In an alternative embodiment of the present invention, the step of the bios executing the fault handling instruction includes: and the basic input/output system responds to a fault processing instruction for isolating the first fault memory page, writes the address of the first fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
The basic input/output system responds to a fault processing instruction for performing memory page isolation on a first fault memory page, and can write the address of the first fault memory page into a preset hardware error source table, so that the memory page can be determined to be faulty based on an operating system layer, and a general platform error record identifier is set, so that the page isolation can be performed on the address of the first fault memory page in the preset hardware error source table after the operating system reads the general platform error record identifier.
In addition, in order to avoid the downtime risk caused by the continuous operation of the operating system, the step of executing the fault processing instruction by the basic input output system further includes: and the basic input and output system performs memory page offline processing on the first fault memory page.
For example, referring to fig. 4, 1) when a CE (correctable) error is generated in the memory and is collected and analyzed by the BMC, it is determined that the error type is a persistent Cell (memory Cell) failure. 2) The BMC informs the OSPM module of the OS (operating system) of the memory fault through SCI (serial communication interface information), the OSPM (operating system power management) calls an ACPI ASL method (advanced configuration and power management interface object), the BIOS ASL method obtains a fault cell page (memory page) address from the BMC and fills in a HEST table (hardware error source table), CPER flag (universal platform error record identification) of APEI is set (as long as CPER flag of APEI is set, the operating system automatically detects the flag, a kernel code is called, and memory page isolation is carried out). 3) The OS uses APEI as a page of the failed memory to prevent subsequent address errors from continuing to access, resulting in an error upgrade.
In an alternative embodiment of the present invention, the step of the bios executing the fault handling instruction includes: and the basic input/output system responds to a fault processing instruction for isolating the second fault memory page, writes the address of the second fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
The basic input/output system responds to a fault processing instruction for performing memory page isolation on the second fault memory page, and the basic input/output system can write the address of the second fault memory page into a preset hardware error source table, so that the memory page can be determined to be faulty based on the operating system layer, and then a universal platform error record identifier is set, so that the page isolation can be performed on the address of the second fault memory page in the preset hardware error source table after the operating system reads the universal platform error record identifier.
In addition, in order to avoid the downtime risk caused by the continuous operation of the operating system, the step of executing the fault processing instruction by the basic input output system further includes: and the basic input and output system performs memory page offline processing on the second fault memory page.
For example, referring to fig. 5, 1) when a CE error occurs in the memory and the memory is collected and analyzed by the BMC, the fault information is compared with the history information, and when the CE error information is di mm, rank, device, bank, column (dual in-line memory module information, the memory error information on the memory module includes: physical location information, memory block information, memory device granule information, logical storage array information, row information, column information) information is completely consistent, when Row information meets the following: ABS [ Row (new) and Row (old) ] > Burst Length (i.e. the absolute value of the difference value of the Row information of the fault unit is greater than the Burst Length of the memory, DDR4 Burst Length 8, burst Length 16 of DDR 5), then the Row fault is determined. 2) And the BMC calculates the memory page addresses of all memory cells associated with the fault Row. 3) The BMC informs the OSPM module of the OS of the memory failure through the SCI, the OSPM calls an APCI ASL method, and the BIOS ASL method obtains the memory page address of the failed cell from the BMC and fills in the HEST table, and CPER flag of APEI is set. 4) The OS uses APEI as a page of the fault memory to prevent subsequent wrong addresses from continuing to access, resulting in wrong upgrades.
In an alternative embodiment of the present invention, the step of the bios executing the fault handling instruction includes: the basic input/output system responds to a fault processing instruction for replacing the fault memory bus and determines an error correction pin associated with the fault memory bus based on a preset bus switching rule; the basic input/output system switches a fault pin in the fault memory bus to the error correction pin.
In response to a fault handling instruction to replace a failed memory bus, the bios may first determine an error correction pin associated with the failed memory bus based on a preset bus switching rule. And switching the fault pin in the fault memory bus to an error correction pin to repair the fault memory bus.
In addition, in order to avoid the downtime risk caused by the continuous operation of the operating system, the step of executing the fault processing instruction by the basic input output system further includes: the basic input and output system reduces the error correction information length of the memory.
In addition, the length of error correction information of the memory can be reduced by the basic input/output system, so that downtime risk is avoided.
For example, referring to fig. 6, 1) when a CE error is generated in a memory and collected by a BMC, the BMC counts the memory details of the error, including the dimm, rank, device, bank, row, column information of the memory. 2) And the BMC compares the fault information with the historical fault information, and if different particle faults always appear in the historical faults, and DQ information displayed by the faults always coincide, the bus faults are judged to appear. 3) If the bus is faulty, the BMC informs an OSPM module of the OS of the memory fault through the SCI, the OSPM calls an APCI ASL method, the BIOS ASL method obtains faulty bus information from the BMC, the BIOS calls LANE REPLACEMENT to trigger a process, and the fault data pin is replaced by the ECC pin.
4) When the memory IMC (memory controller) checks data, 64bit ECC (error checking and correcting) data are reduced to 56bit ECC data, error correction and detection capability are slightly reduced, but the downtime risk caused by the subsequent system operation can be avoided due to the fact that bus faults are not processed.
In an alternative embodiment of the present invention, the step of the bios executing the fault handling instruction includes: and the basic input/output system responds to a fault processing instruction for locking the fault memory bank, and triggers an adaptive dual-device data correction mechanism to lock the fault memory bank.
In order to realize lockstep processing on a fault memory bank, the basic input/output system can trigger an adaptive dual-device data correction mechanism, and the adaptive dual-device data correction mechanism is utilized to process data of the memory.
For example, referring to fig. 7, 1) when a CE error is generated in a memory and collected by a BMC, the BMC counts the memory details of the error, including the dimm, rank, device, bank, row, column information of the memory. (Dual inline memory Module information, memory error information on the memory Module includes physical location information, memory Block information, memory device granule information, logical storage array information, column information) 2) comparing this failure information with historical failure information, any one of the historical information in bank, column more than 3 times, judging bank, column failure, when a granule has 3 times of bank failure. 3) If the fault is a bank, the BMC informs an OSPM module of the OS of the memory fault through the SCI, the OSPM calls an APCI ASL method, the BIOS ASL method obtains memory bank information of the fault from the BMC, and the BIOS calls ADDDC to trigger a flow and execute lock step processing.
Step 210, when the memory controller of the memory completes a single inspection, the bios reads the failed row of the memory;
And when the memory controller of the memory completes single inspection, namely the single inspection is finished, the basic input/output system reads the fault row with the largest faults, namely the fault row of the memory.
Specifically, the step of the bios reading the failed row of the memory when the memory controller of the memory completes a single patrol includes: and when the memory controller of the memory completes single inspection, the basic input/output system receives interrupt processing sent by the memory, and in the interrupt processing, the fault line of the memory is read.
In practical application, when the memory controller completes a single inspection, interrupt processing is triggered, the basic input/output system receives the interrupt processing sent by the memory, and the fault line of the memory is read in the interrupt processing.
And step 211, performing fault processing on the fault line by the basic input output system.
And carrying out fault processing on the fault line to avoid downtime caused by the fault line.
Specifically, the step of performing fault processing on the fault line by the basic input output system includes: the basic input and output system judges whether the fault line has undergone memory page offline processing; when the fault line has been subjected to memory page offline processing, the basic input/output system adopts a standby memory line to replace the fault line; when the fault line does not perform memory page offline processing, the basic input output system determines storage unit data of the fault line; and inverting the storage unit data, and performing memory rewriting on the inverted storage unit data.
The basic input/output system can judge whether the fault line has undergone the memory page offline processing; when the fault line is subjected to the offline processing of the memory page, the fault line can be replaced by a standby memory line; determining storage unit data of the fault line when the fault line does not perform memory page offline processing; and inverting the storage unit data, and performing memory rewriting on the inverted storage unit data.
Further, the step of performing fault processing on the fault line by the basic input output system further includes: the basic input and output system determines the storage unit address associated with the fault line; and the basic input and output system performs memory page isolation on the memory pages corresponding to the memory unit addresses.
The basic input/output system can determine the storage unit address associated with the fault line, and the memory page isolation is carried out on the memory page corresponding to the storage unit address so as to prevent the further expansion of the memory fault and influence the operation of the server.
For example, referring to fig. 8, 1) a system management interrupt is generated when the memory controller patrol completes one round of memory. 2) The Bios receives the interrupt handling, checks the ECS related registers from which the faulty row can be extracted. 3) It is determined whether the failed row was previously processed by page offset. 4) If the memory page offline processing has been performed, then runtime PPR (post-package repair) processing is performed, and the BIOS calls runtimePPR to replace the memory failure Row with a spare Row for each bankGroup (a set of memory storage arrays). 5) If the processing is not performed by pageoffline, all cell data related to the failed column needs to be rewritten and then rewritten into the memory. 6) After the writing succeeds, the system addresses associated with all cells of the faulty line need to be translated to write all the associated system addresses into the HEST table, and CPER flag of APEI is set (as long as CPER flag of APEI is set, the operating system automatically detects the flag bit, calls the kernel code, and performs memory page isolation). The OS uses APEI as a page of the fault memory to prevent subsequent wrong addresses from continuing to access, resulting in wrong upgrades.
And acquiring the memory line information with the maximum single bit error through the register. Because PPR resources are limited, firstly, a memory fault line is processed by using a memory page offline function, and because the on-die ECC and the system are mutually independent, the on-die ECC is continuously performed after the memory page is offline, and the design scheme is realized by rewriting the memory after the fault line data is inverted, so that errors of on-die ECC scanning data are eliminated to the greatest extent, and the runtime PPR is used for replacing the memory line when the memory fault line continuously occurs. Limited PPR resources are maximally utilized to critical faults.
In order that the embodiments of the present invention may be apparent to those skilled in the art, the following description is given by way of example:
Referring to fig. 9, the bmc may perform fault screening, repair by adopting different schemes for different fault types, and implement a memory fault address isolation function under the operating system by cooperating with the BIOS when a persistent memory unit fault is screened, and notify the BIOS to call PPR for error replacement when a Row fault is screened. When a Column, bank, chip fault is selected, the BIOS is notified to call ADDDC for error repair. After the BUS fault is screened out, the BIOS is informed to operate the memory controller by the BIOS to directly replace the fault Lane with part Lane in ECC DEVICES, so that the complete CACHE LINE data is maintained. Meanwhile, after the memory controller completes one-time memory error inspection, the BIOS is actively informed to inspect the on-die ECC fault. The BIOS may read the ECS related registers to obtain the row of errors with the most on-die ECC correction errors. Memory address isolation or repair is performed for this row using runtime PPR.
It should be noted that, for simplicity of description, the method embodiments are shown as a series of acts, but it should be understood by those skilled in the art that the embodiments are not limited by the order of acts, as some steps may occur in other orders or concurrently in accordance with the embodiments. Further, those skilled in the art will appreciate that the embodiments described in the specification are presently preferred embodiments, and that the acts are not necessarily required by the embodiments of the invention.
Referring to fig. 10, there is shown a block diagram of an embodiment of a memory 1004 fault handling device of the present invention, comprising a baseboard management controller 1001, a basic input output system 1002, an operating system 1003 and a memory 1004,
The baseboard management controller 1001 is configured to acquire current memory error information when acquiring a memory controller of the memory 1004 for inspection;
The baseboard management controller 1001 is configured to determine an error type according to the current memory error information;
the operating system 1003 is configured to generate a fault handling instruction according to the error type;
The bios 1002 is configured to execute the fault handling instruction to complete the repair of the memory 1004 fault.
In an alternative embodiment of the present invention, when the current memory error information has a preset storage unit fault identifier, the baseboard management controller 1001 is configured to determine that the error type is a persistent storage unit error;
when the current memory error information does not have a storage unit fault identifier, the baseboard management controller 1001 is configured to obtain historical memory error information;
The baseboard management controller 1001 is configured to compare the current memory error information with the historical error information;
when the absolute value of the fault unit row difference between the current memory error information and the historical error information is greater than a preset burst length, the baseboard management controller 1001 is configured to determine that the error type is a memory row error;
when the pin identifiers of the fault data input/output bus between the current memory error information and the historical error information are the same, and the granule faults between the current memory error information and the historical error information are different, the baseboard management controller 1001 is configured to determine that the error type is a memory bus error;
When the sum of the number of block failures of the current memory error information and the number of block failures of the historical error information is greater than a preset threshold, the baseboard management controller 1001 is configured to determine that the error type is a memory block failure.
In an alternative embodiment of the present invention, when the error type is the persistent storage unit error, the operating system 1003 is configured to parse the error information to determine a first failed memory page;
the operating system 1003 is configured to generate a fault handling instruction for isolating the first faulty memory page.
In an alternative embodiment of the present invention, the bios 1002 is configured to respond to a fault handling instruction for isolating the first faulty memory page, write an address of the first faulty memory page into a preset hardware error source table, and set a universal platform error record identifier.
In an alternative embodiment of the present invention, the bios 1002 is configured to perform a memory page offline processing on the first failed memory page.
In an alternative embodiment of the present invention, when the error type is the memory line error, the operating system 1003 is configured to parse the current memory error information to determine a failed memory line;
the operating system 1003 is configured to determine a second failed memory page associated with the failed memory row;
The operating system 1003 is configured to generate a fault handling instruction for isolating the second faulty memory page.
In an alternative embodiment of the present invention, the bios 1002 is configured to respond to a fault handling instruction for isolating the second faulty memory page, write the address of the second faulty memory page into a preset hardware error source table, and set a universal platform error record identifier.
In an alternative embodiment of the present invention, the bios 1002 is configured to perform a memory page offline processing on the second failed memory page.
In an alternative embodiment of the present invention, when the error type is the memory bus error, the operating system 1003 is configured to parse the current memory error information to determine a failed memory bus;
The operating system 1003 is configured to generate a fault handling instruction that replaces the faulty memory bus.
In an alternative embodiment of the present invention, the bios 1002 is configured to determine an error correction pin associated with the failed memory bus based on a preset bus switching rule in response to a failure handling instruction for replacing the failed memory bus;
the bios 1002 is configured to switch a failed pin in the failed memory bus to the error correction pin.
In an alternative embodiment of the present invention, the bios 1002 is configured to reduce the length of the error correction information in the memory 1004.
In an alternative embodiment of the present invention, when the error type is a memory block failure, the operating system 1003 is configured to parse the current memory error information to determine a failure memory bank to which the failed memory block belongs;
the operating system 1003 is configured to generate a fault handling instruction that locks against the fault repository.
In an alternative embodiment of the present invention, the bios 1002 is configured to trigger an adaptive dual device data correction mechanism to lock the failed memory bank in response to a failure handling instruction for locking the failed memory bank.
In an alternative embodiment of the present invention, further comprising:
the bios 1002 is further configured to read a failed row of the memory 1004 when the memory controller of the memory 1004 completes a single patrol.
In an alternative embodiment of the present invention, when the memory 1004 controller of the memory 1004 completes a single patrol, the bios 1002 is configured to receive an interrupt process sent by the memory 1004, where in the interrupt process, a faulty line of the memory 1004 is read.
In an alternative embodiment of the present invention, further comprising:
the bios 1002 also performs fault handling for the fault line.
In an alternative embodiment of the present invention, the bios 1002 is configured to determine whether the failed row has undergone a memory page offline processing;
When the failed line has undergone memory page offline processing, the bios 1002 is configured to replace the failed line with a spare memory line;
When the failed line does not perform the offline processing of the memory page, the bios 1002 is configured to determine the storage unit data of the failed line; and inverting the storage unit data, and rewriting the inverted storage unit data into the memory 1004.
In an alternative embodiment of the present invention,
The bios 1002 is configured to determine a memory location address associated with the failed row;
the bios 1002 is configured to perform memory page isolation on a memory page corresponding to the memory cell address.
For embodiments of the electronic device, the description is relatively simple as it is substantially similar to the method embodiments, as relevant points are found in the partial description of the method embodiments.
Referring to fig. 11, an embodiment of the present invention further provides a computer readable storage medium 1101, where the storage medium 1101 stores a computer program, and the computer program when executed by a processor performs a memory failure processing method according to any one of the embodiments of the present invention.
The memory fault processing method is applied to electronic equipment, and the electronic equipment comprises a baseboard management controller, a basic input and output system, an operating system and a memory which are connected based on the same bus.
The memory fault processing method comprises the following steps:
The baseboard management controller acquires current memory error information when a memory controller of the memory is patrolled;
The baseboard management controller determines the error type according to the current memory error information;
The operating system generates a fault processing instruction according to the error type;
and the basic input and output system executes the fault processing instruction to complete memory fault repair.
Optionally, the step of determining, by the baseboard management controller, an error type according to the current memory error information includes:
When the current memory error information has a preset storage unit fault identifier, the baseboard management controller determines that the error type is a persistent storage unit error;
when the current memory error information does not have a storage unit fault identifier, the baseboard management controller acquires historical memory error information;
the baseboard management controller compares the current memory error information with the historical error information;
When the absolute value of the fault unit row difference value between the current memory error information and the historical error information is larger than a preset burst length, the baseboard management controller determines that the error type is a memory row error;
When the pin identifiers of the fault data input/output buses between the current memory error information and the historical error information are the same, and the particle faults between the current memory error information and the historical error information are different, the baseboard management controller determines that the error type is a memory bus error;
And when the sum of the block fault number of the current memory error information and the block fault number of the historical error information is larger than a preset threshold value, the baseboard management controller determines that the error type is a memory block fault.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
When the error type is the error of the persistent storage unit, the operating system analyzes the error information and determines a first fault memory page;
and the operating system generates a fault processing instruction for isolating the first fault memory page.
Optionally, the step of executing the fault handling instruction by the bios includes:
and the basic input/output system responds to a fault processing instruction for isolating the first fault memory page, writes the address of the first fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
Optionally, the step of executing the fault handling instruction by the bios further includes:
and the basic input and output system performs memory page offline processing on the first fault memory page.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is the memory line error, the operating system analyzes the current memory error information and determines a fault memory line;
The operating system determines a second failed memory page associated with the failed memory row;
And the operating system generates a fault processing instruction for isolating the second fault memory page.
Optionally, the step of executing the fault handling instruction by the bios includes:
and the basic input/output system responds to a fault processing instruction for isolating the second fault memory page, writes the address of the second fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
Optionally, the step of executing the fault handling instruction by the bios further includes:
And the basic input and output system performs memory page offline processing on the second fault memory page.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is the memory bus error, the operating system analyzes the current memory error information and determines a fault memory bus;
the operating system generates a fault handling instruction to replace the faulty memory bus.
Optionally, the step of executing the fault handling instruction by the bios includes:
The basic input/output system responds to a fault processing instruction for replacing the fault memory bus and determines an error correction pin associated with the fault memory bus based on a preset bus switching rule;
the basic input/output system switches a fault pin in the fault memory bus to the error correction pin.
Optionally, the step of executing the fault handling instruction by the bios further includes:
the basic input and output system reduces the error correction information length of the memory.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is a memory block fault, the operating system analyzes the current memory error information and determines a fault memory bank to which the fault memory block belongs;
the operating system generates a fault handling instruction that locks against the fault repository.
Optionally, the step of executing the fault handling instruction by the bios includes:
And the basic input/output system responds to a fault processing instruction for locking the fault memory bank, and triggers an adaptive dual-device data correction mechanism to lock the fault memory bank.
Optionally, the method further comprises:
And when the memory controller of the memory completes single inspection, the basic input/output system reads the fault line of the memory.
Optionally, the step of the bios reading the failed row of the memory when the memory controller of the memory completes a single patrol includes:
And when the memory controller of the memory completes single inspection, the basic input/output system receives interrupt processing sent by the memory, and in the interrupt processing, the fault line of the memory is read.
Optionally, the method further comprises:
And the basic input and output system performs fault processing on the fault line.
Optionally, the step of performing fault handling on the fault line by the basic input output system includes:
the basic input and output system judges whether the fault line has undergone memory page offline processing;
When the fault line has been subjected to memory page offline processing, the basic input/output system adopts a standby memory line to replace the fault line;
when the fault line does not perform memory page offline processing, the basic input output system determines storage unit data of the fault line; and inverting the storage unit data, and performing memory rewriting on the inverted storage unit data.
Optionally, the step of performing fault processing on the fault line by the basic input output system further includes:
the basic input and output system determines the storage unit address associated with the fault line;
and the basic input and output system performs memory page isolation on the memory pages corresponding to the memory unit addresses.
The memory may include a random access memory (Random Access Memory, abbreviated as RAM) or a non-volatile memory (non-volatile memory), such as at least one disk memory. Optionally, the memory may also be at least one memory device located remotely from the aforementioned processor.
The processor may be a general-purpose processor, including a central processing unit (Central Processing Unit, abbreviated as CPU), a network processor (Network Processor, abbreviated as NP), etc.; but may also be a digital signal processor (DIGITAL SIGNAL Processing, DSP), application Specific Integrated Circuit (ASIC), field-Programmable gate array (FPGA) or other Programmable logic device, discrete gate or transistor logic device, discrete hardware components.
Referring to FIG. 12, a block diagram of a computer program product is shown, provided in accordance with an embodiment of the present invention. The computer program product comprises a computer program 1201, which computer program 1201 when executed by a processor implements the steps of the memory failure handling method as described above.
The memory fault processing method is applied to electronic equipment, the electronic equipment comprises a baseboard management controller, a basic input and output system, an operating system and a memory, which are connected based on the same bus, and the memory fault processing method comprises the following steps:
The baseboard management controller acquires current memory error information when a memory controller of the memory is patrolled;
The baseboard management controller determines the error type according to the current memory error information;
The operating system generates a fault processing instruction according to the error type;
and the basic input and output system executes the fault processing instruction to complete memory fault repair.
Optionally, the step of determining, by the baseboard management controller, an error type according to the current memory error information includes:
When the current memory error information has a preset storage unit fault identifier, the baseboard management controller determines that the error type is a persistent storage unit error;
when the current memory error information does not have a storage unit fault identifier, the baseboard management controller acquires historical memory error information;
the baseboard management controller compares the current memory error information with the historical error information;
When the absolute value of the fault unit row difference value between the current memory error information and the historical error information is larger than a preset burst length, the baseboard management controller determines that the error type is a memory row error;
When the pin identifiers of the fault data input/output buses between the current memory error information and the historical error information are the same, and the particle faults between the current memory error information and the historical error information are different, the baseboard management controller determines that the error type is a memory bus error;
And when the sum of the block fault number of the current memory error information and the block fault number of the historical error information is larger than a preset threshold value, the baseboard management controller determines that the error type is a memory block fault.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
When the error type is the error of the persistent storage unit, the operating system analyzes the error information and determines a first fault memory page;
and the operating system generates a fault processing instruction for isolating the first fault memory page.
Optionally, the step of executing the fault handling instruction by the bios includes:
and the basic input/output system responds to a fault processing instruction for isolating the first fault memory page, writes the address of the first fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
Optionally, the step of executing the fault handling instruction by the bios further includes:
and the basic input and output system performs memory page offline processing on the first fault memory page.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is the memory line error, the operating system analyzes the current memory error information and determines a fault memory line;
The operating system determines a second failed memory page associated with the failed memory row;
And the operating system generates a fault processing instruction for isolating the second fault memory page.
Optionally, the step of executing the fault handling instruction by the bios includes:
and the basic input/output system responds to a fault processing instruction for isolating the second fault memory page, writes the address of the second fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
Optionally, the step of executing the fault handling instruction by the bios further includes:
And the basic input and output system performs memory page offline processing on the second fault memory page.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is the memory bus error, the operating system analyzes the current memory error information and determines a fault memory bus;
the operating system generates a fault handling instruction to replace the faulty memory bus.
Optionally, the step of executing the fault handling instruction by the bios includes:
The basic input/output system responds to a fault processing instruction for replacing the fault memory bus and determines an error correction pin associated with the fault memory bus based on a preset bus switching rule;
the basic input/output system switches a fault pin in the fault memory bus to the error correction pin.
Optionally, the step of executing the fault handling instruction by the bios further includes:
the basic input and output system reduces the error correction information length of the memory.
Optionally, the step of generating the fault handling instruction by the operating system according to the error type includes:
when the error type is a memory block fault, the operating system analyzes the current memory error information and determines a fault memory bank to which the fault memory block belongs;
the operating system generates a fault handling instruction that locks against the fault repository.
Optionally, the step of executing the fault handling instruction by the bios includes:
And the basic input/output system responds to a fault processing instruction for locking the fault memory bank, and triggers an adaptive dual-device data correction mechanism to lock the fault memory bank.
Optionally, the method further comprises:
And when the memory controller of the memory completes single inspection, the basic input/output system reads the fault line of the memory.
Optionally, the step of the bios reading the failed row of the memory when the memory controller of the memory completes a single patrol includes:
And when the memory controller of the memory completes single inspection, the basic input/output system receives interrupt processing sent by the memory, and in the interrupt processing, the fault line of the memory is read.
Optionally, the method further comprises:
And the basic input and output system performs fault processing on the fault line.
Optionally, the step of performing fault handling on the fault line by the basic input output system includes:
the basic input and output system judges whether the fault line has undergone memory page offline processing;
When the fault line has been subjected to memory page offline processing, the basic input/output system adopts a standby memory line to replace the fault line;
when the fault line does not perform memory page offline processing, the basic input output system determines storage unit data of the fault line; and inverting the storage unit data, and performing memory rewriting on the inverted storage unit data.
Optionally, the step of performing fault processing on the fault line by the basic input output system further includes:
the basic input and output system determines the storage unit address associated with the fault line;
and the basic input and output system performs memory page isolation on the memory pages corresponding to the memory unit addresses.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described by differences from other embodiments, and identical and similar parts between the embodiments are all enough to be referred to each other.
It will be apparent to those skilled in the art that embodiments of the present invention may be provided as a method, apparatus, or computer program product. Accordingly, embodiments of the present invention may take the form of an entirely hardware embodiment, an entirely software embodiment or an embodiment combining software and hardware aspects. Furthermore, embodiments of the invention may take the form of a computer program product on one or more computer-usable storage media (including, but not limited to, disk storage, CD-ROM, optical storage, etc.) having computer-usable program code embodied therein.
Embodiments of the present invention are described with reference to flowchart illustrations and/or block diagrams of methods, terminal devices (systems), and computer program products according to embodiments of the invention. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing terminal device to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing terminal device, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be stored in a computer-readable memory that can direct a computer or other programmable data processing apparatus to function in a particular manner, such that the instructions stored in the computer-readable memory produce an article of manufacture including instruction means which implement the function specified in the flowchart flow or flows and/or block diagram block or blocks.
These computer program instructions may also be loaded onto a computer or other programmable data processing apparatus to cause a series of operational steps to be performed on the computer or other programmable apparatus to produce a computer implemented process such that the instructions which execute on the computer or other programmable apparatus provide steps for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
While preferred embodiments of the present invention have been described, additional variations and modifications in those embodiments may occur to those skilled in the art once they learn of the basic inventive concepts. It is therefore intended that the following claims be interpreted as including the preferred embodiment and all such alterations and modifications as fall within the scope of the embodiments of the invention.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or terminal that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or terminal. Without further limitation, an element defined by the phrase "comprising one … …" does not exclude the presence of other like elements in a process, method, article, or terminal device that comprises the element.
The foregoing has outlined rather broadly the principles and embodiments of the present invention in order that the detailed description of the invention that follows may be better understood, such as an embodiment of the present invention that is provided in the present application, an electronic device, a storage medium, and a program product; meanwhile, as those skilled in the art will have variations in the specific embodiments and application scope in accordance with the ideas of the present invention, the present description should not be construed as limiting the present invention in view of the above.

Claims (21)

1. The memory fault processing method is characterized by being applied to electronic equipment, wherein the electronic equipment comprises a baseboard management controller, a basic input and output system, an operating system and a memory which are connected based on the same bus, and the method comprises the following steps:
The baseboard management controller acquires current memory error information when a memory controller of the memory is patrolled;
The baseboard management controller determines the error type according to the current memory error information;
The operating system generates a fault processing instruction according to the error type;
and the basic input and output system executes the fault processing instruction to complete memory fault repair.
2. The method of claim 1, wherein the step of the baseboard management controller determining the error type according to the current memory error information comprises:
When the current memory error information has a preset storage unit fault identifier, the baseboard management controller determines that the error type is a persistent storage unit error;
when the current memory error information does not have a storage unit fault identifier, the baseboard management controller acquires historical memory error information;
the baseboard management controller compares the current memory error information with the historical error information;
When the absolute value of the fault unit row difference value between the current memory error information and the historical error information is larger than a preset burst length, the baseboard management controller determines that the error type is a memory row error;
When the pin identifiers of the fault data input/output buses between the current memory error information and the historical error information are the same, and the particle faults between the current memory error information and the historical error information are different, the baseboard management controller determines that the error type is a memory bus error;
And when the sum of the block fault number of the current memory error information and the block fault number of the historical error information is larger than a preset threshold value, the baseboard management controller determines that the error type is a memory block fault.
3. The method of claim 2, wherein the operating system generating a fault handling instruction in accordance with the error type comprises:
When the error type is the error of the persistent storage unit, the operating system analyzes the error information and determines a first fault memory page;
and the operating system generates a fault processing instruction for isolating the first fault memory page.
4. A method according to claim 3, wherein the step of the bios executing the fault handling instructions comprises:
and the basic input/output system responds to a fault processing instruction for isolating the first fault memory page, writes the address of the first fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
5. The method of claim 4, wherein the step of the bios executing the fault handling instructions further comprises:
and the basic input and output system performs memory page offline processing on the first fault memory page.
6. The method of claim 2, wherein the operating system generating a fault handling instruction in accordance with the error type comprises:
when the error type is the memory line error, the operating system analyzes the current memory error information and determines a fault memory line;
The operating system determines a second failed memory page associated with the failed memory row;
And the operating system generates a fault processing instruction for isolating the second fault memory page.
7. The method of claim 6, wherein the step of the bios executing the fault handling instructions comprises:
and the basic input/output system responds to a fault processing instruction for isolating the second fault memory page, writes the address of the second fault memory page into a preset hardware error source table, and sets a universal platform error record identifier.
8. The method of claim 7, wherein the step of the bios executing the fault handling instructions further comprises:
And the basic input and output system performs memory page offline processing on the second fault memory page.
9. The method of claim 2, wherein the operating system generating a fault handling instruction in accordance with the error type comprises:
when the error type is the memory bus error, the operating system analyzes the current memory error information and determines a fault memory bus;
the operating system generates a fault handling instruction to replace the faulty memory bus.
10. The method of claim 9, wherein the step of the bios executing the fault handling instructions comprises:
The basic input/output system responds to a fault processing instruction for replacing the fault memory bus and determines an error correction pin associated with the fault memory bus based on a preset bus switching rule;
the basic input/output system switches a fault pin in the fault memory bus to the error correction pin.
11. The method of claim 10, wherein the step of the bios executing the fault handling instructions further comprises:
the basic input and output system reduces the error correction information length of the memory.
12. The method of claim 2, wherein the operating system generating a fault handling instruction in accordance with the error type comprises:
when the error type is a memory block fault, the operating system analyzes the current memory error information and determines a fault memory bank to which the fault memory block belongs;
the operating system generates a fault handling instruction that locks against the fault repository.
13. The method of claim 12, wherein the step of the bios executing the fault handling instructions comprises:
And the basic input/output system responds to a fault processing instruction for locking the fault memory bank, and triggers an adaptive dual-device data correction mechanism to lock the fault memory bank.
14. The method according to claim 1, wherein the method further comprises:
And when the memory controller of the memory completes single inspection, the basic input/output system reads the fault line of the memory.
15. The method of claim 14, wherein the step of the bios reading the failed row of the memory when the memory controller of the memory completes a single patrol comprises:
And when the memory controller of the memory completes single inspection, the basic input/output system receives interrupt processing sent by the memory, and in the interrupt processing, the fault line of the memory is read.
16. The method of claim 14, wherein the method further comprises:
And the basic input and output system performs fault processing on the fault line.
17. The method of claim 16, wherein the step of the bios performing fault handling on the fault line comprises:
the basic input and output system judges whether the fault line has undergone memory page offline processing;
When the fault line has been subjected to memory page offline processing, the basic input/output system adopts a standby memory line to replace the fault line;
when the fault line does not perform memory page offline processing, the basic input output system determines storage unit data of the fault line; and inverting the storage unit data, and performing memory rewriting on the inverted storage unit data.
18. The method of claim 16, wherein the step of the bios performing fault handling on the faulty row further comprises:
the basic input and output system determines the storage unit address associated with the fault line;
and the basic input and output system performs memory page isolation on the memory pages corresponding to the memory unit addresses.
19. An electronic device is characterized by comprising a baseboard management controller, a basic input/output system, an operating system and a memory which are connected based on the same bus,
The baseboard management controller is used for acquiring current memory error information when the memory controller of the memory is in inspection;
the baseboard management controller is used for determining the error type according to the current memory error information;
the operating system is used for generating a fault processing instruction according to the error type;
The basic input/output system is used for executing the fault processing instruction to complete memory fault repair.
20. A computer readable storage medium, wherein a computer program is stored on the computer readable storage medium, which when executed by a processor, implements the steps of the memory failure handling method according to any of claims 1 to 18.
21. A computer program product comprising a computer program which, when executed by a processor, implements the steps of the memory failure handling method according to any of claims 1 to 18.
CN202410961857.3A 2024-07-17 2024-07-17 Memory fault processing method, electronic device, storage medium and program product Pending CN118503004A (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410961857.3A CN118503004A (en) 2024-07-17 2024-07-17 Memory fault processing method, electronic device, storage medium and program product

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410961857.3A CN118503004A (en) 2024-07-17 2024-07-17 Memory fault processing method, electronic device, storage medium and program product

Publications (1)

Publication Number Publication Date
CN118503004A true CN118503004A (en) 2024-08-16

Family

ID=92239218

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410961857.3A Pending CN118503004A (en) 2024-07-17 2024-07-17 Memory fault processing method, electronic device, storage medium and program product

Country Status (1)

Country Link
CN (1) CN118503004A (en)

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170364404A1 (en) * 2016-06-21 2017-12-21 EMC IP Holding Company LLC Fault processing method, system, and computer program product
CN115328684A (en) * 2022-06-30 2022-11-11 超聚变数字技术有限公司 Memory fault reporting method, BMC and electronic equipment
CN117909109A (en) * 2023-12-12 2024-04-19 超聚变数字技术有限公司 A memory error information processing method and computing device

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US20170364404A1 (en) * 2016-06-21 2017-12-21 EMC IP Holding Company LLC Fault processing method, system, and computer program product
CN115328684A (en) * 2022-06-30 2022-11-11 超聚变数字技术有限公司 Memory fault reporting method, BMC and electronic equipment
CN117909109A (en) * 2023-12-12 2024-04-19 超聚变数字技术有限公司 A memory error information processing method and computing device

Similar Documents

Publication Publication Date Title
CN111625387B (en) Memory error processing method, device and server
US12197279B2 (en) Memory fault handling method and apparatus
US7409594B2 (en) System and method to detect errors and predict potential failures
US7386771B2 (en) Repair of memory hard failures during normal operation, using ECC and a hard fail identifier circuit
EP2095234B1 (en) Memory system with ecc-unit and further processing arrangement
US20080181035A1 (en) Method and system for a dynamically repairable memory
US20090199056A1 (en) Memory diagnosis method
Du et al. Predicting uncorrectable memory errors for proactive replacement: An empirical study on large-scale field data
US9065481B2 (en) Bad wordline/array detection in memory
CN111459557B (en) Method and system for shortening starting time of server
US20190019569A1 (en) Row repair of corrected memory address
CN106021014B (en) A kind of EMS memory management process and device
US9965346B2 (en) Handling repaired memory array elements in a memory of a computer system
US20240330133A1 (en) Method and system for repairing a dynamic random access memory (dram) of memory device
Li et al. From correctable memory errors to uncorrectable memory errors: What error bits tell
CN115705261A (en) Memory fault repairing method, CPU, OS, BIOS and server
CN111221775A (en) Processor, cache processing method and electronic equipment
WO2019184612A1 (en) Terminal and electronic device
CN117271190A (en) Hardware correctable error processing method and system
US7222271B2 (en) Method for repairing hardware faults in memory chips
CN118503004A (en) Memory fault processing method, electronic device, storage medium and program product
CN115410636B (en) Word line testing method and device
CN119046060A (en) Fault prediction method and related device for memory
CN112562774B (en) Storage device mounting method and device, computer device and storage medium
KR100862407B1 (en) System and method to detect errors and predict potential failures

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination