CN109343986B - Method and computer system for handling memory failures - Google Patents
Method and computer system for handling memory failures Download PDFInfo
- Publication number
- CN109343986B CN109343986B CN201810942648.9A CN201810942648A CN109343986B CN 109343986 B CN109343986 B CN 109343986B CN 201810942648 A CN201810942648 A CN 201810942648A CN 109343986 B CN109343986 B CN 109343986B
- Authority
- CN
- China
- Prior art keywords
- storage space
- control unit
- address
- storage
- faulty memory
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Images
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/0706—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment
- G06F11/073—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation the processing taking place on a specific hardware platform or in a specific software environment in a memory management context, e.g. virtual memory or cache management
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Quality & Reliability (AREA)
- Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- General Physics & Mathematics (AREA)
- Hardware Redundancy (AREA)
- Memory System Of A Hierarchy Structure (AREA)
- Techniques For Improving Reliability Of Storages (AREA)
Abstract
本申请提供了一种处理内存故障的方法,包括:第一控制单元确定第一存储空间中存在故障的内存的地址,第一存储空间能够存储第二控制单元以直接内存访问DMA的方式在第一控制单元中备份的数据;第一控制单元将第一存储空间中存在故障的内存的地址同步至第二控制单元;第二控制单元获取第一存储空间中存在故障的内存的地址;第二控制单元以DMA的方式将缓存的数据存储至所述第一存储空间,并且在存储该数据时隔离该第一存储空间中存在故障的内存,其中,该数据是由处理单元通过第二控制单元进行存储的数据。该方法能够降低镜像访问时由于内存故障导致的OOM发生的概率。
The present application provides a method for handling a memory failure, comprising: a first control unit determining an address of a memory with a fault in a first storage space, and the first storage space can store the second control unit in a direct memory access DMA manner in the first storage space. Data backed up in a control unit; the first control unit synchronizes the address of the faulty memory in the first storage space to the second control unit; the second control unit obtains the address of the faulty memory in the first storage space; the second control unit The control unit stores the cached data in the first storage space by means of DMA, and isolates the faulty memory in the first storage space when storing the data, wherein the data is sent by the processing unit through the second control unit data to be stored. This method can reduce the probability of OOM caused by memory failure during mirror access.
Description
技术领域technical field
本申请涉及信息技术领域,并且更具体地,涉及处理内存故障的方法与计算机系统。The present application relates to the field of information technology, and more particularly, to methods and computer systems for handling memory failures.
背景技术Background technique
一个磁盘上的数据在另一个磁盘上存在一个完全相同的副本即为镜像(mirroring),镜像是冗余的一种类型。在存储控制设备上,镜像是由一个控制器将接收到的数据通过镜像通道在另一个控制器做副本。The existence of an exact copy of the data on one disk on another disk is called mirroring. Mirroring is a type of redundancy. On a storage control device, mirroring means that one controller copies the received data on another controller through the mirroring channel.
当通过直接内存访问(direct memory access,DMA)方式进行镜像访问时,由于是直接访问内存的物理地址空间,一个控制器中的数据从该控制器的物理地址空间直接复制到另一个控制器中完全相同的地址空间。When mirroring is accessed through direct memory access (DMA), the data in one controller is directly copied from the physical address space of the controller to another controller due to the direct access to the physical address space of the memory. Exactly the same address space.
如果另一个控制器的内存存在故障,当将数据写入该控制器中存在故障的内存时,则会导致内存溢出(out of memory,OOM)。If the memory of another controller is faulty, it will cause an out of memory (OOM) when data is written to the faulty memory in that controller.
发明内容SUMMARY OF THE INVENTION
本申请提供一种处理内存故障的方法,能够降低镜像访问时由于内存故障导致的OOM发生的概率。The present application provides a method for handling memory failures, which can reduce the probability of OOM occurring due to memory failures during mirror access.
第一方面,提供了一种处理内存故障的方法,应用于计算机系统中,该计算机系统包括处理单元、第一控制单元、第二控制单元,该第一控制单元与该第二控制单元均与该处理单元连接,包括:该第一控制单元确定该第一存储空间中存在故障的内存的地址,该第一存储空间是该第一控制单元中的存储空间,该第一存储空间能够存储该第二控制单元以直接内存访问DMA的方式在该第一控制单元中备份的数据;该第一控制单元将该第一存储空间中存在故障的内存的地址同步至该第二控制单元;该第二控制单元获取该第一存储空间中存在故障的内存的地址;该第二控制单元以DMA的方式将缓存在该第二控制单元中的该数据存储至该第一存储空间,并且在存储该数据时隔离该第一存储空间中存在故障的内存,其中,该数据是由该处理单元通过该第二控制单元进行存储的数据。In a first aspect, a method for handling memory failures is provided, which is applied to a computer system. The computer system includes a processing unit, a first control unit, and a second control unit. Both the first control unit and the second control unit are connected to The processing unit connection includes: the first control unit determines the address of the faulty memory in the first storage space, the first storage space is the storage space in the first control unit, and the first storage space can store the The second control unit accesses data backed up in the first control unit by direct memory access DMA; the first control unit synchronizes the address of the faulty memory in the first storage space to the second control unit; the first control unit synchronizes the address of the faulty memory in the first storage space to the second control unit; The second control unit obtains the address of the faulty memory in the first storage space; the second control unit stores the data buffered in the second control unit in the first storage space by means of DMA, and stores the data in the first storage space. A faulty memory in the first storage space is isolated when data is stored, wherein the data is stored by the processing unit through the second control unit.
可选地,该第二控制单元以DMA的方式将缓存在该第二控制单元中的该数据存储至该第一存储空间,并且在存储该数据时隔离该第一存储空间中存在故障的内存,包括:该第二控制单元向该第二控制单元中的第二DMAC发送第二写数据指令,该第二写数据指令用于指示该第二DMAC将缓存在该第二控制单元中的该数据存储至该第一存储空间中,该第二写数据指令携带的目的地址所指示的存储空间为该第一存储空间中存在故障的内存之外的存储空间。Optionally, the second control unit stores the data buffered in the second control unit to the first storage space by DMA, and isolates the faulty memory in the first storage space when storing the data , including: the second control unit sends a second write data command to the second DMAC in the second control unit, where the second write data command is used to instruct the second DMAC to buffer the data stored in the second control unit The data is stored in the first storage space, and the storage space indicated by the destination address carried by the second write data instruction is the storage space other than the faulty memory in the first storage space.
通过使本控(例如,第一控制单元)在OS启动之前或启动之后对支持进行DMA的存储空间进行故障检测,并将该存储空间中发生故障的内存的地址通知给对控(例如,第二控制单元),当对控在该存储空间中进行数据备份时,对该存储空间中存在故障的内存进行隔离,即,仅在本控中没有发生内存故障的存储空间中进行数据备份,从而降低发生OOM的概率。By causing the local controller (for example, the first control unit) to perform fault detection on the storage space supporting DMA before or after the OS is started, and notify the controller (for example, the first control unit) of the address of the faulty memory in the storage space to the controller (for example, the first control unit). Two control units), when the data backup is performed in the storage space for the control, the faulty memory in the storage space is isolated, that is, data backup is only performed in the storage space where no memory failure occurs in the control, thereby Reduce the probability of OOM occurrence.
结合第一方面,在第一方面的某些实现方式中,在该第二控制单元以DMA的方式将缓存在该第二控制单元中的该数据存储至该第一存储空间之前,该方法还包括:该第二控制单元将缓存在与该第一存储空间中存在故障的内存的地址对应的地址空间的数据,迁移到第二存储空间其它的存储空间中;其中,该第二存储空间是该第二控制单元中的存储空间。With reference to the first aspect, in some implementations of the first aspect, before the second control unit stores the data buffered in the second control unit to the first storage space in a DMA manner, the method further Including: the second control unit migrates the data cached in the address space corresponding to the address of the faulty memory in the first storage space to other storage spaces in the second storage space; wherein, the second storage space is storage space in the second control unit.
在第二控制单元以DMA的方式将缓存在第二控制单元中的数据存储至第一存储空间之前,通过使第二控制单元将缓存在与第一存储空间中存在故障的内存的地址对应的地址空间的数据,迁移到第二存储空间其它的存储空间中,以使第二控制单元在将该数据存储至第一存储空间时,能够实现对第一存储空间中存在故障的内存的隔离,进而降低发生OOM的概率。Before the second control unit stores the data buffered in the second control unit in the first storage space by means of DMA, by causing the second control unit to buffer the data in the address corresponding to the address of the faulty memory in the first storage space The data in the address space is migrated to other storage spaces in the second storage space, so that the second control unit can isolate the faulty memory in the first storage space when storing the data in the first storage space, This in turn reduces the probability of OOM occurring.
结合第一方面,在第一方面的某些实现方式中,该第一控制单元将该第一存储空间中存在故障的内存的地址同步至该第二控制单元,包括:该第一控制单元统计该第一存储空间中存在故障的内存的地址的数量;如果该数量小于或等于预设的阈值,该第一控制单元将该第一存储空间中存在故障的内存的地址同步至该第二控制单元。With reference to the first aspect, in some implementations of the first aspect, the first control unit synchronizes the address of the faulty memory in the first storage space to the second control unit, including: the first control unit counts The number of addresses of the faulty memory in the first storage space; if the number is less than or equal to a preset threshold, the first control unit synchronizes the address of the faulty memory in the first storage space to the second control unit unit.
在将第一存储空间中存在故障的内存的地址同步至对控之前,通过对第一存储空间中存在故障的内存的地址的数量进行统计,当该数量小于预设的阈值时,才将第一存储空间中存在故障的内存的地址同步至对控,否则将重新在本控内分配用于进行DMA的存储空间,避免在第一存储空间中的故障情况比较严重时,仍将存在故障的内存地址同步至对控所带来的资源浪费。Before synchronizing the address of the faulty memory in the first storage space to the control, by counting the number of addresses of the faulty memory in the first storage space, when the number is less than the preset threshold, the first The address of the faulty memory in one storage space is synchronized to the other control, otherwise the storage space for DMA will be re-allocated in this control to avoid that when the fault in the first storage space is serious, there will still be faults. The waste of resources caused by the synchronization of the memory address to the control.
结合第一方面,在第一方面的某些实现方式中,该第一控制单元包括第一处理器与第一主板管理控制器BMC,该第二控制单元包括第二BMC,该将该第一存储空间中存在故障的内存的地址同步至该第二控制单元,包括:该第一处理器将该第一存储空间中存在故障的内存的地址写入该第一控制单元中预先分配的存储空间中;该第一BMC从该第一控制单元中预先分配的存储空间中获取该第一存储空间内存在故障的内存的地址;该第一BMC向该第二BMC发送第一报文,该第一报文中携带有该第一存储空间中存在故障的内存的地址。With reference to the first aspect, in some implementations of the first aspect, the first control unit includes a first processor and a first motherboard management controller BMC, the second control unit includes a second BMC, and the first control unit includes a second BMC. Synchronizing the address of the faulty memory in the storage space to the second control unit includes: the first processor writing the address of the faulty memory in the first storage space into a pre-allocated storage space in the first control unit in; the first BMC obtains the address of the faulty memory in the first storage space from the storage space pre-allocated in the first control unit; the first BMC sends the first message to the second BMC, the first A message carries the address of the faulty memory in the first storage space.
结合第一方面,在第一方面的某些实现方式中,该第一控制单元还包括第一平台控制单元PCH与第一复杂可编程逻辑器件CPLD,该第一控制单元中预先分配的存储空间是该第一CPLD中的存储空间,该第一处理器将该第一存储空间中存在故障的内存的地址写入该第一控制单元中预先分配的存储空间中,包括:该第一处理器通过该第一PCH将该第一存储空间中存在故障的内存的地址写入该第一CPLD中的预先分配的存储空间中;该第一BMC从该第一控制单元中预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址,包括:该第一BMC从该第一CPLD中的预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。With reference to the first aspect, in some implementations of the first aspect, the first control unit further includes a first platform control unit PCH and a first complex programmable logic device CPLD, and a pre-allocated storage space in the first control unit is the storage space in the first CPLD, and the first processor writes the address of the faulty memory in the first storage space into the pre-allocated storage space in the first control unit, including: the first processor The address of the faulty memory in the first storage space is written into the pre-allocated storage space in the first CPLD through the first PCH; the first BMC is loaded from the pre-allocated storage space in the first control unit Acquiring the address of the faulty memory in the first storage space includes: the first BMC acquiring the address of the faulty memory in the first storage space from the pre-allocated storage space in the first CPLD.
结合第一方面,在第一方面的某些实现方式中,该第一控制单元包括第一处理器与第一直接内存访问控制器DMAC,该第二控制单元包括第二处理器,该将该第一存储空间中存在故障的内存的地址同步至该第二控制单元,包括:该第一处理器向该第一DMAC发送第一写数据指令,该第一写数据指令中携带有该第一存储空间中存在故障的内存的地址在第二存储空间中的第一存储子空间中的存储地址,该第二存储空间为该第二控制单元中的存储空间,该第一存储子空间用于存储该第一存储空间中存在故障的内存的地址,该第二存储空间还包括第二存储子空间,该第二存储子空间用于存储该第一控制单元以DMA的方式在该第二控制单元中备份的数据;该第一DMAC根据该第一写数据指令,将该第一存储空间中存在故障的内存的地址存储至该存储地址在该第一存储子空间中对应的存储空间;该第一处理器向该第二处理器发送第二报文,该第二报文中携带有该第一存储空间中存在故障的内存的地址在该第一存储子空间中的存储地址。With reference to the first aspect, in some implementations of the first aspect, the first control unit includes a first processor and a first direct memory access controller DMAC, the second control unit includes a second processor, and the Synchronizing the address of the faulty memory in the first storage space to the second control unit includes: the first processor sending a first write data command to the first DMAC, where the first write data command carries the first data write command. The address of the faulty memory in the storage space is the storage address in the first storage subspace in the second storage space, where the second storage space is the storage space in the second control unit, and the first storage subspace is used for Store the address of the faulty memory in the first storage space, the second storage space further includes a second storage subspace, and the second storage subspace is used to store the first control unit in the second control unit in a DMA manner. The data backed up in the unit; the first DMAC stores the address of the faulty memory in the first storage space to the storage space corresponding to the storage address in the first storage subspace according to the first write data instruction; the The first processor sends a second message to the second processor, where the second message carries the storage address of the address of the faulty memory in the first storage space in the first storage subspace.
通过将第一存储空间中存在故障的内存的地址通过DMA的方式同步至对控,从而提高故障同步的速度,使得对控能够及时发现本控中存在故障的内存的地址,在对本控的存储空间进行数据访问时,能够及时对发生故障的内存进行隔离,避免由于故障同步不及时导致对控对本控中存在故障的存储空间进行数据访问,进而导致发生OOM的情况。By synchronizing the address of the faulty memory in the first storage space to the peer controller through DMA, the speed of fault synchronization is improved, so that the peer controller can find the address of the faulty memory in the local controller in time. When data access is performed in the space, the faulty memory can be isolated in a timely manner, so as to avoid data access to the faulty storage space in the local control due to untimely synchronization of the fault, which may lead to OOM.
结合第一方面,在第一方面的某些实现方式中,该第二控制单元包括第二处理器与第二BMC,该第二控制单元获取该第一存储空间中存在故障的内存的地址,包括:该第二BMC接收该第一BMC发送的该第一报文,该第一报文中携带有该第一存储空间中存在故障的内存的地址;该第二BMC对该第一报文进行解析,获取该第一存储空间中存在故障的内存的地址;该第二BMC将该第一存储空间中存在故障的内存的地址写入该第二控制单元中预先分配的存储空间中;该第二处理器从该第二控制单元中预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。With reference to the first aspect, in some implementations of the first aspect, the second control unit includes a second processor and a second BMC, and the second control unit obtains the address of the faulty memory in the first storage space, Including: the second BMC receives the first message sent by the first BMC, and the first message carries the address of the faulty memory in the first storage space; the second BMC responds to the first message Perform analysis to obtain the address of the faulty memory in the first storage space; the second BMC writes the address of the faulty memory in the first storage space into the pre-allocated storage space in the second control unit; the The second processor acquires the address of the faulty memory in the first storage space from the storage space pre-allocated in the second control unit.
结合第一方面,在第一方面的某些实现方式中,该第二控制单元还包括第二PCH与第二CPLD,该第二控制单元中预先分配的存储空间是该第二CPLD中的存储空间,该第二BMC将该第一存储空间中存在故障的内存的地址写入该第二控制单元中预先分配的存储空间中,包括:该第二BMC将该第一存储空间中存在故障的内存的地址写入该第二CPLD中的预先分配的存储空间中;该第二处理器从该第二CPLD中的预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址,包括:该第二处理器通过该第二PCH从该第二CPLD中的预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。With reference to the first aspect, in some implementations of the first aspect, the second control unit further includes a second PCH and a second CPLD, and the pre-allocated storage space in the second control unit is the storage space in the second CPLD space, the second BMC writes the address of the faulty memory in the first storage space into the pre-allocated storage space in the second control unit, including: the second BMC writes the faulty memory in the first storage space The address of the memory is written into the pre-allocated storage space in the second CPLD; the second processor obtains the address of the faulty memory in the first storage space from the pre-allocated storage space in the second CPLD, The method includes: the second processor obtains, through the second PCH, the address of the faulty memory in the first storage space from the pre-allocated storage space in the second CPLD.
结合第一方面,在第一方面的某些实现方式中,该第二控制单元包括第二处理器,该第二控制单元获取该第一存储空间中存在故障的内存的地址,包括:该第二处理器接收该第一处理器发送的该第二报文,该第二报文中携带有该第一存储空间中存在故障的内存的地址在该第一存储子空间中的存储地址;该第二处理器根据该第二报文,从该存储地址在该第一存储子空间中对应的存储空间获取该第一存储空间中存在故障的内存的地址。With reference to the first aspect, in some implementations of the first aspect, the second control unit includes a second processor, and the second control unit acquires the address of the faulty memory in the first storage space, including: the first The second processor receives the second message sent by the first processor, and the second message carries the storage address of the address of the faulty memory in the first storage space in the first storage subspace; the The second processor acquires, according to the second message, the address of the faulty memory in the first storage space from the storage space corresponding to the storage address in the first storage subspace.
结合第一方面,在第一方面的某些实现方式中,该第一控制单元确定该第一存储空间中存在故障的内存的地址,包括:该第一处理器对该第一存储空间进行可纠正错误ECC检测;该第一处理器根据对该第一存储空间进行ECC检测的检测结果,确定该第一存储空间中存在故障的内存的地址。With reference to the first aspect, in some implementations of the first aspect, the first control unit determines the address of the faulty memory in the first storage space, including: the first processor performs an available operation on the first storage space. Correct error ECC detection; the first processor determines the address of the faulty memory in the first storage space according to the detection result of the ECC detection on the first storage space.
结合第一方面,在第一方面的某些实现方式中,该第一存储空间中存在故障的内存的地址包括该第一存储空间存在故障的内存所在的页帧的首地址。With reference to the first aspect, in some implementations of the first aspect, the address of the faulty memory in the first storage space includes the first address of the page frame where the faulty memory in the first storage space is located.
结合第一方面,在第一方面的某些实现方式中,该计算机系统还包括存储单元,该第一控制单元与该第二控制单元均与该存储单元连接,该方法还包括:处理单元通过第二控制单元将该数据存储至该存储单元。With reference to the first aspect, in some implementations of the first aspect, the computer system further includes a storage unit, the first control unit and the second control unit are both connected to the storage unit, and the method further includes: the processing unit passes the The second control unit stores the data to the storage unit.
第二方面,提供一种处理内存故障的装置,配置于计算机系统中,该计算机系统还包括处理单元,该装置与该处理单元连接,该装置包括第一控制单元与第二控制单元,第一控制单元用于执行上述第一方面或第一方面的任一可能的实现方式中由第一控制单元执行的方法的操作步骤,第二控制单元用于执行上述第一方面或第一方面的任一可能的实现方式中由第二控制单元执行的方法的操作步骤。In a second aspect, a device for handling a memory failure is provided, which is configured in a computer system, the computer system further includes a processing unit, the device is connected to the processing unit, the device includes a first control unit and a second control unit, the first The control unit is configured to execute the operation steps of the method performed by the first control unit in the first aspect or any possible implementation manner of the first aspect, and the second control unit is configured to execute the first aspect or any of the first aspect. Operation steps of the method performed by the second control unit in a possible implementation.
第三方面,提供一种计算机系统,该计算机系统包括处理单元与存储控制单元,存储控制单元与处理单元连接,存储控制单元包括:第一控制单元,用于确定第一存储空间中存在故障的内存的地址,该第一存储空间是该第一控制单元中的存储空间,该第一存储空间能够存储第二控制单元以直接内存访问DMA的方式在该第一控制单元中备份的数据;第一控制单元,还用于将该第一存储空间中存在故障的内存的地址同步至该第二控制单元;第二控制单元,用于获取该第一存储空间中存在故障的内存的地址;第二控制单元,还用于以DMA的方式将缓存在该第二控制单元中的该数据存储至该第一存储空间,并且在存储该数据时隔离该第一存储空间中存在故障的内存,其中,该数据是由该处理单元通过该第二控制单元进行存储的数据。In a third aspect, a computer system is provided, the computer system includes a processing unit and a storage control unit, the storage control unit is connected to the processing unit, and the storage control unit includes: a first control unit for determining a faulty device in the first storage space The address of the memory, the first storage space is the storage space in the first control unit, and the first storage space can store the data backed up by the second control unit in the first control unit by means of direct memory access DMA; a control unit, which is also used for synchronizing the address of the faulty memory in the first storage space to the second control unit; the second control unit is used for acquiring the address of the faulty memory in the first storage space; The second control unit is further configured to store the data buffered in the second control unit in the first storage space by means of DMA, and isolate the faulty memory in the first storage space when storing the data, wherein , the data is stored by the processing unit through the second control unit.
本申请提供的计算机系统,通过使本控(例如,第一控制单元)在OS启动之前或启动之后对支持进行DMA的存储空间进行故障检测,并将该存储空间中发生故障的内存的地址通知给对控(例如,第二控制单元),当对控在该存储空间中进行数据备份时,对该存储空间中存在故障的内存进行隔离,即,仅在本控中没有发生内存故障的存储空间中进行数据备份,从而降低发生OOM的概率。In the computer system provided by the present application, the local control (for example, the first control unit) performs fault detection on the storage space that supports DMA before or after the OS is started, and notifies the address of the faulty memory in the storage space. For the control (for example, the second control unit), when the control performs data backup in the storage space, isolate the faulty memory in the storage space, that is, only the storage without memory failure in this control Data backup is performed in the space, thereby reducing the probability of OOM.
结合第三方面,在第三方面的某些实现方式中,第二控制单元,还用于在以DMA的方式将缓存在该第二控制单元中的该数据存储至该第一存储空间之前,将缓存在与该第一存储空间中存在故障的内存的地址对应的地址空间的数据,迁移到第二存储空间其它的存储空间中;其中,该第二存储空间是该第二控制单元中的存储空间。With reference to the third aspect, in some implementations of the third aspect, the second control unit is further configured to store the data buffered in the second control unit in the first storage space in a DMA manner, Migrate the data cached in the address space corresponding to the address of the faulty memory in the first storage space to other storage spaces in the second storage space; wherein, the second storage space is in the second control unit. storage.
在第二控制单元以DMA的方式将缓存在第二控制单元中的数据存储至第一存储空间之前,通过使第二控制单元将缓存在与第一存储空间中存在故障的内存的地址对应的地址空间的数据,迁移到第二存储空间其它的存储空间中,以使第二控制单元在将数据存储至第一存储空间时,能够实现对第一存储空间中存在故障的内存的隔离,进而降低发生OOM的概率。Before the second control unit stores the data buffered in the second control unit in the first storage space by means of DMA, by causing the second control unit to buffer the data in the address corresponding to the address of the faulty memory in the first storage space The data in the address space is migrated to other storage spaces in the second storage space, so that the second control unit can isolate the faulty memory in the first storage space when storing the data in the first storage space, and further Reduce the probability of OOM occurrence.
结合第三方面,在第三方面的某些实现方式中,第一控制单元,还用于统计该第一存储空间中存在故障的内存的地址的数量;如果该数量小于或等于预设的阈值,将第一存储空间中存在故障的内存的地址同步至该第二控制单元。With reference to the third aspect, in some implementations of the third aspect, the first control unit is further configured to count the number of addresses of the faulty memory in the first storage space; if the number is less than or equal to a preset threshold , synchronizing the address of the faulty memory in the first storage space to the second control unit.
在将第一存储空间中存在故障的内存的地址同步至对控之前,通过对第一存储空间中存在故障的内存的地址的数量进行统计,当该数量小于预设的阈值时,才将第一存储空间中存在故障的内存的地址同步至对控,否则将重新在本控内分配用于进行DMA的存储空间,避免在第一存储空间中的故障情况比较严重时,仍将存在故障的内存地址同步至对控所带来的资源浪费。Before synchronizing the address of the faulty memory in the first storage space to the control, by counting the number of addresses of the faulty memory in the first storage space, when the number is less than the preset threshold, the first The address of the faulty memory in one storage space is synchronized to the other control, otherwise the storage space for DMA will be re-allocated in this control to avoid that when the fault in the first storage space is serious, there will still be faults. The waste of resources caused by the synchronization of the memory address to the control.
结合第三方面,在第三方面的某些实现方式中,第一控制单元包括第一处理器与第一主板管理控制器BMC,该第二控制单元包括第二BMC,第一处理器,用于将该第一存储空间中存在故障的内存的地址写入该第一控制单元中预先分配的存储空间中;第一BMC,用于从该第一控制单元中预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址;第一BMC,还用于向该第二BMC发送第一报文,该第一报文中携带有该第一存储空间中存在故障的内存的地址。With reference to the third aspect, in some implementations of the third aspect, the first control unit includes a first processor and a first motherboard management controller BMC, the second control unit includes a second BMC, the first processor, and the Write the address of the faulty memory in the first storage space into the storage space pre-allocated in the first control unit; the first BMC is used to obtain the storage space from the pre-allocated storage space in the first control unit The address of the faulty memory in the first storage space; the first BMC is also used to send a first message to the second BMC, where the first message carries the address of the faulty memory in the first storage space .
结合第三方面,在第三方面的某些实现方式中,该第一控制单元还包括第一平台控制单元PCH与第一复杂可编程逻辑器件CPLD,该第一控制单元中预先分配的存储空间是该第一CPLD中的存储空间,第一处理器,还用于通过该第一PCH将该第一存储空间中存在故障的内存的地址写入该第一CPLD中的预先分配的存储空间中;第一BMC,还用于从该第一控制单元中预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址,包括:第一BMC,还用于从该第一CPLD中的预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。With reference to the third aspect, in some implementations of the third aspect, the first control unit further includes a first platform control unit PCH and a first complex programmable logic device CPLD, and a pre-allocated storage space in the first control unit is the storage space in the first CPLD, and the first processor is also used to write the address of the faulty memory in the first storage space into the pre-allocated storage space in the first CPLD through the first PCH ; The first BMC is also used to obtain the address of the faulty memory in the first storage space from the pre-allocated storage space in the first control unit, including: the first BMC is also used to obtain the address of the faulty memory in the first storage space from the first CPLD. The address of the faulty memory in the first storage space is obtained from the pre-allocated storage space.
结合第三方面,在第三方面的某些实现方式中,该第一控制单元包括第一处理器与第一直接内存访问控制器DMAC,该第二控制单元包括第二处理器,第一处理器,还用于向该第一DMAC发送第一写数据指令,该第一写数据指令中携带有该第一存储空间中存在故障的内存的地址在第二存储空间中的第一存储子空间中的存储地址,该第二存储空间为该第二控制单元中的存储空间,该第一存储子空间用于存储该第一存储空间中存在故障的内存的地址,该第二存储空间还包括第二存储子空间,该第二存储子空间用于存储该第一控制单元以DMA的方式在该第二控制单元中备份的数据;第一DMAC,用于根据该第一写数据指令,将该第一存储空间中存在故障的内存的地址存储至该存储地址在该第一存储子空间中对应的存储空间;第一处理器,还用于向该第二处理器发送第二报文,该第二报文中携带有该第一存储空间中存在故障的内存的地址在该第一存储子空间中的存储地址。With reference to the third aspect, in some implementations of the third aspect, the first control unit includes a first processor and a first direct memory access controller DMAC, the second control unit includes a second processor, and the first process The device is also used to send a first write data command to the first DMAC, where the first write data command carries the address of the faulty memory in the first storage space in the first storage subspace in the second storage space The storage address in the second storage space is the storage space in the second control unit, the first storage subspace is used to store the address of the faulty memory in the first storage space, and the second storage space also includes The second storage subspace is used to store the data backed up by the first control unit in the second control unit by means of DMA; the first DMAC is used to store the data backed up in the second control unit according to the first write data instruction. The address of the faulty memory in the first storage space is stored in the storage space corresponding to the storage address in the first storage subspace; the first processor is further configured to send a second message to the second processor, The second message carries the storage address of the address of the faulty memory in the first storage space in the first storage subspace.
通过将第一存储空间中存在故障的内存的地址通过DMA的方式同步至对控,从而提高故障同步的速度,使得对控能够及时发现本控中存在故障的内存的地址,在对本控的存储空间进行数据访问时,能够及时对发生故障的内存进行隔离,避免由于故障同步不及时导致对控对本控中存在故障的存储空间进行数据访问,进而导致发生OOM的情况。By synchronizing the address of the faulty memory in the first storage space to the peer controller through DMA, the speed of fault synchronization is improved, so that the peer controller can find the address of the faulty memory in the local controller in time. When data access is performed in the space, the faulty memory can be isolated in a timely manner, so as to avoid data access to the faulty storage space in the local control due to untimely synchronization of the fault, which may lead to OOM.
结合第三方面,在第三方面的某些实现方式中,第二控制单元包括第二处理器与第二BMC,第二BMC,用于接收该第一BMC发送的该第一报文,该第一报文中携带有该第一存储空间中存在故障的内存的地址;第二BMC,还用于对该第一报文进行解析,获取该第一存储空间中存在故障的内存的地址;第二BMC,还用于将该第一存储空间中存在故障的内存的地址写入该第二控制单元中预先分配的存储空间中;第二处理器,用于从该第二控制单元中预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。With reference to the third aspect, in some implementations of the third aspect, the second control unit includes a second processor and a second BMC, and the second BMC is configured to receive the first message sent by the first BMC, the The first message carries the address of the faulty memory in the first storage space; the second BMC is also used to parse the first packet to obtain the address of the faulty memory in the first storage space; The second BMC is further configured to write the address of the faulty memory in the first storage space into the storage space pre-allocated in the second control unit; the second processor is configured to pre-register from the second control unit The address of the faulty memory in the first storage space is obtained from the allocated storage space.
结合第三方面,在第三方面的某些实现方式,第二控制单元还包括第二PCH与第二CPLD,该第二控制单元中预先分配的存储空间是该第二CPLD中的存储空间,第二BMC,还用于将该第一存储空间中存在故障的内存的地址写入该第二CPLD中的预先分配的存储空间中;该第二处理器,还用于通过该第二PCH从该第二CPLD中的预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。In conjunction with the third aspect, in some implementations of the third aspect, the second control unit further includes a second PCH and a second CPLD, and the pre-allocated storage space in the second control unit is the storage space in the second CPLD, The second BMC is further configured to write the address of the faulty memory in the first storage space into the pre-allocated storage space in the second CPLD; the second processor is further configured to obtain the The address of the faulty memory in the first storage space is obtained from the pre-allocated storage space in the second CPLD.
结合第三方面,在第三方面的某些实现方式中,第二控制单元包括第二处理器,第二处理器,还用于接收该第一处理器发送的该第二报文,该第二报文中携带有该第一存储空间中存在故障的内存的地址在该第一存储子空间中的存储地址;第二处理器,还用于根据该第二报文,从该存储地址在该第一存储子空间中对应的存储空间获取该第一存储空间中存在故障的内存的地址。With reference to the third aspect, in some implementations of the third aspect, the second control unit includes a second processor, and the second processor is further configured to receive the second message sent by the first processor, the first The second message carries the storage address of the address of the faulty memory in the first storage space in the first storage subspace; the second processor is further configured to, according to the second message, retrieve the address from the storage address in the first storage subspace. The corresponding storage space in the first storage subspace acquires the address of the faulty memory in the first storage space.
结合第三方面,在第三方面的某些实现方式中,该第一存储空间中存在故障的内存的地址包括该第一存储空间存在故障的内存所在的页帧的首地址。With reference to the third aspect, in some implementations of the third aspect, the address of the faulty memory in the first storage space includes the first address of the page frame where the faulty memory in the first storage space is located.
结合第三方面,在第三方面的某些实现方式中,该第一处理器,还用于对第一存储空间进行可纠正错误ECC检测;根据对该第一存储空间进行ECC检测的检测结果,确定该第一存储空间中存在故障的内存的地址。With reference to the third aspect, in some implementations of the third aspect, the first processor is further configured to perform correctable error ECC detection on the first storage space; according to a detection result of performing ECC detection on the first storage space , and determine the address of the faulty memory in the first storage space.
结合第三方面,在第三方面的某些实现方式中,第二控制单元,还用于向该第二控制单元中的第二DMAC发送第二写数据指令,该第二写数据指令用于指示该第二DMAC将缓存在该第二控制单元中的该数据存储至该第一存储空间中,该第二写数据指令携带的目的地址所指示的存储空间为该第一存储空间中存在故障的内存之外的存储空间。With reference to the third aspect, in some implementations of the third aspect, the second control unit is further configured to send a second write data command to the second DMAC in the second control unit, where the second write data command is used for Instruct the second DMAC to store the data buffered in the second control unit into the first storage space, and the storage space indicated by the destination address carried by the second write data instruction is faulty in the first storage space storage space outside of memory.
结合第三方面,在第三方面的某些实现方式中,计算机系统还包括存储单元,第一控制单元与第二控制单元均与该存储单元连接,处理单元,还用于通过第二控制单元将该数据存储至存储单元。With reference to the third aspect, in some implementations of the third aspect, the computer system further includes a storage unit, both the first control unit and the second control unit are connected to the storage unit, the processing unit is further configured to pass the second control unit This data is stored in the storage unit.
需要说明的是,在该计算机系统中,处理单元、存储控制单元与存储单元可以由同一个计算设备实现;或者,处理单元、存储控制单元与存储单元可以分别由独立的三个设备实现,例如,处理单元可以由独立的服务器实现,存储控制单元可以由独立的存储控制设备实现,存储单元可以由独立的存储设备实现,服务器、存储控制设备与存储设备之间可以通过网络连接。或者,处理单元与存储控制单元可以由一个独立的计算设备实现,存储单元可以由另一个独立的计算设备实现。或者,存储控制单元与存储单元可以由一个独立的计算设备实现,处理单元可以由另一个独立的计算设备实现。It should be noted that, in this computer system, the processing unit, the storage control unit and the storage unit may be implemented by the same computing device; or, the processing unit, the storage control unit and the storage unit may be implemented by three independent devices, for example , the processing unit can be implemented by an independent server, the storage control unit can be implemented by an independent storage control device, the storage unit can be implemented by an independent storage device, and the server, the storage control device and the storage device can be connected through a network. Alternatively, the processing unit and the storage control unit may be implemented by an independent computing device, and the storage unit may be implemented by another independent computing device. Alternatively, the storage control unit and the storage unit may be implemented by an independent computing device, and the processing unit may be implemented by another independent computing device.
第四方面,提供一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当该指令在计算机上运行时,使得计算机执行第一方面或第一方面的任一可能的实现方式中的方法。In a fourth aspect, a computer-readable storage medium is provided, and instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer can execute the first aspect or any possible implementation manner of the first aspect. method in .
第五方面,提供一种包含指令的计算机程序产品,当该指令在计算机上运行时,使得计算机执行第一方面或第一方面的任一可能的实现方式中的方法。In a fifth aspect, there is provided a computer program product comprising instructions that, when run on a computer, cause the computer to perform the method of the first aspect or any possible implementation of the first aspect.
附图说明Description of drawings
图1是本申请提供的计算机系统的示意性框图。FIG. 1 is a schematic block diagram of a computer system provided by the present application.
图2是本申请提供的计算机系统的另一示意性框图。FIG. 2 is another schematic block diagram of the computer system provided by the present application.
图3是本申请提供的计算机系统的另一示意性框图。FIG. 3 is another schematic block diagram of the computer system provided by the present application.
图4是本申请提供的DMA传输过程的示意性流程图。FIG. 4 is a schematic flowchart of a DMA transmission process provided by the present application.
图5是本申请提供的处理内存故障的方法的示意性交互流程图。FIG. 5 is a schematic interactive flowchart of the method for handling a memory fault provided by the present application.
图6是本申请提供的同步存在故障的内存的地址时,存在故障的内存的地址在计算机系统中的流向示意图。FIG. 6 is a schematic diagram of the flow of the address of the faulty memory in the computer system when the addresses of the faulty memory are synchronized according to the present application.
图7是本申请提供的计算机系统的另一示意性框图。FIG. 7 is another schematic block diagram of the computer system provided by the present application.
具体实施方式Detailed ways
下面将结合附图,对本申请中的技术方案进行描述。The technical solutions in the present application will be described below with reference to the accompanying drawings.
首先结合图1对计算机系统100进行介绍。First, the computer system 100 will be introduced with reference to FIG. 1 .
如图1所示,计算机系统100包括处理单元101、存储控制单元102与存储单元103。As shown in FIG. 1 , the computer system 100 includes a processing unit 101 , a storage control unit 102 and a storage unit 103 .
存储控制单元102内部部署有第一控制单元1021与第二控制单元1022,第一控制单元内部部署有中央处理器(central processing unit,CPU)10211与存储器10212,第二控制单元内部部署有CPU10221与存储器10222。The storage control unit 102 is equipped with a
其中,存储器10212与存储器10222可以为第三代双倍速率同步动态随机存储器(double data rate SDRAM,DDR SDRAM)。The
第一控制单元1021与处理单元101之间通过前端接口10213(例如,处理器10211与处理单元101之间的接口)通信,第二控制单元1022与处理单元101之间通过前端端口10223(例如,处理器10221与处理单元101之间的接口)通信,第一控制单元1021与第二存储控制单元1022之间通过镜像接口10214与镜像接口10224通信,第一控制单元1021与存储单元103之间通过后端接口10215(例如,处理器10211与存储单元103之间的接口)通信,第二控制单元1022与存储单元103之前通过后端接口10225(例如,处理器10221与存储设备103之间的接口)通信。The communication between the
作为示例而非限定,在计算机系统100中,处理单元101、存储控制单元102与存储单元103可以由同一个计算设备实现;或者,处理单元101、存储控制单元102与存储单元103可以分别由独立的三个设备实现,例如,处理单元101可以由独立的服务器实现,存储控制单元102可以由独立的存储控制设备实现,存储单元103可以由独立的存储设备实现,服务器、存储控制设备与存储设备之间可以通过网络连接;或者,处理单元101与存储控制单元102可以由一个独立的计算设备实现,存储单元103可以由另一个独立的计算设备实现;或者,存储控制单元102与存储单元103可以由一个独立的计算设备实现,处理单元101可以由另一个独立的计算设备实现。As an example and not a limitation, in the computer system 100, the processing unit 101, the storage control unit 102 and the storage unit 103 may be implemented by the same computing device; or, the processing unit 101, the storage control unit 102 and the storage unit 103 may be implemented by independent For example, the processing unit 101 can be implemented by an independent server, the storage control unit 102 can be implemented by an independent storage control device, the storage unit 103 can be implemented by an independent storage device, the server, the storage control device and the storage device. can be connected through a network; alternatively, the processing unit 101 and the storage control unit 102 can be implemented by an independent computing device, and the storage unit 103 can be implemented by another independent computing device; or, the storage control unit 102 and the storage unit 103 can be implemented by Implemented by an independent computing device, the processing unit 101 may be implemented by another independent computing device.
其中,处理单元101在具体实现时可以为处理器或包含处理器的计算机设备,存储单元103在具体实现时可以为固态硬盘(solid state drives,SSD)或包含SSD的存储设备。The processing unit 101 may be a processor or a computer device including a processor in specific implementation, and the storage unit 103 may be a solid state drive (solid state drives, SSD) or a storage device including SSD in specific implementation.
下面以处理单元101位于服务器中、存储控制单元102位于存储控制设备中、存储单元103位于存储设备中为例,对镜像存储的一般方法进行介绍。Hereinafter, the general method of image storage will be introduced by taking the example that the processing unit 101 is located in the server, the storage control unit 102 is located in the storage control device, and the storage unit 103 is located in the storage device.
当处理单元101位于服务器中、存储控制单元102位于存储控制设备中、存储单元103位于存储设备中时,相应的计算机系统100示意性框图如图2所示。When the processing unit 101 is located in the server, the storage control unit 102 is located in the storage control device, and the storage unit 103 is located in the storage device, the corresponding schematic block diagram of the computer system 100 is shown in FIG. 2 .
第一控制单元1021中的处理器10211将从处理单元101接收的数据存储至存储器10212,再将该数据通过第二控制单元1022中的处理器10221存储至存储器10222,其中,该数据被存储在存储器10212与存储器10222中完全相同的地址空间。即,在第二控制单元中对该数据进行备份。The processor 10211 in the
之所以将数据存储至两个完全相同的地址空间,是为了保证由于其中一个控制单元发生故障导致存储在该控制单元中的数据丢失时,在另一个控制单元中还能保存一份完全相同的数据。The reason why the data is stored in two identical address spaces is to ensure that if the data stored in one of the control units is lost due to the failure of the control unit, an identical copy of the data can be saved in the other control unit. data.
最终由处理器10211将存储器10212中缓存的数据保存至存储单元103中;或者,由处理器10221将存储器10222中缓存的与存储器10212中完全相同的数据保存至存储单元103。Finally, the processor 10211 saves the data cached in the
下面对直接内存访问(direct memory access,DMA)进行介绍。Direct memory access (DMA) is described below.
DMA是一种硬件机制,它允许外围设备和主内存之间直接传输它们的输入/输出(input/output,I/O)数据,而不需要处理器的参与,使用这种机制可以大大提高与设备通信的吞吐量。DMA is a hardware mechanism that allows the direct transfer of their input/output (I/O) data between peripheral devices and main memory without the involvement of the processor. Throughput of device communication.
例如,当使用DMA机制进行数据访问时,需要在第一控制单元1021与第二控制单元1022中分别增加直接内存访问控制器(direct memory access controller,DMAC)10216与DMAC10226,如图3所示。For example, when using the DMA mechanism for data access, a direct memory access controller (DMAC) 10216 and a DMAC 10226 need to be added to the
此时,无需第一控制单元1021中的处理器10211将该数据通过第二控制单元1022中的处理器10221存储至存储器10222,而是由第一控制单元1021中的DMAC10216直接对第二控制单元中1022中的存储器10222进行访问,将该数据存储至第二控制单元中1022中的存储器10222。At this time, there is no need for the processor 10211 in the
一个完整的DMA传输过程主要包括以下几个步骤,图4示出了DMA传输过程的示意性流程图,下面分别进行说明。A complete DMA transmission process mainly includes the following steps, and FIG. 4 shows a schematic flow chart of the DMA transmission process, which will be described separately below.
(1)DMA请求(1) DMA request
处理器对DMAC初始化,并向I/O接口发出操作命令,I/O接口向DMAC提出DMA请求。DMAC接到请求后,向处理器发出请求,并且将请求信号加到处理器的保持(HOLD)请求输入端上。The processor initializes the DMAC and issues an operation command to the I/O interface, and the I/O interface makes a DMA request to the DMAC. After receiving the request, the DMAC sends a request to the processor, and adds the request signal to the hold (HOLD) request input of the processor.
(2)DMA响应(2) DMA response
处理器接到请求后对DMAC做出回应,将其响应信号加到DMAC的保持认可(holdacknowledge,HLDA)响应输出端上,同时向DMAC预置待访问的存储空间的首地址、交换数据个数以及读写命令,并且放弃对系统总线的控制权,此时,DMAC获得系统总线的控制权。After receiving the request, the processor responds to the DMAC, adds its response signal to the DMAC's hold acknowledgement (HLDA) response output, and at the same time presets the first address of the storage space to be accessed and the number of exchanged data to the DMAC. And read and write commands, and give up the control of the system bus, at this time, DMAC obtains the control of the system bus.
(3)DMA传输(3) DMA transfer
DMAC获得系统总线的控制权后,处理器即刻挂起或只执行内部操作,由DMAC进行数据传输。After the DMAC obtains control of the system bus, the processor suspends immediately or only performs internal operations, and the DMAC performs data transmission.
(4)DMA结束(4) DMA ends
当完成规定的成批数据传送后,DMAC即释放总线控制权,并向I/O接口发出结束信号。After completing the specified batch data transfer, the DMAC releases the bus control right and sends an end signal to the I/O interface.
由此可见,DMA传输方式无需处理器直接控制传输,使处理器的效率大为提高。It can be seen that the DMA transmission method does not require the processor to directly control the transmission, which greatly improves the efficiency of the processor.
然而,对于图3所示的计算机系统而言,当通过DMA传输方式进行镜像存储时,由于数据需要被存储至存储器10212与存储器10222中完全相同的物理地址空间,例如,DMAC10216将数据从存储器10212物理地址空间直接复制到存储器10222中完全相同的物理地址空间。However, for the computer system shown in FIG. 3, when the mirror storage is performed by DMA transmission, since the data needs to be stored in the exact same physical address space as the
如果存储器10222中的内存空间存在故障,而DMAC10216将数据从存储器10212物理地址空间直接复制到存储器10222中完全相同的物理地址空间时,并不感知存储器10222中的内存故障情况,当存储器10212将数据写入存储器10222中存在故障的内存空间时,则会导致内存溢出(out of memory,OOM)。If the memory space in the
所谓内存溢出,通常是内存不足以存储需要存储的数据的情况。例如,通常在运行大型软件或游戏时,软件或游戏所需要的内存远远超出了主机内安装的内存所承受的大小,就叫内存溢出。此时,软件或游戏便无法运行,系统会提示内存溢出。The so-called memory overflow is usually a situation where there is insufficient memory to store the data that needs to be stored. For example, when running large-scale software or games, the memory required by the software or games far exceeds the capacity of the memory installed in the host, which is called memory overflow. At this point, the software or game cannot run, and the system will prompt a memory overflow.
针对该问题,本发明实施例提出了一种处理内存故障的方法,通过使本控(例如,第一控制单元)在OS启动之前或启动之后对支持进行DMA的存储空间进行故障检测,并将该存储空间中发生故障的内存的地址通知给对控(例如,第二控制单元),以使对控在该存储空间中进行数据备份时,能够对发生故障的内存进行隔离,从而降低发生OOM的概率。In response to this problem, an embodiment of the present invention proposes a method for handling a memory failure, by enabling the local control (for example, the first control unit) to perform fault detection on the storage space supporting DMA before or after the OS is started, and The address of the faulty memory in the storage space is notified to the controller (for example, the second control unit), so that when the controller performs data backup in the storage space, the faulty memory can be isolated, thereby reducing the occurrence of OOM The probability.
下面对本发明实施例提供的处理内存故障的方法200进行介绍。图5示出了该方法的示意性交互流程图。该方法200可以应用于计算机系统100中。The following describes the method 200 for handling a memory fault provided by the embodiment of the present invention. Figure 5 shows a schematic interaction flow diagram of the method. The method 200 may be applied in the computer system 100 .
201,第一控制单元确定第一存储空间中存在故障的内存的地址,第一存储空间是第一控制单元中的存储空间,第一存储空间能够存储第二控制单元以直接内存访问DMA的方式在第一控制单元中备份的数据。201, the first control unit determines the address of the faulty memory in the first storage space, the first storage space is the storage space in the first control unit, and the first storage space can store the second control unit in a direct memory access DMA manner Data backed up in the first control unit.
具体地,第一控制单元(例如,第一控制单元1021)确定存储器10212中被配置为能够存储第二控制单元(例如,第二控制单元1022)以DMA的方式在第一控制单元1021中备份的数据的存储空间(例如,第一存储空间)中存在故障的内存的地址。Specifically, the first control unit (eg, the first control unit 1021 ) determines that the
作为示例而非限定,第一控制单元确定该第一存储空间中存在故障的内存的地址,包括:第一处理器对该第一存储空间进行可纠正错误(error correcting check,ECC)检测;该第一处理器根据对该第一存储空间进行ECC检测的检测结果,确定该第一存储空间中存在故障的内存的地址。As an example and not a limitation, the first control unit determines the address of the faulty memory in the first storage space, including: the first processor performs an error correcting check (ECC) detection on the first storage space; the The first processor determines the address of the faulty memory in the first storage space according to the detection result of the ECC detection on the first storage space.
具体地,第一控制单元1021中的处理器10211(例如,第一处理器)在计算机系统上电时(即,在基本输入输出系统(basic input output system,BIOS)启动期间),对存储器10212中的第一存储空间进行ECC检测,即,ECC检测发生在操作系统OS启动之前。根据对第一存储空间进行ECC检测的检测结果,确定第一存储空间中存在故障的内存的地址。Specifically, the processor 10211 (eg, the first processor) in the
此外,处理器10211还可以在OS启动之后对第一存储空间进行检测,即,ECC检测还可以发生在操作系统OS启动之后。In addition, the processor 10211 may also perform detection on the first storage space after the OS is started, that is, the ECC detection may also occur after the OS is started.
处理器10211可以以字节为最小单位对第一存储空间进行检测;或者,还可以以一定数量的比特为最小单位进行检测。The processor 10211 may perform detection on the first storage space in a minimum unit of bytes; or, may also perform detection in a minimum unit of a certain number of bits.
例如,处理器10211对第一存储空间以字节为单位进行ECC检测,经过检测,处理器10211确定第一存储空间中存在故障的内存的位置为索引号为4、7、9的字节对应的内存的位置。For example, the processor 10211 performs ECC detection on the first storage space in units of bytes. After detection, the processor 10211 determines that the location of the faulty memory in the first storage space corresponds to the bytes whose index numbers are 4, 7, and 9. the location of the memory.
当对第一存储空间进行分页管理时,可以对第一存储空间进行分页ECC检测。所谓分页管理,是指将第一存储空间中的若干字节视为一页(page),例如,每页包括4千比特(Kbyte),此时,第一存储空间变成了连续的页,即第一存储空间为页数组,每一页物理内存叫页帧,以页为最小单位对第一存储空间进行编号,该编号可作为页数组的索引号,又称为页帧号。When page management is performed on the first storage space, page ECC detection may be performed on the first storage space. The so-called paging management refers to treating several bytes in the first storage space as a page (page), for example, each page includes 4 kilobits (Kbyte), at this time, the first storage space becomes a continuous page, That is, the first storage space is a page array, and each page of physical memory is called a page frame. The first storage space is numbered with the page as the smallest unit. The number can be used as the index number of the page array, also known as the page frame number.
即,当对第一存储空间进行分页管理时,可以以分页为单位对第一存储空间进行ECC检测。That is, when page management is performed on the first storage space, ECC detection may be performed on the first storage space in units of pages.
例如,处理器10211对第一存储空间以分页为单位进行ECC检测,经过检测,处理器10211确定第一存储空间中存在故障的内存的位置为页帧号为3、8、10、11的分页对应的内存的位置。For example, the processor 10211 performs ECC detection on the first storage space in units of pages. After detection, the processor 10211 determines that the location of the faulty memory in the first storage space is the page with the page frame numbers of 3, 8, 10, and 11. The corresponding memory location.
需要说明的是,上述的通过对第一存储空间进行ECC检测,从而确定第一存储空间中存在故障的内存的位置的方案仅为示例性说明,并不对本发明实施例构成任何限定,通过其他检测方法确定第一存储空间中存在故障的内存的位置的方案均落入本发明实施例的保护范围内。It should be noted that the above-mentioned solution of determining the location of the faulty memory in the first storage space by performing ECC detection on the first storage space is only an exemplary illustration, and does not constitute any limitation to the embodiments of the present invention. The solution in which the detection method determines the location of the faulty memory in the first storage space falls within the protection scope of the embodiments of the present invention.
202,第一控制单元将第一存储空间中存在故障的内存的地址同步至第二控制单元。202. The first control unit synchronizes the address of the faulty memory in the first storage space to the second control unit.
具体地,处理器10211确定了第一存储空间中存在故障的内存的地址,将第一存储空间中存在故障的内存的地址同步至第二控制单元1022中处理器10221(例如,第二处理器)中。Specifically, the processor 10211 determines the address of the faulty memory in the first storage space, and synchronizes the address of the faulty memory in the first storage space to the processor 10221 in the second control unit 1022 (for example, the second processor )middle.
例如,处理器10211检测到第一存储空间中存在故障的内存为一段连续的存储空间,则处理器10211可以将该段连续的存储空间的起始地址与结束地址同步至处理器10221中。For example, if the processor 10211 detects that the faulty memory in the first storage space is a continuous storage space, the processor 10211 can synchronize the start address and end address of the continuous storage space to the processor 10221.
再例如,处理器10211检测到第一存储空间内的索引号为6的字节至索引号为8的字节均存在故障,则处理器10211将索引号为6的字节至索引号为8的字节对应的存储空间的起始地址与结束地址同步至处理器10221中。For another example, if the processor 10211 detects that the byte with index number 6 to the byte with index number 8 in the first storage space is faulty, the processor 10211 will change the byte with index number 6 to the byte with index number 8. The start address and end address of the storage space corresponding to the bytes of , are synchronized to the processor 10221.
再例如,处理器10211检测到第一存储空间中存在故障的内存为若干离散的字节,则处理器10211可以将若干个字节对应的内存的地址同步至处理器10221中。For another example, if the processor 10211 detects that the faulty memory in the first storage space is several discrete bytes, the processor 10211 can synchronize the addresses of the memory corresponding to the several bytes to the processor 10221.
此外,处理器10211同步至处理器10221中的第一存储空间中存在故障的内存的地址还可以是该第一存储空间中存在故障的内存所在的页帧的首地址。In addition, the address of the faulty memory in the first storage space synchronized by the processor 10211 to the processor 10221 may also be the first address of the page frame where the faulty memory in the first storage space is located.
具体地,当对第一存储空间进行分页管理时,处理器10211将检测到的第一存储空间中存在故障的内存所在的分页的首地址同步至处理器10221中。Specifically, when performing paging management on the first storage space, the processor 10211 synchronizes, to the processor 10221, the first address of the page where the detected faulty memory is located in the first storage space.
例如,处理器10211检测到的第一存储空间中存在故障的内存所在的页帧号为1、2、4的分页的首地址同步至处理器10221中。For example, the first addresses of pages with page frame numbers 1, 2, and 4 where the faulty memory is located in the first storage space detected by the processor 10211 are synchronized to the processor 10221.
需要说明的是,当处理器10211对第一存储空间进行的ECC检测发生在BIOS启动期间时,如果处理器10211确定第一存储空间中的内存存在故障时,可以立即将第一存储空间中存在故障的内存的地址同步至处理器10221中。It should be noted that, when the ECC detection performed by the processor 10211 on the first storage space occurs during the BIOS startup, if the processor 10211 determines that the memory in the first storage space is faulty, it can immediately delete the memory in the first storage space. The address of the failed memory is synchronized to the processor 10221.
此外,处理器10211还可以在将第一存储空间中存在故障的内存的地址同步至处理器10221中之前,先对第一存储空间中存在故障的内存的地址的数量进行统计,并将该数量与预设的阈值进行比较:In addition, before synchronizing the addresses of the faulty memory in the first storage space to the processor 10221, the processor 10211 may first count the number of addresses of the faulty memory in the first storage space, and calculate the number of addresses of the faulty memory in the first storage space. Compare with a preset threshold:
当该数量小于或等于预设的阈值时,处理器10211将第一存储空间中存在故障的内存的地址同步至处理器10221;When the number is less than or equal to the preset threshold, the processor 10211 synchronizes the address of the faulty memory in the first storage space to the processor 10221;
当该数量大于预设的阈值时,处理器10211在存储器10212中重新分配用于进行DMA的存储空间。When the number is greater than the preset threshold, the processor 10211 reallocates the storage space for DMA in the
例如,处理器10211以字节为最小单位对第一存储空间进行检测,预设的阈值为10字节,经过检测,处理器10211确定第一存储空间中存在故障的内存的地址的数量为12字节,由于第一存储空间中存在故障的内存的地址的数量大于预设的阈值,则处理器10211在存储器10212内重新分配能够存储第二控制单元1022以DMA的方式在第一控制单元1021中备份的数据的存储空间。For example, the processor 10211 detects the first storage space in bytes as the smallest unit, and the preset threshold is 10 bytes. After detection, the processor 10211 determines that the number of addresses of the faulty memory in the first storage space is 12 bytes, since the number of addresses of the faulty memory in the first storage space is greater than the preset threshold, the processor 10211 reallocates the
需要说明的是,在本发明实施例中,处理器10211在存储器10212内重新分配的该存储空间通常是无故障的存储空间。It should be noted that, in this embodiment of the present invention, the storage space reallocated by the processor 10211 in the
但是在后续运行中,重新分配的该存储空间有可能会出现故障,因此,处理器10211可以对该重新分配的存储空间进行检测,在检测到该重新分配的存储空间中的内存存在故障时,处理器10211可以将该重新分配的存储空间中的内存的地址同步至处理器10221。However, in subsequent operations, the reallocated storage space may fail. Therefore, the processor 10211 can detect the reallocated storage space. When detecting that the memory in the reallocated storage space is faulty, The processor 10211 can synchronize the address of the memory in the reallocated storage space to the processor 10221.
在将第一存储空间中存在故障的内存的地址同步至对控之前,通过对第一存储空间中存在故障的内存的地址的数量进行统计,当该数量小于预设的阈值时,才将第一存储空间中存在故障的内存的地址同步至对控,否则在本控内重新分配能够存储对控以DMA的方式在本控中备份的数据的存储空间,避免在第一存储空间的内存故障情况比较严重时,仍将存在故障的内存地址同步至对控所带来的资源浪费。Before synchronizing the address of the faulty memory in the first storage space to the control, by counting the number of addresses of the faulty memory in the first storage space, when the number is less than the preset threshold, the first The address of the faulty memory in one storage space is synchronized to the peer control, otherwise the memory space that can store the data backed up by the peer control in the local control by DMA will be re-allocated in the local control to avoid memory failure in the first storage space When the situation is more serious, it is still a waste of resources caused by synchronizing the faulty memory address to the control.
203,第二控制单元获取第一存储空间中存在故障的内存的地址。203. The second control unit acquires the address of the faulty memory in the first storage space.
具体地,处理器10221获取处理器10211同步至处理器10221中的第一存储空间中存在故障的内存的地址,将第一存储空间中存在故障的内存的地址写入隔离地址表,该隔离地址表可以存储在存储器10222中。Specifically, the processor 10221 obtains the address of the faulty memory in the first storage space synchronized by the processor 10211 to the processor 10221, and writes the address of the faulty memory in the first storage space into the isolation address table, the isolation address Tables may be stored in
例如,处理器10221获取的第一存储空间中存在故障的内存的地址为一段连续的存储空间的起始地址与结束地址,该段连接的存储空间为索引号为6的字节至索引号为8的字节对应的存储空间,处理器10221将该段连续的存储空间的起始地址与结束地址写入隔离地址表,即,处理器10221对第一存储空间中存在故障的内存进行标记。For example, the address of the faulty memory in the first storage space obtained by the processor 10221 is the start address and end address of a continuous storage space, and the connected storage space is the byte with index number 6 to the index number of 8 bytes of storage space, the processor 10221 writes the start address and end address of the continuous storage space into the isolation address table, that is, the processor 10221 marks the faulty memory in the first storage space.
205,第二控制单元以DMA的方式将缓存在第二控制单元中的数据存储至第一存储空间,并且在存储该数据时隔离第一存储空间中存在故障的内存,其中,该数据是由处理单元通过第二控制单元进行存储的数据。205. The second control unit stores the data buffered in the second control unit in the first storage space by means of DMA, and isolates the faulty memory in the first storage space when storing the data, wherein the data is stored in the first storage space. The processing unit performs the stored data through the second control unit.
可选地,第二控制单元以DMA的方式将缓存在第二控制单元中的数据存储至第一存储空间中存在故障的内存之外的存储空间,包括:第二控制单元向第二控制单元中的第二DMAC发送第二写数据指令,第二写数据指令用于指示第二DMAC将缓存在第二控制单元中的数据存储至第一存储空间中,第二写数据指令携带的目的地址所指示的存储空间为第一存储空间中存在故障的内存之外的存储空间。Optionally, the second control unit stores the data cached in the second control unit in a storage space other than the faulty memory in the first storage space by means of DMA, including: the second control unit sends the data to the second control unit. The second DMAC in the device sends the second write data command, the second write data command is used to instruct the second DMAC to store the data buffered in the second control unit into the first storage space, and the destination address carried by the second write data command The indicated storage space is a storage space other than the faulty memory in the first storage space.
具体地,如果处理器10221需要将数据在存储器10212中进行备份,处理器10221首先查看隔离地址表,根据隔离地址表中记录的第一存储空间中存在故障的内存的地址,确定第一存储空间中可以用于进行DMA的内存的地址。Specifically, if the processor 10221 needs to back up data in the
处理器10221可以向DMAC10226(例如,第二DMAC)发送写数据访问指令(例如,第二写数据指令),该数据访问指令中携带的目的地址不包括第一存储空间中存在故障的内存的地址,即,处理器10221对第一存储空间中存在故障的内存进行隔离。The processor 10221 can send a write data access instruction (for example, the second write data instruction) to the DMAC 10226 (for example, the second DMAC), and the destination address carried in the data access instruction does not include the address of the faulty memory in the first storage space That is, the processor 10221 isolates the faulty memory in the first storage space.
当DMAC10226获得总线控制权后,根据第二写数据指令携带的目的地址,将第二写数据指令中携带的数据以DMA的方式写入第一存储空间中的该目的地址所指示的存储位置。After the DMAC10226 obtains the bus control right, according to the destination address carried by the second write data command, the data carried in the second write data command is written into the storage location indicated by the destination address in the first storage space by means of DMA.
例如,处理器10221确定第一存储空间中存在故障的内存空间为一段连续的存储空间,该段连续的存储空间为索引号为6的字节至索引号为8的字节对应的存储空间。For example, the processor 10221 determines that the faulty memory space in the first storage space is a continuous storage space, and the continuous storage space is the storage space corresponding to the byte with index number 6 to the byte with index number 8.
假设处理器10221将从服务器接收的数据缓存在存储器10222中的索引号为4的字节对应的存储空间,则处理器10221可以向DMAC10226发送第二写数据指令,第二写数据指令中携带的目的地址可以对应第一存储空间中的索引号为4的字节对应的存储空间。Assuming that the processor 10221 caches the data received from the server in the storage space corresponding to the byte with index number 4 in the
当DMAC10226获得总线控制权后,根据该第二写数据指令中携带的目的地址,将该数据以DMA的方式写入第一存储空间中的索引号为4的字节对应的存储空间。After the DMAC10226 obtains the bus control right, according to the destination address carried in the second write data instruction, the data is written into the storage space corresponding to the byte with index number 4 in the first storage space by means of DMA.
可选地,在第二控制单元以DMA的方式将缓存在第二控制单元中的数据存储至第一存储空间之前,方法200还包括:Optionally, before the second control unit stores the data buffered in the second control unit to the first storage space in a DMA manner, the method 200 further includes:
204,第二控制单元将缓存在与第一存储空间中存在故障的内存的地址对应的地址空间的数据,迁移到第二存储空间其它的存储空间中;其中,该第二存储空间是该第二控制单元中的存储空间。204. The second control unit migrates the data cached in the address space corresponding to the address of the faulty memory in the first storage space to other storage spaces in the second storage space; wherein, the second storage space is the first storage space. 2. Storage space in the control unit.
具体地,在处理器10221以DMA的方式将缓存在第二控制单元中的数据存储至第一存储空间之前,处理器10221还要确定该数据在存储器10222中的存储地址与第一存储空间中存在故障的内存的地址是否相同:Specifically, before the processor 10221 stores the data buffered in the second control unit in the first storage space by DMA, the processor 10221 also needs to determine the storage address of the data in the
如果该数据在存储器10222中的存储地址与第一存储空间中存在故障的内存的地址完全不同,则处理器10221可以将该数据存储至第一存储空间中与该数据在第二控制单元中的存储地址相同的存储空间;If the storage address of the data in the
如果该数据在存储器10222中的存储地址与第一存储空间中存在故障的内存的地址对应时,此处的该数据在存储器10222中的存储地址与第一存储空间中存在故障的内存的地址对应,包括数据在存储器10222中的部分存储地址或全部存储地址与第一存储空间中存在故障的内存的地址对应。If the storage address of the data in the
当数据在存储器10222中的部分存储地址与第一存储空间中存在故障的内存的地址对应时,处理器10221可以将该部分存储地址对应的存储空间中存储的该数据迁移至该第二控制单元中与该第一存储空间中存在故障的内存地址不同的存储地址对应的存储空间。例如,将该数据迁移到第二控制单元中的第二存储空间的其他的存储空间,或者,处理器10221还可以将该全部存储地址对应的存储空间中存储的该数据均进行迁移,本发明实施例对此不作特别限定。When a partial storage address of the data in the
下面针对数据在存储器10222中的部分存储地址或全部存储地址与第一存储空间中存在故障的内存的地址对应的情况分别进行举例说明。In the following, an example will be given for the case where some or all of the storage addresses of the data in the
例如,处理器确定第一存储空间中存在故障的内存为若干离散的字节,该若干离散的字节的索引号可以为2、6、8、9。For example, the processor determines that the faulty memory in the first storage space is several discrete bytes, and the index numbers of the several discrete bytes may be 2, 6, 8, and 9.
假设处理器10221将从服务器接收的数据缓存在存储器10222中的索引号为6的字节对应的存储空间,即该数据在存储器10222中的存储地址(索引号为6的字节)与第一存储空间中存在故障的内存的地址(索引号为6的字节)对应,由于该数据需要存储在第一控制单元与第二控制单元中完全相同的地址空间,然而,第一存储空间中索引号为6的字节对应的存储空间存在故障。Assuming that the processor 10221 caches the data received from the server in the storage space corresponding to the byte with index number 6 in the
此时,处理器10221需要将数据在存储器10222中进行迁移,例如,处理器10221将该数据从索引号为6的字节对应的存储空间迁移至索引号为7的字节对应的存储空间,处理器10221向DMAC10226发送的第二写数据指令中携带的目的地址对应第一存储空间中的索引号为7的字节对应的存储空间。即,对第一存储空间中存在故障的内存进行隔离。At this time, the processor 10221 needs to migrate the data in the
DMAC10226根据该第二写数据指令中携带的目的地址,将该数据以DMA的方式写入第一存储空间中的索引号为7的字节对应的存储空间。DMAC10226 writes the data into the storage space corresponding to the byte whose index number is 7 in the first storage space by means of DMA according to the destination address carried in the second write data instruction.
再例如,处理器确定第一存储空间中存在故障的内存为若干离散的字节,该若干离散的字节的索引号可以为4、5、6、9。For another example, the processor determines that the faulty memory in the first storage space is several discrete bytes, and the index numbers of the several discrete bytes may be 4, 5, 6, and 9.
假设处理器10221将从服务器接收的数据缓存在存储器10222中的索引号为6的字节、索引号为7的字节以及索引号为8的字节对应的存储空间,即,索引号为6的字节、索引号为7的字节以及索引号为8的字节对应的存储空间中分别存储了该数据,该数据在存储器10222中的部分存储地址(索引号为6的字节)与第一存储空间中存在故障的内存的地址(索引号为6的字节)对应,由于该数据需要存储在第一控制单元与第二控制单元中完全相同的地址空间,然而,第一存储空间中索引号为6的字节对应的存储空间存在故障。It is assumed that the processor 10221 caches the data received from the server in the
此时,处理器10221需要将数据在存储器10222中进行迁移,例如,处理器10221将该索引号为6的字节对应的存储空间存储的该数据的迁移至索引号为3的字节对应的存储空间,处理器10221向DMAC10226发送的第二写数据指令中携带的目的地址对应第一存储空间中的索引号为3的字节、索引号为7的字节以及索引号为8的字节对应的存储空间。即,对第一存储空间中存在故障的内存进行隔离。需要说明的是,处理器10221在对数据进行迁移时,可以仅将索引号为6的字节对应存储空间中存储的该数据进行迁移。作为另外一种实现方式,为保证该数据存储地址的连续性,也可以将索引号为6的字节、索引号为7的字节以及索引号为8的字节对应的存储空间中分别存储的该数据均进行迁移,本发明实施例对此不作特别限定。例如,处理器10221将该索引号为6的字节、索引号为7的字节以及索引号为8的字节对应的存储空间中分别存储的该数据迁移至索引号为1的字节至索引号为3的字节对应的存储空间。由于索引号为1的字节至索引号为3的字节对应的存储空间没有出现故障,处理器10221向DMAC10226发送的第二写数据指令中携带的目的地址对应第一存储空间中的索引号为1至索引号为3的字节对应的存储空间,即实现了对第一存储空间中存在故障的内存进行隔离。At this time, the processor 10221 needs to migrate the data in the
DMAC10226根据该第二写数据指令中携带的目的地址,将该数据以DMA的方式写入第一存储空间中的索引号为1的字节至索引号为3的字节对应的存储空间。The DMAC10226 writes the data into the storage space corresponding to the byte with index number 1 to the byte with index number 3 in the first storage space by DMA according to the destination address carried in the second write data instruction.
再例如,当对存储空间进行分页管理时,处理器10221确定第一存储空间中存在故障的内存所在的分页的首地址分别为d1、d3、d5。For another example, when performing paging management on the storage space, the processor 10221 determines that the first addresses of the pages where the faulty memory in the first storage space is located are d 1 , d 3 , and d 5 , respectively.
假设处理器10221将从服务器接收的数据缓存在存储器10222中的首地址为d5的分页中,由于该数据需要存储在第一控制单元与第二控制单元中完全相同的地址空间,然而,第一存储空间中的首地址为d5的分页存在故障。Assuming that the processor 10221 caches the data received from the server in the page with the first address of d 5 in the
此时,处理器10221需要将数据在存储器10222中进行迁移,例如,处理器10221将该数据从首地址为d5的分页迁移至首地址为d4的分页中的索引号为8的字节对应的存储空间,处理器10221向DMAC10226发送的第二写数据指令中携带的目的地址对应第一存储空间中的首地址为d4的分页中的索引号为8的字节对应的存储空间。即,对第一存储空间中存在故障的内存进行隔离。At this time, the processor 10221 needs to migrate the data in the
DMAC10226根据该第二写数据指令中携带的目的地址,将该数据以DMA的方式写入第一存储空间中的首地址为d4的分页中的索引号为8的字节对应的存储空间。The DMAC10226 writes the data into the storage space corresponding to the byte whose index number is 8 in the page with the first address of d4 in the first storage space by means of DMA according to the destination address carried in the second write data instruction.
需要说明的是,上述列举的数据在存储器10222中的迁移前或迁移后的存储地址,例如d4的分页中的索引号为8的字节对应的存储空间,仅为示例性说明,并不对本发明实施例构成任何限定。It should be noted that the storage addresses of the data listed above before or after the migration in the
此外,在本发明实施例中,处理器10211还可以对第一存储空间中存在故障的内存进行重复检测,若发现原本存在故障的内存已经恢复,则判定该原本存在故障的内存为软失效,进而可以将第一存储空间中已经恢复故障的内存的地址同步至处理器10221。In addition, in this embodiment of the present invention, the processor 10211 can also repeatedly detect the faulty memory in the first storage space, and if it is found that the original faulty memory has been recovered, it is determined that the original faulty memory is a soft failure, Further, the address of the recovered memory in the first storage space can be synchronized to the processor 10221 .
通过使本控(例如,第一控制单元)在OS启动之前或启动之后对支持进行DMA的存储空间进行故障检测,并将该存储空间中发生故障的内存的地址通知给对控(例如,第二控制单元),以使对控在该存储空间中进行数据备份时,能够对发生故障的内存进行隔离,从而降低发生OOM的概率。By causing the local controller (for example, the first control unit) to perform fault detection on the storage space supporting DMA before or after the OS is started, and notify the controller (for example, the first control unit) of the address of the faulty memory in the storage space to the controller (for example, the first control unit). Two control units), so that when the control performs data backup in the storage space, it can isolate the faulty memory, thereby reducing the probability of OOM.
下面以分别以ECC检测发生在计算机系统上电时或ECC检测发生在OS启动之后,对处理器10211将第一存储空间中存在故障的内存的地址同步至处理器10221中的方法以及处理器10221获取该第一存储空间中存在故障的内存的地址的方法进行说明。The following is a method for synchronizing the address of the faulty memory in the first storage space to the processor 10221 for the processor 10211 and the method for the processor 10221 to respectively use the ECC detection when the computer system is powered on or after the ECC detection occurs after the OS is started. The method for obtaining the address of the faulty memory in the first storage space will be described.
场景1:ECC检测发生在计算机系统上电时,且第一存储空间中存在故障的内存的地址的数量小于或等于预设的阈值。Scenario 1: ECC detection occurs when the computer system is powered on, and the number of addresses of faulty memory in the first storage space is less than or equal to a preset threshold.
第一控制单元包括第一处理器与第一主板管理控制器(baseboard managementcontroller,BMC),第二控制单元包括第二BMC,将第一存储空间中存在故障的内存的地址同步至第二控制单元,包括:第一处理器将第一存储空间中存在故障的内存的地址写入第一控制单元中预先分配的存储空间中;第一BMC从第一控制单元中预先分配的存储空间中获取第一存储空间中存在故障的内存的地址;第一BMC向第二BMC发送第一报文,第一报文中携带有第一存储空间中存在故障的内存的地址。The first control unit includes a first processor and a first motherboard management controller (baseboard management controller, BMC), the second control unit includes a second BMC, and synchronizes the address of the faulty memory in the first storage space to the second control unit , including: the first processor writes the address of the faulty memory in the first storage space into the pre-allocated storage space in the first control unit; the first BMC obtains the first BMC from the pre-allocated storage space in the first control unit The address of the faulty memory in a storage space; the first BMC sends a first message to the second BMC, and the first packet carries the address of the faulty memory in the first storage space.
第二控制单元包括第二处理器与第二BMC,第二控制单元获取第一存储空间中存在故障的内存的地址,包括:第二BMC接收第一BMC发送的第一报文,第一报文中携带有第一存储空间中存在故障的内存的地址;第二BMC对第一报文进行解析,获取第一存储空间中存在故障的内存的地址;第二BMC将第一存储空间中存在故障的内存的地址写入第二控制单元中预先分配的存储空间中;第二处理器从第二控制单元中预先分配的存储空间中获取第一存储空间中存在故障的内存的地址。The second control unit includes a second processor and a second BMC, and the second control unit obtains the address of the faulty memory in the first storage space, including: the second BMC receives the first message sent by the first BMC, and the first message The text carries the address of the faulty memory in the first storage space; the second BMC parses the first message and obtains the address of the faulty memory in the first storage space; the second BMC parses the faulty memory in the first storage space; The address of the faulty memory is written into the pre-allocated storage space in the second control unit; the second processor obtains the address of the faulty memory in the first storage space from the pre-allocated storage space in the second control unit.
具体地,处理器10211在对第一存储空间进行ECC检测时,在检测到第一存储空间中存在故障的内存时,将检测到的第一存储空间中存在故障的内存的地址写入第一控制单元1021中预先分配的存储空间中,该第一控制单元1021中预先分配的存储空间可以是存储器10212中的存储空间,或者,该第一控制单元1021中预先分配的存储空间还可以是第一控制单元1021中的复杂可编程逻辑器件(complex programmable logic device,CPLD)10218(例如,第一CPLD)中的存储空间。Specifically, when the processor 10211 performs ECC detection on the first storage space, when detecting a faulty memory in the first storage space, the processor 10211 writes the address of the detected faulty memory in the first storage space into the first storage space. In the pre-allocated storage space in the
第一控制单元1021中的BMC10219(例如,第一BMC)从第一控制单元1021中的预先分配的存储空间中获取第一存储空间中存在故障的内存的地址,BMC10219将获取到的第一存储空间中存在故障的内存的地址封装在报文(例如,第一报文)中,并通过心跳通道将该报文传输至第二控制单元1022中BMC10229(例如,第二BMC),其中,该报文可以为以太网报文。The BMC10219 (for example, the first BMC) in the
BMC10229将接收到的报文进行解析,获取该报文中携带的第一存储空间中存在故障的内存的地址,并将该第一存储空间中存在故障的内存的地址写入第二控制单元1022中预先分配的存储空间中,该第二控制单元1022中预先分配的存储空间可以是存储器10222中的存储空间,或者,该第二控制单元1021中预先分配的存储空间还可以是第二控制单元1022中的CPLD10228(例如,第二CPLD)中的存储空间。The BMC10229 parses the received message, obtains the address of the faulty memory in the first storage space carried in the message, and writes the address of the faulty memory in the first storage space into the
处理器10212从该第二控制单元1022中预先分配的存储空间中获取第一存储空间中存在故障的内存的地址。The
下面以第一控制单元1021中预先分配的存储空间是CPLD10218中的存储空间,并且第二控制单元1022中预先分配的存储空间是CPLD10228中的存储空间为例,对处理器10211将第一存储空间中存在故障的内存的地址同步至处理器10221中的方法以及处理器10221获取该第一存储空间中存在故障的内存的地址的方法进行详细说明。Taking the pre-allocated storage space in the
当第一控制单元1021中预先分配的存储空间是CPLD10218中的存储空间时,第一控制单元1021中还包括平台控制单元(platform controller hub,PCH)10217(例如,第一PCH),其中,PCH 10217为处理器10211与CPLD10218之间进行通信的接口;当第二控制单元1022中预先分配的存储空间是CPLD10227中的存储空间时,第二控制单元1022中还包括平台控制单元(platform controller hub,PCH)10227(例如,第一PCH),其中,PCH 10227为处理器10221与CPLD10228之间进行通信的接口。When the pre-allocated storage space in the
如图6所示,图6中的PCH10217与图1至图3中所示的处理器10211连接,CPLD10218分别与PCH10217、BMC10219连接,PCH10227与图1至图3中所示的处理器10221连接,CPLD10228分别与PCH10227、BMC10229连接。其中,PCH也可称为桥片。As shown in Figure 6, PCH10217 in Figure 6 is connected to the processor 10211 shown in Figures 1 to 3, CPLD10218 is connected to PCH10217, BMC10219 respectively, PCH10227 is connected to the processor 10221 shown in Figures 1 to 3, CPLD10228 is connected with PCH10227 and BMC10229 respectively. Among them, PCH can also be called bridge slice.
处理器10211在对第一存储空间进行ECC检测时,在检测到第一存储空间中存在故障的内存时,将检测到的第一存储空间中存在故障的内存的地址通过PCH10217写入CPLD10218中的预先分配的存储空间中,该存储空间专用于存储第一存储空间中存在故障的内存的地址,并且,PCH10217与BMC10219均能够获知该存储空间的地址。When the processor 10211 performs ECC detection on the first storage space and detects a faulty memory in the first storage space, the processor 10211 writes the address of the detected faulty memory in the first storage space to the address in the CPLD10218 through PCH10217. In the pre-allocated storage space, the storage space is dedicated to storing the address of the faulty memory in the first storage space, and both the PCH10217 and the BMC10219 can know the address of the storage space.
CPLD10218向BMC10219上报中断,此时,BMC10219便会从CPLD10218中的预先分配的存储空间中获取第一存储空间中存在故障的内存的地址。The CPLD10218 reports the interruption to the BMC10219, and at this time, the BMC10219 obtains the address of the faulty memory in the first storage space from the pre-allocated storage space in the CPLD10218.
当BMC10219从CPLD10218中的预先分配的存储空间中成功获取到第一存储空间中存在故障的内存的地址后,BMC10219便会在CPLD10218中预先分配的存储空间中写入默认的字段(例如,写入全0或全1),即对该CPLD10218中预先分配的存储空间中存储的第一存储空间中存在故障的内存的地址进行擦除,以便PCH10217将后续检测到的第一存储空间中存在故障的内存的地址写入CPLD10218中的预先分配的存储空间。After the BMC10219 successfully obtains the address of the faulty memory in the first storage space from the pre-allocated storage space in the CPLD10218, the BMC10219 will write the default fields (for example, write All 0 or all 1), that is, the address of the faulty memory in the first storage space stored in the pre-allocated storage space in the CPLD10218 is erased, so that the PCH10217 will subsequently detect the faulty memory in the first storage space. The address of the memory is written to the pre-allocated storage space in the CPLD10218.
此外,BMC10219还可以通过轮询的方式从CPLD10218中的预先分配的存储空间中获取第一存储空间中存在故障的内存的地址,例如,BMC10219可以对CPLD10218中的预先分配的存储空间进行周期性的监控,当该预先分配的存储空间中被写入新的信息时,BMC10219便可以从CPLD10218中的预先分配的存储空间中获取第一存储空间中存在故障的内存的地址。In addition, the BMC10219 can also obtain the address of the faulty memory in the first storage space from the pre-allocated storage space in the CPLD10218 by polling. For example, the BMC10219 can periodically perform a periodic Monitoring, when new information is written into the pre-allocated storage space, the BMC10219 can obtain the address of the faulty memory in the first storage space from the pre-allocated storage space in the CPLD10218.
同样地,当BMC10219从CPLD10218中的预先分配的存储空间中成功获取到第一存储空间中存在故障的内存的地址后,便会对该CPLD10218中预先分配的存储空间中存储的第一存储空间中存在故障的内存的地址进行擦除。Similarly, after the BMC10219 successfully obtains the address of the faulty memory in the first storage space from the pre-allocated storage space in the CPLD10218, it will store the address in the first storage space stored in the pre-allocated storage space in the CPLD10218. The address of the faulty memory is erased.
BMC10219将获取到的第一存储空间中存在故障的内存的地址封装在报文(例如,第一报文)中,并通过心跳通道将该报文传输至BMC10229,其中,该报文可以为以太网报文。The BMC10219 encapsulates the acquired address of the faulty memory in the first storage space into a message (for example, the first message), and transmits the message to the BMC10229 through the heartbeat channel, where the message can be an Ethernet network message.
BMC10229将接收到的报文进行解析,获取该报文中携带的第一存储空间中存在故障的内存的地址,并将该第一存储空间中存在故障的内存的地址写入CPLD10228中预先分配的存储空间中,由CPLD10228向PCH10227上报中断,PCH10227从CPLD10228中的预先分配的存储空间中获取第一存储空间中存在故障的内存的地址,并将第一存储空间中存在故障的内存的地址发送至处理器10221。The BMC10229 parses the received message, obtains the address of the faulty memory in the first storage space carried in the message, and writes the address of the faulty memory in the first storage space into the pre-allocated memory in the CPLD10228 In the storage space, CPLD10228 reports an interruption to PCH10227, PCH10227 obtains the address of the faulty memory in the first storage space from the pre-allocated storage space in CPLD10228, and sends the address of the faulty memory in the first storage space to processor 10221.
当PCH10227从CPLD10228中的预先分配的存储空间中成功获取到第一存储空间中存在故障的内存的地址后,PCH10227便会在CPLD10228中预先分配的存储空间中写入默认的字段(例如,写入全0或全1),即对该CPLD10228中预先分配的存储空间中存储的第一存储空间中存在故障的内存的地址进行擦除,以便BMC10229将后续检测到的第一存储空间中存在故障的内存的地址写入CPLD10228中的预先分配的存储空间。After PCH10227 successfully obtains the address of the faulty memory in the first storage space from the pre-allocated storage space in CPLD10228, PCH10227 will write default fields (for example, write All 0 or all 1), that is, the address of the faulty memory in the first storage space stored in the pre-allocated storage space in the CPLD10228 is erased, so that the BMC10229 will subsequently detect the faulty memory in the first storage space. The address of the memory is written to the pre-allocated storage space in the CPLD10228.
此外,PCH10227还可以通过轮询的方式从CPLD10228中的预先分配的存储空间中获取第一存储空间中存在故障的内存的地址,例如,PCH10227可以对CPLD10228中的预先分配的存储空间进行周期性的监控,当该预先分配的存储空间中被写入新的信息时,PCH10227便可以从CPLD10228中的预先分配的存储空间中获取第一存储空间中存在故障的内存的地址。In addition, the PCH10227 can also obtain the address of the faulty memory in the first storage space from the pre-allocated storage space in the CPLD10228 by polling. For example, the PCH10227 can periodically perform a periodic Monitoring, when new information is written in the pre-allocated storage space, the PCH10227 can obtain the address of the faulty memory in the first storage space from the pre-allocated storage space in the CPLD 10228.
同样地,当PCH10227从CPLD10228中的预先分配的存储空间中成功获取到第一存储空间中存在故障的内存的地址后,便会对该CPLD10228预先分配的存储空间中存储的第一存储空间中存在故障的内存的地址进行擦除。Similarly, after PCH10227 successfully obtains the address of the faulty memory in the first storage space from the pre-allocated storage space in CPLD10228, it will store the address in the first storage space stored in the pre-allocated storage space of CPLD10228. The address of the faulty memory is erased.
需要说明的是,处理器10211除了按照上述描述的方式将第一存储空间中存在故障的内存的地址同步至处理器10221以外,还可以由PCH10217将第一存储空间中存在故障的内存的地址封装在报文中,将该报文发送至PCH10227,PCH10227对该报文进行解析后,将获得的第一存储空间中存在故障的内存的地址发送至处理器10221。It should be noted that, in addition to synchronizing the address of the faulty memory in the first storage space to the processor 10221 in the manner described above, the processor 10211 can also encapsulate the address of the faulty memory in the first storage space by the PCH10217 In the message, the message is sent to the PCH 10227, and after parsing the message, the PCH 10227 sends the obtained address of the faulty memory in the first storage space to the processor 10221.
场景2:ECC检测发生在OS启动之后。Scenario 2: ECC detection happens after OS boot.
第一控制单元包括第一处理器与第一直接内存访问控制器DMAC,第二控制单元包括第二处理器,将第一存储空间中存在故障的内存的地址同步至第二控制单元,包括:第一处理器向第一DMAC发送第一写数据指令,第一写数据指令中携带有第一存储空间中存在故障的内存的地址信息在第二存储空间中的第一存储子空间中的存储地址,第二存储空间为第二控制单元中的存储空间,第一存储子空间用于存储第一存储空间中存在故障的内存的地址,第二存储空间还包括第二存储子空间,第二存储子空间用于存储第一控制单元以DMA的方式在第二控制单元中备份的数据;第一DMAC根据第一写数据指令,将第一存储空间中存在故障的内存的地址存储至第一存储子空间;第一处理器向第二处理器发送第二报文,第二报文中携带有第一存储空间中存在故障的内存的地址信息在第一存储子空间中的存储地址。The first control unit includes a first processor and a first direct memory access controller DMAC, the second control unit includes a second processor, and synchronizes the address of the faulty memory in the first storage space to the second control unit, including: The first processor sends a first write data instruction to the first DMAC, where the first write data instruction carries the storage of address information of the faulty memory in the first storage space in the first storage subspace in the second storage space address, the second storage space is the storage space in the second control unit, the first storage subspace is used to store the address of the faulty memory in the first storage space, the second storage space also includes a second storage subspace, the second storage subspace The storage subspace is used to store the data backed up by the first control unit in the second control unit by means of DMA; the first DMAC stores the address of the faulty memory in the first storage space to the first DMAC according to the first write data instruction. A storage subspace; the first processor sends a second message to the second processor, and the second message carries the storage address in the first storage subspace of address information of the faulty memory in the first storage space.
第二控制单元包括第二处理器,第二控制单元获取第一存储空间中存在故障的内存的地址,包括:第二处理器接收第一处理器发送的第二报文,第二报文中携带有第一存储空间中存在故障的内存的地址信息在第一存储子空间中的存储地址;第二处理器根据第二报文中的存储地址,从第一存储子空间中获取第一存储空间中存在故障的内存的地址。The second control unit includes a second processor, and the second control unit obtains the address of the faulty memory in the first storage space, including: the second processor receives the second message sent by the first processor, and the second message The storage address in the first storage subspace that carries the address information of the faulty memory in the first storage space; the second processor obtains the first storage subspace from the first storage subspace according to the storage address in the second message The address of the faulty memory in the space.
具体地,当OS系统启动之后,处理器10211还可以继续对第一存储空间进行ECC检测,经过检测,如果处理器10211发现第一存储空间中包括存在故障的内存,则处理器10211可以将第一存储空间中存在故障的内存的地址以DMA的方式同步至存储器10222中被配置为用于存储第一存储空间中存在故障的内存的地址的存储空间(例如,第二存储空间中的第一存储子空间)。Specifically, after the OS system is started, the processor 10211 may continue to perform ECC detection on the first storage space. After detection, if the processor 10211 finds that the first storage space includes a faulty memory, the processor 10211 may The address of the faulty memory in one storage space is DMA-synchronized to the storage space in
例如,处理器10211可以将第一存储空间中存在故障的内存的地址以DMA的方式同步至存储器10222中的第一存储子空间,即由处理器10211中的DMAC10216(例如,第一DMAC)以DMA的方式对存储器10222中的第一存储子空间进行访问,将第一存储空间中存在故障的内存的地址写入存储器10222中的第一存储子空间。For example, the processor 10211 can synchronize the address of the faulty memory in the first storage space to the first storage subspace in the
其中,第一存储子空间可以是在BIOS阶段在存储器10222中划分的存储空间。此外,还可以在存储器10222中划分第二存储子空间,该第二存储子空间用于存储第一控制单元以DMA的方式在第二控制单元中备份的数据。The first storage subspace may be a storage space divided in the
即,处理器10211向DMAC10216发送写数据指令(例如,第一写数据指令),第一写数据指令中携带有第一存储空间中存在故障的内存的地址在第一存储子空间中的存储地址,DMAC10216在获得总线控制权后,DMAC10216根据第一写数据指令,将第一存储空间中存在故障的内存的地址存储至存储器10222中的第一存储子空间。That is, the processor 10211 sends a data write command (eg, the first data write command) to the DMAC 10216, and the first write data command carries the storage address of the address of the faulty memory in the first storage space in the first storage subspace , after the DMAC10216 obtains the bus control right, the DMAC10216 stores the address of the faulty memory in the first storage space to the first storage subspace in the
处理器10211在恢复总线控制权后,向处理器10221发送报文(例如,第二报文),该报文中携带有第一存储子空间内存储第一存储空间中存在故障的内存的地址的存储空间的地址。After the processor 10211 restores the control of the bus, it sends a message (for example, a second message) to the processor 10221, and the message carries the address of the memory in the first storage subspace that stores the faulty memory in the first storage space. address of the storage space.
处理器10221对该报文进行解析,获取第一存储子空间内存储第一存储空间中存在故障的内存的地址的存储空间的地址,并在存储器10222中相应的存储位置获取第一存储空间中存在故障的内存的地址。The processor 10221 parses the message, obtains the address of the storage space in the first storage subspace that stores the address of the faulty memory in the first storage space, and obtains the address of the first storage space at the corresponding storage location in the
至此,处理器10211将第一存储空间中存在故障的内存的地址同步到了处理器10221中。So far, the processor 10211 has synchronized the address of the faulty memory in the first storage space to the processor 10221.
需要说明的是,上述的第二存储子空间用于存储第一控制单元以DMA的方式在第二控制单元中备份的数据仅为示例性说明,本发明实施例对此不作特别限定,例如,第二存储子空间还可以用于缓存第二控制单元从服务器接收的需要以DMA的方式在第一控制单元中进行备份的数据。It should be noted that the above-mentioned second storage subspace is used to store the data backed up by the first control unit in the second control unit by means of DMA, which is only an exemplary illustration, and is not particularly limited in this embodiment of the present invention. For example, The second storage subspace may also be used for buffering the data received by the second control unit from the server and which needs to be backed up in the first control unit by means of DMA.
还需要说明的是,在本发明实施例中,第一存储空间中存在故障的内存的地址也可以称为第一存储空间中存在故障的内存的地址信息,本发明实施例对此不作特别限定。通过将第一存储空间中存在故障的内存的地址通过DMA的方式同步至对控,从而提高故障同步的速度,使得对控能够及时发现本控中存在故障的内存的地址,在对本控的存储空间进行数据访问时,能够及时对发生故障的内存进行隔离,避免由于故障同步不及时导致对控对本控中存在故障的存储空间进行数据访问,进而导致发生OOM的情况。It should also be noted that, in this embodiment of the present invention, the address of the faulty memory in the first storage space may also be referred to as address information of the faulty memory in the first storage space, which is not particularly limited in the embodiment of the present invention. . By synchronizing the address of the faulty memory in the first storage space to the peer controller through DMA, the speed of fault synchronization is improved, so that the peer controller can find the address of the faulty memory in the local controller in time. When data access is performed in the space, the faulty memory can be isolated in a timely manner, so as to avoid data access to the faulty storage space in the local control due to untimely synchronization of the fault, which may lead to OOM.
需要说明的是,当ECC检测发生在OS启动之后时,本发明实施例上述场景2中的方法为例,对处理器10211将第一存储空间中存在故障的内存的地址同步至处理器10221的方法进行说明,但本发明实施例并不限定于此,例如,当ECC检测发生在OS启动之后时,处理器10211还可以按照场景1中描述的方法将第一存储空间中存在故障的内存的地址同步至处理器10221中。It should be noted that when the ECC detection occurs after the OS is started, the method in the above-mentioned scenario 2 of the embodiment of the present invention is taken as an example, and the processor 10211 synchronizes the address of the faulty memory in the first storage space to the address of the processor 10221. The method is described, but the embodiment of the present invention is not limited to this. For example, when the ECC detection occurs after the OS is started, the processor 10211 can also follow the method described in Scenario 1. The address is synchronized into the processor 10221.
还需要说明的是,处理器10211将第一存储空间内已经恢复故障的内存的地址同步至处理器10221的方法请参考上述场景1与场景2中的相关描述,为了简洁,此处不再赘述。It should also be noted that, for the method for the processor 10211 to synchronize the address of the recovered memory in the first storage space to the processor 10221, please refer to the relevant descriptions in the above scenarios 1 and 2. For brevity, details are not repeated here. .
还需要说明的是,本发明实施例中仅以处理器10211将第一存储空间中存在故障的内存的地址同步至处理器10221中的方法为例进行说明,该方法同样适用于处理器10221将存储器10222中支持进行DMA的存储空间中存在故障的内存的地址同步至处理器10211的场景中,关于处理器10221将存储器10222中用于进行DMA的存储空间中存在故障的内存的地址同步至处理器10211中的方法,请参考上述相关描述,为了简洁,此处不再赘述。It should also be noted that, in the embodiment of the present invention, only the method in which the processor 10211 synchronizes the address of the faulty memory in the first storage space to the processor 10221 is used as an example for description. In the scenario where the address of the faulty memory in the storage space that supports DMA in the
本发明实施例还提供了一种处理内存故障的装置,该装置配置在计算机系统100中,包括第一控制单元与第二控制单元,第一控制单元用于执行方法200中由第一控制单元1021执行的方法的操作步骤,第二控制单元用于执行方法200中由第二控制单元1022执行的步骤,且第二控制单元中包括第二控制单元1022执行的方法的操作步骤。An embodiment of the present invention further provides an apparatus for handling a memory failure. The apparatus is configured in the computer system 100 and includes a first control unit and a second control unit. The first control unit is used to execute the method 200 by the first control unit. The operation steps of the method performed by the
本发明实施例提供的处理内存故障的装置,通过使本控(例如,第一控制单元)在OS启动之前或启动之后对支持进行DMA的存储空间进行故障检测,并将该存储空间中发生故障的内存的地址通知给对控(例如,第二控制单元),当对控在该存储空间中进行数据备份时,对该存储空间中存在故障的内存进行隔离,即,仅在本控中没有发生内存故障的存储空间中进行数据备份,从而降低发生OOM的概率。The apparatus for handling memory failures provided by the embodiments of the present invention enables the local control (for example, the first control unit) to perform failure detection on the storage space supporting DMA before or after the OS is started, and detects the failure in the storage space. The address of the memory is notified to the controller (for example, the second control unit), and when the controller performs data backup in the storage space, the faulty memory in the storage space is isolated, that is, only in this controller Data backup is performed in the storage space where memory failure occurs, thereby reducing the probability of OOM.
本发明实施例还提供了一种计算机系统300,如图7所示,该计算机系统300包括处理单元301与存储控制单元302,该存储控制单元302与该处理单元301连接,该存储控制单元302包括:An embodiment of the present invention further provides a computer system 300. As shown in FIG. 7, the computer system 300 includes a processing unit 301 and a storage control unit 302, the storage control unit 302 is connected to the processing unit 301, and the storage control unit 302 include:
第一控制单元3021,用于确定第一存储空间中存在故障的内存的地址,该第一存储空间是该第一控制单元3021中的存储空间,该第一存储空间能够存储第二控制单元3022以直接内存访问DMA的方式在该第一控制单元3021中备份的数据;The first control unit 3021 is used to determine the address of the faulty memory in the first storage space, where the first storage space is the storage space in the first control unit 3021, and the first storage space can store the second control unit 3022 Data backed up in the first control unit 3021 by means of direct memory access DMA;
该第一控制单元3021,还用于将该第一存储空间中存在故障的内存的地址同步至该第二控制单元3022;The first control unit 3021 is further configured to synchronize the address of the faulty memory in the first storage space to the second control unit 3022;
该第二控制单元3022,用于获取该第一存储空间中存在故障的内存的地址;The second control unit 3022 is configured to obtain the address of the faulty memory in the first storage space;
该第二控制单元3022,还用于以DMA的方式将缓存在该第二控制单元3022中的该数据存储至该第一存储空间,并且在存储该数据时隔离该第一存储空间中存在故障的内存,其中,该数据是由该处理单元通过该第二控制单元3022进行存储的数据。The second control unit 3022 is further configured to store the data buffered in the second control unit 3022 in the first storage space in a DMA manner, and isolate a fault in the first storage space when storing the data memory, wherein the data is the data stored by the processing unit through the second control unit 3022.
可选地,第二控制单元3022,还用于在以DMA的方式将缓存在第二控制单元3022中的数据存储至第一存储空间之前,将缓存在与第一存储空间中存在故障的内存的地址对应的地址空间的数据,迁移到第二存储空间其它的存储空间中;其中,第二存储空间是第二控制单元3022中的存储空间。Optionally, the second control unit 3022 is further configured to, before storing the data cached in the second control unit 3022 to the first storage space by DMA, store the data cached in the faulty memory with the first storage space. The data in the address space corresponding to the address of , is migrated to other storage spaces in the second storage space; wherein, the second storage space is the storage space in the second control unit 3022 .
可选地,该第一控制单元3021包括第一处理器与第一主板管理控制器BMC,该第二控制单元3022包括第二BMC,第一处理器,用于将该第一存储空间中存在故障的内存的地址写入该第一控制单元3021中预先分配的存储空间中;第一BMC,用于从该第一控制单元3021中预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址;第一BMC,还用于向该第二BMC发送第一报文,该第一报文中携带有该第一存储空间中存在故障的内存的地址。Optionally, the first control unit 3021 includes a first processor and a first motherboard management controller BMC, and the second control unit 3022 includes a second BMC, the first processor, for storing in the first storage space The address of the faulty memory is written into the storage space pre-allocated in the first control unit 3021; the first BMC is used to obtain the fault in the first storage space from the storage space pre-allocated in the first control unit 3021 The address of the memory; the first BMC is further configured to send a first message to the second BMC, where the first message carries the address of the faulty memory in the first storage space.
可选地,第一控制单元3021还包括第一平台控制单元PCH与第一复杂可编程逻辑器件CPLD,该第一控制单元3021中预先分配的存储空间是该第一CPLD中的存储空间,第一处理器,还用于通过该第一PCH将该第一存储空间中存在故障的内存的地址写入该第一CPLD中的预先分配的存储空间中;第一BMC,还用于从该第一CPLD中的预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。Optionally, the first control unit 3021 further includes a first platform control unit PCH and a first complex programmable logic device CPLD, and the pre-allocated storage space in the first control unit 3021 is the storage space in the first CPLD, and the first A processor is further configured to write the address of the faulty memory in the first storage space into the pre-allocated storage space in the first CPLD through the first PCH; the first BMC is further configured to retrieve the address from the first storage space The address of the faulty memory in the first storage space is obtained from the pre-allocated storage space in a CPLD.
可选地,第一控制单元3021包括第一处理器与第一直接内存访问控制器DMAC,第二控制单元3022包括第二处理器,第一处理器,还用于向该第一DMAC发送第一写数据指令,该第一写数据指令中携带有该第一存储空间中存在故障的内存的地址在第二存储空间中的第一存储子空间中的存储地址,该第二存储空间为该第二控制单元3022中的存储空间,该第一存储子空间用于存储该第一存储空间中存在故障的内存的地址,该第二存储空间还包括第二存储子空间,该第二存储子空间用于存储该第一控制单元3021以DMA的方式在该第二控制单元3022中备份的数据;第一DMAC,用于根据该第一写数据指令,将该第一存储空间中存在故障的内存的地址存储至该存储地址在该第一存储子空间中对应的存储空间;Optionally, the first control unit 3021 includes a first processor and a first direct memory access controller DMAC, and the second control unit 3022 includes a second processor, the first processor, and is further configured to send the first DMAC to the first DMAC. A write data command, the first write data command carries the storage address of the address of the faulty memory in the first storage space in the first storage subspace in the second storage space, and the second storage space is the The storage space in the second control unit 3022, the first storage subspace is used to store the address of the faulty memory in the first storage space, the second storage space further includes a second storage subspace, the second storage subspace The space is used to store the data backed up by the first control unit 3021 in the second control unit 3022 by means of DMA; the first DMAC is used to store the faulty data in the first storage space according to the first write data instruction. The address of the memory is stored in the storage space corresponding to the storage address in the first storage subspace;
第一处理器,还用于向该第二处理器发送第二报文,该第二报文中携带有该第一存储空间中存在故障的内存的地址在该第一存储子空间中的存储地址。The first processor is further configured to send a second message to the second processor, where the second message carries the storage of the address of the faulty memory in the first storage space in the first storage subspace address.
可选地,第二控制单元3022包括第二处理器与第二BMC,第二BMC,用于接收该第一BMC发送的该第一报文,该第一报文中携带有该第一存储空间中存在故障的内存的地址;第二BMC,还用于对该第一报文进行解析,获取该第一存储空间中存在故障的内存的地址;第二BMC,还用于将该第一存储空间中存在故障的内存的地址写入该第二控制单元3022中预先分配的存储空间中;第二处理器,用于从该第二控制单元3022中预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。Optionally, the second control unit 3022 includes a second processor and a second BMC, and the second BMC is used to receive the first message sent by the first BMC, where the first message carries the first storage The address of the faulty memory in the space; the second BMC is also used to parse the first message to obtain the address of the faulty memory in the first storage space; the second BMC is also used to The address of the faulty memory in the storage space is written into the storage space pre-allocated in the second control unit 3022; the second processor is configured to obtain the first The address of the faulty memory in the storage space.
可选地,第二控制单元3022还包括第二PCH与第二CPLD,第二控制单元3022中预先分配的存储空间是第二CPLD中的存储空间,第二BMC,还用于将该第一存储空间中存在故障的内存的地址写入该第二CPLD中的预先分配的存储空间中;第二处理器,还用于从该第二CPLD中的预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址,包括:第二处理器,还用于通过该第二PCH从该第二CPLD中的预先分配的存储空间中获取该第一存储空间中存在故障的内存的地址。Optionally, the second control unit 3022 further includes a second PCH and a second CPLD, and the pre-allocated storage space in the second control unit 3022 is the storage space in the second CPLD, and the second BMC is also used to store the first The address of the faulty memory in the storage space is written into the pre-allocated storage space in the second CPLD; the second processor is further configured to obtain the first storage space from the pre-allocated storage space in the second CPLD The address of the faulty memory in the space, including: a second processor, further configured to obtain the address of the faulty memory in the first storage space from the pre-allocated storage space in the second CPLD through the second PCH .
可选地,第二控制单元3022包括第二处理器,第二处理器还用于接收该第一处理器发送的该第二报文,该第二报文中携带有该第一存储空间中存在故障的内存的地址在该第一存储子空间中的存储地址;第二处理器,还用于根据该第二报文,从该存储地址在该第一存储子空间中对应的存储空间获取该第一存储空间中存在故障的内存的地址。Optionally, the second control unit 3022 includes a second processor, and the second processor is further configured to receive the second message sent by the first processor, where the second message carries the information in the first storage space. The storage address of the address of the faulty memory in the first storage subspace; the second processor is further configured to obtain from the storage space corresponding to the storage address in the first storage subspace according to the second message The address of the faulty memory in the first storage space.
可选地,第一处理器,还用于对第一存储空间进行可纠正错误ECC检测;根据对该第一存储空间进行ECC检测的检测结果,确定该第一存储空间中存在故障的内存的地址。Optionally, the first processor is further configured to perform correctable error ECC detection on the first storage space; according to the detection result of performing ECC detection on the first storage space, determine the faulty memory in the first storage space. address.
可选地,第一控制单元3021,还用于统计该第一存储空间中存在故障的内存的地址的数量;如果该数量小于或等于预设的阈值,将该第一存储空间中存在故障的内存的地址同步至第二控制单元3022。Optionally, the first control unit 3021 is also used to count the number of addresses of the faulty memory in the first storage space; if the number is less than or equal to a preset threshold The address of the memory is synchronized to the second control unit 3022 .
可选地,第一存储空间中存在故障的内存的地址包括该第一存储空间存在故障的内存所在的页帧的首地址。Optionally, the address of the faulty memory in the first storage space includes the first address of the page frame where the faulty memory in the first storage space is located.
可选地,第二控制单元3022,还用于向该第二控制单元3022中的第二DMAC发送第二写数据指令,该第二写数据指令用于指示该第二DMAC将缓存在该第二控制单元中的该数据存储至该第一存储空间中,该第二写数据指令携带的目的地址所指示的存储空间为该第一存储空间中存在故障的内存之外的存储空间。Optionally, the second control unit 3022 is further configured to send a second write data instruction to the second DMAC in the second control unit 3022, where the second write data instruction is used to instruct the second DMAC to cache in the second DMAC. The data in the second control unit is stored in the first storage space, and the storage space indicated by the destination address carried by the second write data instruction is the storage space other than the faulty memory in the first storage space.
可选地,计算机系统300还包括存储单元303,第一控制单元3021与第二控制单元3022均与存储单元303连接,处理单元301还用于通过第二控制单元3022将该数据存储至存储单元303。Optionally, the computer system 300 further includes a storage unit 303, the first control unit 3021 and the second control unit 3022 are both connected to the storage unit 303, and the processing unit 301 is further configured to store the data in the storage unit through the second control unit 3022 303.
本发明实施例提供的计算机系统,通过使本控(例如,第一控制单元)在OS启动之前或启动之后对支持进行DMA的存储空间进行故障检测,并将该存储空间中发生故障的内存的地址通知给对控(例如,第二控制单元),当对控在该存储空间中进行数据备份时,对该存储空间中存在故障的内存进行隔离,即,仅在本控中没有发生内存故障的存储空间中进行数据备份,从而降低发生OOM的概率。In the computer system provided by the embodiments of the present invention, the local control (for example, the first control unit) performs fault detection on the storage space that supports DMA before or after the OS is started, and records the faulty memory in the storage space. The address is notified to the controller (for example, the second control unit), and when the controller performs data backup in the storage space, it isolates the faulty memory in the storage space, that is, only in this controller, no memory failure occurs. Data backup is performed in the storage space, thereby reducing the probability of OOM.
本发明实施例提供了一种计算机可读存储介质,该计算机可读存储介质中存储有指令,当该指令在计算机上运行时,使得计算机执行上述方法200中的步骤。An embodiment of the present invention provides a computer-readable storage medium, where instructions are stored in the computer-readable storage medium, and when the instructions are executed on a computer, the computer executes the steps in the foregoing method 200 .
本发明实施例提供了一种包含指令的计算机程序产品,当该指令在计算机上运行时,使得计算机执行上述方法200中的步骤。An embodiment of the present invention provides a computer program product containing instructions, when the instructions are run on a computer, the computer causes the computer to execute the steps in the above method 200 .
本领域普通技术人员可以意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、或者计算机软件和电子硬件的结合来实现。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本申请的范围。Those of ordinary skill in the art can realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, or a combination of computer software and electronic hardware. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of this application.
所属领域的技术人员可以清楚地了解到,为描述的方便和简洁,上述描述的系统、装置和单元的具体工作过程,可以参考前述方法实施例中的对应过程,在此不再赘述。Those skilled in the art can clearly understand that, for the convenience and brevity of description, the specific working process of the system, device and unit described above may refer to the corresponding process in the foregoing method embodiments, which will not be repeated here.
在本申请所提供的几个实施例中,应该理解到,所揭露的系统、装置和方法,可以通过其它的方式实现。例如,以上所描述的装置实施例仅仅是示意性的,例如,所述单元的划分,仅仅为一种逻辑功能划分,实际实现时可以有另外的划分方式,例如多个单元或组件可以结合或者可以集成到另一个系统,或一些特征可以忽略,或不执行。另一点,所显示或讨论的相互之间的耦合或直接耦合或通信连接可以是通过一些接口,装置或单元的间接耦合或通信连接,可以是电性,机械或其它的形式。In the several embodiments provided in this application, it should be understood that the disclosed system, apparatus and method may be implemented in other manners. For example, the apparatus embodiments described above are only illustrative. For example, the division of the units is only a logical function division. In actual implementation, there may be other division methods. For example, multiple units or components may be combined or Can be integrated into another system, or some features can be ignored, or not implemented. On the other hand, the shown or discussed mutual coupling or direct coupling or communication connection may be through some interfaces, indirect coupling or communication connection of devices or units, and may be in electrical, mechanical or other forms.
所述作为分离部件说明的单元可以是或者也可以不是物理上分开的,作为单元显示的部件可以是或者也可以不是物理单元,即可以位于一个地方,或者也可以分布到多个网络单元上。可以根据实际的需要选择其中的部分或者全部单元来实现本实施例方案的目的。The units described as separate components may or may not be physically separated, and components displayed as units may or may not be physical units, that is, may be located in one place, or may be distributed to multiple network units. Some or all of the units may be selected according to actual needs to achieve the purpose of the solution in this embodiment.
另外,在本申请各个实施例中的各功能单元可以集成在一个处理单元中,也可以是各个单元单独物理存在,也可以两个或两个以上单元集成在一个单元中。In addition, each functional unit in each embodiment of the present application may be integrated into one processing unit, or each unit may exist physically alone, or two or more units may be integrated into one unit.
所述功能如果以软件功能单元的形式实现并作为独立的产品销售或使用时,可以存储在一个计算机可读取存储介质中。基于这样的理解,本申请的技术方案本质上或者说对现有技术做出贡献的部分或者该技术方案的部分可以以软件产品的形式体现出来,该计算机软件产品存储在一个存储介质中,包括若干指令用以使得一台计算机设备(可以是个人计算机,服务器,或者网络设备等)执行本申请各个实施例所述方法的全部或部分步骤。而前述的存储介质包括:U盘、移动硬盘、只读存储器(read-only memory,ROM)、随机存取存储器(random access memory,RAM)、磁碟或者光盘等各种可以存储程序代码的介质。The functions, if implemented in the form of software functional units and sold or used as independent products, may be stored in a computer-readable storage medium. Based on this understanding, the technical solution of the present application can be embodied in the form of a software product in essence, or the part that contributes to the prior art or the part of the technical solution, and the computer software product is stored in a storage medium, including Several instructions are used to cause a computer device (which may be a personal computer, a server, or a network device, etc.) to execute all or part of the steps of the methods described in the various embodiments of the present application. The aforementioned storage medium includes: U disk, removable hard disk, read-only memory (ROM), random access memory (RAM), magnetic disk or optical disk and other media that can store program codes .
以上所述,仅为本申请的具体实施方式,但本申请的保护范围并不局限于此,任何熟悉本技术领域的技术人员在本申请揭露的技术范围内,可轻易想到变化或替换,都应涵盖在本申请的保护范围之内。因此,本申请的保护范围应以所述权利要求的保护范围为准。The above are only specific embodiments of the present application, but the protection scope of the present application is not limited to this. should be covered within the scope of protection of this application. Therefore, the protection scope of the present application should be subject to the protection scope of the claims.
Claims (20)
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810942648.9A CN109343986B (en) | 2018-08-17 | 2018-08-17 | Method and computer system for handling memory failures |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN201810942648.9A CN109343986B (en) | 2018-08-17 | 2018-08-17 | Method and computer system for handling memory failures |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN109343986A CN109343986A (en) | 2019-02-15 |
| CN109343986B true CN109343986B (en) | 2020-12-22 |
Family
ID=65291698
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN201810942648.9A Active CN109343986B (en) | 2018-08-17 | 2018-08-17 | Method and computer system for handling memory failures |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN109343986B (en) |
Families Citing this family (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114816271B (en) * | 2022-06-02 | 2025-07-01 | 上海磐启微电子有限公司 | Data storage method and system applied to flash memory |
| CN115421948A (en) * | 2022-07-30 | 2022-12-02 | 超聚变数字技术有限公司 | Method for detecting memory data fault and related equipment thereof |
Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5588112A (en) * | 1992-12-30 | 1996-12-24 | Digital Equipment Corporation | DMA controller for memory scrubbing |
| CN101040268A (en) * | 2004-10-05 | 2007-09-19 | 索尼计算机娱乐公司 | External data interface in a computer architecture for broadband networks |
| CN101876925A (en) * | 2009-11-27 | 2010-11-03 | 成都市华为赛门铁克科技有限公司 | Internal storage mirroring method, device and system |
| CN104519516A (en) * | 2013-09-29 | 2015-04-15 | 华为技术有限公司 | Method and device for testing memory |
| CN105976868A (en) * | 2016-05-05 | 2016-09-28 | 浪潮电子信息产业股份有限公司 | Method for improving reliability of memory through fault isolation technology |
| CN106021014A (en) * | 2016-05-12 | 2016-10-12 | 浪潮电子信息产业股份有限公司 | Memory management method and device |
Family Cites Families (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN103049220B (en) * | 2012-12-19 | 2016-05-25 | 华为技术有限公司 | Storage controlling method, memory control device and solid-state memory system |
| CN103942013B (en) * | 2014-04-21 | 2016-09-07 | 北京网视通联科技有限公司 | High-speed read-write and mass-storage system and method for work thereof under a kind of ARM platform |
-
2018
- 2018-08-17 CN CN201810942648.9A patent/CN109343986B/en active Active
Patent Citations (6)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US5588112A (en) * | 1992-12-30 | 1996-12-24 | Digital Equipment Corporation | DMA controller for memory scrubbing |
| CN101040268A (en) * | 2004-10-05 | 2007-09-19 | 索尼计算机娱乐公司 | External data interface in a computer architecture for broadband networks |
| CN101876925A (en) * | 2009-11-27 | 2010-11-03 | 成都市华为赛门铁克科技有限公司 | Internal storage mirroring method, device and system |
| CN104519516A (en) * | 2013-09-29 | 2015-04-15 | 华为技术有限公司 | Method and device for testing memory |
| CN105976868A (en) * | 2016-05-05 | 2016-09-28 | 浪潮电子信息产业股份有限公司 | Method for improving reliability of memory through fault isolation technology |
| CN106021014A (en) * | 2016-05-12 | 2016-10-12 | 浪潮电子信息产业股份有限公司 | Memory management method and device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN109343986A (en) | 2019-02-15 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| US12399782B2 (en) | System and device for data recovery for ephemeral storage | |
| JP6009095B2 (en) | Storage system and storage control method | |
| US9983935B2 (en) | Storage checkpointing in a mirrored virtual machine system | |
| US20130254457A1 (en) | Methods and structure for rapid offloading of cached data in a volatile cache memory of a storage controller to a nonvolatile memory | |
| US9239797B2 (en) | Implementing enhanced data caching and takeover of non-owned storage devices in dual storage device controller configuration with data in write cache | |
| JP5392594B2 (en) | Virtual machine redundancy system, computer system, virtual machine redundancy method, and program | |
| WO2015010327A1 (en) | Data sending method, data receiving method and storage device | |
| JPH07117903B2 (en) | Disaster recovery method | |
| US8762771B2 (en) | Method for completing write operations to a RAID drive pool with an abnormally slow drive in a timely fashion | |
| JP2017091456A (en) | Control device, control program, and control method | |
| US11068337B2 (en) | Data processing apparatus that disconnects control circuit from error detection circuit and diagnosis method | |
| JP2007149085A (en) | Run initialization code to configure connected devices | |
| US10235255B2 (en) | Information processing system and control apparatus | |
| CN109343986B (en) | Method and computer system for handling memory failures | |
| JP6540334B2 (en) | SYSTEM, INFORMATION PROCESSING DEVICE, AND INFORMATION PROCESSING METHOD | |
| US20090177916A1 (en) | Storage system, controller of storage system, control method of storage system | |
| JP4394533B2 (en) | Disk array system | |
| US8041850B2 (en) | Storage apparatus and data integrity assurance method | |
| CN116414616A (en) | SSD (solid state disk) fault recovery method, SSD and SSD system | |
| JP2004102395A (en) | Method for obtaining memory dump data, information processing apparatus, and program therefor | |
| US20150019822A1 (en) | System for Maintaining Dirty Cache Coherency Across Reboot of a Node | |
| US20160026537A1 (en) | Storage system | |
| JP6356822B2 (en) | Computer system and memory dump method | |
| US10656867B2 (en) | Computer system, data management method, and data management program | |
| WO2025113322A1 (en) | Storage system, data access method, and storage subsystem |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |