[go: up one dir, main page]

CN111104283B - A fault detection method, device, equipment and medium for a distributed storage system - Google Patents

A fault detection method, device, equipment and medium for a distributed storage system Download PDF

Info

Publication number
CN111104283B
CN111104283B CN201911207102.XA CN201911207102A CN111104283B CN 111104283 B CN111104283 B CN 111104283B CN 201911207102 A CN201911207102 A CN 201911207102A CN 111104283 B CN111104283 B CN 111104283B
Authority
CN
China
Prior art keywords
storage system
distributed storage
fault
threshold value
node
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN201911207102.XA
Other languages
Chinese (zh)
Other versions
CN111104283A (en
Inventor
甄天桥
孟祥瑞
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
IEIT Systems Co Ltd
Original Assignee
Inspur Electronic Information Industry Co Ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Electronic Information Industry Co Ltd filed Critical Inspur Electronic Information Industry Co Ltd
Priority to CN201911207102.XA priority Critical patent/CN111104283B/en
Publication of CN111104283A publication Critical patent/CN111104283A/en
Application granted granted Critical
Publication of CN111104283B publication Critical patent/CN111104283B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3003Monitoring arrangements specially adapted to the computing system or computing system component being monitored
    • G06F11/3034Monitoring arrangements specially adapted to the computing system or computing system component being monitored where the computing system component is a storage system, e.g. DASD based or network based
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3055Monitoring arrangements for monitoring the status of the computing system or of the computing system component, e.g. monitoring if the computing system is on, off, available, not available
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/30Monitoring
    • G06F11/3065Monitoring arrangements determined by the means or processing involved in reporting the monitored data
    • G06F11/3072Monitoring arrangements determined by the means or processing involved in reporting the monitored data where the reporting involves data filtering, e.g. pattern matching, time or event triggered, adaptive or policy-based reporting

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Computing Systems (AREA)
  • Quality & Reliability (AREA)
  • General Engineering & Computer Science (AREA)
  • General Physics & Mathematics (AREA)
  • Computer Vision & Pattern Recognition (AREA)
  • Mathematical Physics (AREA)
  • Debugging And Monitoring (AREA)

Abstract

本申请公开了一种分布式存储系统的故障检测方法、装置、设备及计算机可读存储介质,方法包括:根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值;获取分布式存储系统中各节点分别被上报为异常状态的上报次数;根据各上报次数和故障门限值,确定出分布式存储系统的故障情况。可见,本方法中的故障门限值是根据存储系统的存储池类型,利用对应的计算规则确定出来的,因此,根据各上报次数和故障门限值,确定出分布式存储系统的故障情况,能够避免后端网络出现故障的节点误报其他节点异常导致的故障误判,提高对分布式存储系统的故障检测的准确度,相对保障整个分布式存储系统的正常使用。

Figure 201911207102

The present application discloses a fault detection method, device, device and computer-readable storage medium for a distributed storage system. The method includes: determining a fault threshold value by using a corresponding calculation rule according to a storage pool type of the distributed storage system; Acquire the number of reports that each node in the distributed storage system is reported as an abnormal state, and determine the fault condition of the distributed storage system according to the number of reports and the fault threshold. It can be seen that the fault threshold value in this method is determined according to the storage pool type of the storage system and using the corresponding calculation rule. Therefore, the fault condition of the distributed storage system is determined according to the number of reports and the fault threshold value. It can avoid the faulty misjudgment caused by the faulty node of the back-end network misreporting the abnormality of other nodes, improve the accuracy of fault detection of the distributed storage system, and relatively ensure the normal use of the entire distributed storage system.

Figure 201911207102

Description

一种分布式存储系统的故障检测方法、装置、设备及介质A fault detection method, device, equipment and medium for a distributed storage system

技术领域technical field

本发明涉及分布式存储系统领域,特别涉及一种分布式存储系统的故障检测方法、装置、设备及计算机可读存储介质。The present invention relates to the field of distributed storage systems, and in particular, to a fault detection method, apparatus, device and computer-readable storage medium for a distributed storage system.

背景技术Background technique

在分布式存储系统中,通过在每个节点上设置守护进程(或服务),用以提供对存储池中的硬盘的访问和监控等;并且通过不同节点间的守护进程(或服务)间的心跳消息,来检测对端的守护进程(或服务)是否正常。In a distributed storage system, a daemon process (or service) is set on each node to provide access and monitoring of the hard disks in the storage pool; and through the daemon process (or service) between different nodes Heartbeat message to detect whether the peer daemon (or service) is normal.

对每个节点而言,包括前端网络和后端网络,前端网络供客户业务使用,后端网络供集群内消息通信和数据交互;为了探测网络的连通性,节点间的守护进程会同时在前端网络和后端网络进行心跳检测;各节点通过前端网络与集群管理进程进行消息交互。在这种情况下,如果个别节点的后端网络出现故障(实际故障节点),导致其后端网络无法和其它节点通信,则其它节点会向集群管理进程上报这些实际故障节点为异常状态;这些实际故障节点也会由于无法与其他节点通信,而通过自身的前端网络上报其它节点为异常状态。For each node, including the front-end network and the back-end network, the front-end network is used for customer business, and the back-end network is used for message communication and data interaction within the cluster; in order to detect the connectivity of the network, the daemon process between nodes will be in the front-end at the same time. The network and the back-end network perform heartbeat detection; each node communicates with the cluster management process through the front-end network. In this case, if the back-end network of an individual node fails (the actual faulty node), causing its back-end network to be unable to communicate with other nodes, other nodes will report these actual faulty nodes to the cluster management process as abnormal; these The actual faulty node will also report the abnormal state of other nodes through its own front-end network because it cannot communicate with other nodes.

现有技术中,通过预先设置一个固定的故障门限值,然后在判断出某个节点被上报为异常状态的上报次数超过该故障门限值时,则判定该节点为故障节点。但是,这样的方法将存在一个问题:比如,假设当前的分布式存储系统中有两个实际故障节点,其它节点会上报这两个实际故障节点异常,这两个实际故障节点也会上报其它所有节点异常;这样一来,其他每个节点都至少被这两个实际故障节点上报异常,而由于其他每个节点被上报异常的次数均超过预设门限值,因此集群管理进程就会把所有节点置为故障节点,从而导致整个集群不可用。而实际上,只有两个实际故障节点,集群可能还是可用的。可见,现有技术中的对分布式存储系统的故障检测方法,在节点的后端网络发生故障时,将存在误报异常的情况,从而影响整个分布式存储系统的正常使用。In the prior art, a fixed fault threshold value is preset, and then when it is determined that the number of times that a node is reported as an abnormal state exceeds the fault threshold value, the node is determined to be a faulty node. However, there will be a problem with this method: for example, if there are two actual faulty nodes in the current distributed storage system, other nodes will report that these two actual faulty nodes are abnormal, and these two actual faulty nodes will also report all other faulty nodes. The node is abnormal; in this way, each other node is reported abnormally by at least the two actual faulty nodes, and because the number of times that each other node is reported abnormally exceeds the preset threshold, the cluster management process will A node becomes a failed node, rendering the entire cluster unavailable. In reality, with only two actual failed nodes, the cluster may still be available. It can be seen that with the fault detection method for a distributed storage system in the prior art, when the back-end network of a node fails, there will be a false positive abnormal situation, thereby affecting the normal use of the entire distributed storage system.

因此,如何提高对分布式存储系统的故障检测的准确度,相对保障分布式存储系统的正常使用,是本领域技术人员目前需要解决的技术问题。Therefore, how to improve the accuracy of fault detection of the distributed storage system and relatively ensure the normal use of the distributed storage system is a technical problem to be solved by those skilled in the art at present.

发明内容SUMMARY OF THE INVENTION

有鉴于此,本发明的目的在于提供一种分布式存储系统的故障检测方法,能够提高对分布式存储系统的故障检测的准确度,相对保障分布式存储系统的正常使用;本发明的另一目的是提供一种分布式存储系统的故障检测装置、设备及计算机可读存储介质,均具有上述有益效果。In view of this, the purpose of the present invention is to provide a fault detection method for a distributed storage system, which can improve the accuracy of fault detection of the distributed storage system and relatively ensure the normal use of the distributed storage system; another aspect of the present invention is The purpose is to provide a fault detection device, device and computer-readable storage medium of a distributed storage system, all of which have the above beneficial effects.

为解决上述技术问题,本发明提供一种分布式存储系统的故障检测方法,包括:In order to solve the above technical problems, the present invention provides a fault detection method for a distributed storage system, including:

根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值;According to the storage pool type of the distributed storage system, use the corresponding calculation rule to determine the failure threshold value;

获取所述分布式存储系统中各节点分别被上报为异常状态的上报次数;Obtain the number of reports that each node in the distributed storage system is reported as an abnormal state;

根据各所述上报次数和所述故障门限值,确定出所述分布式存储系统的故障情况。The fault condition of the distributed storage system is determined according to each of the reported times and the fault threshold.

优选地,所述根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值的过程,具体包括:Preferably, according to the storage pool type of the distributed storage system, the process of determining the fault threshold value by using the corresponding calculation rule specifically includes:

若所述分布式存储系统的存储池为副本类型,则获取所述存储池中属于同一归置组的硬盘的第一数量;If the storage pool of the distributed storage system is of the copy type, obtaining the first number of hard disks in the storage pool that belong to the same placement group;

将大于所述第一数量一半的值设置为所述故障门限值。A value greater than half of the first number is set as the fault threshold value.

优选地,所述根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值的过程,具体包括:Preferably, according to the storage pool type of the distributed storage system, the process of determining the fault threshold value by using the corresponding calculation rule specifically includes:

若所述分布式存储系统的存储池为纠删类型,则获取所述存储池中根据数据分块计算出的冗余数据的第二数量;If the storage pool of the distributed storage system is of the erasure type, acquiring the second quantity of redundant data in the storage pool calculated according to the data blocks;

将大于所述第二数量的值设置为所述故障门限值。A value greater than the second number is set as the fault threshold value.

优选地,当第一节点上有多个所述存储池时,进一步包括:Preferably, when there are multiple storage pools on the first node, the method further includes:

分别获取所述第一节点上的多个所述存储池分别对应的故障门限值;respectively acquiring fault thresholds corresponding to a plurality of the storage pools on the first node;

将多个所述故障门限值中的最大值设置为所述第一节点的所述故障门限值。A maximum value among a plurality of the fault threshold values is set as the fault threshold value of the first node.

优选地,所述获取所述分布式存储系统中各节点分别被上报为异常状态的上报次数的过程,具体为:Preferably, the process of acquiring the number of times that each node in the distributed storage system is reported as an abnormal state is specifically:

按照预设时间周期获取所述分布式存储系统中各所述节点被上报为异常状态的所述上报次数。The number of times of reporting that each of the nodes in the distributed storage system is reported as an abnormal state is acquired according to a preset time period.

优选地,在所述根据各所述上报次数和所述故障门限值,确定出所述分布式存储系统的故障情况之后,进一步包括:Preferably, after determining the fault condition of the distributed storage system according to each of the reporting times and the fault threshold, the method further includes:

利用显示装置显示当前被判定为故障节点的节点数量。The number of nodes currently judged to be faulty nodes is displayed using a display device.

优选地,在根据各所述上报次数和所述故障门限值,确定出所述分布式存储系统的故障情况之后,进一步包括:Preferably, after determining the fault condition of the distributed storage system according to each of the reporting times and the fault threshold, the method further includes:

发出对应的提示信息。Send the corresponding prompt message.

为解决上述技术问题,本发明还提供一种分布式存储系统的故障检测装置,包括:In order to solve the above technical problems, the present invention also provides a fault detection device for a distributed storage system, including:

门限值确定模块,用于根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值;The threshold value determination module is used to determine the fault threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system;

获取模块,用于获取所述分布式存储系统中各节点分别被上报为异常状态的上报次数;an obtaining module, configured to obtain the number of reports that each node in the distributed storage system is reported as an abnormal state;

故障确定模块,用于根据各所述上报次数和所述故障门限值,确定出所述分布式存储系统的故障情况。A fault determination module, configured to determine the fault condition of the distributed storage system according to the number of times of reporting and the fault threshold value.

优选地,当第一节点上有多个所述存储池时,进一步包括:Preferably, when there are multiple storage pools on the first node, the method further includes:

第一获取模块,用于分别获取所述第一节点上的多个所述存储池分别对应的故障门限值;a first obtaining module, configured to obtain respective fault thresholds corresponding to the plurality of storage pools on the first node;

设置模块,用于将多个所述故障门限值中的最大值设置为所述第一节点的所述故障门限值。A setting module, configured to set the maximum value of the plurality of fault threshold values as the fault threshold value of the first node.

优选地,进一步包括:Preferably, it further includes:

显示模块,用于利用显示装置显示当前被判定为故障节点的节点数量。The display module is used for displaying the number of nodes currently judged as faulty nodes by using a display device.

优选地,进一步包括:Preferably, it further includes:

提示模块,用于发出对应的提示信息。The prompt module is used to send out the corresponding prompt information.

为解决上述技术问题,本发明还提供一种分布式存储系统的故障检测设备,包括:In order to solve the above technical problems, the present invention also provides a fault detection device for a distributed storage system, including:

存储器,用于存储计算机程序;memory for storing computer programs;

处理器,用于执行所述计算机程序时实现上述任一种分布式存储系统的故障检测方法的步骤。The processor is configured to implement the steps of any of the above-mentioned fault detection methods of the distributed storage system when executing the computer program.

为解决上述技术问题,本发明还提供一种计算机可读存储介质,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现上述任一种分布式存储系统的故障检测方法的步骤。In order to solve the above technical problems, the present invention also provides a computer-readable storage medium, on which a computer program is stored, and when the computer program is executed by a processor, any of the above distributed storage systems can be implemented. The steps of the fault detection method.

本发明提供的一种分布式存储系统的故障检测方法,相较于现有技术中预先设置固定的故障门限值,并根据该固定的故障门限值来确定出分布式存储系统的故障情况的方法,本方法中的故障门限值是根据存储系统的存储池类型,利用对应的计算规则确定出来的,因此,根据各上报次数和故障门限值,确定出分布式存储系统的故障情况,能够避免后端网络出现故障的节点误报其他节点异常导致的故障误判,提高对分布式存储系统的故障检测的准确度,相对保障整个分布式存储系统的正常使用。The invention provides a fault detection method for a distributed storage system. Compared with the prior art, a fixed fault threshold value is preset, and the fault condition of the distributed storage system is determined according to the fixed fault threshold value. The fault threshold value in this method is determined according to the storage pool type of the storage system and using the corresponding calculation rule. Therefore, the fault condition of the distributed storage system is determined according to the number of reports and the fault threshold value. , which can prevent the faulty node of the back-end network from falsely reporting the fault of other nodes, improve the accuracy of fault detection of the distributed storage system, and relatively ensure the normal use of the entire distributed storage system.

为解决上述技术问题,本发明还提供了一种分布式存储系统的故障检测装置、设备及计算机可读存储介质,均具有上述有益效果。In order to solve the above technical problems, the present invention also provides a fault detection device, equipment and computer-readable storage medium of a distributed storage system, all of which have the above beneficial effects.

附图说明Description of drawings

为了更清楚地说明本发明实施例或现有技术的技术方案,下面将对实施例或现有技术描述中所需要使用的附图作简单的介绍,显而易见地,下面描述中的附图仅仅是本发明的一些实施例,对于本领域普通技术人员来讲,在不付出创造性劳动的前提下,还可以根据提供的附图获得其他的附图。In order to illustrate the embodiments of the present invention or the technical solutions of the prior art more clearly, the following will briefly introduce the accompanying drawings used in the description of the embodiments or the prior art. Obviously, the drawings in the following description are only For some embodiments of the present invention, for those of ordinary skill in the art, other drawings can also be obtained according to the provided drawings without creative efforts.

图1为本发明实施例提供的一种分布式存储系统的故障检测方法的流程图;1 is a flowchart of a method for detecting faults in a distributed storage system according to an embodiment of the present invention;

图2为本发明实施例提供的一种分布式存储系统的故障检测装置的结构图;FIG. 2 is a structural diagram of a fault detection device of a distributed storage system according to an embodiment of the present invention;

图3为本发明实施例提供的一种分布式存储系统的故障检测设备的结构图。FIG. 3 is a structural diagram of a fault detection device of a distributed storage system according to an embodiment of the present invention.

具体实施方式Detailed ways

下面将结合本发明实施例中的附图,对本发明实施例中的技术方案进行清楚、完整地描述,显然,所描述的实施例仅仅是本发明一部分实施例,而不是全部的实施例。基于本发明中的实施例,本领域普通技术人员在没有做出创造性劳动前提下所获得的所有其他实施例,都属于本发明保护的范围。The technical solutions in the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention. Obviously, the described embodiments are only a part of the embodiments of the present invention, but not all of the embodiments. Based on the embodiments of the present invention, all other embodiments obtained by those of ordinary skill in the art without creative efforts shall fall within the protection scope of the present invention.

本发明实施例的核心是提供一种分布式存储系统的故障检测方法,能够提高对分布式存储系统的故障检测的准确度,相对保障分布式存储系统的正常使用;本发明的另一核心是提供一种分布式存储系统的故障检测装置、设备及计算机可读存储介质,均具有上述有益效果。The core of the embodiments of the present invention is to provide a fault detection method for a distributed storage system, which can improve the accuracy of fault detection of the distributed storage system and relatively ensure the normal use of the distributed storage system; another core of the present invention is A fault detection device, device and computer-readable storage medium of a distributed storage system are provided, all of which have the above beneficial effects.

为了使本领域技术人员更好地理解本发明方案,下面结合附图和具体实施方式对本发明作进一步的详细说明。In order to make those skilled in the art better understand the solution of the present invention, the present invention will be further described in detail below with reference to the accompanying drawings and specific embodiments.

图1为本发明实施例提供的一种分布式存储系统的故障检测方法的流程图。如图1所示,一种分布式存储系统的故障检测方法包括:FIG. 1 is a flowchart of a fault detection method for a distributed storage system according to an embodiment of the present invention. As shown in Figure 1, a fault detection method for a distributed storage system includes:

S10:根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值。S10: According to the storage pool type of the distributed storage system, use a corresponding calculation rule to determine a fault threshold value.

具体的,在本实施例中,首先需要确定出分布式存储系统的存储池类型,存储池类型根据其存储容错机制,包括纠删类型副本类型。然后,根据存储池类型,并利用与存储池类型对应的计算规则计算出对应的故障门限值。Specifically, in this embodiment, the storage pool type of the distributed storage system needs to be determined first, and the storage pool type includes the erasure type copy type according to its storage fault tolerance mechanism. Then, according to the storage pool type, the corresponding fault threshold value is calculated by using the calculation rule corresponding to the storage pool type.

S20:获取分布式存储系统中各节点分别被上报为异常状态的上报次数。S20: Acquire the number of times that each node in the distributed storage system is reported as an abnormal state.

具体的,当分布式存储系统中存在实际故障节点时,若该实际故障节点为前端网络故障,则其它正常节点会向集群管理进程上报这些实际故障节点为异常状态,而实际故障节点由于前端网络故障,无法上报其他正常的节点为异常状态,在这种情况下,集群管理服务器根据各上报次数可以准确地确定出分布式存储系统的故障情况。Specifically, when there is an actual faulty node in the distributed storage system, if the actual faulty node is a front-end network fault, other normal nodes will report these actual faulty nodes to the cluster management process as abnormal status, and the actual faulty node is due to the front-end network. In this case, the cluster management server can accurately determine the fault of the distributed storage system according to the number of reports.

本实施例主要考虑的是,实际故障节点为后端网络故障,此时,实际故障节点无法和其他节点通信,则其他节点会向集群管理进程上报实际故障节点为异常状态;同时,实际故障节点也会向集群管理进程上报其他节点为异常状态。此时,获取分布式存储系统中的各节点分别被上报为异常状态的上报次数。The main consideration in this embodiment is that the actual faulty node is a back-end network fault. At this time, if the actual faulty node cannot communicate with other nodes, other nodes will report to the cluster management process that the actual faulty node is in an abnormal state; at the same time, the actual faulty node It will also report other nodes as abnormal states to the cluster management process. At this time, the number of reports that each node in the distributed storage system is reported as an abnormal state is acquired.

S30:根据各上报次数和故障门限值,确定出分布式存储系统的故障情况。S30: Determine the fault condition of the distributed storage system according to the number of times of reporting and the fault threshold value.

在确定出故障门限值,并获取到分布式存储系统中各节点分别被上报为异常状态的上报次数之后,根据节点对应的存储池类型,确定出作为参考值的故障门限值,然后利用上报次数与故障门限值进行比较;若某目标节点的上报次数大于故障门限值,则将该目标节点判定为故障节点;否则,则保持该目标节点的正常状态。After determining the fault threshold value and obtaining the number of reports that each node in the distributed storage system is reported as abnormal state, determine the fault threshold value as a reference value according to the storage pool type corresponding to the node, and then use the The number of reports is compared with the fault threshold; if the number of reports of a target node is greater than the fault threshold, the target node is determined as a fault; otherwise, the normal state of the target node is maintained.

在实际操作中,在将上报次数大于对应的故障门限值的节点判定为故障节点之后,再确定出故障节点的节点数量;然后,获取预先设置的与存储池类型对应的用于确定是否发生集群故障的节点阈值,判断节点数量是否大于节点阈值;若是,则判定分布式存储系统发生集群故障;若否,则表示虽然当前存储池中存在故障节点,但是该存储池还是能够正常使用,即,集群正常。In actual operation, after the node whose number of reports is greater than the corresponding fault threshold value is determined as a fault node, the number of nodes of the fault node is determined; then, a preset corresponding to the storage pool type is obtained to determine whether the occurrence The node threshold for cluster failure, to determine whether the number of nodes is greater than the node threshold; if so, it is determined that a cluster failure has occurred in the distributed storage system; if not, it means that although there are faulty nodes in the current storage pool, the storage pool can still be used normally, that is , the cluster is normal.

需要说明的是,对于不同类型的存储池而言,对故障节点的容忍度是不同的。因此,不同类型的存储池的节点阈值也是不同的,节点阈值的设置方式如下:It should be noted that for different types of storage pools, the tolerance to faulty nodes is different. Therefore, the node thresholds of different types of storage pools are also different. The node thresholds are set as follows:

具体的,在纠删类型的存储池中,根据指定的纠删算法,将原始数据等分为K份分块,然后用这K份数据分块计算出M份冗余数据出来,最终得到K+M份数据;然后将这K+M份数据按份分别写入到K+M块硬盘上,每块硬盘保存一份不同的数据,任何时候从这K+M份数据中读取任意K份出来都能还原出原始数据。换句话说,对纠删类型的存储池而言,至少要有K份正常数据才能还原出原始数据,否则就会丢失数据。因此,纠删类型的存储池的节点阈值为(K+M-K),即,冗余数据的数量M。Specifically, in the erasure type storage pool, according to the specified erasure algorithm, the original data is equally divided into K parts, and then the K parts of the data are used to calculate M parts of redundant data, and finally K +M pieces of data; then write the K+M pieces of data to K+M hard disks in copies, each hard disk saves a different piece of data, and read any K from the K+M pieces of data at any time. The original data can be restored. In other words, for erasure-type storage pools, at least K copies of normal data are required to restore the original data, otherwise data will be lost. Therefore, the node threshold of the erasure type storage pool is (K+M-K), that is, the number M of redundant data.

也就是说,只要故障节点的节点数量不超过节点阈值M,则该集群可以正常使用;若故障节点的节点数量超过M,如节点数量为(M+1),则此时由于正常节点的数量小于K,无法还原出原始数据,因此判定集群故障。That is to say, as long as the number of nodes of the faulty node does not exceed the node threshold M, the cluster can be used normally; if the number of nodes of the faulty node exceeds M, for example, the number of nodes is (M+1), then due to the number of normal nodes If it is less than K, the original data cannot be restored, so it is determined that the cluster is faulty.

具体的,在副本类型的存储池中,副本即为与原始数据完全一样的数据,N副本就是将原始数据写N份到N块硬盘上,每块硬盘对应一份与其他硬盘完全一样的数据。因此,对于副本类型的存储池而言,只要有一个正常的硬盘,有一份正常的数据就可以获取到原始数据。因此,副本类型的存储池的节点阈值为(N-1)。Specifically, in a copy-type storage pool, the copy is the data that is exactly the same as the original data, and the N copy is to write N copies of the original data to N hard disks, and each hard disk corresponds to a copy of the data that is exactly the same as other hard disks. . Therefore, for a copy-type storage pool, as long as there is a normal hard disk and a copy of normal data, the original data can be obtained. Therefore, the node threshold for replica-type storage pools is (N-1).

换句话说,只要故障节点的节点数量不超过节点阈值(N-1),则该集群可以正常使用,若故障节点的节点数量超过(N-1),即,节点数量为N,则此时由于正常节点的数量小于1,无法获取到原始数据,因此集群故障。In other words, as long as the number of nodes of the faulty node does not exceed the node threshold (N-1), the cluster can be used normally. If the number of nodes of the faulty node exceeds (N-1), that is, the number of nodes is N, then at this time Since the number of normal nodes is less than 1, the original data cannot be obtained, so the cluster fails.

本发明实施例提供的一种分布式存储系统的故障检测方法,相较于现有技术中预先设置固定的故障门限值,并根据该固定的故障门限值来确定出分布式存储系统的故障情况的方法,本方法中的故障门限值是根据存储系统的存储池类型,利用对应的计算规则确定出来的,因此,根据各上报次数和故障门限值,确定出分布式存储系统的故障情况,能够避免后端网络出现故障的节点误报其他节点异常导致的故障误判,提高对分布式存储系统的故障检测的准确度,相对保障整个分布式存储系统的正常使用。Compared with the method for detecting a fault of a distributed storage system provided by an embodiment of the present invention, a fixed fault threshold value is preset in the prior art, and the fault threshold value of the distributed storage system is determined according to the fixed fault threshold value. The fault condition method, the fault threshold value in this method is determined according to the storage pool type of the storage system and using the corresponding calculation rule. Therefore, according to the number of reports and the fault threshold value, the distributed storage system The fault condition can prevent the faulty node of the back-end network from falsely reporting the fault of other nodes, improve the accuracy of fault detection of the distributed storage system, and relatively ensure the normal use of the entire distributed storage system.

在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例中,根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值的过程,具体包括:On the basis of the above embodiment, this embodiment further describes and optimizes the technical solution. Specifically, in this embodiment, according to the storage pool type of the distributed storage system, the corresponding calculation rule is used to determine the failure threshold value process, including:

若分布式存储系统的存储池为副本类型,则获取存储池中属于同一归置组的硬盘的第一数量;If the storage pool of the distributed storage system is a copy type, obtain the first number of hard disks in the storage pool that belong to the same placement group;

将大于第一数量一半的值设置为故障门限值。A value greater than half of the first number is set as the fault threshold value.

具体的,首先获取存储池中属于同一归置组的硬盘的第一数量,属于同一归置组的硬盘指的是写有完全一样的数据的硬盘,也即,副本的数量N;然后将大于第一数量一半的值设置为故障门限值,也就是说,故障门限值至少大于N/2。作为优选的实施方式,副本类型的存储池对应的故障门限值可以是(N/2+1)。Specifically, first obtain the first number of hard disks belonging to the same placement group in the storage pool, and the hard disks belonging to the same placement group refer to the hard disks with exactly the same data written, that is, the number N of copies; The value of one half of the first number is set as the fault threshold value, that is, the fault threshold value is at least greater than N/2. As a preferred implementation manner, the failure threshold value corresponding to the storage pool of the copy type may be (N/2+1).

可以理解的是,当分布式存储系统内的某个节点的后端网络发生异常时(实际故障节点),则集群中的各节点将互相上报其他节点为异常状态,按照少数服从多数的原则,当有(N/2+1)个节点上报某目标节点为异常状态时,即当目标节点被上报为异常状态的上报次数为(N/2+1)时,才将该目标节点判定为故障节点;否则保持该目标节点为正常状态。It is understandable that when the back-end network of a node in the distributed storage system is abnormal (the actual faulty node), each node in the cluster will report the abnormal state of other nodes to each other. According to the principle of minority obeying the majority, When (N/2+1) nodes report that a target node is in an abnormal state, that is, when the number of times the target node is reported to be in an abnormal state is (N/2+1), the target node is determined to be faulty node; otherwise, keep the target node in normal state.

若分布式存储系统的存储池为纠删类型,则获取存储池中根据数据分块计算出的冗余数据的第二数量;If the storage pool of the distributed storage system is of the erasure type, obtain the second quantity of redundant data in the storage pool calculated according to the data blocks;

将大于第二数量的值设置为故障门限值。A value greater than the second number is set as the fault threshold value.

具体的,根据一份原始数据等分为K份数据分块,并将K份数据分块按照纠删算法计算出M份冗余数据,将冗余数据的数量M设置为第二数量;然后,将大于第二数量的值设置为故障门限值。作为优选的实施方式,本实施例将纠删类型的存储池的故障门限值设置为(M+1)。Specifically, according to a piece of original data, it is equally divided into K pieces of data blocks, and the K pieces of data are divided into M pieces of redundant data according to the erasure algorithm, and the quantity M of redundant data is set as the second quantity; then , a value greater than the second number is set as the fault threshold value. As a preferred implementation manner, in this embodiment, the fault threshold value of the erasure type storage pool is set to (M+1).

具体的,对于存储池为K+M的纠删池的集群来说,若某个目标节点被上报为异常状态的上报次数大于(M+1),则判定该目标节点为故障节点,此时,由于最低要求有(M+1)个节点上报该目标节点为异常状态,这样当有M或小于M个节点为实际故障节点时,其它节点上报这些实际故障节点为异常状态,这些实际故障节点也会上报其它节点为异常状态,但由于实际故障的节点的个数小于或等于M,因此集群管理进程只判定这些实际故障节点为异常状态,不会认为其它正常节点为异常状态,这样正常节点的状态依然保持正常,而且实际故障节点的节点数量在M个以内,正常节点的数量大于或等于K,集群依然可以正常工作。Specifically, for a cluster whose storage pool is an erasure pool of K+M, if the number of reports of a target node as an abnormal state is greater than (M+1), the target node is determined to be a faulty node. , because (M+1) nodes are required to report that the target node is in an abnormal state, so when there are M or less than M nodes as actual faulty nodes, other nodes report these actual faulty nodes as abnormal states, and these actual faulty nodes are in an abnormal state. Other nodes will also be reported as abnormal, but since the number of actual faulty nodes is less than or equal to M, the cluster management process only determines these actual faulty nodes to be abnormal, and will not consider other normal nodes to be abnormal. The state of the cluster remains normal, and the actual number of faulty nodes is within M, and the number of normal nodes is greater than or equal to K, and the cluster can still work normally.

可见,按照本实施例提供的设置故障门限值的方法,能够避免在分布式存储系统能够正常工作时误报集群故障。It can be seen that, according to the method for setting a fault threshold value provided in this embodiment, it is possible to avoid a false alarm of a cluster fault when the distributed storage system can work normally.

在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例中,当第一节点上有多个存储池时,进一步包括:On the basis of the above embodiment, this embodiment further describes and optimizes the technical solution. Specifically, in this embodiment, when there are multiple storage pools on the first node, the method further includes:

分别获取第一节点上的多个存储池分别对应的故障门限值;respectively acquiring fault thresholds corresponding to the multiple storage pools on the first node;

将多个故障门限值中的最大值设置第一节点的故障门限值。The fault threshold value of the first node is set as the maximum value among the multiple fault threshold values.

可以理解的是,在实际操作中,分布式存储系统中的某一个节点(称为第一节点)上可以存在多个存储池,则将多个故障门限值中的最大值设置该第一节点的故障门限值,该第一节点被上报为异常状态的上报次数要超过该最大值对应的故障门限值,才判定该第一节点为故障节点。It can be understood that, in actual operation, there may be multiple storage pools on a certain node (referred to as the first node) in the distributed storage system, and the maximum value of the multiple fault thresholds is set to the first node. The fault threshold value of the node, the first node is determined to be a faulty node only when the number of times the first node is reported as an abnormal state exceeds the fault threshold value corresponding to the maximum value.

以纠删类型的存储池为例,如果同时存在两个纠删池,第一个是2+1(K=2,M=1)冗余的,第二个是3+2(K=3,M=2)冗余的,则需以第二个存储池的冗余为准,因此,要同时有M+1=2+1=3个节点上报该第一节点为异常状态,才会判定该第一节点为故障节点。Taking the erasure type storage pool as an example, if there are two erasure pools at the same time, the first one is 2+1 (K=2, M=1) redundant, and the second one is 3+2 (K=3 , M=2) redundancy, the redundancy of the second storage pool shall prevail. Therefore, M+1=2+1=3 nodes must report that the first node is in an abnormal state at the same time. It is determined that the first node is a faulty node.

可见,本实施例进一步考虑到当某一个节点上有多个存储池时,对应的设置故障门限值的方法,能进一步提高检测分布式存储系统的故障情况的准确度。It can be seen that this embodiment further considers that when there are multiple storage pools on a certain node, the corresponding method for setting the fault threshold value can further improve the accuracy of detecting the fault condition of the distributed storage system.

在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例中,获取分布式存储系统中各节点分别被上报为异常状态的上报次数的过程,具体为:On the basis of the above embodiment, this embodiment further describes and optimizes the technical solution. Specifically, in this embodiment, the process of obtaining the number of reports that each node in the distributed storage system is reported as an abnormal state, Specifically:

按照预设时间周期获取分布式存储系统中各节点被上报为异常状态的上报次数。Acquires the number of times that each node in the distributed storage system is reported as an abnormal state according to a preset time period.

具体的,在本实施例中,具体是按照预设时间周期获取分布式存储系统中各节点被上报为异常状态的上报次数。可以理解的是,在获取各节点的上报次数时,一般是获取在某一个时间段内各节点分别对应的上报次数,本实施例中,进一步是按照预设时间周期,分别获取各节点在预设时间段内的上报次数。本实施例对预设时间周期的长度不做限定,具体根据实际情况设置。Specifically, in this embodiment, the number of reports that each node in the distributed storage system is reported as an abnormal state is obtained according to a preset time period. It can be understood that when acquiring the number of reports of each node, it is generally to acquire the number of reports corresponding to each node in a certain period of time. Set the number of reports within the time period. This embodiment does not limit the length of the preset time period, which is specifically set according to the actual situation.

可见,通过按照预设时间周期获取分布式存储系统中各节点被上报为异常状态的上报次数,能够及时更新获取到的分布式存储系统中的各节点的情况,进而能够更及时地确定出分布式存储系统中的故障情况。It can be seen that by obtaining the number of reports that each node in the distributed storage system is reported as an abnormal state according to a preset time period, the obtained situation of each node in the distributed storage system can be updated in time, and the distribution can be determined in a more timely manner. failure conditions in the storage system.

在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例在根据各上报次数和故障门限值,确定出分布式存储系统的故障情况之后,进一步包括:On the basis of the above-mentioned embodiment, this embodiment further describes and optimizes the technical solution. Further includes:

利用显示装置显示当前被判定为故障节点的节点数量。The number of nodes currently judged to be faulty nodes is displayed using a display device.

具体的,在本实施例中,由于按照预设时间周期获取分布式存储系统中各节点被上报为异常状态的上报次数,因此判定为故障节点的节点数量将可能发生变化。因此,本实施例进一步利用显示装置显示该节点数量。Specifically, in this embodiment, since the number of times that each node in the distributed storage system is reported as abnormal state is obtained according to a preset time period, the number of nodes determined to be faulty nodes may change. Therefore, the present embodiment further utilizes a display device to display the number of nodes.

具体的,显示装置可以具体为TFT(Thin Film Transistor,薄膜场效应晶体管)液晶显示屏或UFB(Ultra Fine&Bright)液晶显示屏或OLED(Organic Light-EmittingDiode,有机发光二极管)显示屏等,本实施例对此不做限定。Specifically, the display device may be a TFT (Thin Film Transistor, thin film field effect transistor) liquid crystal display screen or a UFB (Ultra Fine&Bright) liquid crystal display screen or an OLED (Organic Light-Emitting Diode, organic light emitting diode) display screen, etc. This embodiment This is not limited.

另外,本实施例对显示该节点数量的方式不做限定,例如可以是以文字、图像或者动画的方式显示,具体根据实际需求进行设置。In addition, this embodiment does not limit the manner of displaying the number of nodes, for example, it may be displayed in the manner of text, image or animation, which is specifically set according to actual requirements.

可见,本实施例通过进一步显示装置显示当前被判定为故障节点的节点数量,便于用户直接知晓当前分布式存储系统中判定为故障节点的数量情况,进一步提升用户的使用体验。It can be seen that in this embodiment, the number of nodes currently determined as faulty nodes is further displayed by the display device, so that the user can directly know the number of nodes determined to be faulty in the current distributed storage system, and the user experience is further improved.

在上述实施例的基础上,本实施例对技术方案作了进一步的说明和优化,具体的,本实施例在根据各上报次数和故障门限值,确定出分布式存储系统的故障情况之后,进一步包括:On the basis of the above-mentioned embodiment, this embodiment further describes and optimizes the technical solution. Further includes:

发出对应的提示信息。Send the corresponding prompt message.

具体的,本实施例中,在根据各上报次数和故障门限值,确定出分布式存储系统的故障情况之后,若分布式存储系统发生集群故障,则进一步利用提示装置发出对应的提示信息,以便提示操作者当前的分布式存储系统集群不可用,将无法读写或者可能会造成数据丢失的情况。Specifically, in this embodiment, after the fault condition of the distributed storage system is determined according to the number of reports and the fault threshold value, if a cluster fault occurs in the distributed storage system, the prompt device is further used to send out corresponding prompt information, In order to prompt the operator that the current distributed storage system cluster is unavailable, will not be able to read and write or may cause data loss.

需要说明的是,本实施例中对用于发出对应的提示信息的提示装置的具体类型不做限定,作为优选的实施方式,提示装置可以是蜂鸣器和/或指示灯等,并根据提示装置的不同工作状态设置对应的提示信息。It should be noted that the specific type of the prompting device used to send out the corresponding prompting information is not limited in this embodiment. As a preferred embodiment, the prompting device may be a buzzer and/or an indicator light, etc. Set corresponding prompt information for different working states of the device.

可见,本实施例通过进一步在判定分布式存储系统发生集群故障之后,发出对应的提示信息,以便能够及时有效地提示操作者分布式存此系统的故障情况,从而进一步提升用户的使用体验。It can be seen that, in this embodiment, after it is determined that a cluster failure occurs in the distributed storage system, corresponding prompt information is sent, so as to timely and effectively prompt the operator of the failure of the distributed storage system, thereby further improving the user experience.

上文对于本发明提供的一种分布式存储系统的故障检测方法的实施例进行了详细的描述,本发明还提供了一种与该方法对应的分布式存储系统的故障检测装置、设备及计算机可读存储介质,由于装置、设备及计算机可读存储介质部分的实施例与方法部分的实施例相互照应,因此装置、设备及计算机可读存储介质部分的实施例请参见方法部分的实施例的描述,这里暂不赘述。The embodiments of a fault detection method for a distributed storage system provided by the present invention are described in detail above, and the present invention also provides a fault detection device, device, and computer for a distributed storage system corresponding to the method. For the readable storage medium, since the embodiments of the apparatus, device, and computer-readable storage medium part correspond to the embodiments of the method part, please refer to the embodiments of the method part for the embodiments of the apparatus, device, and computer-readable storage medium part. description, which will not be repeated here.

图2为本发明实施例提供的一种分布式存储系统的故障检测装置的结构图,如图2所示,一种分布式存储系统的故障检测装置包括:FIG. 2 is a structural diagram of a fault detection apparatus of a distributed storage system provided by an embodiment of the present invention. As shown in FIG. 2 , a fault detection apparatus of a distributed storage system includes:

门限值确定模块21,用于根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值;The threshold value determination module 21 is used to determine the fault threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system;

获取模块22,用于获取分布式存储系统中各节点分别被上报为异常状态的上报次数;The obtaining module 22 is used to obtain the number of reports that each node in the distributed storage system is reported as an abnormal state;

故障确定模块23,用于根据各上报次数和故障门限值,确定出分布式存储系统的故障情况。The fault determination module 23 is configured to determine the fault condition of the distributed storage system according to the number of reporting times and the fault threshold value.

本发明实施例提供的分布式存储系统的故障检测装置,具有上述分布式存储系统的故障检测方法的有益效果。The fault detection device of the distributed storage system provided by the embodiment of the present invention has the beneficial effects of the above-mentioned fault detection method of the distributed storage system.

作为优选的实施方式,当第一节点上有多个存储池时,进一步包括:As a preferred embodiment, when there are multiple storage pools on the first node, it further includes:

第一获取模块,用于分别获取第一节点上的多个存储池分别对应的故障门限值;a first obtaining module, configured to obtain respective fault thresholds corresponding to the plurality of storage pools on the first node;

设置模块,用于将多个故障门限值中的最大值设置为第一节点的故障门限值。The setting module is configured to set the maximum value of the multiple fault threshold values as the fault threshold value of the first node.

作为优选的实施方式,进一步包括:As a preferred embodiment, it further includes:

显示模块,用于利用显示装置显示当前被判定为故障节点的节点数量。The display module is used for displaying the number of nodes currently judged as faulty nodes by using a display device.

作为优选的实施方式,进一步包括:As a preferred embodiment, it further includes:

提示模块,用于发出对应的提示信息。The prompt module is used to send out the corresponding prompt information.

图3为本发明实施例提供的一种分布式存储系统的故障检测设备的结构图,如图3所示,一种分布式存储系统的故障检测设备包括:FIG. 3 is a structural diagram of a fault detection device of a distributed storage system according to an embodiment of the present invention. As shown in FIG. 3 , a fault detection device of a distributed storage system includes:

存储器31,用于存储计算机程序;memory 31 for storing computer programs;

处理器2,用于执行计算机程序时实现如上述分布式存储系统的故障检测方法的步骤。The processor 2 is configured to implement the steps of the above-mentioned fault detection method of the distributed storage system when executing the computer program.

本发明实施例提供的分布式存储系统的故障检测设备,具有上述分布式存储系统的故障检测方法的有益效果。The fault detection device of the distributed storage system provided by the embodiment of the present invention has the beneficial effects of the above-mentioned fault detection method of the distributed storage system.

为解决上述技术问题,本发明还提供一种计算机可读存储介质,计算机可读存储介质上存储有计算机程序,计算机程序被处理器执行时实现如上述分布式存储系统的故障检测方法的步骤。To solve the above technical problems, the present invention also provides a computer-readable storage medium, where a computer program is stored thereon, and when the computer program is executed by a processor, the steps of the above-mentioned fault detection method for a distributed storage system are implemented.

本发明实施例提供的计算机可读存储介质,具有上述分布式存储系统的故障检测方法的有益效果。The computer-readable storage medium provided by the embodiment of the present invention has the beneficial effects of the above-mentioned fault detection method of a distributed storage system.

以上对本发明所提供的分布式存储系统的故障检测方法、装置、设备及计算机可读存储介质进行了详细介绍。本文中应用了具体实施例对本发明的原理及实施方式进行了阐述,以上实施例的说明只是用于帮助理解本发明的方法及其核心思想。应当指出,对于本技术领域的普通技术人员来说,在不脱离本发明原理的前提下,还可以对本发明进行若干改进和修饰,这些改进和修饰也落入本发明权利要求的保护范围内。The fault detection method, apparatus, device and computer-readable storage medium of the distributed storage system provided by the present invention have been described in detail above. The principles and implementations of the present invention are described herein by using specific embodiments, and the descriptions of the above embodiments are only used to help understand the method and the core idea of the present invention. It should be pointed out that for those skilled in the art, without departing from the principle of the present invention, several improvements and modifications can also be made to the present invention, and these improvements and modifications also fall within the protection scope of the claims of the present invention.

说明书中各个实施例采用递进的方式描述,每个实施例重点说明的都是与其他实施例的不同之处,各个实施例之间相同相似部分互相参见即可。对于实施例公开的装置而言,由于其与实施例公开的方法相对应,所以描述的比较简单,相关之处参见方法部分说明即可。The various embodiments in the specification are described in a progressive manner, and each embodiment focuses on the differences from other embodiments, and the same and similar parts between the various embodiments can be referred to each other. As for the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant part can be referred to the description of the method.

专业人员还可以进一步意识到,结合本文中所公开的实施例描述的各示例的单元及算法步骤,能够以电子硬件、计算机软件或者二者的结合来实现,为了清楚地说明硬件和软件的可互换性,在上述说明中已经按照功能一般性地描述了各示例的组成及步骤。这些功能究竟以硬件还是软件方式来执行,取决于技术方案的特定应用和设计约束条件。专业技术人员可以对每个特定的应用来使用不同方法来实现所描述的功能,但是这种实现不应认为超出本发明的范围。Professionals may further realize that the units and algorithm steps of each example described in conjunction with the embodiments disclosed herein can be implemented in electronic hardware, computer software, or a combination of the two, in order to clearly illustrate the possibilities of hardware and software. Interchangeability, the above description has generally described the components and steps of each example in terms of function. Whether these functions are performed in hardware or software depends on the specific application and design constraints of the technical solution. Skilled artisans may implement the described functionality using different methods for each particular application, but such implementations should not be considered beyond the scope of the present invention.

Claims (8)

1.一种分布式存储系统的故障检测方法,其特征在于,包括:1. a fault detection method of a distributed storage system, is characterized in that, comprises: 根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值;According to the storage pool type of the distributed storage system, use the corresponding calculation rule to determine the failure threshold value; 获取所述分布式存储系统中各节点分别被上报为异常状态的上报次数;Obtain the number of reports that each node in the distributed storage system is reported as an abnormal state; 根据各所述上报次数和所述故障门限值,确定出所述分布式存储系统的故障情况;Determine the fault condition of the distributed storage system according to each of the reported times and the fault threshold; 所述根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值的过程,具体包括:The process of determining the fault threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes: 若所述分布式存储系统的存储池为副本类型,则获取所述存储池中属于同一归置组的硬盘的第一数量;If the storage pool of the distributed storage system is of the copy type, obtaining the first number of hard disks in the storage pool that belong to the same placement group; 将大于所述第一数量一半的值设置为所述故障门限值;setting a value greater than half of the first number as the fault threshold value; 所述根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值的过程,具体包括:The process of determining the fault threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system specifically includes: 若所述分布式存储系统的存储池为纠删类型,则获取所述存储池中根据数据分块计算出的冗余数据的第二数量;If the storage pool of the distributed storage system is of the erasure type, acquiring the second quantity of redundant data in the storage pool calculated according to the data blocks; 将大于所述第二数量的值设置为所述故障门限值。A value greater than the second number is set as the fault threshold value. 2.根据权利要求1所述的方法,其特征在于,当第一节点上有多个所述存储池时,进一步包括:2. The method according to claim 1, wherein when there are multiple storage pools on the first node, the method further comprises: 分别获取所述第一节点上的多个所述存储池分别对应的故障门限值;respectively acquiring fault thresholds corresponding to a plurality of the storage pools on the first node; 将多个所述故障门限值中的最大值设置为所述第一节点的所述故障门限值。A maximum value among a plurality of the fault threshold values is set as the fault threshold value of the first node. 3.根据权利要求1所述的方法,其特征在于,所述获取所述分布式存储系统中各节点分别被上报为异常状态的上报次数的过程,具体为:3. The method according to claim 1, wherein the process of acquiring the number of reports that each node in the distributed storage system is reported as an abnormal state is specifically: 按照预设时间周期获取所述分布式存储系统中各所述节点被上报为异常状态的所述上报次数。The number of times that each node in the distributed storage system is reported as an abnormal state is acquired according to a preset time period. 4.根据权利要求3所述的方法,其特征在于,在所述根据各所述上报次数和所述故障门限值,确定出所述分布式存储系统的故障情况之后,进一步包括:4 . The method according to claim 3 , wherein after determining the fault condition of the distributed storage system according to each of the reporting times and the fault threshold value, the method further comprises: 5 . 利用显示装置显示当前被判定为故障节点的节点数量。The number of nodes currently judged to be faulty nodes is displayed using a display device. 5.根据权利要求1至4任一项所述的方法,其特征在于,在根据各所述上报次数和所述故障门限值,确定出所述分布式存储系统的故障情况之后,进一步包括:5. The method according to any one of claims 1 to 4, characterized in that, after determining the fault condition of the distributed storage system according to each of the reporting times and the fault threshold value, further comprising: : 发出对应的提示信息。Send the corresponding prompt message. 6.一种分布式存储系统的故障检测装置,其特征在于,包括:6. A fault detection device for a distributed storage system, comprising: 门限值确定模块,用于根据分布式存储系统的存储池类型,利用对应的计算规则确定出故障门限值;The threshold value determination module is used to determine the fault threshold value by using the corresponding calculation rule according to the storage pool type of the distributed storage system; 获取模块,用于获取所述分布式存储系统中各节点分别被上报为异常状态的上报次数;an acquisition module, configured to acquire the number of reports that each node in the distributed storage system is reported as an abnormal state; 故障确定模块,用于根据各所述上报次数和所述故障门限值,确定出所述分布式存储系统的故障情况;a fault determination module, configured to determine the fault condition of the distributed storage system according to the number of times of reporting and the fault threshold; 所述门限值确定模块,具体用于若所述分布式存储系统的存储池为副本类型,则获取所述存储池中属于同一归置组的硬盘的第一数量;将大于所述第一数量一半的值设置为所述故障门限值;若所述分布式存储系统的存储池为纠删类型,则获取所述存储池中根据数据分块计算出的冗余数据的第二数量;将大于所述第二数量的值设置为所述故障门限值。The threshold value determination module is specifically configured to obtain the first number of hard disks belonging to the same placement group in the storage pool if the storage pool of the distributed storage system is a copy type; The value of half of the quantity is set as the failure threshold value; if the storage pool of the distributed storage system is of the erasure type, obtain the second quantity of redundant data in the storage pool calculated according to the data block; A value greater than the second number is set as the fault threshold value. 7.一种分布式存储系统的故障检测设备,其特征在于,包括:7. A fault detection device for a distributed storage system, comprising: 存储器,用于存储计算机程序;memory for storing computer programs; 处理器,用于执行所述计算机程序时实现如权利要求1至5任一项所述的分布式存储系统的故障检测方法的步骤。The processor is configured to implement the steps of the fault detection method of the distributed storage system according to any one of claims 1 to 5 when executing the computer program. 8.一种计算机可读存储介质,其特征在于,所述计算机可读存储介质上存储有计算机程序,所述计算机程序被处理器执行时实现如权利要求1至5任一项所述的分布式存储系统的故障检测方法的步骤。8. A computer-readable storage medium, wherein a computer program is stored on the computer-readable storage medium, and when the computer program is executed by a processor, the distribution according to any one of claims 1 to 5 is realized The steps of the fault detection method of the storage system.
CN201911207102.XA 2019-11-29 2019-11-29 A fault detection method, device, equipment and medium for a distributed storage system Active CN111104283B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN201911207102.XA CN111104283B (en) 2019-11-29 2019-11-29 A fault detection method, device, equipment and medium for a distributed storage system

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN201911207102.XA CN111104283B (en) 2019-11-29 2019-11-29 A fault detection method, device, equipment and medium for a distributed storage system

Publications (2)

Publication Number Publication Date
CN111104283A CN111104283A (en) 2020-05-05
CN111104283B true CN111104283B (en) 2022-04-22

Family

ID=70420956

Family Applications (1)

Application Number Title Priority Date Filing Date
CN201911207102.XA Active CN111104283B (en) 2019-11-29 2019-11-29 A fault detection method, device, equipment and medium for a distributed storage system

Country Status (1)

Country Link
CN (1) CN111104283B (en)

Families Citing this family (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112162699B (en) * 2020-09-18 2023-12-22 北京浪潮数据技术有限公司 Data reading and writing method, device, equipment and computer readable storage medium
TWI789075B (en) * 2021-10-26 2023-01-01 中華電信股份有限公司 Electronic device and method for detecting abnormal execution of application program
CN114443332B (en) * 2021-12-24 2024-01-09 苏州浪潮智能科技有限公司 Storage pool detection method and device, electronic equipment and storage medium
CN114780442A (en) * 2022-06-22 2022-07-22 杭州悦数科技有限公司 Testing method and device for distributed system
CN115470061A (en) * 2022-10-10 2022-12-13 中电云数智科技有限公司 A distributed storage system I/O sub-health intelligent detection and recovery method

Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827120A (en) * 2010-02-25 2010-09-08 浪潮(北京)电子信息产业有限公司 Cluster storage method and system
CN103514068A (en) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 Method for automatically locating internal storage faults
CN104735107A (en) * 2013-12-20 2015-06-24 中国移动通信集团公司 Recovery method and device for data copies in distributed storage system
CN107346273A (en) * 2017-06-14 2017-11-14 北京奇艺世纪科技有限公司 A kind of data reconstruction method, device and electronic equipment
CN107783857A (en) * 2017-10-31 2018-03-09 珠海市魅族科技有限公司 A kind of abnormal restorative procedure and device, computer installation, readable storage medium storing program for executing
CN109144835A (en) * 2018-08-02 2019-01-04 广东浪潮大数据研究有限公司 A kind of automatic prediction method, device, equipment and the medium of application service failure
CN109189352A (en) * 2018-09-11 2019-01-11 宜春小马快印科技有限公司 Printer fault monitoring method, device, system and readable storage medium storing program for executing
CN109213637A (en) * 2018-11-09 2019-01-15 浪潮电子信息产业股份有限公司 Data recovery method, device and medium for cluster nodes of distributed file system
CN109557994A (en) * 2018-11-29 2019-04-02 努比亚技术有限公司 A kind of charge fault monitoring method, equipment and computer can storage mediums
CN109726048A (en) * 2018-12-13 2019-05-07 中国银联股份有限公司 Data recovery method and device in a transaction system
CN110457194A (en) * 2019-08-02 2019-11-15 广东小天才科技有限公司 Electronic equipment stability early warning method, system, device, equipment and storage medium

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US9584160B2 (en) * 2014-02-20 2017-02-28 Quantum Corporation Dynamically configuring erasure code redundancy and distribution

Patent Citations (11)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN101827120A (en) * 2010-02-25 2010-09-08 浪潮(北京)电子信息产业有限公司 Cluster storage method and system
CN103514068A (en) * 2012-06-28 2014-01-15 北京百度网讯科技有限公司 Method for automatically locating internal storage faults
CN104735107A (en) * 2013-12-20 2015-06-24 中国移动通信集团公司 Recovery method and device for data copies in distributed storage system
CN107346273A (en) * 2017-06-14 2017-11-14 北京奇艺世纪科技有限公司 A kind of data reconstruction method, device and electronic equipment
CN107783857A (en) * 2017-10-31 2018-03-09 珠海市魅族科技有限公司 A kind of abnormal restorative procedure and device, computer installation, readable storage medium storing program for executing
CN109144835A (en) * 2018-08-02 2019-01-04 广东浪潮大数据研究有限公司 A kind of automatic prediction method, device, equipment and the medium of application service failure
CN109189352A (en) * 2018-09-11 2019-01-11 宜春小马快印科技有限公司 Printer fault monitoring method, device, system and readable storage medium storing program for executing
CN109213637A (en) * 2018-11-09 2019-01-15 浪潮电子信息产业股份有限公司 Data recovery method, device and medium for cluster nodes of distributed file system
CN109557994A (en) * 2018-11-29 2019-04-02 努比亚技术有限公司 A kind of charge fault monitoring method, equipment and computer can storage mediums
CN109726048A (en) * 2018-12-13 2019-05-07 中国银联股份有限公司 Data recovery method and device in a transaction system
CN110457194A (en) * 2019-08-02 2019-11-15 广东小天才科技有限公司 Electronic equipment stability early warning method, system, device, equipment and storage medium

Non-Patent Citations (2)

* Cited by examiner, † Cited by third party
Title
A Low-Power High-Performance Concurrent Fault Detection Approach for the Composite Field S-Box and Inverse S-Box;Mehran Mozaffari-Kermani 等;《IEEE Transactions on Computers》;20110729;第1327-1340页 *
分布式存储系统的数据冗余策略;官斌;《武汉大学学报(工学版)》;20150430;第279-283页 *

Also Published As

Publication number Publication date
CN111104283A (en) 2020-05-05

Similar Documents

Publication Publication Date Title
CN111104283B (en) A fault detection method, device, equipment and medium for a distributed storage system
US10891182B2 (en) Proactive failure handling in data processing systems
CN111796959B (en) Self-healing method, device and system for host container
CN105607973B (en) Method, device and system for processing equipment fault in virtual machine system
US9032247B2 (en) Intermediate database management layer
US9697067B2 (en) Monitoring system and monitoring method
CN111104238A (en) A CE-based memory diagnostic method, device and medium
US20140215279A1 (en) Scalable structured data store operations
US20240419354A1 (en) Disk processing method and system, and electronic device
US8984333B2 (en) Automatic computer storage medium diagnostics
CN118152124A (en) A data processing method and system based on cloud computing
US20220255824A1 (en) Detecting outages in a cloud environment
CN106951445A (en) A kind of distributed file system and its memory node loading method
CN110968456B (en) Method and device for processing fault disk in distributed storage system
CN115292123A (en) System exception handling method and device, storage medium and electronic equipment
CN109271270A (en) The troubleshooting methodology, system and relevant apparatus of bottom hardware in storage system
CN107193679A (en) A kind of disaster recovery method and system
CN106571969A (en) Cloud service availability assessment method and system
CN116820822A (en) A read-write link control method, device, and medium
JP2023067014A (en) Determination program, determination method, and information processing apparatus
US11138512B2 (en) Management of building energy systems through quantification of reliability
CN106897201A (en) Device hardware information updating determines method and device in a kind of data center's O&M
US8176149B2 (en) Ejection of storage drives in a computing network
TWI685736B (en) Method for remotely clearing abnormal status of racks applied in data center
CN117407282A (en) Application program warning method, device, equipment, storage medium and program product

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant