[go: up one dir, main page]

CN115242820B - A cluster node failure processing method, device, equipment and medium - Google Patents

A cluster node failure processing method, device, equipment and medium Download PDF

Info

Publication number
CN115242820B
CN115242820B CN202210879795.2A CN202210879795A CN115242820B CN 115242820 B CN115242820 B CN 115242820B CN 202210879795 A CN202210879795 A CN 202210879795A CN 115242820 B CN115242820 B CN 115242820B
Authority
CN
China
Prior art keywords
node
heartbeat
target node
data
cluster
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210879795.2A
Other languages
Chinese (zh)
Other versions
CN115242820A (en
Inventor
位风杰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Inspur Jinan data Technology Co ltd
Original Assignee
Inspur Jinan data Technology Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Inspur Jinan data Technology Co ltd filed Critical Inspur Jinan data Technology Co ltd
Priority to CN202210879795.2A priority Critical patent/CN115242820B/en
Publication of CN115242820A publication Critical patent/CN115242820A/en
Application granted granted Critical
Publication of CN115242820B publication Critical patent/CN115242820B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L67/00Network arrangements or protocols for supporting network services or applications
    • H04L67/01Protocols
    • H04L67/10Protocols in which an application is distributed across nodes in the network
    • H04L67/1097Protocols in which an application is distributed across nodes in the network for distributed storage of data in networks, e.g. transport arrangements for network file system [NFS], storage area networks [SAN] or network attached storage [NAS]
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0654Management of faults, events, alarms or notifications using network fault recovery
    • H04L41/0663Performing the actions predefined by failover planning, e.g. switching to standby network elements
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L43/00Arrangements for monitoring or testing data switching networks
    • H04L43/10Active monitoring, e.g. heartbeat, ping or trace-route

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Health & Medical Sciences (AREA)
  • Cardiology (AREA)
  • General Health & Medical Sciences (AREA)
  • Hardware Redundancy (AREA)

Abstract

本申请公开了一种集群节点故障处理方法、装置、设备及介质,涉及分布式存储领域,该方法应用于集群琐碎数据库中的主节点,包括:若在第一预设时间内未接收到集群琐碎数据库中的第一目标节点发送的心跳数据与节点间心跳交互数据,则从本地的节点交互记录中查询第二目标节点的节点间心跳交互数据;判断第二目标节点的节点间心跳交互数据中,在第一预设时间内是否存在与第一目标节点的心跳记录;若存在,则将当前节点设置为挂起状态;若不存在,则将第一目标节点确定为故障节点并移除集群。本发明中提出新型的心跳发送机制,可避免现有技术中由于主节点故障导致正常节点被误判为故障节点后,将正常节点移除集群的情况,进而保证集群的高可用状态。

The present application discloses a cluster node fault processing method, device, equipment and medium, which relates to the field of distributed storage. The method is applied to the master node in the cluster trivial database, including: if the heartbeat data and the heartbeat interaction data between nodes sent by the first target node in the cluster trivial database are not received within the first preset time, query the node heartbeat interaction data of the second target node from the local node interaction record; determine whether there is a heartbeat record with the first target node in the node heartbeat interaction data of the second target node within the first preset time; if there is, set the current node to a suspended state; if not, determine the first target node as a faulty node and remove the cluster. The present invention proposes a new heartbeat sending mechanism, which can avoid the situation in the prior art where a normal node is removed from the cluster after being misjudged as a faulty node due to a master node failure, thereby ensuring the high availability of the cluster.

Description

Cluster node fault processing method, device, equipment and medium
Technical Field
The present invention relates to the field of distributed storage, and in particular, to a method, an apparatus, a device, and a medium for processing a cluster node failure.
Background
CTDB (Cluster Trivial Database) is a lightweight cluster database implementation, which is a cluster database component of cluster Samba, and is mainly used for processing a cross-node message of Samba and implementing a distributed TDB database on all cluster nodes. CTDB include providing a cluster version of the TDB database and automatically rebuilding/recovering the database in the event of a node failure, monitoring the nodes in the cluster and the services running on each node, and managing the pool of public IP addresses used to provide services to clients.
In the current CTDB high availability scheme, each node judges whether the opposite end is abnormal or not through heartbeat, namely in CTDB clusters, all nodes including a main node can send heartbeat data to other nodes at regular time. When the node does not receive the heartbeat sent by other nodes, the opposite end state is considered to be faulty, and when the fault occurs, only the main node can perform fault processing. For example, if the master node a does not receive heartbeat data from a node B within a preset time, the master node a determines the node B as a failed node and removes the cluster. However, in some special cases, the reason why the master node a does not receive the heartbeat data of the node B may be that the master node a fails, but at this time, the node B is removed from the cluster because the master node a misjudges the node B to be a failure, which causes unnecessary resource waste. And then the master node can continue to operate in a fault state, which may cause the cluster operation fault, cause service interruption and have serious consequences.
From the above, how to avoid the situation that the main node fault causes the wrong judgment and may cause the cluster operation fault in the CTDB cluster operation process is a problem to be solved in the field.
Disclosure of Invention
Therefore, the present invention aims to provide a method, an apparatus, a device, and a medium for processing a cluster node failure, which can prevent a normal node from being removed when a main node makes a misjudgment by modifying processing logic in CTDB, and keep a normal state of the cluster so as to normally provide service. The specific scheme is as follows:
In a first aspect, the present application discloses a method for processing a cluster node failure, which is applied to a master node in a cluster trivial database, and includes:
If the heartbeat data and the inter-node heartbeat interaction data sent by the first target node in the cluster trivial database are not received within a first preset time, inquiring the inter-node heartbeat interaction data of a second target node from a node interaction record locally storing the inter-node interaction data of each node, wherein the node interaction record comprises the heartbeat interaction records of the first target node and the second target node in the cluster trivial database;
Judging whether heartbeat records of the first target node and the second target node exist in the inter-node heartbeat interaction data of the second target node in the first preset time;
if the heartbeat record of the first target node and the second target node exists in the inter-node heartbeat interaction data of the second target node within the first preset time, setting the current node state to be a suspension state;
and if the heartbeat record of the first target node and the heartbeat record of the second target node do not exist in the inter-node heartbeat interaction data of the second target node within the first preset time, determining the first target node as a fault node, and removing the fault node from the cluster.
Optionally, before the query of the inter-node heartbeat interaction data of the second target node from the inter-node interaction record locally stored with the inter-node interaction data of each node, the method further includes:
receiving heartbeat data sent by other nodes in the cluster trivial database and heartbeat interaction data among nodes;
and storing the received heartbeat data and the heartbeat interaction data among the nodes into node interaction records corresponding to the local preset storage positions.
Optionally, the storing the received heartbeat data and the heartbeat interaction data between nodes in a node interaction record corresponding to a local preset storage position includes:
And storing the received data of the heartbeat data and the heartbeat interaction data among the nodes, wherein the data of which the data transmission time is within a preset time period, into a node interaction record corresponding to a local preset storage position.
Optionally, the querying the inter-node heartbeat interaction data of the second target node from the node interaction record locally stored with the inter-node interaction data of each node includes:
determining a second target node from the clustered trivial database;
and inquiring the inter-node heartbeat interaction data of the second target node from the node interaction records locally stored with the inter-node interaction data of each node.
Optionally, the determining the second target node from the cluster trivial database includes:
Determining a target node from the cluster trivial database;
judging whether heartbeat data sent by the target node and heartbeat interaction data between nodes are received in preset heartbeat data sending time;
and if the heartbeat data sent by the target node and the heartbeat interaction data between the nodes are received within the preset heartbeat data sending time, determining the target node as a second target node.
Optionally, after determining whether the heartbeat data sent by the target node and the inter-node heartbeat interaction data are received within the preset heartbeat data sending time, the method further includes:
And if the heartbeat data transmitted by the target node and the heartbeat interaction data between nodes are not received within the preset heartbeat data transmission time, the target node is redetermined from the cluster trivial database, and then the step of judging whether the heartbeat data transmitted by the target node and the heartbeat interaction data between nodes are received within the preset heartbeat data transmission time is executed.
Optionally, after the determining the target node, the method further includes:
recording the number of current historical target nodes;
judging whether the number of the historical target nodes meets a preset threshold value or not;
and if the history target node meets a preset threshold value, setting the current node state to be a suspension state.
In a second aspect, the present application discloses a cluster node fault handling device, applied to a master node in a cluster trivial database, comprising:
The system comprises a record inquiring module, a first node interaction module and a second node interaction module, wherein the record inquiring module is used for inquiring the inter-node heartbeat interaction data of a second target node from a node interaction record locally storing the inter-node interaction data of each node if the heartbeat data and the inter-node heartbeat interaction data sent by a first target node in the cluster trivial database are not received within a first preset time;
The record judging module is used for judging whether heartbeat records of the first target node and the second target node exist in the inter-node heartbeat interaction data of the second target node in the first preset time;
The record existence module is used for setting the current node state as a suspension state if the heartbeat record of the first target node and the second target node exists in the first preset time in the inter-node heartbeat interaction data of the second target node;
And the record absence module is used for determining the first target node as a fault node and removing the fault node from the cluster if the heartbeat record of the first target node and the second target node does not exist in the inter-node heartbeat interaction data of the second target node within the first preset time.
In a third aspect, the present application discloses an electronic device, comprising:
A memory for storing a computer program;
And the processor is used for executing the computer program to realize the cluster node fault processing method.
In a fourth aspect, the present application discloses a computer storage medium for storing a computer program, where the computer program when executed by a processor implements the steps of the foregoing disclosed cluster node fault handling method.
The method comprises the steps of inquiring inter-node heartbeat interaction data of a second target node from node interaction records locally storing inter-node interaction data of all nodes if heartbeat data and inter-node heartbeat interaction data sent by a first target node in a cluster trivial database are not received within a first preset time, judging whether the inter-node heartbeat interaction data of the second target node exist in the first preset time or not, if the inter-node heartbeat interaction data of the second target node exist in the inter-node heartbeat interaction data of the first target node and the second target node, setting the current node state to be in a hanging state, and if the inter-node heartbeat interaction data of the second target node do not exist in the first preset time, determining that the first target node and the second target node fail, and removing the fault of the cluster node. The application provides a novel heartbeat sending mechanism, namely each node synchronously sends heartbeat interaction data between the node and other nodes when sending the heartbeat data, so that each node can mutually store the heartbeat interaction data between other nodes, when a main node cannot receive the heartbeat data of a certain node, the main node can judge which node is a fault node according to the heartbeat interaction data stored in each node, and finally after the fault node is determined, the fault node is removed, so that the situation that the normal node is removed from a cluster after the normal node is misjudged as the fault node due to the fault of the main node in the prior art can be avoided, and the high availability state of the cluster is further ensured.
Drawings
In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the prior art, the drawings that are required to be used in the embodiments or the description of the prior art will be briefly described below, and it is obvious that the drawings in the following description are only embodiments of the present invention, and that other drawings can be obtained according to the provided drawings without inventive effort for a person skilled in the art.
FIG. 1 is a flow chart of a method for handling a cluster node failure provided by the application;
FIG. 2 is a flowchart of a method for handling a cluster node failure according to the present application;
FIG. 3 is a schematic diagram of a cluster node fault handling device according to the present application;
fig. 4 is a block diagram of an electronic device according to the present application.
Detailed Description
The following description of the embodiments of the present invention will be made clearly and completely with reference to the accompanying drawings, in which it is apparent that the embodiments described are only some embodiments of the present invention, but not all embodiments. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.
In the prior art, in CTDB clusters, all nodes, including a master node, send heartbeat data to other nodes at regular time, and when the master node a does not receive heartbeat data from a certain node B within a preset time, the node B is determined to be a fault node and the cluster is removed. But the reason that the heartbeat data of B is not received by the master node a may be that the master node a fails. In order to avoid the situation that the main node is misjudged due to self fault, the invention provides a cluster node fault processing method, which can prevent the normal node from being removed when the main node is misjudged, keep the normal state of the cluster to normally provide service, and can quickly remove the cluster when determining that a certain node is in fault, thereby ensuring the high availability of the cluster and improving the reliability of the cluster.
The embodiment of the invention discloses a cluster node fault processing method which is applied to a master node in a cluster trivial database, and is described with reference to fig. 1, and the method comprises the following steps:
Step S11, if the heartbeat data and the inter-node heartbeat interaction data sent by the first target node in the cluster trivial database are not received within a first preset time, inquiring the inter-node heartbeat interaction data of the second target node from the inter-node interaction records locally stored with the inter-node interaction data of each node, wherein the inter-node interaction data comprises the heartbeat interaction records of the first target node and the second target node in the cluster trivial database.
In this embodiment, the clustered trivial database is CTDB (i.e., cluster Trivial Database), and in the present application, a novel heartbeat data sending mechanism is preset for each node, and when each node sends heartbeat data to a certain node, heartbeat interaction data of the node and other nodes are also sent simultaneously. For example, a is a master node in CTDB clusters, B, C, D is a common node, and in the existing heartbeat mechanism, the node sends heartbeat data to other nodes except for itself in the clusters, that is, node a sends heartbeat data to node B, C, D, node B sends heartbeat data to node A, C, D, node C sends heartbeat data to node A, B, D, and node D sends heartbeat data to node A, B, C. In the novel heartbeat data sending mechanism provided by the application, when a node sends heartbeat data to other nodes except the node in the cluster, the node synchronously sends heartbeat interaction data of the node and other nodes. Namely, when the node B sends heartbeat data to the node A, heartbeat interaction data of the node B and the node C and heartbeat interaction data of the node B and the node D are also sent to the node A, and when the node A receives the heartbeat data sent by the node B, the node A records the heartbeat interaction data of the node B and the node C and the heartbeat interaction data of the node B and the node D at the same time.
In this embodiment, before inquiring the inter-node heartbeat interaction data of the second target node from the node interaction record locally stored with the inter-node interaction data of each node, the method may further include receiving heartbeat data and inter-node heartbeat interaction data sent by other nodes in the group trivial database, and storing the received heartbeat data and inter-node heartbeat interaction data in the node interaction record corresponding to the local preset storage position.
It can be appreciated that the first preset time is greater than the inter-node heartbeat data transmission frequency. In this embodiment, the value of the first preset time may be changed based on a specific real-time scenario. For example, when the heartbeat data transmission frequency is once every second, the first preset time may be set to be three seconds, that is, each node will transmit data at a frequency of once every second, and if the master node does not receive the data transmitted by a certain node within three seconds, the node or the master node fails.
In this embodiment, when each node sends heartbeat interaction data between nodes, the data in a preset time period may be selected for sending, or all the historical heartbeat interaction data may be sent. In some preferred embodiments, historical heartbeat interaction data with other nodes over the past 32 seconds may be selected for transmission.
In this embodiment, the storing the received heartbeat data and the heartbeat interaction data between nodes in the node interaction record corresponding to the local preset storage position may include storing the data in the preset time period in the node interaction record corresponding to the local preset storage position in the received heartbeat data and the heartbeat interaction data between nodes.
In a specific embodiment, after the node a receives the heartbeat data sent by the node B and the heartbeat interaction data between the node B and the node C, and between the node B and the node D, the heartbeat interaction data are stored in the node interaction record stored in the local preset storage location. In this embodiment, when each node stores inter-node heartbeat interaction data sent by the peer node, data within a preset time period may be selected for storage, and in some preferred embodiments, the preset time period may be historical heartbeat interaction data within the past 32 seconds.
In a specific embodiment, when each node receives the heartbeat data sent by the opposite end node, the heartbeat record data corresponding to the opposite end node recorded in the node is updated, so that the data recorded in the node is always data meeting the preset time period.
In this embodiment, the querying the inter-node heartbeat interaction data of the second target node from the node interaction record locally stored with the inter-node interaction data of each node may include determining the second target node from the cluster trivial database, and querying the inter-node heartbeat interaction data of the second target node from the node interaction record locally stored with the inter-node interaction data of each node.
In this embodiment, since the interaction record between the nodes is already pre-stored in the node interaction record, when the master node a does not receive the heartbeat data sent by the node B within the preset first time, the locally stored node interaction record is checked, and if it is determined that C is the second target node at this time, whether the node B operates normally is determined from the interaction data between the node B and the second target node C in the node interaction record.
And step S12, judging whether heartbeat records of the first target node and the second target node exist in the inter-node heartbeat interaction data of the second target node in the first preset time.
In this embodiment, since the master node does not receive the heartbeat data of the first target node within the first preset time, when inquiring the node interaction record, it is determined whether the first target node has heartbeat interaction data within the first preset time, so as to determine whether the first target node is abnormal.
And step S13, if the heartbeat record of the first target node and the second target node exists in the inter-node heartbeat interaction data of the second target node within the first preset time, setting the current node state as a suspension state.
In this embodiment, if heartbeat interaction data of the first target node and the second target node exist in the node interaction record within a first preset time, it indicates that normal heartbeat interaction exists between the first target node and the second target node within the first preset time, and the first target node is in a normal state. At this time, it can be determined that the main node is in a sub-health state, the main node will set its own state to be suspended, and then other modules in the cluster will drop the network port of the main node to perform node isolation, so as to ensure the normal operation of the cluster.
And step S14, if the heartbeat record of the first target node and the heartbeat record of the second target node do not exist in the inter-node heartbeat interaction data of the second target node within the first preset time, determining the first target node as a fault node, and removing the fault node from the cluster.
In this embodiment, if the heartbeat interaction data of the first target node and the second target node do not exist in the node interaction record within the first preset time, it indicates that normal heartbeat interaction does not exist between the first target node and the second target node in the first preset event, and the first node fails, at this time, the master node removes the failed node from the cluster, so as to ensure normal operation of the cluster.
In the embodiment, if heartbeat data and inter-node heartbeat interaction data sent by a first target node in the cluster trivia database are not received within a first preset time, inquiring inter-node heartbeat interaction data of a second target node from node interaction records locally storing the inter-node interaction data of each node, wherein the inter-node interaction data comprises the heartbeat interaction records of the first target node and the second target node in the cluster trivia database, judging whether the inter-node heartbeat interaction data of the second target node exists in the first preset time, if the inter-node heartbeat interaction data of the first target node and the second target node exists in the first preset time, the inter-node heartbeat interaction data of the second target node exists in the first preset time, setting the current node state to be in a hanging state, if the inter-node heartbeat interaction data of the second target node does not exist in the first preset time and the second target node, determining that the first target node is in the first preset time, and removing the fault of the cluster node. The invention provides a novel heartbeat sending mechanism, namely each node synchronously sends heartbeat interaction data between the node and other nodes when sending the heartbeat data, so that each node can mutually store the heartbeat interaction data between other nodes, when a main node cannot receive the heartbeat data of a certain node, the main node can judge which node is a fault node according to the heartbeat interaction data stored in each node, and finally after the fault node is determined, the fault node is removed, so that the situation that the normal node is removed from a cluster after the normal node is misjudged as the fault node due to the fault of the main node in the prior art can be avoided, and the high availability state of the cluster is further ensured.
Fig. 2 is a flowchart of a specific cluster node fault handling method according to an embodiment of the present application. Referring to fig. 2, the method includes:
Step S21, if the heartbeat data and the inter-node heartbeat interaction data sent by the first target node in the cluster trivial database are not received within the first preset time, determining a second target node from the cluster trivial database, and inquiring the inter-node heartbeat interaction data of the second target node from the inter-node interaction records locally stored with the inter-node interaction data of each node, wherein the inter-node interaction data comprises the heartbeat interaction records of the first target node and the second target node in the cluster trivial database.
In this embodiment, the determining the second target node from the clustered trivia database may include determining a target node from the clustered trivia database, determining whether heartbeat data sent by the target node and heartbeat interaction data between nodes are received within a preset heartbeat data sending time, and determining the target node as the second target node if the heartbeat data sent by the target node and the heartbeat interaction data between nodes are received within the preset heartbeat data sending time.
In this embodiment, after determining whether the heartbeat data and the inter-node heartbeat interaction data sent by the target node are received within the preset heartbeat data sending time, the method may further include the steps of, if the heartbeat data and the inter-node heartbeat interaction data sent by the target node are not received within the preset heartbeat data sending time, re-determining the target node from the cluster trivial database, and then executing the step of determining whether the heartbeat data and the inter-node heartbeat interaction data sent by the target node are received within the preset heartbeat data sending time.
In this embodiment, when determining the second target node, the target node is determined from the cluster. For example, in some specific embodiments, if the preset heartbeat data transmission frequency in the cluster is that heartbeat data is transmitted once per second, the preset heartbeat data transmission time is one second, the first preset time is three seconds, when the master node a does not receive the heartbeat data and inter-node heartbeat interaction data transmitted by the first target node B within three seconds, the master node determines the target node from the remaining nodes in the cluster, if the target node determines that the target node is C, it is determined whether the heartbeat data and inter-node heartbeat interaction data transmitted by the target node C are received within the second preceding the current time, if the target node is not received, it is determined that the target node is D again, it is determined whether the heartbeat data and inter-node heartbeat interaction data transmitted by the target node D are received within the second preceding the current time, if the target node is received, it is indicated that the node D is in a normal state, and at this time, it is determined that the target node is a second target node.
In this embodiment, after the target node is determined, the method may further include recording a number of current historical target nodes, determining whether the number of the historical target nodes meets a preset threshold, and if the historical target nodes meet the preset threshold, setting the current node state to be a suspended state.
In this embodiment, if the target node determines that the heartbeat data sent by the target node D and the inter-node heartbeat interaction data are not received within a second previous to the current time, the second target node is determined to be E again, and then it is determined whether the heartbeat data sent by the target node E and the inter-node heartbeat interaction data are received within the second previous to the current time, and finally it is determined whether the target node needs to be determined again. It should be noted that, in this embodiment, in the process of repeatedly determining the target node, each time the target node is determined, the number of times of determining the target node is recorded, if the number of times of determining the target node meets the preset threshold, the master node is set to a suspended state, and in some preferred embodiments, after the master node is suspended, information about a cluster operation error is returned to the corresponding display interface. That is, in the process of determining the second target node, if none of the preset threshold nodes in the cluster transmits heartbeat data and inter-node heartbeat interaction data within the preset heartbeat transmission time, it indicates that a large-scale node fault or a main node fault may occur in the cluster, and at this time, the main node is suspended and returns information of a cluster operation error.
And S22, judging whether heartbeat records of the first target node and the second target node exist in the inter-node heartbeat interaction data of the second target node in the first preset time.
In this embodiment, if the target node is in a normal heartbeat sending state, the target node is determined to be a second target node, and the operation state of the first target node is determined by using the interaction data of the second target node and the first target node.
Step S23, if the heartbeat record of the first target node and the second target node exists in the inter-node heartbeat interaction data of the second target node within the first preset time, setting the current node state to be a suspension state.
For more specific processing in step S23, reference may be made to the corresponding content disclosed in the foregoing embodiment, and a detailed description is omitted herein.
And step S24, if the heartbeat record of the first target node and the heartbeat record of the second target node do not exist in the inter-node heartbeat interaction data of the second target node within the first preset time, determining the first target node as a fault node, and removing the fault node from the cluster.
For more specific processing in step S24, reference may be made to the corresponding content disclosed in the foregoing embodiment, and a detailed description is omitted herein.
In this embodiment, a detailed process of determining the second target node is provided, that is, the target node is determined first, if the data sent by the target node is not received within the preset heartbeat data sending time, the target node is determined again, if the number of the determined target nodes meets the preset threshold, it is indicated that a large-scale node fault or a main node fault may occur in the cluster, and at this time, the main node is suspended and returns information of a cluster operation error. In this embodiment, it is finally ensured that the target node in the normal operation state is determined as the second target node, and then the operation state of the first target node is determined by using the interaction data of the second target node and the first target node, so that the security of the cluster is improved, and the stable operation of the cluster is ensured.
Referring to fig. 3, the embodiment of the application discloses a cluster node fault processing device, which is applied to a master node in a cluster trivial database, and specifically may include:
The record inquiring module 11 is configured to inquire inter-node heartbeat interaction data of a second target node from a node interaction record locally storing inter-node interaction data of each node if heartbeat data and inter-node heartbeat interaction data sent by a first target node in the cluster trivial database are not received within a first preset time;
the record judging module 12 is configured to judge whether a heartbeat record of the first target node and the second target node exists in the inter-node heartbeat interaction data of the second target node within the first preset time;
The record existence module 13 is configured to set a current node state to a suspended state if a heartbeat record of the first target node and the second target node exists in the inter-node heartbeat interaction data of the second target node within the first preset time;
And a record absence module 14, configured to determine the first target node as a failure node if no heartbeat record of the first target node and the second target node exists in the inter-node heartbeat interaction data of the second target node within the first preset time, and remove the failure node from the cluster.
The method comprises the steps of inquiring inter-node heartbeat interaction data of a second target node from node interaction records locally storing inter-node interaction data of all nodes if heartbeat data and inter-node heartbeat interaction data sent by a first target node in a cluster trivial database are not received within a first preset time, judging whether the inter-node heartbeat interaction data of the second target node exist in the first preset time or not, if the inter-node heartbeat interaction data of the second target node exist in the inter-node heartbeat interaction data of the first target node and the second target node, setting the current node state to be in a hanging state, and if the inter-node heartbeat interaction data of the second target node do not exist in the first preset time, determining that the first target node and the second target node fail, and removing the fault of the cluster node. The invention provides a novel heartbeat sending mechanism, namely each node synchronously sends heartbeat interaction data between the node and other nodes when sending the heartbeat data, so that each node can mutually store the heartbeat interaction data between other nodes, when a main node cannot receive the heartbeat data of a certain node, the main node can judge which node is a fault node according to the heartbeat interaction data stored in each node, and finally after the fault node is determined, the fault node is removed, so that the situation that the normal node is removed from a cluster after the normal node is misjudged as the fault node due to the fault of the main node in the prior art can be avoided, and the high availability state of the cluster is further ensured.
In some specific embodiments, the cluster node fault handling apparatus further includes:
the data receiving module is used for receiving heartbeat data sent by other nodes in the cluster trivial database and heartbeat interaction data among the nodes;
and the data storage module is used for storing the received heartbeat data and the heartbeat interaction data among the nodes into node interaction records corresponding to the local preset storage positions.
In some embodiments, the data storage module specifically includes:
And the data storage unit is used for storing the received data of the heartbeat data and the heartbeat interaction data among the nodes, wherein the data of which the data transmission time is within a preset time period into the node interaction record corresponding to the local preset storage position.
In some embodiments, the record query module 11 includes:
A second target node determining unit, configured to determine a second target node from the cluster trivial database;
and the data query unit is used for querying the inter-node heartbeat interaction data of the second target node from the inter-node interaction records locally stored with the inter-node interaction data of each node.
In some specific embodiments, the second target node determining unit specifically includes:
A target node determining unit, configured to determine a target node from the cluster trivial database;
The data judging unit is used for judging whether heartbeat data sent by the target node and heartbeat interaction data between nodes are received in preset heartbeat data sending time;
And the node determining unit is used for determining the target node as a second target node if the heartbeat data sent by the target node and the heartbeat interaction data between the nodes are received within the preset heartbeat data sending time.
In some specific embodiments, the second target node determining unit further includes:
and the node redetermining unit is used for redetermining the target node from the cluster trivial database if the heartbeat data transmitted by the target node and the heartbeat interaction data between nodes are not received within the preset heartbeat data transmission time, and then executing the step of judging whether the heartbeat data transmitted by the target node and the heartbeat interaction data between nodes are received within the preset heartbeat data transmission time.
In some specific embodiments, the cluster node fault processing apparatus specifically further includes:
the number recording unit is used for recording the number of the current historical target nodes;
The number judging unit is used for judging whether the number of the historical target nodes meets a preset threshold value or not;
and the node abnormality unit is used for setting the current node state as a suspension state if the history target node meets a preset threshold value.
Further, the embodiment of the present application further discloses an electronic device, and fig. 4 is a block diagram of an electronic device 20 according to an exemplary embodiment, where the content of the diagram is not to be considered as any limitation on the scope of use of the present application.
Fig. 4 is a schematic structural diagram of an electronic device 20 according to an embodiment of the present application. The electronic device 20 may include, in particular, at least one processor 21, at least one memory 22, a power supply 23, a display screen 24, an input-output interface 25, a communication interface 26, and a communication bus 27. The memory 22 is configured to store a computer program, where the computer program is loaded and executed by the processor 21 to implement relevant steps in the cluster node fault handling method disclosed in any of the foregoing embodiments. In addition, the electronic device 20 in the present embodiment may be specifically an electronic computer.
In this embodiment, the power supply 23 is configured to provide working voltages for each hardware device on the electronic device 20, the communication interface 26 is configured to create a data transmission channel with an external device for the electronic device 20, and the communication protocol to be followed is any communication protocol applicable to the technical solution of the present application, which is not specifically limited herein, and the input/output interface 25 is configured to obtain external input data or output data to the external device, where the specific interface type may be selected according to the specific application needs, and is not specifically limited herein.
The memory 22 may be a carrier for storing resources, such as a read-only memory, a random access memory, a magnetic disk, or an optical disk, and the resources stored thereon may include an operating system 221, a computer program 222, virtual machine data 223, and the virtual machine data 223 may include various data. The storage means may be a temporary storage or a permanent storage.
The operating system 221 is used for managing and controlling various hardware devices on the electronic device 20 and the computer program 222, which may be Windows Server, netware, unix, linux, etc. The computer program 222 may further include a computer program that can be used to perform other specific tasks in addition to the computer program that can be used to perform the cluster node failure processing method performed by the electronic device 20 disclosed in any of the foregoing embodiments.
Further, the present application also discloses a computer readable storage medium, where the computer readable storage medium includes random access Memory (Random Access Memory, RAM), memory, read-Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, magnetic disk, or optical disk or any other form of storage medium known in the art. The method for processing the fault of the cluster node is characterized in that the computer program is executed by a processor to realize the method for processing the fault of the cluster node. For specific steps of the method, reference may be made to the corresponding contents disclosed in the foregoing embodiments, and no further description is given here.
In this specification, each embodiment is described in a progressive manner, and each embodiment is mainly described in a different point from other embodiments, so that the same or similar parts between the embodiments are referred to each other. For the device disclosed in the embodiment, since it corresponds to the method disclosed in the embodiment, the description is relatively simple, and the relevant points refer to the description of the method section. Those of skill would further appreciate that the various illustrative elements and algorithm steps described in connection with the embodiments disclosed herein may be implemented as electronic hardware, computer software, or combinations of both, and that the various illustrative elements and steps are described above generally in terms of functionality in order to clearly illustrate the interchangeability of hardware and software. Whether such functionality is implemented as hardware or software depends upon the particular application and design constraints imposed on the solution. Skilled artisans may implement the described functionality in varying ways for each particular application, but such implementation decisions should not be interpreted as causing a departure from the scope of the present application.
The steps of a method or algorithm described in connection with the embodiments disclosed herein may be embodied directly in hardware, in a software module executed by a processor, or in a combination of the two. The software modules may be disposed in Random Access Memory (RAM), memory, read Only Memory (ROM), electrically programmable ROM, electrically erasable programmable ROM, registers, hard disk, a removable disk, a CD-ROM, or any other form of storage medium known in the art.
Finally, it is further noted that relational terms such as first and second, and the like are used solely to distinguish one entity or action from another entity or action without necessarily requiring or implying any actual such relationship or order between such entities or actions. Moreover, the terms "comprises," "comprising," or any other variation thereof, are intended to cover a non-exclusive inclusion, such that a process, method, article, or apparatus that comprises a list of elements does not include only those elements but may include other elements not expressly listed or inherent to such process, method, article, or apparatus. Without further limitation, an element defined by the phrase "comprising one does not exclude the presence of other like elements in a process, method, article, or apparatus that comprises an element.
The foregoing describes the method, apparatus, device and storage medium for handling cluster node failures in the present invention in detail, and specific examples are provided herein to illustrate the principles and embodiments of the present invention, and the above examples are provided to assist in understanding the method and core ideas of the present invention, and meanwhile, to those skilled in the art, according to the ideas of the present invention, there are variations in the specific embodiments and application scope, so the disclosure should not be construed as limiting the present invention.

Claims (6)

1. A method for processing the fault of cluster node is characterized in that the method is applied to a master node in a cluster trivial database and comprises the following steps:
If the heartbeat data and the inter-node heartbeat interaction data sent by the first target node in the cluster trivial database are not received within a first preset time, inquiring the inter-node heartbeat interaction data of a second target node from node interaction records locally storing the inter-node interaction data of each node, wherein the node interaction records comprise the heartbeat interaction records of the first target node and the second target node in the cluster trivial database, and the first preset time is longer than the inter-node heartbeat data sending frequency;
Judging whether heartbeat records of the first target node and the second target node exist in the inter-node heartbeat interaction data of the second target node in the first preset time;
If the heartbeat records of the first target node and the second target node exist in the inter-node heartbeat interaction data of the second target node within the first preset time, setting the current node state to be a suspension state so that other modules in the cluster can drop the network port of the master node;
If the heartbeat record of the first target node and the heartbeat record of the second target node do not exist in the inter-node heartbeat interaction data of the second target node within the first preset time, determining the first target node as a fault node, and removing the fault node from a cluster;
The query of the inter-node heartbeat interaction data of the second target node from the node interaction records locally stored with the inter-node interaction data of each node includes:
determining a second target node from the clustered trivial database;
Inquiring the inter-node heartbeat interaction data of the second target node from the node interaction records locally stored with the inter-node interaction data of each node;
Wherein said determining a second target node from said clustered trivial database comprises:
Determining a target node from the cluster trivial database;
judging whether heartbeat data sent by the target node and heartbeat interaction data between nodes are received in preset heartbeat data sending time;
if heartbeat data sent by the target node and heartbeat interaction data between nodes are received within preset heartbeat data sending time, determining the target node as a second target node;
Wherein, the judging whether the heartbeat data sent by the target node and the heartbeat interaction data between nodes are received in the preset heartbeat data sending time further comprises:
if the heartbeat data sent by the target node and the heartbeat interaction data between nodes are not received within the preset heartbeat data sending time, the target node is redetermined from the cluster trivial database, and then the step of judging whether the heartbeat data sent by the target node and the heartbeat interaction data between nodes are received within the preset heartbeat data sending time is executed;
wherein after the determining the target node, the method further comprises:
recording the number of current historical target nodes;
judging whether the number of the historical target nodes meets a preset threshold value or not;
if the historical target node meets a preset threshold, the current node state is set to be a suspension state, and information of cluster operation errors is returned to a display interface so as to indicate that large-scale node faults or main node faults occur in the clusters.
2. The method for processing a cluster node fault according to claim 1, wherein before querying inter-node heartbeat interaction data of the second target node from the inter-node interaction record locally stored with inter-node interaction data of each node, the method further comprises:
receiving heartbeat data sent by other nodes in the cluster trivial database and heartbeat interaction data among nodes;
and storing the received heartbeat data and the heartbeat interaction data among the nodes into node interaction records corresponding to the local preset storage positions.
3. The method for processing the cluster node fault according to claim 2, wherein the storing the received heartbeat data and the heartbeat interaction data between nodes in the node interaction record corresponding to the local preset storage location includes:
And storing the received data of the heartbeat data and the heartbeat interaction data among the nodes, wherein the data of which the data transmission time is within a preset time period, into a node interaction record corresponding to a local preset storage position.
4. A cluster node failure handling device, applied to a master node in a cluster trivial database, comprising:
The system comprises a record inquiring module, a first node interaction record and a second node interaction record, wherein the record inquiring module is used for inquiring the inter-node heartbeat interaction data of a second target node from the inter-node interaction records locally stored with the inter-node interaction data of each node if the heartbeat data and the inter-node heartbeat interaction data sent by a first target node in the cluster trivial database are not received within a first preset time;
The record judging module is used for judging whether heartbeat records of the first target node and the second target node exist in the inter-node heartbeat interaction data of the second target node in the first preset time;
the record existence module is used for setting the current node state to be in a suspension state if the heartbeat record of the first target node and the second target node exists in the inter-node heartbeat interaction data of the second target node within the first preset time, so that other modules in the cluster drop the network port of the main node;
The record absence module is used for determining the first target node as a fault node and removing the fault node from a cluster if the heartbeat record of the first target node and the second target node does not exist in the inter-node heartbeat interaction data of the second target node within the first preset time;
Wherein, the record inquiry module includes:
A second target node determining unit, configured to determine a second target node from the cluster trivial database;
The data query unit is used for querying the inter-node heartbeat interaction data of the second target node from the node interaction records locally stored with the inter-node interaction data of each node;
wherein the second target node determining unit includes:
A target node determining unit, configured to determine a target node from the cluster trivial database;
The data judging unit is used for judging whether heartbeat data sent by the target node and heartbeat interaction data between nodes are received in preset heartbeat data sending time;
The node determining unit is used for determining the target node as a second target node if heartbeat data sent by the target node and heartbeat interaction data between nodes are received within preset heartbeat data sending time;
wherein the second target node determining unit further includes:
A redetermining node unit, configured to redetermine, if heartbeat data sent by the target node and heartbeat interaction data between nodes are not received within a preset heartbeat data sending time, the target node from the cluster trivial database, and then execute the step of determining whether heartbeat data sent by the target node and heartbeat interaction data between nodes are received within the preset heartbeat data sending time;
Wherein, still include:
the number recording unit is used for recording the number of the current historical target nodes;
The number judging unit is used for judging whether the number of the historical target nodes meets a preset threshold value or not;
And the node abnormality unit is used for setting the current node state as a suspension state if the historical target node meets a preset threshold value, and returning information of cluster operation errors to a display interface so as to indicate that a large-scale node fault or a main node fault occurs in the cluster.
5. An electronic device, comprising a processor and a memory, wherein the processor implements the cluster node failure handling method according to any one of claims 1 to 3 when executing a computer program stored in the memory.
6. A computer readable storage medium for storing a computer program, wherein the computer program when executed by a processor implements a cluster node failure handling method according to any of claims 1 to 3.
CN202210879795.2A 2022-07-25 2022-07-25 A cluster node failure processing method, device, equipment and medium Active CN115242820B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210879795.2A CN115242820B (en) 2022-07-25 2022-07-25 A cluster node failure processing method, device, equipment and medium

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210879795.2A CN115242820B (en) 2022-07-25 2022-07-25 A cluster node failure processing method, device, equipment and medium

Publications (2)

Publication Number Publication Date
CN115242820A CN115242820A (en) 2022-10-25
CN115242820B true CN115242820B (en) 2025-05-13

Family

ID=83674743

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210879795.2A Active CN115242820B (en) 2022-07-25 2022-07-25 A cluster node failure processing method, device, equipment and medium

Country Status (1)

Country Link
CN (1) CN115242820B (en)

Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114422412A (en) * 2020-10-13 2022-04-29 华为技术有限公司 Equipment detection method and device and communication equipment

Family Cites Families (9)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN106301853B (en) * 2015-06-05 2019-06-18 华为技术有限公司 Fault detection method and device for nodes in cluster system
CN106936662B (en) * 2015-12-31 2020-01-31 杭州华为数字技术有限公司 method, device and system for realizing heartbeat mechanism
US10243780B2 (en) * 2016-06-22 2019-03-26 Vmware, Inc. Dynamic heartbeating mechanism
CN107124469B (en) * 2017-06-07 2020-07-24 苏州浪潮智能科技有限公司 Cluster node communication method and system
CN107566219B (en) * 2017-09-27 2020-09-18 华为技术有限公司 Fault diagnosis method applied to cluster system, node equipment and computer equipment
CN109728981A (en) * 2019-03-19 2019-05-07 江苏汇智达信息科技有限公司 A kind of cloud platform fault monitoring method and device
CN112367198B (en) * 2020-10-30 2022-07-01 新华三大数据技术有限公司 Main/standby node switching method and device
CN112822078B (en) * 2021-02-26 2023-01-13 上海沄熹科技有限公司 Method for realizing raft heartbeat report of nodes in different network domains
CN113595836A (en) * 2021-09-27 2021-11-02 云宏信息科技股份有限公司 Heartbeat detection method of high-availability cluster, storage medium and computing node

Patent Citations (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN114422412A (en) * 2020-10-13 2022-04-29 华为技术有限公司 Equipment detection method and device and communication equipment

Also Published As

Publication number Publication date
CN115242820A (en) 2022-10-25

Similar Documents

Publication Publication Date Title
US6314512B1 (en) Automatic notification of connection or system failure in asynchronous multi-tiered system by monitoring connection status using connection objects
US7076691B1 (en) Robust indication processing failure mode handling
US6295558B1 (en) Automatic status polling failover or devices in a distributed network management hierarchy
CN102739434B (en) Communication system using server agents according to simple network management protocol
US6952766B2 (en) Automated node restart in clustered computer system
EP2256582B1 (en) Remotely managing a data processing system via a communications network
CN100426751C (en) Method for ensuring accordant configuration information in cluster system
US7356531B1 (en) Network file system record lock recovery in a highly available environment
EP1697843B1 (en) System and method for managing protocol network failures in a cluster system
US20030196148A1 (en) System and method for peer-to-peer monitoring within a network
CN113055203B (en) Method and device for recovering exception of SDN control plane
WO2007093072A1 (en) Gateway for wireless mobile clients
US20080288812A1 (en) Cluster system and an error recovery method thereof
US8112518B2 (en) Redundant systems management frameworks for network environments
CN114143905B (en) Session establishing method, communication system, electronic device and storage medium
US7499987B2 (en) Deterministically electing an active node
CN108429656A (en) Method for monitoring connection state of network card of physical machine
JP3870174B2 (en) Method for managing remotely accessible resources
CN108509296B (en) Method and system for processing equipment fault
CN115242820B (en) A cluster node failure processing method, device, equipment and medium
CN114422335A (en) Communication method, communication device, server and storage medium
CN112491633B (en) Fault recovery method, system and related components of multi-node cluster
JP2000148539A (en) Fault detecting method, computer system, constitutional device, and recording medium
CN112787868A (en) Information synchronization method and device
CN110768838A (en) SNMP message processing method and related device

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant