CN119814529A

CN119814529A - Fault alarm method, device, computer equipment and storage medium

Info

Publication number: CN119814529A
Application number: CN202412000332.6A
Authority: CN
Inventors: 杨子康
Original assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Current assignee: Suzhou Metabrain Intelligent Technology Co Ltd
Priority date: 2024-12-31
Filing date: 2024-12-31
Publication date: 2025-04-11

Abstract

The present invention relates to the field of computer technology, and discloses a fault alarm method, device, computer equipment and storage medium, including: obtaining identification information and startup dependency sequence of each service run by the master node in the current cycle; obtaining the service status of each service according to the identification information and startup dependency sequence of each service; when it is determined that the service status of the target service is abnormal, or the service status of the target service is not obtained, obtaining the configuration file of the target service in the current cycle and the file content index value of the target service in the previous cycle corresponding to the current cycle according to the identification information of the target service, the target service is any service; determining the fault type of the target service according to the configuration file of the target service in the current cycle and the file content index value of the target service in the previous cycle; generating an alarm notification according to the identification information and the fault type of the target service, and sending it to a client. The present invention can improve the efficiency of troubleshooting.

Description

Fault alarm method, device, computer equipment and storage medium

Technical Field

The present invention relates to the field of computer technologies, and in particular, to a fault alarm method, a fault alarm device, a computer device, and a storage medium.

Background

In the field of computer technology, service platforms often employ dual-engine hot standby systems. The dual hot standby system comprises a main node and a standby node. The relevant configurations of the two nodes are identical. Under normal zero clearing, there is and only one node serving as the master node. When the master node fails, the standby node can immediately start service to respond to external data requests.

However, after the node fails, a technician needs to adopt a complex query operation to find out related failure problems, and the efficiency is low.

Disclosure of Invention

In view of the above, the present invention provides a fault alarm method, apparatus, computer device, storage medium and program product, so as to solve the problem of low fault detection efficiency.

In a first aspect, the present invention provides a fault alerting method, the method comprising:

In the current period, acquiring identification information corresponding to each service in at least one service operated by a main node and a starting dependency sequence corresponding to each service;

Acquiring a service state corresponding to each service according to the identification information and the starting dependency sequence corresponding to each service;

When determining that the service state corresponding to the target service is abnormal or the service state corresponding to the target service is not acquired, acquiring a configuration file of the target service in the current period and a file content index value of the target service in the last period corresponding to the current period according to identification information corresponding to the target service, wherein the target service is any one of at least one service;

determining the fault type of the target service according to the configuration file of the target service in the current period and the file content index value of the target service in the last period;

And generating an alarm notification according to the identification information of the target service and the fault type, and sending the alarm notification to the client.

The fault alarming method provided by the invention has the following advantages:

According to the scheme, identification information and starting dependency sequence of the service operated by the main node are periodically acquired, and the service state of the service is acquired, when the service state is abnormal or the service state is not acquired, a configuration file of the service in the current period and a file content index value of the service in the last period are further acquired according to the identification information of the service, and the fault type is determined. And finally, generating an alarm notification based on the identification information of the service and the fault type, and sending the alarm notification to the client. In this way, the technician need not actively query and analyze to know what specific portion of the fault occurred, as well as the specific cause of the fault (i.e., the type of fault). Therefore, the fault checking efficiency is higher, technicians can solve related faults in time, and the influence on the operation service of the main node is reduced.

In an optional embodiment, the obtaining the service state corresponding to each service according to the identification information corresponding to each service and the start-up dependency sequence includes:

Determining a service state acquisition sequence corresponding to each service according to the starting dependency sequence corresponding to each service;

traversing the identification information corresponding to each service according to the service state acquisition sequence corresponding to each service;

the method comprises the steps of traversing identification information of one service each time, and acquiring a service state corresponding to the traversed service according to the identification information corresponding to the traversed service;

And after traversing the identification information respectively corresponding to all the services in at least one service, completing the operation of acquiring the service state corresponding to each service in the current period.

In particular, the acquisition of service states in a startup dependency order may ensure that all of its dependent services have been checked before checking a certain service. This helps identify problems due to depending on service failures, thereby judging the service status more accurately.

In an alternative embodiment, the determining the fault type of the target service according to the configuration file of the target service in the current period and the file content index value of the target service in the last period includes:

processing the configuration file of the target service in the current period according to a pre-constructed target algorithm to obtain a file content index value of the target service in the current period;

And determining the fault type of the target service according to the file content index values respectively corresponding to the last period and the current period of the target service.

Specifically, the related art generally traverses all configuration parameters in a configuration file, and determines whether each configuration parameter changes to determine whether a fault may be caused by a configuration change. Since the configuration file contains other codes in addition to the configuration parameters, changes to these codes may also lead to failures. Therefore, the method directly determines the fault type by judging whether the index values of the file content corresponding to the service in two adjacent periods are the same or not. Thus, the accuracy of the fault type can be ensured, and the efficiency of determining the fault type can be improved.

In an optional implementation manner, the determining, according to file content index values corresponding to the target service in the previous period and the current period respectively, a fault type of the target service includes:

Determining whether the file content index value corresponding to the last period of the target service is consistent with the file content index value corresponding to the current period;

When the file content index value corresponding to the last period of the target service is inconsistent with the file content index value corresponding to the current period of the target service, determining the fault type of the target service as a first fault type, wherein the first fault type is used for indicating that the fault of the target service is caused by the modification of the configuration file of the target service;

or alternatively

And when the file content index value corresponding to the target service in the last period is consistent with the file content index value corresponding to the current period respectively, determining that the fault type of the target service is a second fault type, wherein the second fault type is used for indicating that the fault reason of the target service is not modified for the configuration file of the target service.

Specifically, when it is determined that the file content index values of the target service in two adjacent periods are consistent, it may be indicated that the configuration file is not changed, and the reason for the failure of the target service is not the failure of the configuration file, but is caused by the failure of the master node, for example, a hardware failure, or a change of an operating system or other configuration files. When it is determined that the index values of the file contents of the target service in two adjacent periods are inconsistent, it can be explained that the reason that the target service fails is that the configuration file of the target service changes.

In an alternative embodiment, when the service running in the master node includes a data synchronization service, the method further includes:

detecting a network communication link between the main node and a standby node corresponding to the main node;

when the network communication link is detected to be faulty, determining whether the master node has mounted the data synchronization service;

when the main node is determined to be mounted with the data synchronization service, acquiring target time for mounting the data synchronization service;

and adding the target time as a target identifier to a target log corresponding to the data synchronization service, wherein the target identifier is used for indicating the master node to mount the data synchronization service during the network communication link fault.

Specifically, by periodically detecting the network communication link between the master node and the slave node, a response can be made at the first time when a network failure occurs, maintaining certain service continuity. And when a network fault occurs, whether the data synchronization service is mounted is checked, the mounting time point is recorded, the target time is used as an identifier and added into a log, a clear time clue is provided for the subsequent problem investigation, and the data state during the fault period is conveniently analyzed, so that the problem root is more quickly positioned.

In an alternative embodiment, after the adding the target time as the target identifier to the target log corresponding to the data synchronization service, the method further includes:

when the network communication link is detected to be recovered to be normal and a network fault checking instruction sent by the client is received, determining whether the target identifier is included in the target log;

and when the target identifier is determined to be included in the target log, the target identifier is sent to the client side, so that the client side can determine whether the data of the main node and the data of the standby node are inconsistent during the network communication link failure.

Specifically, by checking the target identifier in the target log, it can be clearly known whether the master node mounts the data synchronization service during the communication link failure, so that the client can evaluate and repair the problem of inconsistent data according to the received target identifier, and ensure the data synchronization between the master node and the slave node.

In an alternative embodiment, after the target identifier is sent to the client when it is determined that the target identifier is included in the target log, the method further includes:

deleting the target identifier from the target log after receiving a target data processing instruction sent by the client, wherein the target data processing instruction is an instruction acquired when the client determines that the data of the main node and the data of the standby node are inconsistent;

Analyzing the target data processing instruction to obtain an operation type;

When the operation type is determined to be a deleting operation, deleting the data generated by the data synchronization service during mounting;

or alternatively

And when the operation type is determined to be a reserved operation, reserving the data generated by the data synchronization service during mounting.

Specifically, after receiving the target data processing instruction, the master node may delete the target identifier from the target log, so when the network communication link is detected to fail again later, a new time may be directly added, so that the problem that a plurality of identifiers exist simultaneously, and the data inconsistency cannot be accurately identified later is avoided. In addition, the data is deleted or reserved based on the target data processing instruction sent by the client, so that the problem of inconsistent data can be solved.

In a second aspect, the present invention provides a fault alerting device, the device comprising:

The system comprises an acquisition module, a control module and a control module, wherein the acquisition module is used for acquiring identification information corresponding to each service in at least one service operated by a main node and a starting dependency sequence corresponding to each service in the current period; when the service state corresponding to the target service is determined to be abnormal or the service state corresponding to the target service is not acquired, acquiring a configuration file of the target service in the current period and a file content index value of the target service in the last period corresponding to the current period according to the identification information corresponding to the target service, wherein the target service is any one of the services;

The determining module is used for determining the fault type of the target service according to the configuration file of the target service in the current period and the file content index value of the target service in the last period;

and the generation module is used for generating an alarm notification according to the identification information of the target service and the fault type and sending the alarm notification to the client.

In a third aspect, the present invention provides a computer device, including a memory and a processor, where the memory and the processor are communicatively connected to each other, and the memory stores computer instructions, and the processor executes the computer instructions, thereby executing the fault alerting method of the first aspect or any implementation manner corresponding to the first aspect.

In a fourth aspect, the present invention provides a computer readable storage medium having stored thereon computer instructions for causing a computer to execute the fault alerting method of the first aspect or any one of its corresponding embodiments.

In a fifth aspect, the present invention provides a computer program product comprising computer instructions for causing a computer to perform the fault alerting method of the first aspect or any of its corresponding embodiments.

Drawings

In order to more clearly illustrate the embodiments of the present invention or the technical solutions in the related art, the drawings that are required to be used in the description of the embodiments or the related art will be briefly described, and it is apparent that the drawings in the description below are some embodiments of the present invention, and other drawings may be obtained according to the drawings without inventive effort for those skilled in the art.

FIG. 1 is a schematic architecture diagram of a target system according to an embodiment of the invention;

FIG. 2 is a flow chart of a fault alerting method according to an embodiment of the present invention;

FIG. 3 is a flow diagram of detecting network communication link failure according to an embodiment of the present invention;

FIG. 4 is a block diagram of a fault alerting device according to an embodiment of the present invention;

Fig. 5 is a schematic diagram of a hardware structure of a computer device according to an embodiment of the present invention.

Detailed Description

For the purpose of making the objects, technical solutions and advantages of the embodiments of the present invention more apparent, the technical solutions of the embodiments of the present invention will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present invention, and it is apparent that the described embodiments are some embodiments of the present invention, but not all embodiments of the present invention. All other embodiments, which can be made by those skilled in the art based on the embodiments of the invention without making any inventive effort, are intended to be within the scope of the invention.

The present invention may be implemented by a target system, as shown in fig. 1, where the target system may include a plurality of nodes and clients, and the plurality of nodes includes a master node and a standby node. The master node and the standby node can be servers, and the client can be a computer, a mobile phone and the like.

The embodiment of the invention provides a fault alarming method, which has higher efficiency of automatically alarming and generating fault types.

In accordance with an embodiment of the present invention, a fault alerting method embodiment is provided, it being noted that the steps shown in the flow diagrams of the figures may be performed in a computer system, such as a set of computer executable instructions, and, although a logical order is shown in the flow diagrams, in some cases, the steps shown or described may be performed in an order other than that shown or described herein.

In this embodiment, a fault warning method is provided, which may be executed by a master node, and fig. 2 is a flowchart of the fault warning method according to an embodiment of the present invention, as shown in fig. 2, where the flowchart includes the following steps:

Step S201, in the current period, acquiring identification information corresponding to each service in at least one service operated by the master node, and a start-up dependency sequence corresponding to each service.

The identification information of the service may be a name, a number, etc. of the service. For example, the services may include database services, in-memory data structure services, request handling services, and the like. The start-up dependency order of the database service may be 1, the start-up dependency order of the memory data structure service may be 2, and the start-up dependency order of the request processing service may be 3.

Specifically, after any node in the target system becomes the master node, the detection script can be periodically invoked, and according to the execution step in the detection script, the identification information and the start-up dependency sequence of the service currently running are firstly obtained so as to check whether the service fails.

Step S202, obtaining the service state corresponding to each service according to the identification information corresponding to each service and the starting dependency sequence.

Wherein the service status may include normal or abnormal.

Specifically, the master node may first determine a service state acquisition order corresponding to each service according to the start-up dependency order corresponding to each service. And traversing the identification information corresponding to each service according to the service state acquisition sequence corresponding to each service. And traversing the identification information of one service each time, and acquiring the service state corresponding to the traversed service according to the identification information corresponding to the traversed service. Thus, after the identification information corresponding to each service in at least one service is traversed, the operation of acquiring the service state corresponding to each service in the current period is completed.

In some optional embodiments, according to the identification information corresponding to the traversed service, the obtaining the service state corresponding to the traversed service may specifically be:

First, a pre-constructed acquisition instruction template is acquired, and then, according to the acquisition instruction template and the identification information of the traversed service, an acquisition instruction corresponding to the identification information of the traversed service is generated and executed. After execution is completed, the service state can be obtained. For example, the get instruction template may be "SYSTEMCTL STATUS + service identification information".

In step S203, when it is determined that the service state corresponding to the target service is abnormal, or the service state corresponding to the target service is not acquired, a configuration file of the target service in the current period and a file content index value of the target service in the last period corresponding to the current period are acquired according to the identification information corresponding to the target service.

Wherein the target service may be any one of the at least one service.

Specifically, for the target service, when the master node acquires the service state corresponding to the target service, it may be determined whether the service state is normal or abnormal, if so, no processing is required, and if so, the fault type may be further determined. Or when the acquired instruction corresponding to the identification information of the target service is executed, and the acquired return result is that the service state acquisition fails (i.e. the service state corresponding to the target service is not acquired), the fault type can be further determined. Before determining the fault type, the master node may first acquire a configuration file corresponding to the target service from itself according to the identification information of the target service, and a file content index value of the target service in a previous period. For example, the configuration file of the database service may be "mariadb-server. Conf", and the configuration file of the request processing service may be "httpd. Conf".

Step S204, determining the fault type of the target service according to the configuration file of the target service in the current period and the file content index value of the target service in the last period.

Wherein the file content indicator value may be an MD5 check value.

Specifically, the master node may determine the failure type of the target service according to the following steps:

step one, processing a configuration file of the target service in the current period according to a pre-constructed target algorithm to obtain a file content index value of the target service in the current period.

Wherein the target algorithm may be an MD5 algorithm.

And step two, determining the fault type of the target service according to the file content index values respectively corresponding to the last period and the current period of the target service.

Specifically, it is determined whether or not the file content index value corresponding to the last period of the target service and the file content index value corresponding to the current period are identical. When the master node determines that the file content index value corresponding to the target service in the last period is inconsistent with the file content index values respectively corresponding to the current period, the fault type of the target service can be determined to be a first fault type, wherein the first fault type is used for indicating that the fault reason of the target service is that the configuration file of the target service is modified. Or when the master node determines that the file content index value corresponding to the target service in the last period is consistent with the file content index value respectively corresponding to the current period, determining that the fault type of the target service is a second fault type, wherein the second fault type is used for indicating that the fault cause of the target service is not that the configuration file of the target service is modified.

After calculating the index value of the file content corresponding to the current period of the target service, the identification information of the target service and the index value of the file content corresponding to the current period of the target service can be correspondingly recorded and used for determining the fault type in the next period.

In some optional embodiments, after the master node determines that the file content index value corresponding to the target service in the previous period is consistent with the file content index values respectively corresponding to the current period, the following operations may be further performed:

Determining whether the fault type of the last service before the target service in the start-up dependency sequence is a first fault type, and if so, determining that the fault type of the target service is a first sub-fault type in a second fault type, wherein the first sub-fault type is used for indicating that the fault of the target service is the modification of the configuration file of the service on which the fault of the target service depends. If not, it may be determined that the failure cause of the target service is a second sub-failure type of the second failure type, wherein the second sub-failure type is used to indicate that the failure cause of the target service is an operating system failure or a hardware failure of the primary node.

In this way, a more specific failure cause can be indicated to the technician, and the failure removal efficiency is quickened.

Step S205, according to the identification information of the target service and the fault type, generating an alarm notification and sending the alarm notification to the client.

Specifically, after determining the fault type of the target service, the master node may generate an alarm notification corresponding to the target service according to the identification information and the fault type of the target service, and send the alarm notification to the client. After receiving the alarm notification, the client can directly display, so that a technician can know the service with the fault and the specific fault type, and can perform the operation of solving the fault problem. In this way, the failure duration of the target service can be reduced.

In some optional embodiments, after determining whether each service fails and determining the failure type of each failed service, the master node may generate an alarm notification according to the identification information and the failure types corresponding to all the failed services, and send the alarm notification to the client. In this way, waste of network resources can be reduced.

In some alternative embodiments, after determining whether each service fails and determining the failure type of each failed service, the master node may first count the number of failed services, obtain a level corresponding to each failed service (a higher level of a failed service indicates a greater impact on the operation of the master node after the failed service occurs), determine an alarm level according to the number of failed services and the level corresponding to each failed service, and generate an alarm notification according to the alarm level, identification information of each failed service, and the failure type, and send the alarm notification to the client. In addition, the master node may shut down the network when it determines that the alert level is greater than the preset alert level. Thus, the standby node can not receive the heartbeat signal from the main node, and the standby node can become the main node to provide service for the outside.

Wherein, the alarm level is determined according to the number of fault services and the level corresponding to each fault service by adopting the following expression:

wherein L is an alarm level, N is the number of fault services, k is a preset value greater than 0 and less than 1, and G _i is the level of the ith fault service.

Therefore, when the main node has serious faults (the alarm level is greater than the preset alarm level), remedial measures can be automatically made, and alarm notification can be timely sent to the client, so that technicians can timely solve the faults, and normal operation of the main node is restored.

According to the fault alarming method provided by the embodiment, the identification information and the starting dependency sequence of the service operated by the main node are periodically acquired, and the service state of the service is acquired, when the service state is abnormal or the service state is not acquired, the configuration file of the service in the current period and the file content index value of the service in the previous period are further acquired according to the identification information of the service, and the fault type is determined. And finally, generating an alarm notification based on the identification information of the service and the fault type, and sending the alarm notification to the client. In this way, the technician need not actively query and analyze to know what specific portion of the fault occurred, as well as the specific cause of the fault (i.e., the type of fault). Therefore, the fault checking efficiency is higher, technicians can solve related faults in time, and the influence on operation service is reduced.

In some alternative embodiments, the above detection script may further include a step of detecting a network communication link, so that when the service operated by the master node includes a data synchronization service, the master node may further perform the following steps:

step one, detecting a network communication link between a main node and a standby node corresponding to the main node.

And step two, after detecting that the network communication link fails, determining whether the master node has mounted the data synchronization service.

Wherein the data synchronization service may be a distributed copy block device service (Distributed Replicated Block Device, DRBD).

And step three, when the main node is determined to have mounted the data synchronization service, acquiring the target time for mounting the data synchronization service.

And step four, adding the target time as a target identifier into a target log corresponding to the data synchronization service.

Wherein the target identifier is used to instruct the master node to mount the data synchronization service during a network communication link failure.

Specifically, the master node may periodically detect a network communication link between itself and the standby node, where detection may be checked by using a ping command, for example, "ping+standby node communication address-c 100-i0.001", which indicates that ping is performed 100 times within 1 second, obtains a check result, and determines whether the network communication link fails according to the check result. Or a grep command check can be adopted to check whether a packet loss event occurs, and if so, the network communication link is considered to be monitored to be faulty.

When the master node detects that the network communication link fails, it can determine whether itself has installed the data synchronization service. If so, the target time for the self-mount data synchronization service may be obtained and added to the target field in the target log (e.g.,/var/log/drbd_state. Log). If not, no action may be taken. Similarly, the standby node may also periodically detect a network communication link between itself and the master node, and may also execute steps one to four after detecting that the network communication link fails.

Thus, by periodically detecting the network communication link between the master node and the slave node, a response can be made at the first time when a network failure occurs, maintaining certain service continuity. And when a network fault occurs, whether a data synchronization service (such as DRBD) is mounted is checked, the mounting time point is recorded, the time is used as an identifier and added into a target log, a clear time clue is provided for subsequent problem investigation, and the data state during the fault period is conveniently analyzed, so that the problem root can be rapidly positioned. From the identifier, the skilled person can know which node has recorded more data to choose which node's data to retain.

In some alternative embodiments, after adding the target time as the target identifier to the target log corresponding to the data synchronization service, the master node may further perform the steps of:

And step one, when the network communication link is detected to be recovered to be normal and a network fault checking instruction sent by the client is received, determining whether the target identifier is included in the target log.

And step two, when the target identifier is determined to be included in the target log, the target identifier is sent to the client side, so that the client side can determine whether the data of the main node and the data of the standby node are inconsistent during the network communication link failure.

Specifically, referring to fig. 3, after detecting that the network communication link is restored to normal, the master node and the standby node may each send a notification of restoration of the network communication link to the client. The client may send a network failure check instruction (for example, a grep instruction) to the two nodes after receiving the notification that the network communication links respectively sent by the master node and the standby node are restored to normal. In this way, after receiving the network fault checking instruction, the master node and the standby node can analyze the target log corresponding to the data synchronization service stored by themselves to determine whether the target field of the target log comprises the identifier, and if so, the identifier is sent to the client as a return result. If not, sending the indication information of the completion of the checking to the client as a return result. After receiving the returned results fed back by the master node and the standby node respectively, the client can determine whether the returned results fed back by the two nodes both comprise identifiers, if so, the two nodes consider themselves as master nodes during the network communication link failure period, and respond to external requests, so that the problem of inconsistency of the data of the two nodes (commonly referred to as data brain crack) is caused. The client can display target alarm information for indicating the occurrence of data brain fracture of the target system. If only one node feedback result includes an identifier, then the problem of data inconsistency during the network communication link failure is considered to be avoided without any processing. If the two nodes have no identifier in the feedback result, the two nodes are not considered to respond to the external request during the network communication link failure period, the data still keep consistent, but the running state of the data synchronization service may have a problem, and at this time, the client may send a restart instruction to the two nodes respectively. The main node and the standby node can execute the restarting operation after receiving the restarting instruction, and rerun the data synchronization service.

In some optional embodiments, if the results fed back by the two nodes have no identifier, the client may first detect the mount state of the data synchronization service of the two nodes, and if the mount states of the data synchronization service of the two nodes are not mount, it is indicated that the mount states of the two nodes are wrong, so that at this time, a restart instruction may be sent to the two nodes respectively. Thus, after restarting, the normal mounting state of the data synchronization service of the two nodes can be recovered.

In some optional embodiments, after the technician sees the target alarm information displayed by the client, the drbd recovery command may be used to select which node data is reserved and which node data is deleted, and accordingly, the master node and the standby node may each perform the following steps:

and step one, deleting the target identifier from the target log after receiving a target data processing instruction sent by the client.

The target data processing instruction is an instruction obtained when the client determines that the data of the main node and the data of the standby node are inconsistent.

And step two, analyzing the target data processing instruction to obtain the operation type.

And step three, deleting the data generated during the mounting of the data synchronization service when the operation type is determined to be a deleting operation.

Or alternatively

And step four, when the operation type is determined to be a reservation operation, reserving the data generated during the mounting of the data synchronization service.

For example, when the target data processor instruction is "drbdadm connect r0", the operation type is a save operation, and when the target data processor instruction is "drbdadm-discard-my-data connect r0", the operation type is a delete operation.

Thus, after receiving the target data processing instruction, the master node can delete the target identifier from the target log, so that when the network communication link is detected to be faulty again later, new time can be directly added, and the problem that a plurality of identifiers exist simultaneously, so that data inconsistency cannot be accurately identified later is avoided. In addition, the data is deleted or reserved based on the target data processing instruction sent by the client, so that the problem of inconsistent data can be solved.

In some optional embodiments, after receiving the alarm notification sent by the master node, the client may parse the alarm notification to obtain a fault type, and when determining that the fault type is an operating system fault or a hardware fault, may send an operating system reload instruction to the master node. After receiving the operating system reloading instruction, the main node can execute the operation of reloading the operating system first, and then acquire the latest configuration file of each service from the standby node according to drbdadminvalidate r's command, so as to complete the configuration operation. In addition, before sending the operating system reloading instruction to the master node, a configuration file checking instruction (for example, icenter recovery check, mainly checking whether a dual-machine configuration file such as ha.cf, haservice.xml, r0.Res exists) may be sent to the standby node, a result fed back by the standby node is obtained, and when the result is used to indicate that the configuration file on the standby node is complete, the client sends the operating system reloading instruction to the master node (for example, icenter recovery). Or when the result is used for indicating that the configuration file on the standby node is incomplete, a complete configuration file can be acquired first and sent to the standby node, and then an operating system reloading instruction is sent to the main node. The standby node may also perform the reloading operation of the operating system in a similar manner, which is not described herein.

Therefore, after the client receives the alarm notification, the type of the fault can be rapidly analyzed and determined, corresponding measures can be timely taken, and the fault response time is shortened.

In this embodiment, a fault alarm device is further provided, and the fault alarm device is used to implement the foregoing embodiments and preferred embodiments, and is not described in detail. As used below, the term "module" may be a combination of software and/or hardware that implements a predetermined function. While the means described in the following embodiments are preferably implemented in software, implementation in hardware, or a combination of software and hardware, is also possible and contemplated.

The present embodiment provides a fault alarm device, as shown in fig. 4, including:

An obtaining module 401, configured to obtain, in a current period, identification information corresponding to each service in at least one service operated by a master node, and a start-up dependency sequence corresponding to each service, obtain a service state corresponding to each service according to the identification information and the start-up dependency sequence corresponding to each service, and obtain, when determining that a service state corresponding to a target service is abnormal or a service state corresponding to the target service is not obtained, a configuration file of the target service in the current period and a file content index value of the target service in a previous period corresponding to the current period according to the identification information corresponding to the target service, where the target service is any one of the at least one service;

A determining module 402, configured to determine a fault type of the target service according to a configuration file of the target service in a current period and a file content index value of the target service in a previous period;

And the generating module 403 is configured to generate an alarm notification according to the identification information of the target service and the fault type, and send the alarm notification to the client.

In some alternative embodiments, the obtaining module 401 is specifically configured to:

In some alternative embodiments, the determining module 402 is specifically configured to:

processing a configuration file of the target service in the current period according to a pre-constructed target algorithm to obtain a file content index value of the target service in the current period;

when the file content index value corresponding to the last period of the target service is inconsistent with the file content index value corresponding to the current period respectively, determining the fault type of the target service as a first fault type, wherein the first fault type is used for indicating that the fault cause of the target service is that the configuration file of the target service is modified;

or alternatively

And when the file content index value corresponding to the last period of the target service is consistent with the file content index value corresponding to the current period of the target service, determining the fault type of the target service as a second fault type, wherein the second fault type is used for indicating that the fault cause of the target service is not the configuration file of the target service is modified.

In some alternative embodiments, the apparatus further comprises a detection module 404, the detection module 404 configured to:

When the service running in the main node comprises a data synchronization service, detecting a network communication link between the main node and a standby node corresponding to the main node;

when the network communication link is detected to be faulty, determining whether the main node has mounted data synchronization service;

When the main node is confirmed to be mounted with the data synchronization service, acquiring the target time of mounting the data synchronization service;

And adding the target time as a target identifier to a target log corresponding to the data synchronization service, wherein the target identifier is used for indicating that the master node mounts the data synchronization service during the network communication link fault.

In some alternative embodiments, the detection module 404 is further configured to:

when the network communication link is detected to be recovered to be normal and a network fault checking instruction sent by the client is received, determining whether a target identifier is included in the target log;

when the target identifier is determined to be included in the target log, the target identifier is sent to the client, so that the client can determine whether the data of the master node and the data of the standby node are inconsistent during the network communication link failure.

In some alternative embodiments, the apparatus further comprises a receiving module 405, the receiving module 405 further configured to:

After receiving a target data processing instruction sent by a client, deleting a target identifier from a target log, wherein the target data processing instruction is an instruction obtained when the client determines that the data of a main node and the data of a standby node are inconsistent;

Analyzing the target data processing instruction to obtain an operation type;

When the operation type is determined to be a deleting operation, deleting the data generated during the mounting of the data synchronization service;

or alternatively

When the operation type is determined to be a reservation operation, data generated during mounting of the data synchronization service is reserved.

Further functional descriptions of the above respective modules and units are the same as those of the above corresponding embodiments, and are not repeated here.

The fault alerting means in this embodiment are presented in the form of functional units, here referred to as ASIC (Application SPECIFIC INTEGRATED Circuit) circuits, processors and memories executing one or more software or fixed programs, and/or other devices that can provide the above described functions.

The embodiment of the invention also provides computer equipment, which is provided with the fault alarm device shown in the figure 4.

Referring to fig. 5, fig. 5 is a schematic structural diagram of a computer device according to an alternative embodiment of the present invention, and as shown in fig. 5, the computer device includes one or more processors 10, a memory 20, and interfaces for connecting components, including a high-speed interface and a low-speed interface. The various components are communicatively coupled to each other using different buses and may be mounted on a common motherboard or in other manners as desired. The processor may process instructions executing within the computer device, including instructions stored in or on memory to display graphical information of the GUI on an external input/output device, such as a display device coupled to the interface. In some alternative embodiments, multiple processors and/or multiple buses may be used, if desired, along with multiple memories and multiple memories. Also, multiple computer devices may be connected, each providing a portion of the necessary operations (e.g., as a server array, a set of blade servers, or a multiprocessor system). One processor 10 is illustrated in fig. 5.

The processor 10 may be a central processor, a network processor, or a combination thereof. Wherein the processor 10 may further comprise a hardware integrated circuit. The hardware integrated circuit may be an application specific integrated circuit, a programmable logic device, or a combination thereof. The programmable logic device may be a complex programmable logic device, a field programmable gate array, a general-purpose array logic, or any combination thereof.

Wherein the memory 20 stores instructions executable by the at least one processor 10 to cause the at least one processor 10 to perform the methods shown in implementing the above embodiments.

The memory 20 may include a storage program area that may store an operating system, application programs required for at least one function, and a storage data area that may store data created according to the use of the computer device, etc. In addition, the memory 20 may include high-speed random access memory, and may also include non-transitory memory, such as at least one magnetic disk storage device, flash memory device, or other non-transitory solid-state storage device. In some alternative embodiments, memory 20 may optionally include memory located remotely from processor 10, which may be connected to the computer device via a network. Examples of such networks include, but are not limited to, the internet, intranets, local area networks, mobile communication networks, and combinations thereof.

The memory 20 may comprise volatile memory, such as random access memory, or nonvolatile memory, such as flash memory, hard disk or solid state disk, or the memory 20 may comprise a combination of the above types of memory.

The computer device also includes a communication interface 30 for the computer device to communicate with other devices or communication networks.

The embodiments of the present invention also provide a computer readable storage medium, and the method according to the embodiments of the present invention described above may be implemented in hardware, firmware, or as a computer code which may be recorded on a storage medium, or as original stored in a remote storage medium or a non-transitory machine readable storage medium downloaded through a network and to be stored in a local storage medium, so that the method described herein may be stored on such software process on a storage medium using a general purpose computer, a special purpose processor, or programmable or special purpose hardware. The storage medium may be a magnetic disk, an optical disk, a read-only memory, a random-access memory, a flash memory, a hard disk, a solid state disk, or the like, and further, the storage medium may further include a combination of the above types of memories. It will be appreciated that a computer, processor, microprocessor controller or programmable hardware includes a storage element that can store or receive software or computer code that, when accessed and executed by the computer, processor or hardware, implements the methods illustrated by the above embodiments.

Portions of the present invention may be implemented as a computer program product, such as computer program instructions, which when executed by a computer, may invoke or provide methods and/or aspects in accordance with the present invention by way of operation of the computer. Those skilled in the art will appreciate that the existence of computer program instructions in a computer-readable medium includes, but is not limited to, source files, executable files, installation package files, and the like, and accordingly, the manner in which computer program instructions are executed by a computer includes, but is not limited to, the computer directly executing the instructions, or the computer compiling the instructions and then executing the corresponding compiled programs, or the computer reading and executing the instructions, or the computer reading and installing the instructions and then executing the corresponding installed programs. Herein, a computer-readable medium may be any available computer-readable storage medium or communication medium that can be accessed by a computer.

Although embodiments of the present invention have been described in connection with the accompanying drawings, various modifications and variations may be made by those skilled in the art without departing from the spirit and scope of the invention, and such modifications and variations fall within the scope of the invention as defined by the appended claims.

Claims

1. A fault alarm method, characterized in that the method comprises:

In the current cycle, obtaining identification information corresponding to each of the at least one service running on the master node, and a startup dependency sequence corresponding to each of the services;

Acquire the service status corresponding to each of the services according to the identification information corresponding to each of the services and the startup dependency sequence;

When it is determined that the service status corresponding to the target service is abnormal, or the service status corresponding to the target service is not obtained, obtaining, according to the identification information corresponding to the target service, a configuration file of the target service in the current cycle and a file content index value of the target service in a previous cycle corresponding to the current cycle, wherein the target service is any one of at least one of the services;

Determine the fault type of the target service according to the configuration file of the target service in the current cycle and the file content index value of the target service in the previous cycle;

An alarm notification is generated according to the identification information and fault type of the target service and sent to the client.

2. The method according to claim 1, characterized in that the obtaining of the service status corresponding to each of the services according to the identification information corresponding to each of the services and the startup dependency sequence comprises:

Determine a service status acquisition order corresponding to each of the services according to a startup dependency order corresponding to each of the services;

According to the order of obtaining the service status corresponding to each of the services, traversing the identification information corresponding to each of the services;

Each time identification information of a service is traversed, the service status corresponding to the traversed service is obtained according to the identification information corresponding to the traversed service;

After traversing the identification information corresponding to all services in at least one of the services, the operation of acquiring the service status corresponding to each of the services in the current cycle is completed.

3. The method according to claim 1 or 2, characterized in that the step of determining the fault type of the target service according to the configuration file of the target service in the current cycle and the file content index value of the target service in the previous cycle comprises:

After processing the configuration file of the target service in the current period according to the pre-built target algorithm, a file content index value of the target service in the current period is obtained;

The fault type of the target service is determined according to the file content indicator values corresponding to the target service in the previous cycle and the current cycle respectively.

4. The method according to claim 3, characterized in that the determining the fault type of the target service according to the file content index values corresponding to the target service in the previous cycle and the current cycle respectively comprises:

Determine whether the file content index value corresponding to the target service in the previous cycle is consistent with the file content index value corresponding to the current cycle;

When it is determined that the file content index value corresponding to the target service in the previous cycle and the file content index values respectively corresponding to the current cycle are inconsistent, determining that the fault type of the target service is a first fault type, wherein the first fault type is used to indicate that the cause of the fault of the target service is that the configuration file of the target service is modified;

or,

When it is determined that the file content index value corresponding to the target service in the previous period is consistent with the file content index value corresponding to the current period, it is determined that the failure type of the target service is the second failure type, wherein the second failure type is used to indicate that the cause of the failure of the target service is not that the configuration file of the target service has been modified.

5. The method according to claim 1 or 2, characterized in that when the service running in the master node includes a data synchronization service, the method further comprises:

Detecting a network communication link between the master node and a standby node corresponding to the master node;

When a failure of the network communication link is detected, determining whether the master node has mounted the data synchronization service;

When it is determined that the master node has mounted the data synchronization service, obtaining a target time for mounting the data synchronization service;

The target time is added as a target identifier to a target log corresponding to the data synchronization service, wherein the target identifier is used to instruct the master node to mount the data synchronization service during the failure of the network communication link.

6. The method according to claim 5, characterized in that after adding the target time as a target identifier to the target log corresponding to the data synchronization service, the method further comprises:

When it is detected that the network communication link is restored to normal and a network fault checking instruction sent by the client is received, determining whether the target log includes the target identifier;

When it is determined that the target log includes the target identifier, the target identifier is sent to the client so that the client can determine whether inconsistency occurs between the data of the primary node and the data of the backup node during the network communication link failure.

7. The method according to claim 6, characterized in that when it is determined that the target log includes the target identifier, after sending the target identifier to the client, the method further comprises:

After receiving the target data processing instruction sent by the client, deleting the target identifier from the target log, wherein the target data processing instruction is an instruction obtained when the client determines that the data of the primary node and the data of the standby node are inconsistent;

Parsing the target data processing instruction to obtain an operation type;

When it is determined that the operation type is a deletion operation, deleting the data generated by the data synchronization service during the mounting period;

or,

When it is determined that the operation type is a retain operation, the data generated by the data synchronization service during the mounting period is retained.

8. A device for determining a fault type, characterized in that the device comprises:

An acquisition module is used to acquire, in the current cycle, identification information corresponding to each of the at least one service run by the master node, and a startup dependency sequence corresponding to each of the services; acquire the service status corresponding to each of the services according to the identification information and the startup dependency sequence corresponding to each of the services; when it is determined that the service status corresponding to the target service is abnormal, or the service status corresponding to the target service is not acquired, acquire the configuration file of the target service in the current cycle according to the identification information corresponding to the target service, and the file content index value of the target service in the previous cycle corresponding to the current cycle, wherein the target service is any one of the at least one services;

A determination module, used to determine the fault type of the target service according to the configuration file of the target service in the current cycle and the file content index value of the target service in the previous cycle;

The generating module is used to generate an alarm notification according to the identification information and the fault type of the target service, and send the alarm notification to the client.

9. A computer device, comprising:

A memory and a processor, wherein the memory and the processor are communicatively connected to each other, the memory stores computer instructions, and the processor executes the fault alarm method according to any one of claims 1 to 7 by executing the computer instructions.

10 . A computer-readable storage medium, characterized in that computer instructions are stored on the computer-readable storage medium, and the computer instructions are used to enable a computer to execute the fault alarm method according to any one of claims 1 to 7.