CN118394559B

CN118394559B - Server fault prediction method and device

Info

Publication number: CN118394559B
Application number: CN202410852079.4A
Authority: CN
Inventors: 王帅; 曾治富; 刘燚
Original assignee: Sichuan Chuanxi Data Industry Co ltd
Current assignee: Sichuan Chuanxi Data Industry Co ltd
Priority date: 2024-06-28
Filing date: 2024-06-28
Publication date: 2024-10-15
Anticipated expiration: 2044-06-28
Also published as: CN118394559A

Abstract

The application discloses a server fault prediction method and device, and relates to the technical field of server monitoring. Wherein the method comprises the following steps: acquiring a historical fault record in a first preset time period before prediction from a fault database of a server; acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; determining a node association relationship based on the first time sequence data and the second time sequence data; constructing a heterogeneous graph comprising fault record nodes and running state nodes according to the node association relation; updating the weight of each edge between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; and predicting the future fault condition of the server according to the updated heterogeneous graph. The application realizes the technical effect of predicting the future fault condition of the server.

Description

Server fault prediction method and device

Technical Field

The application relates to the technical field of server monitoring, in particular to a server fault prediction method and device.

Background

As the size of data centers continues to grow, there is an increasing demand for security and stability to data centers. The machine room serves as an important data center, and if a server stored in the machine room fails, the related business operation is often caused to be problematic, so that data damage, economic loss and the like are caused.

Currently, the prior art monitors the failure condition of a server, typically by manually and periodically testing the server. On the one hand, because the number of servers stored in the data center is huge, huge manpower is consumed for manually monitoring the fault condition of the servers, and the maintenance cost of the servers is increased. On the other hand, it is difficult to find and process the failure in time after the failure occurs because it is impossible to predict the failure of the server.

In view of the above problems, no effective solution has been proposed at present.

Disclosure of Invention

The embodiment of the application provides a server fault prediction method and device, which aim to solve the technical problems.

According to an aspect of the embodiment of the present application, there is provided a server failure prediction method, including: acquiring a historical fault record in a first preset time period before prediction from a fault database of a server; wherein, the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data; acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server; determining a node association relationship based on the first time sequence data and the second time sequence data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operation state node are used for representing a plurality of characteristics of the operation state of the server; the plurality of attributes of the fault record node are used for representing a plurality of characteristics of the plurality of fault records; constructing a heterogeneous graph comprising the fault record node and the running state node according to the node association relation; in the abnormal graph, each edge between the fault recording node and the running state node is determined based on the node association relation; updating the weight of each edge between the fault record node and the running state node in the abnormal graph based on the graph embedding vector of the server; the graph embedding vector of the server is constructed according to a first graph neural network; and predicting the future fault condition of the server according to the updated heterograms.

According to another aspect of the embodiment of the present application, there is provided a server failure prediction apparatus, including: a history fault record obtaining unit, configured to obtain a history fault record in a first preset time period before prediction from a fault database of a server; wherein, the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data; a historical running state obtaining unit, configured to obtain second time sequence data corresponding to the running state of the server in a second preset time period before prediction is performed; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server; a node association relation determining unit configured to determine a node association relation based on the first time series data and the second time series data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operation state node are used for representing a plurality of characteristics of the operation state of the server; the plurality of attributes of the fault record node are used for representing a plurality of characteristics of the plurality of fault records; the heterogeneous composition construction unit is used for constructing a heterogeneous chart comprising the fault record node and the running state node according to the node association relation; in the abnormal graph, each edge between the fault recording node and the running state node is determined based on the node association relation; an edge weight updating unit, configured to update weights of edges between the fault record node and the running state node in the abnormal pattern based on a graph embedding vector of the server; the graph embedding vector of the server is constructed according to a first graph neural network; and the fault prediction unit predicts the future fault condition of the server according to the updated abnormal pattern.

According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the server failure prediction method described above by the computer program.

Based on the server fault prediction method and device provided by the application, a historical fault record in a first preset time period before prediction is obtained from a fault database of a server; acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; determining a node association relationship based on the first time sequence data and the second time sequence data; constructing a heterogeneous graph comprising fault record nodes and running state nodes according to the node association relation; updating the weight of each edge between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; and predicting the future fault condition of the server according to the updated heterogeneous graph. That is, through the prediction of the faults of the servers stored in the machine room, on one hand, the faults can be found out in time and solved, so that economic loss can be avoided to a certain extent, and the degree of data loss can be reduced; on the other hand, manpower and material resources can be saved, the maintenance cost of the server is saved, the running stability of the server is ensured, and the running stability of related businesses can be further ensured.

Drawings

The accompanying drawings, which are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and without limitation to the application. In the drawings:

FIG. 1 is a flow chart of an alternative server failure prediction method according to an embodiment of the present application;

Fig. 2 is a block diagram of an alternative server failure prediction apparatus according to an embodiment of the present application.

The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.

Detailed Description

In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.

It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.

Optionally, as an optional implementation manner, as shown in fig. 1, an embodiment of the present application provides a server fault prediction method, including:

S101, acquiring a historical fault record in a first preset time period before prediction from a fault database of a server; the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data;

It should be understood that time series data refers to time series data. The time-series data is a data sequence in which a unified index is recorded in chronological order. Individual data in the same data column has comparability. The time series data may be the number of time periods or the number of time points. The time sequence analysis aims at constructing a time sequence model by finding out the statistical characteristics and the development regularity of the time sequence in the sample and carrying out-of-sample prediction.

In the embodiment of the present application, the server may be any one or more servers in the machine room. The failure type data may include, but is not limited to, disk failure type, database failure type, network connection failure type, and central processor failure type. Each fault record may include fault signature data, which may include, but is not limited to, a number of unsafe power-down times, a number of disk bad areas, a number of hardware restarts, a hard disk abnormal temperature, and a hard disk abnormal start time, as examples of fault type data including a disk fault type.

S102, acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server;

In some embodiments of the application, the first preset time period and the second preset time period may be the same or different. The running state of the server may also include, but is not limited to, the temperature of the central processing unit, the motherboard temperature, etc.

As an alternative, the response time of the server may be obtained based on the following steps:

Continuously sending a detection signal to a server at a first time point based on a preset period, wherein the server returns a response signal after receiving the detection signal;

determining a second time point for receiving a response signal returned by the server;

the time period between the second time point and the first time point is taken as the response time of the server.

S103, determining a node association relationship based on the first time sequence data and the second time sequence data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operational state node are used to represent a plurality of characteristics of the operational state of the server; the plurality of attributes of the fault record node are used to represent a plurality of characteristics of a plurality of fault records;

S104, constructing a heterogeneous graph comprising fault record nodes and running state nodes according to the node association relation; in the heterogeneous graph, each edge between a fault recording node and an operation state node is determined based on a node association relationship;

S105, updating the weight of each edge between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; the graph embedding vector of the server is constructed according to the first graph neural network;

s106, predicting future fault conditions of the server according to the updated heterograms.

Based on the embodiment provided by the application, a historical fault record in a first preset time period before prediction is obtained from a fault database of a server; acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; determining a node association relationship based on the first time sequence data and the second time sequence data; constructing a heterogeneous graph comprising fault record nodes and running state nodes according to the node association relation; updating the weight of each edge between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; and predicting the future fault condition of the server according to the updated heterogeneous graph. That is, through the prediction of the faults of the servers stored in the machine room, on one hand, the faults can be found out in time and solved, so that economic loss can be avoided to a certain extent, and the degree of data loss can be reduced; on the other hand, manpower and material resources can be saved, the maintenance cost of the server is saved, the running stability of the server is ensured, and the running stability of related businesses can be further ensured.

As an alternative, updating weights of edges between fault record nodes and running state nodes in a heterogeneous graph based on graph embedding vectors of a server includes:

inputting a plurality of attributes of the operation state nodes and a plurality of attributes of the fault record nodes into a first graph neural network, and calculating Gaussian similarity function values of each edge between the fault record nodes and the operation state nodes;

and constructing an associated weight matrix of each side according to the Gaussian similarity function value of each side and the graph embedded vector of the server, and updating the weight of each side according to the associated weight matrix.

As an alternative, predicting the future failure condition of the server according to the updated heterogram includes:

determining the current network information of the updated heterogeneous graph according to the weight of each edge in the updated heterogeneous graph and the node grouping condition;

under the condition that the current network information of the updated heterogeneous graph meets the preset network condition, mapping the weight of the nodes in each group and the weight of each edge in the updated heterogeneous graph to a generalized adjacent matrix to obtain a heterogeneous weight matrix; wherein the nodes in each group comprise fault record nodes and running state nodes in each group;

In the embodiment of the application, the fault record node and the running state node both comprise a plurality of sub-nodes, and the difference of the sub-node groups corresponding to the nodes included in the heterogeneous graph generally results in the change of the weight value of the sub-node in different groups, and the weight value of the node belonging to the same group is relatively larger. The node groupings may be partitioned based on correlation characteristics of a plurality of child nodes included by both the fault log node and the operational state node, where the correlation characteristics may include software correlations or hardware correlations; the node group may also be partitioned based on time intervals to which the plurality of child nodes included in both the failure record node and the operational state node belong. The node group may also be partitioned based on both the correlation characteristics of the child nodes and the time intervals to which the child nodes belong.

In some embodiments of the application, the preset network conditions are determined based on the degree of difference of the nodes in each group and the density of neighbor nodes; and under the condition that the current network information of the updated heterogeneous graph does not meet the preset network condition, updating the node grouping condition until the current network information of the updated heterogeneous graph meets the preset network condition. Wherein, the more the number of neighbor nodes of a node, the higher the density of the neighbor nodes.

Based on the embodiment provided by the application, the weight of the nodes in each group and the weight of each edge in the updated heterograms are mapped to the generalized adjacent matrix according to the preset network conditions, so that a more accurate heterogeneous weight matrix can be obtained, and the subsequent accurate calculation is facilitated.

Acquiring the feature vector of the fault record node and the feature vector of the running state node in the updated heterogram;

Updating the feature vector of the fault record node and the feature vector of the running state node by using a gating chart convolution time sequence neural network; the method comprises the steps of carrying out weighted summation on the feature vector of an updated fault record node and the feature vector of an updated running state node by using an attention network to obtain a node fusion feature vector;

The attention network forms a shared space by fusing the characteristic information of the fault recording node and the characteristic information of the running state node, so that each parameter shares each characteristic information;

in this embodiment, the feature vector of the updated fault record node and the feature vector of the updated running state node are input into the attention network, so that the dependency relationship between the context information in each node sequence can be captured.

According to the heterogeneous weight matrix, processing the node fusion feature vector by adopting a second graph neural network to obtain a node hiding vector;

and predicting the future fault condition of the server based on the heterogeneous weight matrix and the node hiding vector.

As an alternative, the first graph neural network is built based on the following steps:

acquiring a plurality of training samples; wherein each training sample comprises fault record structure data and running state structure data; the fault record structure data are diagram structure data corresponding to a plurality of characteristics of a plurality of sample fault records, and the running state structure data are diagram structure data corresponding to a plurality of characteristics of a sample running state of the server;

And performing iterative training on the first graph neural network by using a plurality of training samples until the maximum iteration times are reached or the network parameters of the first graph neural network are converged, so as to obtain the trained first graph neural network.

As an alternative, before obtaining the failure database of the server, the method further includes:

After repairing the fault occurring in the server, the fault record is stored in a fault database.

Optionally, as an optional embodiment, as shown in fig. 2, the present application provides a server failure prediction apparatus, which is characterized by including:

A historical fault record obtaining unit 201, configured to obtain a historical fault record in a first preset time period before prediction from a fault database of a server; the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data;

A historical running state obtaining unit 202, configured to obtain second time-series data corresponding to the running state of the server in a second preset time period before the prediction is performed; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server;

A node association relationship determining unit 203, configured to determine a node association relationship based on the first time sequence data and the second time sequence data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operational state node are used to represent a plurality of characteristics of the operational state of the server; the plurality of attributes of the fault record node are used to represent a plurality of characteristics of a plurality of fault records;

a heterogeneous graph construction unit 204, configured to construct a heterogeneous graph including fault record nodes and operation state nodes according to the node association relationship; in the heterogeneous graph, each edge between a fault recording node and an operation state node is determined based on a node association relationship;

An edge weight updating unit 205, configured to update weights of edges between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; the graph embedding vector of the server is constructed according to the first graph neural network;

The failure prediction unit 206 predicts the future failure condition of the server according to the updated heterograms.

Based on the embodiment provided by the application, the functions realized by the units are combined, so that on one hand, faults can be found out and solved in time, economic loss can be avoided to a certain extent, and the degree of data loss is reduced; on the other hand, manpower and material resources can be saved, the maintenance cost of the server is saved, the running stability of the server is ensured, and the running stability of related businesses can be further ensured.

It should be noted that, the embodiment implemented on the server failure prediction device side according to the present application may refer to the embodiment implemented on the server failure prediction method side, and the present application is not described in detail.

The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims

1. A server failure prediction method, comprising:

Acquiring a historical fault record in a first preset time period before prediction from a fault database of a server; the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data;

Acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server;

Determining a node association relationship based on the first time sequence data and the second time sequence data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operational state node are used to represent a plurality of characteristics of the operational state of the server; the plurality of attributes of the fault record node are used to represent a plurality of characteristics of the plurality of fault records;

Constructing a heterogeneous graph comprising the fault record node and the running state node according to the node association relation; in the heterogram, each edge between the fault recording node and the running state node is determined based on the node association relation;

inputting the multiple attributes of the running state nodes and the multiple attributes of the fault record nodes into a first graph neural network, and calculating Gaussian similarity function values of each edge between the fault record nodes and the running state nodes;

Constructing an association weight matrix of each side according to the Gaussian similarity function value of each side and the graph embedding vector of the server, and updating the weight of each side according to the association weight matrix; the graph embedding vector of the server is constructed according to the first graph neural network;

Determining current network information of the updated heterogeneous graph according to the weight of each edge in the updated heterogeneous graph and the node grouping condition;

under the condition that the current network information of the updated iso-graph meets the preset network condition, mapping the weight of the nodes in each group and the weight of each side in the updated iso-graph to a generalized adjacent matrix to obtain a heterogeneous weight matrix; wherein the nodes in each group include the fault record node and the operational state node in each group;

Acquiring the characteristic vector of the fault record node and the characteristic vector of the running state node in the updated iso-graph;

Updating the feature vector of the fault record node and the feature vector of the running state node by using a gating chart convolution time sequence neural network;

The updated feature vector of the fault record node and the updated feature vector of the running state node are weighted and summed by using an attention network to obtain a node fusion feature vector; the attention network forms a shared space by fusing the characteristic information of the fault record node and the characteristic information of the running state node, so that each parameter shares each characteristic information;

According to the heterogeneous weight matrix, a second graph neural network is adopted to process the node fusion feature vector, and a node hiding vector is obtained;

2. The server failure prediction method according to claim 1, characterized in that,

The preset network conditions are determined based on the difference degree of the nodes in each group and the density of the neighbor nodes; and under the condition that the updated current network information of the heterograms does not meet the preset network conditions, updating the node grouping condition until the updated current network information of the heterograms meets the preset network conditions.

3. The server failure prediction method according to claim 1, wherein the first graph neural network is established based on the steps of:

and carrying out iterative training on the first graph neural network by using the plurality of training samples until the maximum iteration times are reached or the network parameters of the first graph neural network are converged, so as to obtain the trained first graph neural network.

4. The server failure prediction method according to claim 1, wherein the response time of the server is obtained based on the steps of:

Continuously transmitting a detection signal to the server at a first time point based on a preset period, wherein the server returns a response signal after receiving the detection signal;

determining a second time point of receiving the response signal returned by the server;

5. The server failure prediction method according to claim 1, characterized in that before acquiring the failure database of the server, the method further comprises:

and after repairing the faults of the server, storing fault records into the fault database.

6. A server failure prediction apparatus, comprising:

a history fault record obtaining unit, configured to obtain a history fault record in a first preset time period before prediction from a fault database of a server; the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data;

A historical running state acquisition unit, configured to acquire second time sequence data corresponding to the running state of the server in a second preset time period before prediction is performed; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server;

A node association relation determining unit configured to determine a node association relation based on the first time-series data and the second time-series data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operational state node are used to represent a plurality of characteristics of the operational state of the server; the plurality of attributes of the fault record node are used to represent a plurality of characteristics of the plurality of fault records;

The heterogeneous composition construction unit is used for constructing a heterogeneous chart comprising the fault record node and the running state node according to the node association relation; in the heterogram, each edge between the fault recording node and the running state node is determined based on the node association relation;

The edge weight updating unit is used for inputting the plurality of attributes of the running state node and the plurality of attributes of the fault record node into a first graph neural network, and calculating Gaussian similarity function values of each edge between the fault record node and the running state node; constructing an association weight matrix of each side according to the Gaussian similarity function value of each side and the graph embedding vector of the server, and updating the weight of each side according to the association weight matrix; the graph embedding vector of the server is constructed according to the first graph neural network;

The fault prediction unit is used for determining the current network information of the updated heterogeneous graph according to the weight of each edge in the updated heterogeneous graph and the node grouping condition; under the condition that the current network information of the updated iso-graph meets the preset network condition, mapping the weight of the nodes in each group and the weight of each side in the updated iso-graph to a generalized adjacent matrix to obtain a heterogeneous weight matrix; wherein the nodes in each group include the fault record node and the operational state node in each group; acquiring the characteristic vector of the fault record node and the characteristic vector of the running state node in the updated iso-graph; updating the feature vector of the fault record node and the feature vector of the running state node by using a gating chart convolution time sequence neural network; the updated feature vector of the fault record node and the updated feature vector of the running state node are weighted and summed by using an attention network to obtain a node fusion feature vector; the attention network forms a shared space by fusing the characteristic information of the fault record node and the characteristic information of the running state node, so that each parameter shares each characteristic information; according to the heterogeneous weight matrix, a second graph neural network is adopted to process the node fusion feature vector, and a node hiding vector is obtained; and predicting the future fault condition of the server based on the heterogeneous weight matrix and the node hiding vector.