[go: up one dir, main page]

CN118394559B - Server fault prediction method and device - Google Patents

Server fault prediction method and device Download PDF

Info

Publication number
CN118394559B
CN118394559B CN202410852079.4A CN202410852079A CN118394559B CN 118394559 B CN118394559 B CN 118394559B CN 202410852079 A CN202410852079 A CN 202410852079A CN 118394559 B CN118394559 B CN 118394559B
Authority
CN
China
Prior art keywords
node
server
fault
running state
graph
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202410852079.4A
Other languages
Chinese (zh)
Other versions
CN118394559A (en
Inventor
王帅
曾治富
刘燚
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Sichuan Chuanxi Data Industry Co ltd
Original Assignee
Sichuan Chuanxi Data Industry Co ltd
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Sichuan Chuanxi Data Industry Co ltd filed Critical Sichuan Chuanxi Data Industry Co ltd
Priority to CN202410852079.4A priority Critical patent/CN118394559B/en
Publication of CN118394559A publication Critical patent/CN118394559A/en
Application granted granted Critical
Publication of CN118394559B publication Critical patent/CN118394559B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Classifications

    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06FELECTRIC DIGITAL DATA PROCESSING
    • G06F11/00Error detection; Error correction; Monitoring
    • G06F11/07Responding to the occurrence of a fault, e.g. fault tolerance
    • G06F11/0703Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
    • G06F11/079Root cause analysis, i.e. error or fault diagnosis
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/042Knowledge-based neural networks; Logical representations of neural networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/04Architecture, e.g. interconnection topology
    • G06N3/045Combinations of networks
    • GPHYSICS
    • G06COMPUTING OR CALCULATING; COUNTING
    • G06NCOMPUTING ARRANGEMENTS BASED ON SPECIFIC COMPUTATIONAL MODELS
    • G06N3/00Computing arrangements based on biological models
    • G06N3/02Neural networks
    • G06N3/08Learning methods

Landscapes

  • Engineering & Computer Science (AREA)
  • Theoretical Computer Science (AREA)
  • Physics & Mathematics (AREA)
  • Health & Medical Sciences (AREA)
  • General Physics & Mathematics (AREA)
  • Biomedical Technology (AREA)
  • General Engineering & Computer Science (AREA)
  • General Health & Medical Sciences (AREA)
  • Data Mining & Analysis (AREA)
  • Evolutionary Computation (AREA)
  • Computational Linguistics (AREA)
  • Molecular Biology (AREA)
  • Computing Systems (AREA)
  • Biophysics (AREA)
  • Artificial Intelligence (AREA)
  • Mathematical Physics (AREA)
  • Software Systems (AREA)
  • Life Sciences & Earth Sciences (AREA)
  • Quality & Reliability (AREA)
  • Debugging And Monitoring (AREA)

Abstract

The application discloses a server fault prediction method and device, and relates to the technical field of server monitoring. Wherein the method comprises the following steps: acquiring a historical fault record in a first preset time period before prediction from a fault database of a server; acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; determining a node association relationship based on the first time sequence data and the second time sequence data; constructing a heterogeneous graph comprising fault record nodes and running state nodes according to the node association relation; updating the weight of each edge between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; and predicting the future fault condition of the server according to the updated heterogeneous graph. The application realizes the technical effect of predicting the future fault condition of the server.

Description

Server fault prediction method and device
Technical Field
The application relates to the technical field of server monitoring, in particular to a server fault prediction method and device.
Background
As the size of data centers continues to grow, there is an increasing demand for security and stability to data centers. The machine room serves as an important data center, and if a server stored in the machine room fails, the related business operation is often caused to be problematic, so that data damage, economic loss and the like are caused.
Currently, the prior art monitors the failure condition of a server, typically by manually and periodically testing the server. On the one hand, because the number of servers stored in the data center is huge, huge manpower is consumed for manually monitoring the fault condition of the servers, and the maintenance cost of the servers is increased. On the other hand, it is difficult to find and process the failure in time after the failure occurs because it is impossible to predict the failure of the server.
In view of the above problems, no effective solution has been proposed at present.
Disclosure of Invention
The embodiment of the application provides a server fault prediction method and device, which aim to solve the technical problems.
According to an aspect of the embodiment of the present application, there is provided a server failure prediction method, including: acquiring a historical fault record in a first preset time period before prediction from a fault database of a server; wherein, the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data; acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server; determining a node association relationship based on the first time sequence data and the second time sequence data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operation state node are used for representing a plurality of characteristics of the operation state of the server; the plurality of attributes of the fault record node are used for representing a plurality of characteristics of the plurality of fault records; constructing a heterogeneous graph comprising the fault record node and the running state node according to the node association relation; in the abnormal graph, each edge between the fault recording node and the running state node is determined based on the node association relation; updating the weight of each edge between the fault record node and the running state node in the abnormal graph based on the graph embedding vector of the server; the graph embedding vector of the server is constructed according to a first graph neural network; and predicting the future fault condition of the server according to the updated heterograms.
According to another aspect of the embodiment of the present application, there is provided a server failure prediction apparatus, including: a history fault record obtaining unit, configured to obtain a history fault record in a first preset time period before prediction from a fault database of a server; wherein, the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data; a historical running state obtaining unit, configured to obtain second time sequence data corresponding to the running state of the server in a second preset time period before prediction is performed; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server; a node association relation determining unit configured to determine a node association relation based on the first time series data and the second time series data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operation state node are used for representing a plurality of characteristics of the operation state of the server; the plurality of attributes of the fault record node are used for representing a plurality of characteristics of the plurality of fault records; the heterogeneous composition construction unit is used for constructing a heterogeneous chart comprising the fault record node and the running state node according to the node association relation; in the abnormal graph, each edge between the fault recording node and the running state node is determined based on the node association relation; an edge weight updating unit, configured to update weights of edges between the fault record node and the running state node in the abnormal pattern based on a graph embedding vector of the server; the graph embedding vector of the server is constructed according to a first graph neural network; and the fault prediction unit predicts the future fault condition of the server according to the updated abnormal pattern.
According to still another aspect of the embodiments of the present application, there is also provided an electronic device including a memory in which a computer program is stored, and a processor configured to execute the server failure prediction method described above by the computer program.
Based on the server fault prediction method and device provided by the application, a historical fault record in a first preset time period before prediction is obtained from a fault database of a server; acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; determining a node association relationship based on the first time sequence data and the second time sequence data; constructing a heterogeneous graph comprising fault record nodes and running state nodes according to the node association relation; updating the weight of each edge between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; and predicting the future fault condition of the server according to the updated heterogeneous graph. That is, through the prediction of the faults of the servers stored in the machine room, on one hand, the faults can be found out in time and solved, so that economic loss can be avoided to a certain extent, and the degree of data loss can be reduced; on the other hand, manpower and material resources can be saved, the maintenance cost of the server is saved, the running stability of the server is ensured, and the running stability of related businesses can be further ensured.
Drawings
The accompanying drawings, which are included to provide a further understanding of embodiments of the application and are incorporated in and constitute a part of this specification, illustrate embodiments of the application and together with the description serve to explain the application and without limitation to the application. In the drawings:
FIG. 1 is a flow chart of an alternative server failure prediction method according to an embodiment of the present application;
Fig. 2 is a block diagram of an alternative server failure prediction apparatus according to an embodiment of the present application.
The achievement of the objects, functional features and advantages of the present invention will be further described with reference to the accompanying drawings, in conjunction with the embodiments.
Detailed Description
In order to enable those skilled in the art to better understand the present application, the following description will make clear and complete descriptions of the technical solutions according to the embodiments of the present application with reference to the accompanying drawings.
It should be noted that the terms "first," "second," and the like in the description and the claims of the present application and the above figures are used for distinguishing between similar objects and not necessarily for describing a particular sequential or chronological order. It is to be understood that the data so used may be interchanged where appropriate such that the embodiments of the application described herein may be implemented in sequences other than those illustrated or otherwise described herein. Furthermore, the terms "comprises," "comprising," and "having," and any variations thereof, are intended to cover a non-exclusive inclusion, such that a process, method, system, article, or apparatus that comprises a list of steps or elements is not necessarily limited to those steps or elements expressly listed but may include other steps or elements not expressly listed or inherent to such process, method, article, or apparatus.
Optionally, as an optional implementation manner, as shown in fig. 1, an embodiment of the present application provides a server fault prediction method, including:
S101, acquiring a historical fault record in a first preset time period before prediction from a fault database of a server; the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data;
It should be understood that time series data refers to time series data. The time-series data is a data sequence in which a unified index is recorded in chronological order. Individual data in the same data column has comparability. The time series data may be the number of time periods or the number of time points. The time sequence analysis aims at constructing a time sequence model by finding out the statistical characteristics and the development regularity of the time sequence in the sample and carrying out-of-sample prediction.
In the embodiment of the present application, the server may be any one or more servers in the machine room. The failure type data may include, but is not limited to, disk failure type, database failure type, network connection failure type, and central processor failure type. Each fault record may include fault signature data, which may include, but is not limited to, a number of unsafe power-down times, a number of disk bad areas, a number of hardware restarts, a hard disk abnormal temperature, and a hard disk abnormal start time, as examples of fault type data including a disk fault type.
S102, acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server;
In some embodiments of the application, the first preset time period and the second preset time period may be the same or different. The running state of the server may also include, but is not limited to, the temperature of the central processing unit, the motherboard temperature, etc.
As an alternative, the response time of the server may be obtained based on the following steps:
Continuously sending a detection signal to a server at a first time point based on a preset period, wherein the server returns a response signal after receiving the detection signal;
determining a second time point for receiving a response signal returned by the server;
the time period between the second time point and the first time point is taken as the response time of the server.
S103, determining a node association relationship based on the first time sequence data and the second time sequence data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operational state node are used to represent a plurality of characteristics of the operational state of the server; the plurality of attributes of the fault record node are used to represent a plurality of characteristics of a plurality of fault records;
S104, constructing a heterogeneous graph comprising fault record nodes and running state nodes according to the node association relation; in the heterogeneous graph, each edge between a fault recording node and an operation state node is determined based on a node association relationship;
S105, updating the weight of each edge between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; the graph embedding vector of the server is constructed according to the first graph neural network;
s106, predicting future fault conditions of the server according to the updated heterograms.
Based on the embodiment provided by the application, a historical fault record in a first preset time period before prediction is obtained from a fault database of a server; acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; determining a node association relationship based on the first time sequence data and the second time sequence data; constructing a heterogeneous graph comprising fault record nodes and running state nodes according to the node association relation; updating the weight of each edge between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; and predicting the future fault condition of the server according to the updated heterogeneous graph. That is, through the prediction of the faults of the servers stored in the machine room, on one hand, the faults can be found out in time and solved, so that economic loss can be avoided to a certain extent, and the degree of data loss can be reduced; on the other hand, manpower and material resources can be saved, the maintenance cost of the server is saved, the running stability of the server is ensured, and the running stability of related businesses can be further ensured.
As an alternative, updating weights of edges between fault record nodes and running state nodes in a heterogeneous graph based on graph embedding vectors of a server includes:
inputting a plurality of attributes of the operation state nodes and a plurality of attributes of the fault record nodes into a first graph neural network, and calculating Gaussian similarity function values of each edge between the fault record nodes and the operation state nodes;
and constructing an associated weight matrix of each side according to the Gaussian similarity function value of each side and the graph embedded vector of the server, and updating the weight of each side according to the associated weight matrix.
As an alternative, predicting the future failure condition of the server according to the updated heterogram includes:
determining the current network information of the updated heterogeneous graph according to the weight of each edge in the updated heterogeneous graph and the node grouping condition;
under the condition that the current network information of the updated heterogeneous graph meets the preset network condition, mapping the weight of the nodes in each group and the weight of each edge in the updated heterogeneous graph to a generalized adjacent matrix to obtain a heterogeneous weight matrix; wherein the nodes in each group comprise fault record nodes and running state nodes in each group;
In the embodiment of the application, the fault record node and the running state node both comprise a plurality of sub-nodes, and the difference of the sub-node groups corresponding to the nodes included in the heterogeneous graph generally results in the change of the weight value of the sub-node in different groups, and the weight value of the node belonging to the same group is relatively larger. The node groupings may be partitioned based on correlation characteristics of a plurality of child nodes included by both the fault log node and the operational state node, where the correlation characteristics may include software correlations or hardware correlations; the node group may also be partitioned based on time intervals to which the plurality of child nodes included in both the failure record node and the operational state node belong. The node group may also be partitioned based on both the correlation characteristics of the child nodes and the time intervals to which the child nodes belong.
In some embodiments of the application, the preset network conditions are determined based on the degree of difference of the nodes in each group and the density of neighbor nodes; and under the condition that the current network information of the updated heterogeneous graph does not meet the preset network condition, updating the node grouping condition until the current network information of the updated heterogeneous graph meets the preset network condition. Wherein, the more the number of neighbor nodes of a node, the higher the density of the neighbor nodes.
Based on the embodiment provided by the application, the weight of the nodes in each group and the weight of each edge in the updated heterograms are mapped to the generalized adjacent matrix according to the preset network conditions, so that a more accurate heterogeneous weight matrix can be obtained, and the subsequent accurate calculation is facilitated.
Acquiring the feature vector of the fault record node and the feature vector of the running state node in the updated heterogram;
Updating the feature vector of the fault record node and the feature vector of the running state node by using a gating chart convolution time sequence neural network; the method comprises the steps of carrying out weighted summation on the feature vector of an updated fault record node and the feature vector of an updated running state node by using an attention network to obtain a node fusion feature vector;
The attention network forms a shared space by fusing the characteristic information of the fault recording node and the characteristic information of the running state node, so that each parameter shares each characteristic information;
in this embodiment, the feature vector of the updated fault record node and the feature vector of the updated running state node are input into the attention network, so that the dependency relationship between the context information in each node sequence can be captured.
According to the heterogeneous weight matrix, processing the node fusion feature vector by adopting a second graph neural network to obtain a node hiding vector;
and predicting the future fault condition of the server based on the heterogeneous weight matrix and the node hiding vector.
As an alternative, the first graph neural network is built based on the following steps:
acquiring a plurality of training samples; wherein each training sample comprises fault record structure data and running state structure data; the fault record structure data are diagram structure data corresponding to a plurality of characteristics of a plurality of sample fault records, and the running state structure data are diagram structure data corresponding to a plurality of characteristics of a sample running state of the server;
And performing iterative training on the first graph neural network by using a plurality of training samples until the maximum iteration times are reached or the network parameters of the first graph neural network are converged, so as to obtain the trained first graph neural network.
As an alternative, before obtaining the failure database of the server, the method further includes:
After repairing the fault occurring in the server, the fault record is stored in a fault database.
Optionally, as an optional embodiment, as shown in fig. 2, the present application provides a server failure prediction apparatus, which is characterized by including:
A historical fault record obtaining unit 201, configured to obtain a historical fault record in a first preset time period before prediction from a fault database of a server; the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data;
A historical running state obtaining unit 202, configured to obtain second time-series data corresponding to the running state of the server in a second preset time period before the prediction is performed; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server;
A node association relationship determining unit 203, configured to determine a node association relationship based on the first time sequence data and the second time sequence data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operational state node are used to represent a plurality of characteristics of the operational state of the server; the plurality of attributes of the fault record node are used to represent a plurality of characteristics of a plurality of fault records;
a heterogeneous graph construction unit 204, configured to construct a heterogeneous graph including fault record nodes and operation state nodes according to the node association relationship; in the heterogeneous graph, each edge between a fault recording node and an operation state node is determined based on a node association relationship;
An edge weight updating unit 205, configured to update weights of edges between the fault record node and the running state node in the heterogeneous graph based on the graph embedding vector of the server; the graph embedding vector of the server is constructed according to the first graph neural network;
The failure prediction unit 206 predicts the future failure condition of the server according to the updated heterograms.
Based on the embodiment provided by the application, the functions realized by the units are combined, so that on one hand, faults can be found out and solved in time, economic loss can be avoided to a certain extent, and the degree of data loss is reduced; on the other hand, manpower and material resources can be saved, the maintenance cost of the server is saved, the running stability of the server is ensured, and the running stability of related businesses can be further ensured.
It should be noted that, the embodiment implemented on the server failure prediction device side according to the present application may refer to the embodiment implemented on the server failure prediction method side, and the present application is not described in detail.
The foregoing is merely a preferred embodiment of the present application and it should be noted that modifications and adaptations to those skilled in the art may be made without departing from the principles of the present application, which are intended to be comprehended within the scope of the present application.

Claims (6)

1. A server failure prediction method, comprising:
Acquiring a historical fault record in a first preset time period before prediction from a fault database of a server; the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data;
Acquiring second time sequence data corresponding to the running state of the server in a second preset time period before prediction; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server;
Determining a node association relationship based on the first time sequence data and the second time sequence data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operational state node are used to represent a plurality of characteristics of the operational state of the server; the plurality of attributes of the fault record node are used to represent a plurality of characteristics of the plurality of fault records;
Constructing a heterogeneous graph comprising the fault record node and the running state node according to the node association relation; in the heterogram, each edge between the fault recording node and the running state node is determined based on the node association relation;
inputting the multiple attributes of the running state nodes and the multiple attributes of the fault record nodes into a first graph neural network, and calculating Gaussian similarity function values of each edge between the fault record nodes and the running state nodes;
Constructing an association weight matrix of each side according to the Gaussian similarity function value of each side and the graph embedding vector of the server, and updating the weight of each side according to the association weight matrix; the graph embedding vector of the server is constructed according to the first graph neural network;
Determining current network information of the updated heterogeneous graph according to the weight of each edge in the updated heterogeneous graph and the node grouping condition;
under the condition that the current network information of the updated iso-graph meets the preset network condition, mapping the weight of the nodes in each group and the weight of each side in the updated iso-graph to a generalized adjacent matrix to obtain a heterogeneous weight matrix; wherein the nodes in each group include the fault record node and the operational state node in each group;
Acquiring the characteristic vector of the fault record node and the characteristic vector of the running state node in the updated iso-graph;
Updating the feature vector of the fault record node and the feature vector of the running state node by using a gating chart convolution time sequence neural network;
The updated feature vector of the fault record node and the updated feature vector of the running state node are weighted and summed by using an attention network to obtain a node fusion feature vector; the attention network forms a shared space by fusing the characteristic information of the fault record node and the characteristic information of the running state node, so that each parameter shares each characteristic information;
According to the heterogeneous weight matrix, a second graph neural network is adopted to process the node fusion feature vector, and a node hiding vector is obtained;
and predicting the future fault condition of the server based on the heterogeneous weight matrix and the node hiding vector.
2. The server failure prediction method according to claim 1, characterized in that,
The preset network conditions are determined based on the difference degree of the nodes in each group and the density of the neighbor nodes; and under the condition that the updated current network information of the heterograms does not meet the preset network conditions, updating the node grouping condition until the updated current network information of the heterograms meets the preset network conditions.
3. The server failure prediction method according to claim 1, wherein the first graph neural network is established based on the steps of:
acquiring a plurality of training samples; wherein each training sample comprises fault record structure data and running state structure data; the fault record structure data are diagram structure data corresponding to a plurality of characteristics of a plurality of sample fault records, and the running state structure data are diagram structure data corresponding to a plurality of characteristics of a sample running state of the server;
and carrying out iterative training on the first graph neural network by using the plurality of training samples until the maximum iteration times are reached or the network parameters of the first graph neural network are converged, so as to obtain the trained first graph neural network.
4. The server failure prediction method according to claim 1, wherein the response time of the server is obtained based on the steps of:
Continuously transmitting a detection signal to the server at a first time point based on a preset period, wherein the server returns a response signal after receiving the detection signal;
determining a second time point of receiving the response signal returned by the server;
The time period between the second time point and the first time point is taken as the response time of the server.
5. The server failure prediction method according to claim 1, characterized in that before acquiring the failure database of the server, the method further comprises:
and after repairing the faults of the server, storing fault records into the fault database.
6. A server failure prediction apparatus, comprising:
a history fault record obtaining unit, configured to obtain a history fault record in a first preset time period before prediction from a fault database of a server; the fault database stores first time sequence data corresponding to a plurality of fault records, and each fault record comprises fault type data;
A historical running state acquisition unit, configured to acquire second time sequence data corresponding to the running state of the server in a second preset time period before prediction is performed; the running state of the server comprises the running state of a disk of the server, the utilization rate of a central processing unit of the server, the read-write state of a database deployed on the server and the response time of the server;
A node association relation determining unit configured to determine a node association relation based on the first time-series data and the second time-series data; the node association relationship comprises a fault record node and an interrelation between running state nodes, wherein the fault record node and the running state nodes comprise a plurality of attributes; the plurality of attributes of the operational state node are used to represent a plurality of characteristics of the operational state of the server; the plurality of attributes of the fault record node are used to represent a plurality of characteristics of the plurality of fault records;
The heterogeneous composition construction unit is used for constructing a heterogeneous chart comprising the fault record node and the running state node according to the node association relation; in the heterogram, each edge between the fault recording node and the running state node is determined based on the node association relation;
The edge weight updating unit is used for inputting the plurality of attributes of the running state node and the plurality of attributes of the fault record node into a first graph neural network, and calculating Gaussian similarity function values of each edge between the fault record node and the running state node; constructing an association weight matrix of each side according to the Gaussian similarity function value of each side and the graph embedding vector of the server, and updating the weight of each side according to the association weight matrix; the graph embedding vector of the server is constructed according to the first graph neural network;
The fault prediction unit is used for determining the current network information of the updated heterogeneous graph according to the weight of each edge in the updated heterogeneous graph and the node grouping condition; under the condition that the current network information of the updated iso-graph meets the preset network condition, mapping the weight of the nodes in each group and the weight of each side in the updated iso-graph to a generalized adjacent matrix to obtain a heterogeneous weight matrix; wherein the nodes in each group include the fault record node and the operational state node in each group; acquiring the characteristic vector of the fault record node and the characteristic vector of the running state node in the updated iso-graph; updating the feature vector of the fault record node and the feature vector of the running state node by using a gating chart convolution time sequence neural network; the updated feature vector of the fault record node and the updated feature vector of the running state node are weighted and summed by using an attention network to obtain a node fusion feature vector; the attention network forms a shared space by fusing the characteristic information of the fault record node and the characteristic information of the running state node, so that each parameter shares each characteristic information; according to the heterogeneous weight matrix, a second graph neural network is adopted to process the node fusion feature vector, and a node hiding vector is obtained; and predicting the future fault condition of the server based on the heterogeneous weight matrix and the node hiding vector.
CN202410852079.4A 2024-06-28 2024-06-28 Server fault prediction method and device Active CN118394559B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202410852079.4A CN118394559B (en) 2024-06-28 2024-06-28 Server fault prediction method and device

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202410852079.4A CN118394559B (en) 2024-06-28 2024-06-28 Server fault prediction method and device

Publications (2)

Publication Number Publication Date
CN118394559A CN118394559A (en) 2024-07-26
CN118394559B true CN118394559B (en) 2024-10-15

Family

ID=91986392

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202410852079.4A Active CN118394559B (en) 2024-06-28 2024-06-28 Server fault prediction method and device

Country Status (1)

Country Link
CN (1) CN118394559B (en)

Families Citing this family (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN119621442B (en) * 2025-02-12 2025-05-13 广州七喜电脑有限公司 Server hardware fault early warning and recovering method and system based on intelligent optimization algorithm

Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400560A (en) * 2020-03-10 2020-07-10 支付宝(杭州)信息技术有限公司 Method and system for predicting based on heterogeneous graph neural network model
CN114580263A (en) * 2021-12-02 2022-06-03 国家电网有限公司信息通信分公司 Knowledge graph-based information system fault prediction method and related equipment

Family Cites Families (5)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111461392B (en) * 2020-01-23 2022-06-17 华中科技大学 A method and system for predicting power failure based on graph neural network
CN112269901B (en) * 2020-09-14 2021-11-05 合肥中科类脑智能技术有限公司 Fault distinguishing and reasoning method based on knowledge graph
CN116208399A (en) * 2023-02-17 2023-06-02 中国电子科技集团公司电子科学研究院 Network malicious behavior detection method and device based on metagraph
CN117407256A (en) * 2023-10-18 2024-01-16 浙江大学 Micro-service abnormality detection method and device based on graph attention network
CN118096134B (en) * 2024-04-25 2024-08-13 江苏中天互联科技有限公司 Fault processing method, electronic device and storage medium

Patent Citations (2)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN111400560A (en) * 2020-03-10 2020-07-10 支付宝(杭州)信息技术有限公司 Method and system for predicting based on heterogeneous graph neural network model
CN114580263A (en) * 2021-12-02 2022-06-03 国家电网有限公司信息通信分公司 Knowledge graph-based information system fault prediction method and related equipment

Also Published As

Publication number Publication date
CN118394559A (en) 2024-07-26

Similar Documents

Publication Publication Date Title
Chen et al. Outage prediction and diagnosis for cloud service systems
CN102257520B (en) Application performance analysis
CN111506478A (en) Method for realizing alarm management control based on artificial intelligence
US6393387B1 (en) System and method for model mining complex information technology systems
US20160055044A1 (en) Fault analysis method, fault analysis system, and storage medium
CN105550100A (en) Method and system for automatic fault recovery of information system
CN118394559B (en) Server fault prediction method and device
CN109062769B (en) Method, device and equipment for predicting IT system performance risk trend
Lan et al. A study of dynamic meta-learning for failure prediction in large-scale systems
CN114202179B (en) Method and device for identifying target enterprises
CN117170915A (en) Data center equipment fault prediction method and device and computer equipment
Cai et al. A real-time trace-level root-cause diagnosis system in alibaba datacenters
Raj et al. Cloud infrastructure fault monitoring and prediction system using LSTM based predictive maintenance
Gao et al. Modeling probabilistic measurement correlations for problem determination in large-scale distributed systems
CN118013256A (en) Resource prediction method based on space-time data fusion
CN119886890B (en) A blockchain-based method and system for monitoring carbon emissions of multiple enterprises in industrial parks
CN119420629A (en) A method for locating the root cause of microservice failures based on graph convolutional neural networks
Molan et al. Graafe: Graph anomaly anticipation framework for exascale hpc systems
CN119939362A (en) A measurement data processing method based on dynamic partition rebalancing and stream-batch collaboration
CN119623775A (en) Task chain self-healing transfer prediction method and system based on parameter fingerprint holographic perception
CN118779798B (en) Power edge optimization control method and system based on IoT cloud-edge collaboration
EP4027277A1 (en) Method, system and computer program product for drift detection in a data stream
CN119396803A (en) Optimization strategy determination method, device, computer equipment, readable storage medium and program product
CN118013043A (en) File data management method, device, equipment and storage medium
Abghari et al. A minimum spanning tree clustering approach for outlier detection in event sequences

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant