[go: up one dir, main page]

CN115348159B - Micro-service fault positioning method and device based on self-encoder and service dependency graph - Google Patents

Micro-service fault positioning method and device based on self-encoder and service dependency graph Download PDF

Info

Publication number
CN115348159B
CN115348159B CN202210958306.2A CN202210958306A CN115348159B CN 115348159 B CN115348159 B CN 115348159B CN 202210958306 A CN202210958306 A CN 202210958306A CN 115348159 B CN115348159 B CN 115348159B
Authority
CN
China
Prior art keywords
service
micro
graph
node
response time
Prior art date
Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
Active
Application number
CN202210958306.2A
Other languages
Chinese (zh)
Other versions
CN115348159A (en
Inventor
常雨竹
刘月灿
孙建刚
李伟良
李静
羊麟威
高颖
杨庆甫
李明
宫帅
尹晓宇
程航
董小菱
饶涵宇
毛冬
张辰
Current Assignee (The listed assignees may be inaccurate. Google has not performed a legal analysis and makes no representation or warranty as to the accuracy of the list.)
Nanjing University of Aeronautics and Astronautics
State Grid Information and Telecommunication Group Co Ltd
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
State Grid Corp of China SGCC
Original Assignee
Nanjing University of Aeronautics and Astronautics
State Grid Information and Telecommunication Group Co Ltd
Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd
Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd
State Grid Corp of China SGCC
Priority date (The priority date is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the date listed.)
Filing date
Publication date
Application filed by Nanjing University of Aeronautics and Astronautics, State Grid Information and Telecommunication Group Co Ltd, Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd, Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd, State Grid Corp of China SGCC filed Critical Nanjing University of Aeronautics and Astronautics
Priority to CN202210958306.2A priority Critical patent/CN115348159B/en
Publication of CN115348159A publication Critical patent/CN115348159A/en
Application granted granted Critical
Publication of CN115348159B publication Critical patent/CN115348159B/en
Active legal-status Critical Current
Anticipated expiration legal-status Critical

Links

Images

Classifications

    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0677Localisation of faults
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • HELECTRICITY
    • H04ELECTRIC COMMUNICATION TECHNIQUE
    • H04LTRANSMISSION OF DIGITAL INFORMATION, e.g. TELEGRAPHIC COMMUNICATION
    • H04L41/00Arrangements for maintenance, administration or management of data switching networks, e.g. of packet switching networks
    • H04L41/06Management of faults, events, alarms or notifications
    • H04L41/0631Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis
    • H04L41/065Management of faults, events, alarms or notifications using root cause analysis; using analysis of correlation between notifications, alarms or events based on decision criteria, e.g. hierarchy, tree or time analysis involving logical or physical relationship, e.g. grouping and hierarchies

Landscapes

  • Engineering & Computer Science (AREA)
  • Computer Networks & Wireless Communication (AREA)
  • Signal Processing (AREA)
  • Debugging And Monitoring (AREA)
  • Data Exchanges In Wide-Area Networks (AREA)

Abstract

本发明公开了一种基于自编码器和服务依赖图的微服务故障定位方法及装置,包括:实时监测与收集微服务系统运行指标数据;基于自编码器模型判断微服务系统是否存在异常;构建服务调用关系图以刻画故障传播途径;关联微服务的运行状态与系统资源利用率从而计算服务调用关系图中每个节点的异常权重;通过改进的加权PageRank算法推断和定位引发异常的故障微服务。克服了现有微服务故障定位方法中需要人工设定各类监测指标阈值进行异常诊断的问题,提高故障定位的准确性。

Figure 202210958306

The invention discloses a microservice fault location method and device based on an autoencoder and a service dependency graph. The service call relationship graph is used to describe the fault propagation path; the running status of the microservice and the system resource utilization are associated to calculate the abnormal weight of each node in the service call graph; the faulty microservice that causes the exception is inferred and located through the improved weighted PageRank algorithm . It overcomes the problem of manually setting various monitoring index thresholds for abnormal diagnosis in the existing microservice fault location method, and improves the accuracy of fault location.

Figure 202210958306

Description

Micro-service fault positioning method and device based on self-encoder and service dependency graph
Technical Field
The invention belongs to the technical field of computer software fault location analysis, and particularly relates to a micro-service fault automatic location method and device based on a self-encoder and a service dependency graph.
Background
With the advent of different computing modes such as cloud computing and mobile computing, a micro-service architecture becomes the latest trend of software service design, development and delivery, each module is implemented and operated as a small but independent system, and access to internal logic and data is provided through a well-defined interface, so that the application software has the advantages of high flexibility, good expandability, large autonomy and the like, and more internet enterprises adopt the micro-service architecture to develop and deploy distributed application software. The micro-service architecture brings convenience in development, and meanwhile, due to complex dependency and frequent delivery and deployment, the system faces more potential threats of faults, and unexpected faults, such as concurrent asynchronous errors, running resource shortage errors and the like, of the system can occur at any time. Because of the independent architecture design and flexible calling relation of the micro-services, when a certain micro-service module fails, related module components also fail due to the dependence on calling, so that large-scale cascading failure of the micro-services is caused, and in order to ensure reliable operation and service quality of the micro-services, developers must quickly repair system failures. In the face of cascading failure of a micro-service system, how to accurately locate the root cause of the failure is particularly critical.
However, locating faults in a micro-service architecture can encounter the following challenges: 1) Complex dependencies. In conventional architectures, system failures are typically determined by examining a log of operations and analyzing system performance with performance monitoring tools. However, in a micro-service architecture, the number of micro-services is usually up to hundreds or thousands, and is usually distributed over multiple service hosts, call and dependency among services are complex and dynamically changed, performance degradation of one service may be widely spread, and abnormality occurs in multiple services, so that running logs and monitoring tools are difficult to meet diagnosis and troubleshooting requirements. 2) A large number of monitoring indexes. Communication and invocation between large-scale services creates a large number of metrics, from which each metric threshold at which a micro-service is abnormal is analyzed is very time consuming, and from which it is often inaccurate to determine whether the micro-service is abnormal. 3) Frequent micro-service updates. In order to meet the demands of users, the micro-service module needs to be updated frequently, in the updating process, the old module is replaced by the new service, and the dependency relationship between the services also changes along with the updating, so that a dynamic system architecture is formed, and the difficulty of automatic fault positioning is aggravated.
Disclosure of Invention
Aiming at the defects of the prior art, the invention provides a micro-service fault positioning method and device based on a self-encoder and a service dependency graph, which judge the running condition of micro-service by learning the fluctuation characteristics of monitoring indexes in normal running of the micro-service through a self-encoder model, solve the problem that various monitoring index thresholds are required to be manually set for carrying out abnormal diagnosis in the existing micro-service fault positioning method, and combine the utilization monitoring indexes of various resources on a server host to carry out weight setting on nodes in the service dependency graph, thereby improving the accuracy of automatically positioning fault micro-service.
In order to achieve the technical purpose, the invention adopts the following technical scheme:
a micro-service fault positioning method based on a self-encoder and a service dependency graph comprises the following steps:
step 1: collecting an index of the utilization rate of physical resources of a server host and an index of response time of a call request between micro services in the running process of the micro service system;
step 2: invoking the monitoring sequence of the response time as training data to train a self-encoder model, reconstructing the response time, and judging whether the micro-service system is abnormal or not by calculating a reconstruction error of the response time index data;
step 3: generating a corresponding node by mapping each micro service, analyzing the call relation between communication data capture micro services among each micro service, constructing directed edges among the nodes through the call relation among the nodes, and constructing a service call relation graph by taking the reconstruction error of the response time index data as a node abnormal weight value;
step 4: correlating the response time index data reconstruction error with physical resource utilization index data of a service host, and calculating abnormal weight of each graph node in a service call relation graph;
step 5: based on the updating of the abnormal weight of each graph node in the service call relation graph in the step 4, a weighted PageRank algorithm is used for deducing and positioning the fault micro-service causing the abnormality.
Wherein:
in step 1, the physical resource utilization index includes an index of the physical machine or virtual machine layer resource usage condition of the running micro service instance, the response time index includes an index of the time spent by the micro service in the micro service system for responding to other micro service requests, and the physical resource utilization index and the response time index are monitored and collected in real time through a Promitus tool.
In step 2:
will micro-serve v i Response time monitoring index sequence data collected at time t for a time window w
Figure BDA0003788710090000021
As input to the self-encoder for training completion, where V represents the set of micro-services, +.>
Figure BDA0003788710090000022
For a vector of dimension h, the +.>
Figure BDA0003788710090000023
Potential feature representation mapped to d-dimension +.>
Figure BDA0003788710090000024
Figure BDA0003788710090000025
Where g is the activation function, h is the number of response time monitoring index collections within the time window w, d is the dimension of the potential feature representation,
Figure BDA0003788710090000026
is the weight matrix of h rows and d columns of the input layer and the hidden layer, b is the h-dimensional bias vector of the input layer, and the potential characteristic is represented by the decoder +.>
Figure BDA0003788710090000027
Reconstruction as microservice v i Is a response time index monitoring sequence data +.>
Figure BDA0003788710090000028
Figure BDA0003788710090000029
Wherein the method comprises the steps of
Figure BDA0003788710090000031
Is the weight matrix of d rows and h columns of the hidden layer and the output layer, c is the d-dimensional bias vector of the hidden layer,
Figure BDA0003788710090000032
Figure BDA0003788710090000033
calculate->
Figure BDA0003788710090000034
And->
Figure BDA0003788710090000035
Reconstruction mean square error between->
Figure BDA0003788710090000036
Figure BDA0003788710090000037
Use of micro-service v during normal operation of micro-service system during self-encoder training phase i Is used as training data to train a self-encoder model through multiple rounds of trainingThereafter, the converged self-encoder model learns the characteristics of the normal response time series data, and the self-encoder model performs the micro-service v i The reconstruction value of the response time monitoring value in normal operation is close to the monitoring value, the corresponding reconstruction error is smaller and fluctuates in a stable range, the mean value mu and standard deviation sigma of the reconstruction error at the moment are calculated, and the micro service v is determined i Is not less than the abnormality detection threshold alpha i =μ+3σ, in microservice v i In the process of detecting the running state in real time, the error is reconstructed
Figure BDA0003788710090000038
Then consider micro service v i An abnormality occurs.
The specific flow in the step 3 is as follows:
step 3-1: the microservice set in the microservice system is denoted as v= { V 1 ,v 2 ,…,v n Where n represents the number of micro services, for any v i E V, map generation map node s i Finally, a graph node set S= { S is obtained 1 ,s 2 ,…,s n };
Step 3-2: capturing call relationships between micro services by parsing communication data between the micro services i To micro-service v j Sending a service request, constructing a slave graph node s i Pointing graph node s j Is directed to edge z of (2) ij Finally, an edge set Z= { Z is formed ij The service request is only constructed into a directed edge, and a service call relation diagram without abnormal weight is generated;
step 3-3. Micro service v i Is used as a graph node s i Initial anomaly weight of (a)
Figure BDA0003788710090000039
Traversing and calculating initial abnormal weight of each micro-service to obtain a graph node abnormal weight set +.>
Figure BDA00037887100900000310
Anomaly in anomaly weight set FWeight->
Figure BDA00037887100900000311
As graph node s i And finally obtaining a service call relation graph G (S, Z, F).
In step 4, based on the service call relationship graph G (S, Z, F), the abnormal weight of each graph node is automatically updated according to the abnormal weight relationship between adjacent graph nodes in the service call relationship graph, and for any graph node S j E S, j e {1,2, …, n }, will contain the directed graph node S j Adjacent graph nodes of directed edges of (a) form a set AN (s j ) Will contain a pointing AN(s) j ) Adjacent graph nodes of the directed edge of any one of the graph nodes form a set NAN (s j ),
Computing AN(s) j ) Average anomaly weight aScare(s) j ):
Figure BDA00037887100900000312
Wherein the method comprises the steps of
Figure BDA00037887100900000313
Representing a graph node s i Abnormal weights of (1), inDegree(s) j ) Representing a graph node s j Is of the order of entry, NAN (s j ) Average anomaly weight cScore(s) j ):
Figure BDA0003788710090000041
Wherein aScore(s) j ) Reflects AN(s) j ) Overall degree of abnormality, cScore (s j ) Representing NAN(s) j ) The degree of abnormality as a whole was combined with aScore (s j ) And cScore(s) j ) Feature calculation graph node s of (a) j Is of anomaly weight acScore(s) j ):
acScore(s j )=aScore(s j )-cScore(s j ) (6)
acScore(s j ) The higher the value of (2), the sectionPoint s j The higher the overall degree of anomaly of the neighboring graph nodes, the more AN(s) j ) The lower the degree of abnormality of the adjacent graph node, the graph node s j Corresponding microservice v j The higher the probability of being the root cause of the fault, the micro-service v is measured by pearson correlation function i Response time series reconstruction errors collected at time t over a time window
Figure BDA0003788710090000042
Deploying microservices v i Correlation of each physical resource monitoring index sequence data on a host computer:
Figure BDA0003788710090000043
wherein,,
Figure BDA0003788710090000044
sequence data representing the physical resource monitoring index of item r collected at time t for a time window w,/v>
Figure BDA0003788710090000045
Representation->
Figure BDA00037887100900000414
R.epsilon. {1,2, …, k }, k representing the number of physical resources,/->
Figure BDA0003788710090000046
Representing microservices v i Reconstruction of the response time monitoring value at time e,/->
Figure BDA0003788710090000047
Representing microservices v i Response time monitoring value at time e, +.>
Figure BDA0003788710090000048
Representing reconstruction errors
Figure BDA0003788710090000049
Is the mean of (v) microservice v i Through->
Figure BDA00037887100900000410
The score is expressed by the height of the score, and is combined with acScore (s i ) And->
Figure BDA00037887100900000411
Calculation graph node s i Final anomaly weight AS(s) i ):
Figure BDA00037887100900000412
In step 5:
the abnormal weight of each micro service is calculated in a traversing way to finish the updating of the abnormal weights of all graph nodes in the service call relation graph, a weighted PageRank algorithm is adopted to 'randomly walk' in the service call relation graph G (S, Z, F), and a service call relation graph node transition probability matrix U is defined firstly:
Figure BDA00037887100900000413
wherein u is ij Representing slave graph node s j Random walk to graph node s i The abnormal weight of the graph node is related to the walk probability, and the slave graph node s is calculated j Random walk to graph node s i Probability u of (2) ij
Figure BDA0003788710090000051
Wherein s is j →s i Representing the existence of a slave graph node s j Pointing graph node s i Is a directional edge of linkut(s) j ) Representing a graph node s j Abnormal weight sum of all graph nodes pointed to, for any graph node s i E S, i= {1,2, L, n }, initializing PR score to PR 0 (s i ) =1/n, the PR scores of all graph nodes are expressed as vector R 0
R 0 =(PR 0 (s 1 ),PR 0 (s 2 ),…,PR 0 (s n )) T (11)
During each round of random walk, the PR score of each graph node is iteratively updated:
R c =dU·R c-1 +(1-β)R 0 (12)
wherein R is c PR score vectors representing all graph nodes after iteration round c, U.epsilon.R n*n Representing a random walk probability matrix, beta epsilon (0, 1) representing a sony coefficient, wherein generally beta=0.85, after iterative updating, PR scores of each graph node tend to converge, the higher the PR score of each graph node is, the greater the probability that the corresponding micro service is a fault root cause is, and finally, a ranking list of the fault root cause micro services is output according to the sequence from high to low of the PR scores of the graph nodes.
The invention also provides a device for using the micro-service fault positioning method based on the self-encoder and the service dependency graph, which comprises a data collection module, an anomaly detection module and a fault positioning module, wherein the data collection module is responsible for collecting the physical resource utilization index of a server host and the response time index of a call request between micro-services in the running process of the micro-service system; the anomaly detection module trains a self-encoder model by calling the monitoring sequence of the response time as training data, reconstructs the response time, and judges whether the micro-service system is abnormal or not by calculating the reconstruction error of the response time index data; the fault locating module generates a corresponding node from each micro service map, analyzes the communication data between each micro service to capture the call relation between the micro services, constructs a directed edge between the nodes through the call relation between the nodes, constructs a service call relation diagram by taking the response time index data reconstruction error as a node abnormal weight value, associates the response time index data reconstruction error with the physical resource utilization index data of the service host, calculates the abnormal weight of each graph node in the service call relation diagram, and deduces and locates the abnormal fault micro service caused by the abnormality based on the update of the abnormal weight of each graph node in the service call relation diagram by using a weighted PageRank algorithm.
The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the above-described self-encoder and service dependency graph based micro-service fault locating method steps.
The invention also provides an electronic device comprising a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, realizes the steps of the micro-service fault positioning method based on the self-encoder and the service dependency graph.
The invention also provides a computer program product characterized by comprising a computer program/instruction which, when executed by a processor, implements the above-mentioned microservice fault location method steps based on self-encoders and service dependency graphs.
Compared with the prior art, the invention has the advantages that:
1. according to the invention, the association relation between the micro-service running state and response time fluctuation is learned through training the self-encoder model, so that the micro-service running state is detected in real time, and the problem that various monitoring index thresholds are required to be manually set for abnormality diagnosis in the existing micro-service fault positioning method is solved; and the call relations among the micro services are captured by analyzing the communication data among the micro services, so that a service call relation diagram is constructed to simulate a fault propagation path, and the abnormal weights of nodes in the call relation diagram are updated by combining the reconstruction errors of the response time of the self-encoder to the micro services and the utilization rate of system resources, so that the fault micro services are automatically positioned based on a weighted PageRank algorithm, and the fault positioning accuracy is improved.
2. According to the invention, the monitoring index data in the normal running state of the micro-service is used as the input of the self-encoder, and the self-encoder is used for encoding and reconstructing, so that compared with the traditional anomaly detection method, the method can capture the index data hiding characteristic in the normal running state of the micro-service, thereby improving the accuracy of real-time anomaly detection.
3. The invention uses the reconstruction error of the encoder to the micro-service monitoring index as the abnormal weight of the service call relation graph node, and the reconstruction error reflects the deviation degree of the real-time monitoring index and the normal monitoring index, so the abnormal degree of the micro-service can be reflected by the reconstruction error (namely the abnormal weight).
4. The invention measures micro services as fault root likelihood size by introducing an aScore score, a cScore score and an acScore score concept, wherein aScore(s) j ) Reflects AN(s) j ) Degree of abnormality as a whole. cScore(s) j ) Representing NAN(s) j ) Degree of abnormality as a whole. AcScare(s) j ) Integrate the nodes s j Itself and its surrounding nodes, acScore(s) j ) The higher the indicating node s j The overall anomaly degree of the adjacent node is high, and AN(s) j ) If the overall degree of abnormality of the adjacent node is low, then node s j Corresponding microservice v j The higher the probability of being the root cause of the fault.
5. The invention updates the abnormal weight of each node by calculating the correlation between the monitoring index of the response time and the sequence data of each physical resource monitoring index through the pearson correlation function, takes the resource utilization rate as a part of the abnormal weight calculation, and enhances the correlation between the positioning of the micro-service fault root cause and the resource utilization rate.
6. According to the invention, through improving the PageRank algorithm and based on the abnormal weight of the connected nodes in the service call relation diagram, the weighted PageRank algorithm is designed, so that the migration strategy in the random walk algorithm is related to the abnormal weight relation of each node and the connected nodes, and the node frequency with higher migration abnormal weight is higher.
Drawings
FIG. 1 is a general framework diagram of a method for automatically locating micro-service faults based on a self-encoder and a service dependency graph according to an embodiment of the present invention;
FIG. 2 is a flow chart of a method for automatically locating micro-service faults based on a self-encoder and a service dependency graph according to an embodiment of the present invention;
FIG. 3 is a diagram of fault location accuracy versus experimental results provided by an embodiment of the present invention;
FIG. 4 is a graph of failure location average accuracy versus experimental results provided by an embodiment of the present invention;
FIG. 5 is a schematic diagram of a micro-service fault locating device according to an embodiment of the present invention;
fig. 6 is another schematic structural diagram of a micro service fault locating device according to an embodiment of the present invention.
Detailed Description
Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.
The invention provides a method and a device for automatically positioning micro-service faults based on a self-encoder and a service dependency graph, wherein the overall framework of the method and the device is shown in figure 1 and mainly comprises 3 modules. The data collection module is responsible for collecting application program indexes and system level index data, wherein the application program indexes are used for detecting application performance problems, and the system level indexes are used for updating node weights in a subsequent service call relation graph. The anomaly detection module is used for carrying out coding reconstruction on the application index data through the self-encoder and detecting whether the micro-service module is abnormal or not. Once the system abnormality is detected, the fault locating module constructs a service call relation diagram by analyzing call relations among the micro services to analyze an abnormal propagation path, calculates the abnormal weight of each micro service node in the service call relation diagram by utilizing the correlation of the system resource utilization rate and the micro service performance, and finally locates the fault root cause micro service module by utilizing a weighted PageRank algorithm.
Referring to fig. 2, fig. 2 is a flowchart of a method for automatically locating a micro service fault based on a self-encoder and a service dependency graph according to an embodiment of the present invention, and specifically, the method includes:
step 1: and monitoring and collecting the physical resource utilization rate index and the response time index called between the micro services in the running process of the micro service system in real time through a Promises tool. The physical resource utilization index is a type of index reflecting the use condition of physical machine or virtual machine level resources running the micro service instance, such as CPU utilization rate, memory utilization rate and the like; the response time index is an index reflecting the length of time it takes for a micro service in the micro service system to respond to other micro service requests.
The invention expresses k physical resource monitoring indexes as M= { M 1 ,m 2 ,…,m k Continuously monitoring value of any monitoring index mr is expressed as time series data
Figure BDA0003788710090000071
Wherein m is r E M, r e {1,2, … k }. To model index m r The change relation of the monitoring value is that the index m is intercepted at the monitoring index collecting time t through a sliding window w with the time length of h r Is expressed as +.>
Figure BDA0003788710090000072
Wherein->
Figure BDA0003788710090000073
Representing index m r Monitoring value at time t. The monitoring sequence data which is collected by k physical resource monitoring indexes at the time t and comprises a time window w is formed into a monitoring index data matrix +.>
Figure BDA0003788710090000074
In a micro service system v= { V having n micro service modules 1 ,v 2 ,…,v n In } for any microservice v i E V, the response time monitoring sequence data collected at time t and containing a time window w is expressed as
Figure BDA0003788710090000075
Wherein->
Figure BDA0003788710090000081
Representing microservices v i A monitored value of response time at time t. The response time monitoring sequence data collected by n micro-services at time t for a time window w is formed into a matrix +.>
Figure BDA0003788710090000082
Step 2: will micro-serve v i Response time monitoring index sequence data collected at time t for duration of time window w
Figure BDA0003788710090000083
As input to the self-encoder after training, the coding layer will be +.>
Figure BDA0003788710090000084
Mapping to d-dimensional latent feature representation +.>
Figure BDA0003788710090000085
Figure BDA0003788710090000086
Where g is the activation function, h is the number of response time monitoring index collections within the time window w, d is the dimension of the potential feature representation,
Figure BDA0003788710090000087
is the weight matrix of the h rows and d columns of the input layer and the hidden layer, and b is the h-dimensional bias vector of the input layer. The potential feature is then represented by the decoder +.>
Figure BDA0003788710090000088
Reconstruction as a micro-service module v i Is a response time index monitoring sequence data +.>
Figure BDA0003788710090000089
Figure BDA00037887100900000810
Wherein the method comprises the steps of
Figure BDA00037887100900000811
Is the weight matrix of d rows and h columns of the hidden layer and the output layer, c E R d+1 Is the bias vector of the hidden layer,
Figure BDA00037887100900000812
Figure BDA00037887100900000813
then calculate +.>
Figure BDA00037887100900000814
And->
Figure BDA00037887100900000815
Reconstruction mean square error between->
Figure BDA00037887100900000816
Figure BDA00037887100900000817
Use of micro-service v during normal operation of micro-service system during self-encoder training phase i Is used as training data to train a self-encoder model. After multiple rounds of training, the converged self-encoder model learns the characteristics of the normal response time series data. Thus from encoder model pair micro-services v i The reconstruction value of the response time monitoring value in normal operation is close to the monitoring value, and the corresponding reconstruction error is small and fluctuates in a stable range. Determining a micro service module v by calculating the mean value mu and the standard deviation sigma of the reconstruction error at the moment e Is not less than the abnormality detection threshold alpha i =μ+3σ. In micro service v i In the process of detecting the running state in real time, if a reconstruction error is found
Figure BDA00037887100900000818
Then consider micro service v i An abnormality occurs.
Step 3: generating a corresponding node by mapping each micro service, capturing call relations among the micro services by analyzing communication data among the micro services, constructing directed edges among the nodes, and constructing a service call relation graph by taking a reconstruction error of response time index data of each micro service as a node abnormal weight value, wherein the specific flow is as follows:
step 3-1: the microservice set in the microservice system is denoted as v= { V 1 ,v 2 ,…,v n Where n represents the number of micro services. For any v i E V, map generation map node s i Finally, a graph node set S= { S is obtained 1 ,s 2 ,…,s n };
Step 3-2: capturing call relationships between micro services by parsing communication data between the micro services, if micro service v i To micro-service v j Sending a service request, constructing a slave s i Direction s j Is directed to edge z of (2) ij Form edge set z= { Z ij The number of the service requests is greater than or equal to 1, the number of the service requests is less than or equal to n, and only one directed edge is constructed by the same service request;
step 3-3: will micro-serve v i As graph node s, the reconstruction error of the response time monitoring index of (c) i Initial anomaly weight of (a)
Figure BDA0003788710090000091
Traversing and calculating abnormal initial weights of each micro service to obtain a graph node abnormal weight set +.>
Figure BDA0003788710090000092
Finally we get the service invocation relationship graph G (S, Z, F).
Step 4: the "walk" strategy of the random walk algorithm between each node depends on the walk probability between the nodes, and the walk probability is associated with the abnormal weight of the nodes, so that the abnormal weight setting of each graph node in the service call relation graph G (S, Z, F) is particularly critical to the fault positioning accuracy of the random walk algorithm. The invention updates the abnormal weight of the graph nodes through the relationship between the graph nodes in the service call relationship graph and the relationship between the micro service running state and the host resource utilization rate, so that the graph nodes followThe graph nodes with higher degree of abnormality in the machine walk algorithm obtain larger walk probability, so that the interpretation of fault positioning is enhanced. Based on the service call relation graph G (S, Z, F), the abnormal weight of each graph node is automatically updated according to the abnormal weight relation between adjacent graph nodes in the service call relation graph. For any graph node s j E S, j e {1,2, …, n }, will contain the directed graph node S j Adjacent graph nodes of directed edges of (a) form a set AN (s j ) Will contain a pointing AN(s) j ) Adjacent graph nodes of the directed edge of any one of the graph nodes form a set NAN (s j )。
Then calculate AN(s) j ) Average anomaly weight aScare(s) j ):
Figure BDA0003788710090000093
Wherein the method comprises the steps of
Figure BDA0003788710090000094
Representing a graph node s i Abnormal weights of (1), inDegree(s) j ) Representing a graph node s j Is included in the (a) is included in the (b). NAN(s) j ) Average anomaly weight cScore(s) j ):
Figure BDA0003788710090000095
Wherein aScore(s) j ) Reflects AN(s) j ) Degree of abnormality as a whole. cScore(s) j ) Representing NAN(s) j ) Degree of abnormality as a whole. Binding to aScore(s) j ) And cScore(s) j ) Feature calculation node s of (a) j Is of anomaly weight acScore(s) j ):
acScore(s j )=aScore(s j )-cScore(s j ) (18)
Obviously, acScore (s j ) Integrate graph nodes s j Itself and its surrounding graph nodes, acScore(s) j ) The higher the value of (a) indicates the graph node s j Adjacent graph node wholeThe degree of abnormality is high, and AN(s) j ) If the overall degree of anomaly of adjacent graph nodes of (a) is lower, then the graph node s j Corresponding microservice v j The higher the probability of being the root cause of the fault. In addition, the response time of the micro-service is related to the change of the performance index of the host where the micro-service is deployed, and the invention selects the Pearson correlation function to measure the micro-service v i Response time series reconstruction errors collected at time t over a time window
Figure BDA0003788710090000101
Deploying microservices v i Correlation of each physical resource monitoring index sequence data on a host computer:
Figure BDA0003788710090000102
wherein,,
Figure BDA00037887100900001013
sequence data representing the physical resource monitoring index of item r collected at time t for a time window w,/v>
Figure BDA0003788710090000103
Representation->
Figure BDA00037887100900001014
R.epsilon {1,2, …, k }, }>
Figure BDA0003788710090000104
Representing microservices v i Reconstruction of the response time monitoring value at time e,/->
Figure BDA0003788710090000105
Representing microservices v i Response time monitoring value at time e, +.>
Figure BDA0003788710090000106
Representing reconstruction error +.>
Figure BDA0003788710090000107
Is the mean of (v) microservice v i Can be controlled by->
Figure BDA0003788710090000108
The score is embodied by the height, so the invention is combined with acScore (s i ) And->
Figure BDA0003788710090000109
Calculation graph node s i Final anomaly weight AS(s) i ):
Figure BDA00037887100900001010
Step 5: and (3) traversing and calculating the abnormal weight of each micro service node to finish updating of the abnormal weights of all graph nodes in the service call relation graph, then adopting a weighted PageRank algorithm to perform random walk in the service call relation graph G (S, Z, F), calculating the walk probability by using the relation between the abnormal weights of the graph nodes and the abnormal weights of the connected graph nodes, and improving the positioning accuracy of the positioning fault root cause. The walk strategy of the weighted PageRank algorithm is based on the probability that each graph node accesses other graph nodes, so that the service call relationship graph node transition probability matrix U needs to be defined first:
Figure BDA00037887100900001011
wherein u is ij Representing slave graph node s j Random walk to graph node s i Is a probability of (2). The abnormal weight value of the graph node is related to the walk probability, and the slave graph node s is calculated j Random walk to graph node s i Probability u of (2) ij
Figure BDA00037887100900001012
Wherein s is j →s i Indicating the presence ofFrom graph node s j Pointing graph node s i Is a directional edge of linkut(s) j ) Representing a graph node s j Abnormal weight sum of the directed graph nodes. For any graph node s i E S, i= {1,2, L, n }, initializing PR score to PR 0 (s i ) Let PR score of all graph nodes be expressed as vector R =1/n 0
R 0 =(PR 0 (s 1 ),PR 0 (s 2 ),…,PR 0 (s n )) T (23)
During each round of random walk, the PR score of each graph node is iteratively updated:
R c =dU·R c-1 +(1-β)R 0 (24)
wherein R is c PR score vectors representing all graph nodes after iteration round c, U.epsilon.R n*n Representing a random walk probability matrix, β e (0, 1) represents the sony coefficient, typically β=0.85. After continuous iterative updating, PR scores of each graph node tend to converge, at this time, the higher the PR score of each graph node is, the greater the probability that the corresponding micro-service module is a fault root cause is, and finally, a ranking list of the fault root cause micro-service is output according to the sequence from high PR scores of the graph nodes to low PR scores.
In order to evaluate the effectiveness of the invention, the invention adopts evaluation indexes AC@K and MAP as indexes for measuring the fault positioning effect, wherein AC@K represents the probability of including fault root micro-services in the first K micro-services output by root prediction. The higher the AC@K score is, the more accurate the representation model locates faults; MAP quantization fault location average accuracy.
Table 1 shows the fault location accuracy of the present invention on different micro-services in the Sock-shop in case of faults such as injection network delay (Latency), CPU resource shortage (CPU Hog), memory leakage (Memory Leak), etc. As can be seen from Table 1, the average fault location accuracy (MAP) of the MicroEncoder on each microservice is over 85%.
TABLE 1
Figure BDA0003788710090000111
To evaluate the effectiveness of the present invention, the present invention was compared to other advanced methods, including Random selection, microRCA, AAMR, and the like. Firstly, the accuracy of fault location under three fault types of network delay (Latency), CPU resource shortage (CPU Hog) and Memory leakage (Memory Leak) is tested and compared by the method and the method. The experimental results are shown in fig. 3, and under the conditions of three faults injection and different values of K, the fault positioning accuracy of the invention is higher than that of other methods, which indicates that the invention really and effectively improves the accuracy in the aspect of fault positioning.
Further, the average fault location accuracy of the method is measured by calculating a fault location evaluation index MAP, and an experimental result is shown in FIG. 4. As can be seen from fig. 4, the average fault location accuracy of the present invention for network delay (Latency) is 92%, the average fault location accuracy for CPU resource shortage (CPU Hog) is 86.4%, and the average fault location accuracy for Memory Leak (Memory Leak) is 91.2%, which is superior to the comparison method.
The invention also provides a device of the micro-service fault positioning method based on the self-encoder and the service dependency graph, which comprises a data collection module, an anomaly detection module and a fault positioning module, wherein the data collection module is used for collecting micro-service physical resource indexes and response time indexes, the response time indexes are used for detecting the response speed of calling among micro-services, and the physical resource indexes are used for updating the abnormal weights of graph nodes in the service calling relation graph; the anomaly detection module carries out coding reconstruction on the data of the data collection module through the self-encoder, and detects whether the micro-service is abnormal or not; the fault locating module analyzes an abnormal propagation path by analyzing a service call relation graph by analyzing call relations among micro services, calculates abnormal weight of each micro service in the service call relation graph by using physical resource indexes and micro service response time indexes, and locates the position of the fault root cause micro service by using a weighted PageRank algorithm.
Referring to fig. 5, which shows a schematic structural diagram of a micro-service fault location device based on a self-encoder and a service dependency graph according to an exemplary embodiment of the present invention, the device provided in this embodiment includes a data collection unit 301, an anomaly detection unit 302, and a fault location unit 303. The data collection unit 301 is responsible for collecting micro-service physical resource indexes and response time indexes, wherein the response time indexes are used for detecting the response speed of calling among micro-services, and the physical resource indexes are used for updating the abnormal weights of the graph nodes in the service calling relation graph; the anomaly detection unit 302 performs coding reconstruction on the data of the data collection module through a self-encoder to detect whether the micro-service is abnormal or not; the fault location unit 303. And constructing a service call relation graph by analyzing call relations among the micro services, analyzing an abnormal propagation path, calculating the abnormal weight of each micro service in the service call relation graph by using the physical resource index and the micro service response time index, and positioning a fault root cause by using a weighted PageRank algorithm.
Referring to fig. 6, a schematic structural diagram of a micro-service fault locating device based on a self-encoder and a service dependency graph according to an embodiment of the present application is shown, hereinafter referred to as device 6, where the device 6 may be integrated in the foregoing electronic apparatus, and as shown in fig. 6, the device includes a memory 602, a processor 601, an input device 603, an output device 604, and a communication interface. The memory 602 may be a separate physical unit, and the memory 602, the processor 601, the transceiver 603 may be connected to the processor 601, the input device 603, and the output device 604 through buses, may be integrated, implemented by hardware, or the like. The memory 602 is used to store a program implementing the above method embodiment, or the respective modules of the apparatus embodiment, and the processor 601 invokes the program to perform the operations of the above method embodiment. Input devices 602 include, but are not limited to, a keyboard, a mouse; output devices include, but are not limited to, display screens. Communication interfaces are used to transmit and receive various types of messages, including but not limited to wireless interfaces or wired interfaces.
Alternatively, when part or all of the distributed task scheduling method of the above-described embodiment is implemented by software, the apparatus may include only the processor. The memory for storing the program is located outside the device and the processor is connected to the memory via a circuit/wire for reading and executing the program stored in the memory. The processor may be a central processor (central processing unit, CPU), a network processor (ne twork processor, NP) or a combination of CPU and NP.
The processor may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complexprogrammable logic device, CPLD), a field-programmable gate array (field-progr ammable gatearray, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof. The memory may include volatile memory (volatile memory), such as access memory (randomaccess memory, RAM); the memory may also include a nonvolatile memory (non-volatile memory), such as a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); the memory may also comprise a combination of the above types of memories.
The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.
The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims (9)

1. The micro-service fault positioning method based on the self-encoder and the service dependency graph is characterized by comprising the following steps of:
step 1: collecting an index of the utilization rate of physical resources of a server host and an index of response time of a call request between micro services in the running process of the micro service system;
step 2: invoking the monitoring sequence of the response time as training data to train a self-encoder model, reconstructing the response time, and judging whether the micro-service system is abnormal or not by calculating a reconstruction error of the response time index data;
step 3: generating a corresponding node by mapping each micro service, analyzing the call relation between communication data capture micro services among each micro service, constructing directed edges among the nodes through the call relation among the nodes, and constructing a service call relation graph by taking the reconstruction error of the response time index data as a node abnormal weight value;
step 4: correlating the response time index data reconstruction error with physical resource utilization index data of a service host, and calculating abnormal weight of each graph node in a service call relation graph;
step 5: based on the updating of the abnormal weight of each graph node in the service call relation graph in the step 4, a weighted PageRank algorithm is used for deducing and positioning the fault micro-service causing the abnormality.
2. The method for positioning micro service faults based on the self-encoder and the service dependency graph according to claim 1, wherein in the step 1, the physical resource utilization index comprises an index of the resource usage condition of a physical machine or a virtual machine layer running the micro service instance, the response time index comprises an index of the time spent by the micro service in response to other micro service requests in the micro service system, and the physical resource utilization index and the response time index are monitored and collected in real time through a Promitus tool.
3. The method for positioning micro service faults based on the self-encoder and the service dependency graph according to claim 1, wherein in the step 2:
will micro-serve v i Response time monitoring index sequence data collected at time t for a time window w
Figure FDA0004244730450000011
As input to the self-encoder for training completion, where V represents the set of micro-services, +.>
Figure FDA0004244730450000016
Is a vector of h dimension, and is transmitted through the coding layer
Figure FDA0004244730450000017
Potential feature representation mapped to d-dimension +.>
Figure FDA0004244730450000012
Figure FDA0004244730450000013
Where g is the activation function, h is the number of response time monitoring index collections within the time window w, d is the dimension of the potential feature representation,
Figure FDA0004244730450000014
is the weight matrix of h rows and d columns of the input layer and the hidden layer, b is the h-dimensional bias vector of the input layer, and the potential characteristic is represented by the decoder +.>
Figure FDA0004244730450000015
Reconstruction as microservice v i Is a response time index monitoring sequence data +.>
Figure FDA0004244730450000021
Figure FDA0004244730450000022
Wherein the method comprises the steps of
Figure FDA0004244730450000023
Is the weight matrix of d rows and h columns of the hidden layer and the output layer, c is the d-dimensional bias vector of the hidden layer,
Figure FDA0004244730450000024
calculate->
Figure FDA0004244730450000025
And->
Figure FDA0004244730450000026
Reconstruction mean square error between->
Figure FDA0004244730450000027
Figure FDA0004244730450000028
Use of micro-service v during normal operation of micro-service system during self-encoder training phase i The response time monitoring sequence of (2) is used as training data to train a self-encoder model, after multiple rounds of training, the converged self-encoder model learns the characteristics of the normal response time sequence data, and the self-encoder model performs micro-service v i The reconstruction value of the response time monitoring value in normal operation is close to the monitoring value, the corresponding reconstruction error is smaller and fluctuates in a stable range, the mean value mu and standard deviation sigma of the reconstruction error at the moment are calculated, and the micro service v is determined i Is not less than the abnormality detection threshold alpha i =μ+3σ, in microservice v i In the process of detecting the running state in real time, the error is reconstructed
Figure FDA0004244730450000029
Then consider micro service v i An abnormality occurs.
4. The method for positioning micro service failure based on self-encoder and service dependency graph according to claim 1, wherein in step 3
The specific flow is as follows:
step 3-1: the microservice set in the microservice system is denoted as v= { V 1 ,v 2 ,…,v n Where n represents the number of micro services, for any v i E V, map generation map node s i Finally, a graph node set s= { s is obtained 1 ,s 2 ,…,s n };
Step 3-2: capturing call relationships between micro services by parsing communication data between the micro services i To micro-service v j Sending a service request, constructing a slave graph node s i Pointing graph node s j Is directed to edge z of (2) ij Finally, an edge set Z= { Z is formed ij And the service call relation diagram without abnormal weight is generated by constructing only one directed edge for the same service request, wherein i is less than or equal to 1, j is less than or equal to n:
step 3-3: will micro-serve v i Is used as a graph node s i Initial anomaly weight of (a)
Figure FDA00042447304500000210
Traversing and calculating initial abnormal weight of each micro-service to obtain a graph node abnormal weight set +.>
Figure FDA00042447304500000211
Figure FDA00042447304500000212
The anomaly weight in the anomaly weight set F is +.>
Figure FDA00042447304500000213
As graph node s i And finally obtaining a service call relation graph G (S, Z, F).
5. The method for positioning micro service fault based on self-encoder and service dependency graph according to claim 1, wherein in step 4, based on service call relationship graph G (S, Z, F), the abnormal weight of each graph node is automatically updated according to the abnormal weight relationship between adjacent graph nodes in the service call relationship graph, for any graph node S j E S, j e {1,2, …, n }, will contain the directed graph node S j Adjacent graph nodes of directed edges of (a) form a set AN (s j ) Will contain a pointing AN(s) j ) Adjacent graph nodes of the directed edge of any one of the graph nodes form a set NAN (s j ),
Computing AN(s) j ) Average anomaly weight aScare(s) j ):
Figure FDA0004244730450000031
Wherein the method comprises the steps of
Figure FDA0004244730450000034
Representing a graph node s i Abnormal weights of (1), inDegree(s) j ) Representing a graph node s j Is of the order of entry, NAN (s j ) Average anomaly weight cScore(s) j ):
Figure FDA0004244730450000032
Wherein aScore(s) j ) Reflects AN(s) j ) Overall degree of abnormality, cScore (s j ) Representing NAN(s) j ) The degree of abnormality as a whole was combined with aScore (s j ) And cScore(s) j ) Feature calculation graph node s of (a) j Is of anomaly weight acScore(s) j ):
acScore(s j )=aScore(s j )-cScore(s j )#(6)
acScore(s j ) The higher the value of (2), the graph node s j The higher the overall degree of anomaly of the neighboring graph nodes, the more AN(s) j ) The lower the degree of abnormality of the adjacent graph node, the graph node s j Corresponding microservice v j The higher the probability of being the root cause of the fault, the micro-service v is measured by pearson correlation function i Response time series reconstruction errors collected at time t over a time window
Figure FDA0004244730450000033
Deploying microservices v i Correlation of each physical resource monitoring index sequence data on a host computer:
Figure FDA0004244730450000041
wherein,,
Figure FDA0004244730450000042
sequence data representing the time window w collected by the r-th physical resource monitoring index at the time t,
Figure FDA0004244730450000043
representation->
Figure FDA0004244730450000044
R.epsilon. {1,2, …, k }, k representing the number of physical resources,/->
Figure FDA0004244730450000045
Representing microservices v i Reconstruction of the response time monitoring value at time e,/->
Figure FDA0004244730450000046
Representing microservices v i Response time monitoring value at time e, +.>
Figure FDA0004244730450000047
Representing reconstruction errors
Figure FDA0004244730450000048
Is the mean of (v) microservice v i Through->
Figure FDA0004244730450000049
The score is expressed by the height of the score, and is combined with acScore (s i ) And->
Figure FDA00042447304500000410
Calculation graph node s i Final anomaly weight As(s) i ):
Figure FDA00042447304500000411
6. The method for positioning micro service failure based on self-encoder and service dependency graph according to claim 5, wherein in step 5:
the abnormal weight of each micro service is calculated in a traversing way to finish the updating of the abnormal weights of all graph nodes in the service call relation graph, a weighted PageRank algorithm is adopted to 'randomly walk' in the service call relation graph G (S, Z, F), and a service call relation graph node transition probability matrix U is defined firstly:
Figure FDA00042447304500000412
wherein u is ij Representing slave graph node s j Random walk to graph node s i The abnormal weight of the graph node is related to the walk probability, and the slave graph node s is calculated j Random walk to graph node s i Probability u of (2) ij
Figure FDA00042447304500000413
Wherein s is j →s i Representing the existence of a slave graph node s j Pointing graph node s i Is a directional edge of linkut(s) j ) Representing a graph node s j Abnormal weight sum of all graph nodes pointed to, for any graph node s i E s, i= {1,2, …, n }, initializing PR score to PR 0 (s i ) =1/n, the PR scores of all graph nodes are expressed as vector R 0
R 0 =(PR 0 (s 1 ),PR 0 (s 2 ),…,PR 0 (s n )) T #(11)
During each round of random walk, the PR score of each graph node is iteratively updated:
R c =dU·R c-1 +(1-β)R 0 #(12)
wherein R is c PR score vectors representing all graph nodes after iteration round c, u E R n*n Representing a random walk probability matrix, beta epsilon (0, 1) representing a sony coefficient, wherein generally beta=0.85, after iterative updating, PR scores of each graph node tend to converge, the higher the PR score of each graph node is, the greater the probability that the corresponding micro service is a fault root cause is, and finally, a ranking list of the fault root cause micro services is output according to the sequence from high to low of the PR scores of the graph nodes.
7. The device for using the micro-service fault location method based on the self-encoder and the service dependency graph according to any one of claims 1-6 comprises a data collection module, an anomaly detection module and a fault location module, wherein the data collection module is responsible for collecting a response time index of a call request between a physical resource utilization index of a server host and a micro-service in the running process of a micro-service system; the anomaly detection module trains a self-encoder model by calling the monitoring sequence of the response time as training data, reconstructs the response time, and judges whether the micro-service system is abnormal or not by calculating the reconstruction error of the response time index data; the fault locating module generates a corresponding node from each micro service map, analyzes the communication data between each micro service to capture the call relation between the micro services, constructs a directed edge between the nodes through the call relation between the nodes, constructs a service call relation diagram by taking the response time index data reconstruction error as a node abnormal weight value, associates the response time index data reconstruction error with the physical resource utilization index data of the service host, calculates the abnormal weight of each graph node in the service call relation diagram, and deduces and locates the abnormal fault micro service caused by the abnormality based on the update of the abnormal weight of each graph node in the service call relation diagram by using a weighted PageRank algorithm.
8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.
9. An electronic device comprising a processor and a memory storing a computer program which, when executed by the processor, implements the method steps of any of claims 1-6.
CN202210958306.2A 2022-08-09 2022-08-09 Micro-service fault positioning method and device based on self-encoder and service dependency graph Active CN115348159B (en)

Priority Applications (1)

Application Number Priority Date Filing Date Title
CN202210958306.2A CN115348159B (en) 2022-08-09 2022-08-09 Micro-service fault positioning method and device based on self-encoder and service dependency graph

Applications Claiming Priority (1)

Application Number Priority Date Filing Date Title
CN202210958306.2A CN115348159B (en) 2022-08-09 2022-08-09 Micro-service fault positioning method and device based on self-encoder and service dependency graph

Publications (2)

Publication Number Publication Date
CN115348159A CN115348159A (en) 2022-11-15
CN115348159B true CN115348159B (en) 2023-06-27

Family

ID=83952520

Family Applications (1)

Application Number Title Priority Date Filing Date
CN202210958306.2A Active CN115348159B (en) 2022-08-09 2022-08-09 Micro-service fault positioning method and device based on self-encoder and service dependency graph

Country Status (1)

Country Link
CN (1) CN115348159B (en)

Families Citing this family (18)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11994975B2 (en) * 2022-04-15 2024-05-28 Dell Products L.P. Method and system for performing root cause analysis associated with service impairments in a distributed multi-tiered computing environment
US12353923B2 (en) 2022-04-15 2025-07-08 Dell Products L.P. Method and system for managing a distributed multi-tiered computing environment based on load predictions
US12265845B2 (en) 2022-04-15 2025-04-01 Dell Products L.P. Method and system for provisioning an application in a distributed multi-tiered computing environment using case based reasoning
US12327144B2 (en) 2022-04-15 2025-06-10 Dell Products L.P. Method and system for managing resource buffers in a distributed multi-tiered computing environment
US11953978B2 (en) 2022-04-15 2024-04-09 Dell Products L.P. Method and system for performing service remediation in a distributed multi-tiered computing environment
CN115883409A (en) * 2022-11-29 2023-03-31 中国科学院信息工程研究所 Multi-type index fusion anomaly detection method and device in micro-service cloud environment
CN116074181A (en) * 2022-12-23 2023-05-05 北京邮电大学 Service fault root cause positioning method and device based on graph reasoning under influence of protection mechanism
CN116610906B (en) * 2023-04-11 2024-05-14 深圳润世华软件和信息技术服务有限公司 Equipment fault diagnosis method and device, computer equipment and storage medium thereof
CN116662060B (en) * 2023-07-31 2024-02-06 深圳市创银科技股份有限公司 Data processing method and system of sensor signal acquisition processing system
CN116775364B (en) * 2023-08-16 2023-12-05 中国电子信息产业集团有限公司第六研究所 Application service health management method and device, electronic equipment and storage medium
CN117354362A (en) * 2023-09-27 2024-01-05 曙光智算信息技术有限公司 Calling service exception handling methods, devices, equipment, media and products
CN117170304B (en) * 2023-11-03 2024-01-05 傲拓科技股份有限公司 PLC remote monitoring control method and system based on industrial Internet of things
CN117909313B (en) * 2024-03-19 2024-05-14 成都融见软件科技有限公司 Distributed storage method for design code data, electronic equipment and medium
CN118520405B (en) * 2024-07-22 2024-10-01 天选智能科技(江苏)有限公司 Cloud data platform comprehensive service management system and method based on artificial intelligence
CN119087934A (en) * 2024-08-29 2024-12-06 南京海鲸药业股份有限公司 A method and device for monitoring the collection and treatment of inorganic salts based on digital twin technology
CN118869812B (en) * 2024-09-24 2024-12-13 广州尚航信息科技股份有限公司 Method and system for monitoring abnormal operation of micro-service architecture
CN119473691A (en) * 2024-11-27 2025-02-18 重庆邮电大学 An unsupervised microservice system fault location method based on multimodal data
CN119862092A (en) * 2025-03-24 2025-04-22 北京能科瑞元数字技术有限公司 Industrial data micro-service abnormality monitoring method, medium and system

Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN113014421A (en) * 2021-02-08 2021-06-22 武汉大学 Micro-service root cause positioning method for cloud native system
EP3951598A1 (en) * 2020-08-07 2022-02-09 NEC Laboratories Europe GmbH Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs

Family Cites Families (1)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
US11663503B2 (en) * 2019-09-05 2023-05-30 International Business Machines Corporation Enhanced bottleneck analysis at an early stage in a microservice system

Patent Citations (3)

* Cited by examiner, † Cited by third party
Publication number Priority date Publication date Assignee Title
EP3951598A1 (en) * 2020-08-07 2022-02-09 NEC Laboratories Europe GmbH Methods and systems for detecting anomalies in cloud services based on mining time-evolving graphs
CN112698975A (en) * 2020-12-14 2021-04-23 北京大学 Fault root cause positioning method and system of micro-service architecture information system
CN113014421A (en) * 2021-02-08 2021-06-22 武汉大学 Micro-service root cause positioning method for cloud native system

Also Published As

Publication number Publication date
CN115348159A (en) 2022-11-15

Similar Documents

Publication Publication Date Title
CN115348159B (en) Micro-service fault positioning method and device based on self-encoder and service dependency graph
Nair et al. Using bad learners to find good configurations
Las-Casas et al. Sifter: Scalable sampling for distributed traces, without feature engineering
US8181132B2 (en) Validating one or more circuits using one or more grids
US20120005532A1 (en) Method and apparatus for determining ranked causal paths for faults in a complex multi-host system with probabilistic inference in a time series
US8069370B1 (en) Fault identification of multi-host complex systems with timesliding window analysis in a time series
US8627150B2 (en) System and method for using dependency in a dynamic model to relate performance problems in a complex middleware environment
US20120005533A1 (en) Methods And Apparatus For Cross-Host Diagnosis Of Complex Multi-Host Systems In A Time Series With Probablistic Inference
US10360140B2 (en) Production sampling for determining code coverage
CN114201326B (en) A microservice anomaly diagnosis method based on attribute relationship graph
BR112015019167B1 (en) Method performed by a computer processor and system
CN105283848A (en) Application Tracing with Distributed Objects
US8250408B1 (en) System diagnosis
Samir et al. Detecting and predicting anomalies for edge cluster environments using hidden Markov models
WO2005026961A1 (en) Methods and systems for model-based management using abstract models
Pham et al. Baro: Robust root cause analysis for microservices via multivariate bayesian online change point detection
Samir et al. Self-adaptive healing for containerized cluster architectures with hidden Markov models
CN119420629A (en) A method for locating the root cause of microservice failures based on graph convolutional neural networks
Jiang et al. L4: Diagnosing large-scale llm training failures via automated log analysis
Hardt et al. The PetShop Dataset--Finding Causes of Performance Issues across Microservices
CN117520040B (en) Micro-service fault root cause determining method, electronic equipment and storage medium
Mitra et al. Dealing with the unknown: Resilience to prediction errors
US20240330096A1 (en) Root Cause Identification in Hybrid Applications via Probing
JP7744727B2 (en) A self-optimizing analysis system for core dumps
US11803433B2 (en) Localization of potential issues to objects

Legal Events

Date Code Title Description
PB01 Publication
PB01 Publication
SE01 Entry into force of request for substantive examination
SE01 Entry into force of request for substantive examination
GR01 Patent grant
GR01 Patent grant