CN115348159B

CN115348159B - Micro-service fault positioning method and device based on self-encoder and service dependency graph

Info

Publication number: CN115348159B
Application number: CN202210958306.2A
Authority: CN
Inventors: 常雨竹; 刘月灿; 孙建刚; 李伟良; 李静; 羊麟威; 高颖; 杨庆甫; 李明; 宫帅; 尹晓宇; 程航; 董小菱; 饶涵宇; 毛冬; 张辰
Original assignee: Nanjing University of Aeronautics and Astronautics; State Grid Information and Telecommunication Group Co Ltd; Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd; State Grid Corp of China SGCC
Current assignee: Nanjing University of Aeronautics and Astronautics; State Grid Information and Telecommunication Group Co Ltd; Information and Telecommunication Branch of State Grid Zhejiang Electric Power Co Ltd; Information and Telecommunication Branch of State Grid Anhui Electric Power Co Ltd; State Grid Corp of China SGCC
Priority date: 2022-08-09
Filing date: 2022-08-09
Publication date: 2023-06-27
Anticipated expiration: 2042-08-09
Also published as: CN115348159A

Abstract

The invention discloses a microservice fault location method and device based on an autoencoder and a service dependency graph. The service call relationship graph is used to describe the fault propagation path; the running status of the microservice and the system resource utilization are associated to calculate the abnormal weight of each node in the service call graph; the faulty microservice that causes the exception is inferred and located through the improved weighted PageRank algorithm . It overcomes the problem of manually setting various monitoring index thresholds for abnormal diagnosis in the existing microservice fault location method, and improves the accuracy of fault location.

Description

Micro-service fault positioning method and device based on self-encoder and service dependency graph

Technical Field

The invention belongs to the technical field of computer software fault location analysis, and particularly relates to a micro-service fault automatic location method and device based on a self-encoder and a service dependency graph.

Background

With the advent of different computing modes such as cloud computing and mobile computing, a micro-service architecture becomes the latest trend of software service design, development and delivery, each module is implemented and operated as a small but independent system, and access to internal logic and data is provided through a well-defined interface, so that the application software has the advantages of high flexibility, good expandability, large autonomy and the like, and more internet enterprises adopt the micro-service architecture to develop and deploy distributed application software. The micro-service architecture brings convenience in development, and meanwhile, due to complex dependency and frequent delivery and deployment, the system faces more potential threats of faults, and unexpected faults, such as concurrent asynchronous errors, running resource shortage errors and the like, of the system can occur at any time. Because of the independent architecture design and flexible calling relation of the micro-services, when a certain micro-service module fails, related module components also fail due to the dependence on calling, so that large-scale cascading failure of the micro-services is caused, and in order to ensure reliable operation and service quality of the micro-services, developers must quickly repair system failures. In the face of cascading failure of a micro-service system, how to accurately locate the root cause of the failure is particularly critical.

However, locating faults in a micro-service architecture can encounter the following challenges: 1) Complex dependencies. In conventional architectures, system failures are typically determined by examining a log of operations and analyzing system performance with performance monitoring tools. However, in a micro-service architecture, the number of micro-services is usually up to hundreds or thousands, and is usually distributed over multiple service hosts, call and dependency among services are complex and dynamically changed, performance degradation of one service may be widely spread, and abnormality occurs in multiple services, so that running logs and monitoring tools are difficult to meet diagnosis and troubleshooting requirements. 2) A large number of monitoring indexes. Communication and invocation between large-scale services creates a large number of metrics, from which each metric threshold at which a micro-service is abnormal is analyzed is very time consuming, and from which it is often inaccurate to determine whether the micro-service is abnormal. 3) Frequent micro-service updates. In order to meet the demands of users, the micro-service module needs to be updated frequently, in the updating process, the old module is replaced by the new service, and the dependency relationship between the services also changes along with the updating, so that a dynamic system architecture is formed, and the difficulty of automatic fault positioning is aggravated.

Disclosure of Invention

Aiming at the defects of the prior art, the invention provides a micro-service fault positioning method and device based on a self-encoder and a service dependency graph, which judge the running condition of micro-service by learning the fluctuation characteristics of monitoring indexes in normal running of the micro-service through a self-encoder model, solve the problem that various monitoring index thresholds are required to be manually set for carrying out abnormal diagnosis in the existing micro-service fault positioning method, and combine the utilization monitoring indexes of various resources on a server host to carry out weight setting on nodes in the service dependency graph, thereby improving the accuracy of automatically positioning fault micro-service.

In order to achieve the technical purpose, the invention adopts the following technical scheme:

a micro-service fault positioning method based on a self-encoder and a service dependency graph comprises the following steps:

step 1: collecting an index of the utilization rate of physical resources of a server host and an index of response time of a call request between micro services in the running process of the micro service system;

step 2: invoking the monitoring sequence of the response time as training data to train a self-encoder model, reconstructing the response time, and judging whether the micro-service system is abnormal or not by calculating a reconstruction error of the response time index data;

step 3: generating a corresponding node by mapping each micro service, analyzing the call relation between communication data capture micro services among each micro service, constructing directed edges among the nodes through the call relation among the nodes, and constructing a service call relation graph by taking the reconstruction error of the response time index data as a node abnormal weight value;

step 4: correlating the response time index data reconstruction error with physical resource utilization index data of a service host, and calculating abnormal weight of each graph node in a service call relation graph;

step 5: based on the updating of the abnormal weight of each graph node in the service call relation graph in the step 4, a weighted PageRank algorithm is used for deducing and positioning the fault micro-service causing the abnormality.

Wherein:

in step 1, the physical resource utilization index includes an index of the physical machine or virtual machine layer resource usage condition of the running micro service instance, the response time index includes an index of the time spent by the micro service in the micro service system for responding to other micro service requests, and the physical resource utilization index and the response time index are monitored and collected in real time through a Promitus tool.

In step 2:

will micro-serve v _i Response time monitoring index sequence data collected at time t for a time window w

As input to the self-encoder for training completion, where V represents the set of micro-services, +.>

For a vector of dimension h, the +.>

Potential feature representation mapped to d-dimension +.>

Where g is the activation function, h is the number of response time monitoring index collections within the time window w, d is the dimension of the potential feature representation,

is the weight matrix of h rows and d columns of the input layer and the hidden layer, b is the h-dimensional bias vector of the input layer, and the potential characteristic is represented by the decoder +.>

Reconstruction as microservice v _i Is a response time index monitoring sequence data +.>

Wherein the method comprises the steps of

Is the weight matrix of d rows and h columns of the hidden layer and the output layer, c is the d-dimensional bias vector of the hidden layer,

calculate->

And->

Reconstruction mean square error between->

Use of micro-service v during normal operation of micro-service system during self-encoder training phase _i Is used as training data to train a self-encoder model through multiple rounds of trainingThereafter, the converged self-encoder model learns the characteristics of the normal response time series data, and the self-encoder model performs the micro-service v _i The reconstruction value of the response time monitoring value in normal operation is close to the monitoring value, the corresponding reconstruction error is smaller and fluctuates in a stable range, the mean value mu and standard deviation sigma of the reconstruction error at the moment are calculated, and the micro service v is determined _i Is not less than the abnormality detection threshold alpha ⁱ =μ+3σ, in microservice v _i In the process of detecting the running state in real time, the error is reconstructed

Then consider micro service v _i An abnormality occurs.

The specific flow in the step 3 is as follows:

step 3-1: the microservice set in the microservice system is denoted as v= { V ₁ ,v ₂ ,…,v _n Where n represents the number of micro services, for any v _i E V, map generation map node s _i Finally, a graph node set S= { S is obtained ₁ ,s ₂ ,…,s _n }；

Step 3-2: capturing call relationships between micro services by parsing communication data between the micro services _i To micro-service v _j Sending a service request, constructing a slave graph node s _i Pointing graph node s _j Is directed to edge z of (2) _ij Finally, an edge set Z= { Z is formed _ij The service request is only constructed into a directed edge, and a service call relation diagram without abnormal weight is generated;

step 3-3. Micro service v _i Is used as a graph node s _i Initial anomaly weight of (a)

Traversing and calculating initial abnormal weight of each micro-service to obtain a graph node abnormal weight set +.>

Anomaly in anomaly weight set FWeight->

As graph node s _i And finally obtaining a service call relation graph G (S, Z, F).

In step 4, based on the service call relationship graph G (S, Z, F), the abnormal weight of each graph node is automatically updated according to the abnormal weight relationship between adjacent graph nodes in the service call relationship graph, and for any graph node S _j E S, j e {1,2, …, n }, will contain the directed graph node S _j Adjacent graph nodes of directed edges of (a) form a set AN (s _j ) Will contain a pointing AN(s) _j ) Adjacent graph nodes of the directed edge of any one of the graph nodes form a set NAN (s _j )，

Computing AN(s) _j ) Average anomaly weight aScare(s) _j )：

Wherein the method comprises the steps of

Representing a graph node s _i Abnormal weights of (1), inDegree(s) _j ) Representing a graph node s _j Is of the order of entry, NAN (s _j ) Average anomaly weight cScore(s) _j )：

Wherein aScore(s) _j ) Reflects AN(s) _j ) Overall degree of abnormality, cScore (s _j ) Representing NAN(s) _j ) The degree of abnormality as a whole was combined with aScore (s _j ) And cScore(s) _j ) Feature calculation graph node s of (a) _j Is of anomaly weight acScore(s) _j )：

acScore(s _j )＝aScore(s _j )-cScore(s _j ) (6)

acScore(s _j ) The higher the value of (2), the sectionPoint s _j The higher the overall degree of anomaly of the neighboring graph nodes, the more AN(s) _j ) The lower the degree of abnormality of the adjacent graph node, the graph node s _j Corresponding microservice v _j The higher the probability of being the root cause of the fault, the micro-service v is measured by pearson correlation function _i Response time series reconstruction errors collected at time t over a time window

Deploying microservices v _i Correlation of each physical resource monitoring index sequence data on a host computer:

wherein,,

sequence data representing the physical resource monitoring index of item r collected at time t for a time window w,/v>

Representation->

R.epsilon. {1,2, …, k }, k representing the number of physical resources,/->

Representing microservices v _i Reconstruction of the response time monitoring value at time e,/->

Representing microservices v _i Response time monitoring value at time e, +.>

Representing reconstruction errors

Is the mean of (v) microservice v _i Through->

The score is expressed by the height of the score, and is combined with acScore (s _i ) And->

Calculation graph node s _i Final anomaly weight AS(s) _i )：

In step 5:

the abnormal weight of each micro service is calculated in a traversing way to finish the updating of the abnormal weights of all graph nodes in the service call relation graph, a weighted PageRank algorithm is adopted to 'randomly walk' in the service call relation graph G (S, Z, F), and a service call relation graph node transition probability matrix U is defined firstly:

wherein u is _ij Representing slave graph node s _j Random walk to graph node s _i The abnormal weight of the graph node is related to the walk probability, and the slave graph node s is calculated _j Random walk to graph node s _i Probability u of (2) _ij ：

Wherein s is _j →s _i Representing the existence of a slave graph node s _j Pointing graph node s _i Is a directional edge of linkut(s) _j ) Representing a graph node s _j Abnormal weight sum of all graph nodes pointed to, for any graph node s _i E S, i= {1,2, L, n }, initializing PR score to PR ₀ (s _i ) =1/n, the PR scores of all graph nodes are expressed as vector R ₀ ：

R ₀ ＝(PR ₀ (s ₁ )，PR ₀ (s ₂ )，…，PR ₀ (s _n )) ^T (11)

During each round of random walk, the PR score of each graph node is iteratively updated:

R _c ＝dU·R _c-1 +(1-β)R ₀ (12)

wherein R is _c PR score vectors representing all graph nodes after iteration round c, U.epsilon.R ^n*n Representing a random walk probability matrix, beta epsilon (0, 1) representing a sony coefficient, wherein generally beta=0.85, after iterative updating, PR scores of each graph node tend to converge, the higher the PR score of each graph node is, the greater the probability that the corresponding micro service is a fault root cause is, and finally, a ranking list of the fault root cause micro services is output according to the sequence from high to low of the PR scores of the graph nodes.

The invention also provides a device for using the micro-service fault positioning method based on the self-encoder and the service dependency graph, which comprises a data collection module, an anomaly detection module and a fault positioning module, wherein the data collection module is responsible for collecting the physical resource utilization index of a server host and the response time index of a call request between micro-services in the running process of the micro-service system; the anomaly detection module trains a self-encoder model by calling the monitoring sequence of the response time as training data, reconstructs the response time, and judges whether the micro-service system is abnormal or not by calculating the reconstruction error of the response time index data; the fault locating module generates a corresponding node from each micro service map, analyzes the communication data between each micro service to capture the call relation between the micro services, constructs a directed edge between the nodes through the call relation between the nodes, constructs a service call relation diagram by taking the response time index data reconstruction error as a node abnormal weight value, associates the response time index data reconstruction error with the physical resource utilization index data of the service host, calculates the abnormal weight of each graph node in the service call relation diagram, and deduces and locates the abnormal fault micro service caused by the abnormality based on the update of the abnormal weight of each graph node in the service call relation diagram by using a weighted PageRank algorithm.

The invention also provides a computer readable storage medium storing a computer program which, when executed by a processor, implements the above-described self-encoder and service dependency graph based micro-service fault locating method steps.

The invention also provides an electronic device comprising a processor and a memory, wherein the memory stores a computer program which, when executed by the processor, realizes the steps of the micro-service fault positioning method based on the self-encoder and the service dependency graph.

The invention also provides a computer program product characterized by comprising a computer program/instruction which, when executed by a processor, implements the above-mentioned microservice fault location method steps based on self-encoders and service dependency graphs.

Compared with the prior art, the invention has the advantages that:

1. according to the invention, the association relation between the micro-service running state and response time fluctuation is learned through training the self-encoder model, so that the micro-service running state is detected in real time, and the problem that various monitoring index thresholds are required to be manually set for abnormality diagnosis in the existing micro-service fault positioning method is solved; and the call relations among the micro services are captured by analyzing the communication data among the micro services, so that a service call relation diagram is constructed to simulate a fault propagation path, and the abnormal weights of nodes in the call relation diagram are updated by combining the reconstruction errors of the response time of the self-encoder to the micro services and the utilization rate of system resources, so that the fault micro services are automatically positioned based on a weighted PageRank algorithm, and the fault positioning accuracy is improved.

2. According to the invention, the monitoring index data in the normal running state of the micro-service is used as the input of the self-encoder, and the self-encoder is used for encoding and reconstructing, so that compared with the traditional anomaly detection method, the method can capture the index data hiding characteristic in the normal running state of the micro-service, thereby improving the accuracy of real-time anomaly detection.

3. The invention uses the reconstruction error of the encoder to the micro-service monitoring index as the abnormal weight of the service call relation graph node, and the reconstruction error reflects the deviation degree of the real-time monitoring index and the normal monitoring index, so the abnormal degree of the micro-service can be reflected by the reconstruction error (namely the abnormal weight).

4. The invention measures micro services as fault root likelihood size by introducing an aScore score, a cScore score and an acScore score concept, wherein aScore(s) _j ) Reflects AN(s) _j ) Degree of abnormality as a whole. cScore(s) _j ) Representing NAN(s) _j ) Degree of abnormality as a whole. AcScare(s) _j ) Integrate the nodes s _j Itself and its surrounding nodes, acScore(s) _j ) The higher the indicating node s _j The overall anomaly degree of the adjacent node is high, and AN(s) _j ) If the overall degree of abnormality of the adjacent node is low, then node s _j Corresponding microservice v _j The higher the probability of being the root cause of the fault.

5. The invention updates the abnormal weight of each node by calculating the correlation between the monitoring index of the response time and the sequence data of each physical resource monitoring index through the pearson correlation function, takes the resource utilization rate as a part of the abnormal weight calculation, and enhances the correlation between the positioning of the micro-service fault root cause and the resource utilization rate.

6. According to the invention, through improving the PageRank algorithm and based on the abnormal weight of the connected nodes in the service call relation diagram, the weighted PageRank algorithm is designed, so that the migration strategy in the random walk algorithm is related to the abnormal weight relation of each node and the connected nodes, and the node frequency with higher migration abnormal weight is higher.

Drawings

FIG. 1 is a general framework diagram of a method for automatically locating micro-service faults based on a self-encoder and a service dependency graph according to an embodiment of the present invention;

FIG. 2 is a flow chart of a method for automatically locating micro-service faults based on a self-encoder and a service dependency graph according to an embodiment of the present invention;

FIG. 3 is a diagram of fault location accuracy versus experimental results provided by an embodiment of the present invention;

FIG. 4 is a graph of failure location average accuracy versus experimental results provided by an embodiment of the present invention;

FIG. 5 is a schematic diagram of a micro-service fault locating device according to an embodiment of the present invention;

fig. 6 is another schematic structural diagram of a micro service fault locating device according to an embodiment of the present invention.

Detailed Description

Embodiments of the present invention are described in further detail below with reference to the accompanying drawings.

The invention provides a method and a device for automatically positioning micro-service faults based on a self-encoder and a service dependency graph, wherein the overall framework of the method and the device is shown in figure 1 and mainly comprises 3 modules. The data collection module is responsible for collecting application program indexes and system level index data, wherein the application program indexes are used for detecting application performance problems, and the system level indexes are used for updating node weights in a subsequent service call relation graph. The anomaly detection module is used for carrying out coding reconstruction on the application index data through the self-encoder and detecting whether the micro-service module is abnormal or not. Once the system abnormality is detected, the fault locating module constructs a service call relation diagram by analyzing call relations among the micro services to analyze an abnormal propagation path, calculates the abnormal weight of each micro service node in the service call relation diagram by utilizing the correlation of the system resource utilization rate and the micro service performance, and finally locates the fault root cause micro service module by utilizing a weighted PageRank algorithm.

Referring to fig. 2, fig. 2 is a flowchart of a method for automatically locating a micro service fault based on a self-encoder and a service dependency graph according to an embodiment of the present invention, and specifically, the method includes:

step 1: and monitoring and collecting the physical resource utilization rate index and the response time index called between the micro services in the running process of the micro service system in real time through a Promises tool. The physical resource utilization index is a type of index reflecting the use condition of physical machine or virtual machine level resources running the micro service instance, such as CPU utilization rate, memory utilization rate and the like; the response time index is an index reflecting the length of time it takes for a micro service in the micro service system to respond to other micro service requests.

The invention expresses k physical resource monitoring indexes as M= { M ¹ ，m ² ，…，m ^k Continuously monitoring value of any monitoring index mr is expressed as time series data

Wherein m is ^r E M, r e {1,2, … k }. To model index m ^r The change relation of the monitoring value is that the index m is intercepted at the monitoring index collecting time t through a sliding window w with the time length of h ^r Is expressed as +.>

Wherein->

Representing index m ^r Monitoring value at time t. The monitoring sequence data which is collected by k physical resource monitoring indexes at the time t and comprises a time window w is formed into a monitoring index data matrix +.>

In a micro service system v= { V having n micro service modules ₁ ，v ₂ ，…，v _n In } for any microservice v _i E V, the response time monitoring sequence data collected at time t and containing a time window w is expressed as

Wherein->

Representing microservices v _i A monitored value of response time at time t. The response time monitoring sequence data collected by n micro-services at time t for a time window w is formed into a matrix +.>

Step 2: will micro-serve v _i Response time monitoring index sequence data collected at time t for duration of time window w

As input to the self-encoder after training, the coding layer will be +.>

Mapping to d-dimensional latent feature representation +.>

is the weight matrix of the h rows and d columns of the input layer and the hidden layer, and b is the h-dimensional bias vector of the input layer. The potential feature is then represented by the decoder +.>

Reconstruction as a micro-service module v _i Is a response time index monitoring sequence data +.>

Wherein the method comprises the steps of

Is the weight matrix of d rows and h columns of the hidden layer and the output layer, c E R ^d+1 Is the bias vector of the hidden layer,

then calculate +.>

And->

Reconstruction mean square error between->

Use of micro-service v during normal operation of micro-service system during self-encoder training phase _i Is used as training data to train a self-encoder model. After multiple rounds of training, the converged self-encoder model learns the characteristics of the normal response time series data. Thus from encoder model pair micro-services v _i The reconstruction value of the response time monitoring value in normal operation is close to the monitoring value, and the corresponding reconstruction error is small and fluctuates in a stable range. Determining a micro service module v by calculating the mean value mu and the standard deviation sigma of the reconstruction error at the moment _e Is not less than the abnormality detection threshold alpha ⁱ =μ+3σ. In micro service v _i In the process of detecting the running state in real time, if a reconstruction error is found

Then consider micro service v _i An abnormality occurs.

Step 3: generating a corresponding node by mapping each micro service, capturing call relations among the micro services by analyzing communication data among the micro services, constructing directed edges among the nodes, and constructing a service call relation graph by taking a reconstruction error of response time index data of each micro service as a node abnormal weight value, wherein the specific flow is as follows:

step 3-1: the microservice set in the microservice system is denoted as v= { V ₁ ，v ₂ ，…，v _n Where n represents the number of micro services. For any v _i E V, map generation map node s _i Finally, a graph node set S= { S is obtained ₁ ，s ₂ ，…，s _n }；

Step 3-2: capturing call relationships between micro services by parsing communication data between the micro services, if micro service v _i To micro-service v _j Sending a service request, constructing a slave s _i Direction s _j Is directed to edge z of (2) _ij Form edge set z= { Z _ij The number of the service requests is greater than or equal to 1, the number of the service requests is less than or equal to n, and only one directed edge is constructed by the same service request;

step 3-3: will micro-serve v _i As graph node s, the reconstruction error of the response time monitoring index of (c) _i Initial anomaly weight of (a)

Traversing and calculating abnormal initial weights of each micro service to obtain a graph node abnormal weight set +.>

Finally we get the service invocation relationship graph G (S, Z, F).

Step 4: the "walk" strategy of the random walk algorithm between each node depends on the walk probability between the nodes, and the walk probability is associated with the abnormal weight of the nodes, so that the abnormal weight setting of each graph node in the service call relation graph G (S, Z, F) is particularly critical to the fault positioning accuracy of the random walk algorithm. The invention updates the abnormal weight of the graph nodes through the relationship between the graph nodes in the service call relationship graph and the relationship between the micro service running state and the host resource utilization rate, so that the graph nodes followThe graph nodes with higher degree of abnormality in the machine walk algorithm obtain larger walk probability, so that the interpretation of fault positioning is enhanced. Based on the service call relation graph G (S, Z, F), the abnormal weight of each graph node is automatically updated according to the abnormal weight relation between adjacent graph nodes in the service call relation graph. For any graph node s _j E S, j e {1,2, …, n }, will contain the directed graph node S _j Adjacent graph nodes of directed edges of (a) form a set AN (s _j ) Will contain a pointing AN(s) _j ) Adjacent graph nodes of the directed edge of any one of the graph nodes form a set NAN (s _j )。

Then calculate AN(s) _j ) Average anomaly weight aScare(s) _j )：

Wherein the method comprises the steps of

Representing a graph node s _i Abnormal weights of (1), inDegree(s) _j ) Representing a graph node s _j Is included in the (a) is included in the (b). NAN(s) _j ) Average anomaly weight cScore(s) _j )：

Wherein aScore(s) _j ) Reflects AN(s) _j ) Degree of abnormality as a whole. cScore(s) _j ) Representing NAN(s) _j ) Degree of abnormality as a whole. Binding to aScore(s) _j ) And cScore(s) _j ) Feature calculation node s of (a) _j Is of anomaly weight acScore(s) _j )：

acScore(s _j )＝aScore(s _j )-cScore(s _j ) (18)

Obviously, acScore (s _j ) Integrate graph nodes s _j Itself and its surrounding graph nodes, acScore(s) _j ) The higher the value of (a) indicates the graph node s _j Adjacent graph node wholeThe degree of abnormality is high, and AN(s) _j ) If the overall degree of anomaly of adjacent graph nodes of (a) is lower, then the graph node s _j Corresponding microservice v _j The higher the probability of being the root cause of the fault. In addition, the response time of the micro-service is related to the change of the performance index of the host where the micro-service is deployed, and the invention selects the Pearson correlation function to measure the micro-service v _i Response time series reconstruction errors collected at time t over a time window

wherein,,

Representation->

R.epsilon {1,2, …, k }, }>

Representing microservices v _i Response time monitoring value at time e, +.>

Representing reconstruction error +.>

Is the mean of (v) microservice v _i Can be controlled by->

The score is embodied by the height, so the invention is combined with acScore (s _i ) And->

Calculation graph node s _i Final anomaly weight AS(s) _i )：

Step 5: and (3) traversing and calculating the abnormal weight of each micro service node to finish updating of the abnormal weights of all graph nodes in the service call relation graph, then adopting a weighted PageRank algorithm to perform random walk in the service call relation graph G (S, Z, F), calculating the walk probability by using the relation between the abnormal weights of the graph nodes and the abnormal weights of the connected graph nodes, and improving the positioning accuracy of the positioning fault root cause. The walk strategy of the weighted PageRank algorithm is based on the probability that each graph node accesses other graph nodes, so that the service call relationship graph node transition probability matrix U needs to be defined first:

wherein u is _ij Representing slave graph node s _j Random walk to graph node s _i Is a probability of (2). The abnormal weight value of the graph node is related to the walk probability, and the slave graph node s is calculated _j Random walk to graph node s _i Probability u of (2) _ij ：

Wherein s is _j →s _i Indicating the presence ofFrom graph node s _j Pointing graph node s _i Is a directional edge of linkut(s) _j ) Representing a graph node s _j Abnormal weight sum of the directed graph nodes. For any graph node s _i E S, i= {1,2, L, n }, initializing PR score to PR ₀ (s _i ) Let PR score of all graph nodes be expressed as vector R =1/n ₀ ：

R ₀ ＝(PR ₀ (s ₁ )，PR ₀ (s ₂ )，…，PR ₀ (s _n )) ^T (23)

R _c ＝dU·R _c-1 +(1-β)R ₀ (24)

wherein R is _c PR score vectors representing all graph nodes after iteration round c, U.epsilon.R ^n*n Representing a random walk probability matrix, β e (0, 1) represents the sony coefficient, typically β=0.85. After continuous iterative updating, PR scores of each graph node tend to converge, at this time, the higher the PR score of each graph node is, the greater the probability that the corresponding micro-service module is a fault root cause is, and finally, a ranking list of the fault root cause micro-service is output according to the sequence from high PR scores of the graph nodes to low PR scores.

In order to evaluate the effectiveness of the invention, the invention adopts evaluation indexes AC@K and MAP as indexes for measuring the fault positioning effect, wherein AC@K represents the probability of including fault root micro-services in the first K micro-services output by root prediction. The higher the AC@K score is, the more accurate the representation model locates faults; MAP quantization fault location average accuracy.

Table 1 shows the fault location accuracy of the present invention on different micro-services in the Sock-shop in case of faults such as injection network delay (Latency), CPU resource shortage (CPU Hog), memory leakage (Memory Leak), etc. As can be seen from Table 1, the average fault location accuracy (MAP) of the MicroEncoder on each microservice is over 85%.

TABLE 1

To evaluate the effectiveness of the present invention, the present invention was compared to other advanced methods, including Random selection, microRCA, AAMR, and the like. Firstly, the accuracy of fault location under three fault types of network delay (Latency), CPU resource shortage (CPU Hog) and Memory leakage (Memory Leak) is tested and compared by the method and the method. The experimental results are shown in fig. 3, and under the conditions of three faults injection and different values of K, the fault positioning accuracy of the invention is higher than that of other methods, which indicates that the invention really and effectively improves the accuracy in the aspect of fault positioning.

Further, the average fault location accuracy of the method is measured by calculating a fault location evaluation index MAP, and an experimental result is shown in FIG. 4. As can be seen from fig. 4, the average fault location accuracy of the present invention for network delay (Latency) is 92%, the average fault location accuracy for CPU resource shortage (CPU Hog) is 86.4%, and the average fault location accuracy for Memory Leak (Memory Leak) is 91.2%, which is superior to the comparison method.

The invention also provides a device of the micro-service fault positioning method based on the self-encoder and the service dependency graph, which comprises a data collection module, an anomaly detection module and a fault positioning module, wherein the data collection module is used for collecting micro-service physical resource indexes and response time indexes, the response time indexes are used for detecting the response speed of calling among micro-services, and the physical resource indexes are used for updating the abnormal weights of graph nodes in the service calling relation graph; the anomaly detection module carries out coding reconstruction on the data of the data collection module through the self-encoder, and detects whether the micro-service is abnormal or not; the fault locating module analyzes an abnormal propagation path by analyzing a service call relation graph by analyzing call relations among micro services, calculates abnormal weight of each micro service in the service call relation graph by using physical resource indexes and micro service response time indexes, and locates the position of the fault root cause micro service by using a weighted PageRank algorithm.

Referring to fig. 5, which shows a schematic structural diagram of a micro-service fault location device based on a self-encoder and a service dependency graph according to an exemplary embodiment of the present invention, the device provided in this embodiment includes a data collection unit 301, an anomaly detection unit 302, and a fault location unit 303. The data collection unit 301 is responsible for collecting micro-service physical resource indexes and response time indexes, wherein the response time indexes are used for detecting the response speed of calling among micro-services, and the physical resource indexes are used for updating the abnormal weights of the graph nodes in the service calling relation graph; the anomaly detection unit 302 performs coding reconstruction on the data of the data collection module through a self-encoder to detect whether the micro-service is abnormal or not; the fault location unit 303. And constructing a service call relation graph by analyzing call relations among the micro services, analyzing an abnormal propagation path, calculating the abnormal weight of each micro service in the service call relation graph by using the physical resource index and the micro service response time index, and positioning a fault root cause by using a weighted PageRank algorithm.

Referring to fig. 6, a schematic structural diagram of a micro-service fault locating device based on a self-encoder and a service dependency graph according to an embodiment of the present application is shown, hereinafter referred to as device 6, where the device 6 may be integrated in the foregoing electronic apparatus, and as shown in fig. 6, the device includes a memory 602, a processor 601, an input device 603, an output device 604, and a communication interface. The memory 602 may be a separate physical unit, and the memory 602, the processor 601, the transceiver 603 may be connected to the processor 601, the input device 603, and the output device 604 through buses, may be integrated, implemented by hardware, or the like. The memory 602 is used to store a program implementing the above method embodiment, or the respective modules of the apparatus embodiment, and the processor 601 invokes the program to perform the operations of the above method embodiment. Input devices 602 include, but are not limited to, a keyboard, a mouse; output devices include, but are not limited to, display screens. Communication interfaces are used to transmit and receive various types of messages, including but not limited to wireless interfaces or wired interfaces.

Alternatively, when part or all of the distributed task scheduling method of the above-described embodiment is implemented by software, the apparatus may include only the processor. The memory for storing the program is located outside the device and the processor is connected to the memory via a circuit/wire for reading and executing the program stored in the memory. The processor may be a central processor (central processing unit, CPU), a network processor (ne twork processor, NP) or a combination of CPU and NP.

The processor may further comprise a hardware chip. The hardware chip may be an application-specific integrated circuit (ASIC), a programmable logic device (programmable logic device, PLD), or a combination thereof. The PLD may be a complex programmable logic device (complexprogrammable logic device, CPLD), a field-programmable gate array (field-progr ammable gatearray, FPGA), general-purpose array logic (generic array logic, GAL), or any combination thereof. The memory may include volatile memory (volatile memory), such as access memory (randomaccess memory, RAM); the memory may also include a nonvolatile memory (non-volatile memory), such as a flash memory (flash memory), a hard disk (HDD) or a Solid State Drive (SSD); the memory may also comprise a combination of the above types of memories.

The present application is described with reference to flowchart illustrations and/or block diagrams of methods, apparatus (systems) and computer program products according to embodiments of the application. It will be understood that each flow and/or block of the flowchart illustrations and/or block diagrams, and combinations of flows and/or blocks in the flowchart illustrations and/or block diagrams, can be implemented by computer program instructions. These computer program instructions may be provided to a processor of a general purpose computer, special purpose computer, embedded processor, or other programmable data processing apparatus to produce a machine, such that the instructions, which execute via the processor of the computer or other programmable data processing apparatus, create means for implementing the functions specified in the flowchart flow or flows and/or block diagram block or blocks.

The above is only a preferred embodiment of the present invention, and the protection scope of the present invention is not limited to the above examples, and all technical solutions belonging to the concept of the present invention belong to the protection scope of the present invention. It should be noted that modifications and adaptations to the invention without departing from the principles thereof are intended to be within the scope of the invention as set forth in the following claims.

Claims

1. The micro-service fault positioning method based on the self-encoder and the service dependency graph is characterized by comprising the following steps of:

2. The method for positioning micro service faults based on the self-encoder and the service dependency graph according to claim 1, wherein in the step 1, the physical resource utilization index comprises an index of the resource usage condition of a physical machine or a virtual machine layer running the micro service instance, the response time index comprises an index of the time spent by the micro service in response to other micro service requests in the micro service system, and the physical resource utilization index and the response time index are monitored and collected in real time through a Promitus tool.

3. The method for positioning micro service faults based on the self-encoder and the service dependency graph according to claim 1, wherein in the step 2:

Is a vector of h dimension, and is transmitted through the coding layer

Potential feature representation mapped to d-dimension +.>

Wherein the method comprises the steps of

calculate->

And->

Reconstruction mean square error between->

Use of micro-service v during normal operation of micro-service system during self-encoder training phase _i The response time monitoring sequence of (2) is used as training data to train a self-encoder model, after multiple rounds of training, the converged self-encoder model learns the characteristics of the normal response time sequence data, and the self-encoder model performs micro-service v _i The reconstruction value of the response time monitoring value in normal operation is close to the monitoring value, the corresponding reconstruction error is smaller and fluctuates in a stable range, the mean value mu and standard deviation sigma of the reconstruction error at the moment are calculated, and the micro service v is determined _i Is not less than the abnormality detection threshold alpha ⁱ =μ+3σ, in microservice v _i In the process of detecting the running state in real time, the error is reconstructed

Then consider micro service v _i An abnormality occurs.

4. The method for positioning micro service failure based on self-encoder and service dependency graph according to claim 1, wherein in step 3

The specific flow is as follows:

step 3-1: the microservice set in the microservice system is denoted as v= { V ₁ ，v ₂ ，…，v _n Where n represents the number of micro services, for any v _i E V, map generation map node s _i Finally, a graph node set s= { s is obtained ₁ ，s ₂ ，…，s _n }；

Step 3-2: capturing call relationships between micro services by parsing communication data between the micro services _i To micro-service v _j Sending a service request, constructing a slave graph node s _i Pointing graph node s _j Is directed to edge z of (2) _ij Finally, an edge set Z= { Z is formed _ij And the service call relation diagram without abnormal weight is generated by constructing only one directed edge for the same service request, wherein i is less than or equal to 1, j is less than or equal to n:

step 3-3: will micro-serve v _i Is used as a graph node s _i Initial anomaly weight of (a)

The anomaly weight in the anomaly weight set F is +.>

5. The method for positioning micro service fault based on self-encoder and service dependency graph according to claim 1, wherein in step 4, based on service call relationship graph G (S, Z, F), the abnormal weight of each graph node is automatically updated according to the abnormal weight relationship between adjacent graph nodes in the service call relationship graph, for any graph node S _j E S, j e {1,2, …, n }, will contain the directed graph node S _j Adjacent graph nodes of directed edges of (a) form a set AN (s _j ) Will contain a pointing AN(s) _j ) Adjacent graph nodes of the directed edge of any one of the graph nodes form a set NAN (s _j )，

Computing AN(s) _j ) Average anomaly weight aScare(s) _j )：

Wherein the method comprises the steps of

acScore(s _j )＝aScore(s _j )-cScore(s _j )#(6)

acScore(s _j ) The higher the value of (2), the graph node s _j The higher the overall degree of anomaly of the neighboring graph nodes, the more AN(s) _j ) The lower the degree of abnormality of the adjacent graph node, the graph node s _j Corresponding microservice v _j The higher the probability of being the root cause of the fault, the micro-service v is measured by pearson correlation function _i Response time series reconstruction errors collected at time t over a time window

wherein,,

sequence data representing the time window w collected by the r-th physical resource monitoring index at the time t,

representation->

R.epsilon. {1,2, …, k }, k representing the number of physical resources,/->

Representing microservices v _i Response time monitoring value at time e, +.>

Representing reconstruction errors

Is the mean of (v) microservice v _i Through->

Calculation graph node s _i Final anomaly weight As(s) _i )：

6. The method for positioning micro service failure based on self-encoder and service dependency graph according to claim 5, wherein in step 5:

Wherein s is _j →s _i Representing the existence of a slave graph node s _j Pointing graph node s _i Is a directional edge of linkut(s) _j ) Representing a graph node s _j Abnormal weight sum of all graph nodes pointed to, for any graph node s _i E s, i= {1,2, …, n }, initializing PR score to PR ₀ (s _i ) =1/n, the PR scores of all graph nodes are expressed as vector R ₀ ：

R ₀ ＝(PR ₀ (s ₁ )，PR ₀ (s ₂ )，…，PR ₀ (s _n )) ^T #(11)

R _c ＝dU·R _c-1 +(1-β)R ₀ #(12)

wherein R is _c PR score vectors representing all graph nodes after iteration round c, u E R ^n*n Representing a random walk probability matrix, beta epsilon (0, 1) representing a sony coefficient, wherein generally beta=0.85, after iterative updating, PR scores of each graph node tend to converge, the higher the PR score of each graph node is, the greater the probability that the corresponding micro service is a fault root cause is, and finally, a ranking list of the fault root cause micro services is output according to the sequence from high to low of the PR scores of the graph nodes.

7. The device for using the micro-service fault location method based on the self-encoder and the service dependency graph according to any one of claims 1-6 comprises a data collection module, an anomaly detection module and a fault location module, wherein the data collection module is responsible for collecting a response time index of a call request between a physical resource utilization index of a server host and a micro-service in the running process of a micro-service system; the anomaly detection module trains a self-encoder model by calling the monitoring sequence of the response time as training data, reconstructs the response time, and judges whether the micro-service system is abnormal or not by calculating the reconstruction error of the response time index data; the fault locating module generates a corresponding node from each micro service map, analyzes the communication data between each micro service to capture the call relation between the micro services, constructs a directed edge between the nodes through the call relation between the nodes, constructs a service call relation diagram by taking the response time index data reconstruction error as a node abnormal weight value, associates the response time index data reconstruction error with the physical resource utilization index data of the service host, calculates the abnormal weight of each graph node in the service call relation diagram, and deduces and locates the abnormal fault micro service caused by the abnormality based on the update of the abnormal weight of each graph node in the service call relation diagram by using a weighted PageRank algorithm.

8. A computer-readable storage medium, characterized in that the computer-readable storage medium stores a computer program which, when executed by a processor, implements the method steps of any of claims 1-6.

9. An electronic device comprising a processor and a memory storing a computer program which, when executed by the processor, implements the method steps of any of claims 1-6.