CN118467232B - Micro-service fault positioning method and electronic equipment - Google Patents
Micro-service fault positioning method and electronic equipment Download PDFInfo
- Publication number
- CN118467232B CN118467232B CN202410935654.7A CN202410935654A CN118467232B CN 118467232 B CN118467232 B CN 118467232B CN 202410935654 A CN202410935654 A CN 202410935654A CN 118467232 B CN118467232 B CN 118467232B
- Authority
- CN
- China
- Prior art keywords
- fault
- failure
- determining
- service
- type
- Prior art date
- Legal status (The legal status is an assumption and is not a legal conclusion. Google has not performed a legal analysis and makes no representation as to the accuracy of the status listed.)
- Active
Links
Classifications
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F11/00—Error detection; Error correction; Monitoring
- G06F11/07—Responding to the occurrence of a fault, e.g. fault tolerance
- G06F11/0703—Error or fault processing not based on redundancy, i.e. by taking additional measures to deal with the error or fault not making use of redundancy in operation, in hardware, or in data representation
- G06F11/079—Root cause analysis, i.e. error or fault diagnosis
-
- G—PHYSICS
- G06—COMPUTING OR CALCULATING; COUNTING
- G06F—ELECTRIC DIGITAL DATA PROCESSING
- G06F18/00—Pattern recognition
- G06F18/20—Analysing
- G06F18/24—Classification techniques
- G06F18/243—Classification techniques relating to the number of classes
- G06F18/2431—Multiple classes
Landscapes
- Engineering & Computer Science (AREA)
- Theoretical Computer Science (AREA)
- Physics & Mathematics (AREA)
- Data Mining & Analysis (AREA)
- General Physics & Mathematics (AREA)
- General Engineering & Computer Science (AREA)
- Bioinformatics & Cheminformatics (AREA)
- Evolutionary Biology (AREA)
- Evolutionary Computation (AREA)
- Computer Vision & Pattern Recognition (AREA)
- Bioinformatics & Computational Biology (AREA)
- Artificial Intelligence (AREA)
- Life Sciences & Earth Sciences (AREA)
- Health & Medical Sciences (AREA)
- Biomedical Technology (AREA)
- Quality & Reliability (AREA)
- Debugging And Monitoring (AREA)
Abstract
The application discloses a microservice fault positioning method and electronic equipment, and relates to the technical field of fault positioning, wherein the method comprises the following steps: determining fault characteristics based on the fault alert information in response to obtaining the micro-service fault alert information; determining fault types based on the fault characteristics, wherein the fault types comprise performance type faults and non-performance type faults; if the fault type is a performance type fault, determining a first sub-fault type based on the number of fault alarm information, and determining a micro-service fault node based on a first fault troubleshooting path corresponding to the first sub-fault type; the first sub-fault type includes a global class performance fault and a local class performance fault; if the fault type is a non-performance fault, determining a micro-service fault node based on a second fault troubleshooting path corresponding to the second sub-fault type; the second sub-fault type includes a traffic exception type fault and an application exception type fault. Thus, the fault can be rapidly and accurately positioned.
Description
Technical Field
The present application relates to the field of fault location technologies, and in particular, to a method and an electronic device for locating a micro service fault.
Background
Currently, the utilization of micro-service architecture is increasing. Compared with the single service, the micro-service architecture has increased complexity, and the call between the functional modules is changed from the original in-process call to the inter-process call. As the number of calling nodes increases, the locations where faults may occur also increase in geometric multiples, making fault localization increasingly difficult.
Disclosure of Invention
In view of the above, the embodiment of the application provides a method for positioning a micro-service fault and an electronic device.
According to a first aspect of the present application, an embodiment of the present application provides a method for positioning a micro service fault, including:
determining fault characteristics based on the fault alert information in response to obtaining the micro-service fault alert information;
Determining fault types based on the fault characteristics, wherein the fault types comprise performance type faults and non-performance type faults;
if the fault type is a performance type fault, determining a first sub-fault type based on the number of fault alarm information, and determining a micro-service fault node based on a first fault troubleshooting path corresponding to the first sub-fault type; the first sub-fault type includes a global class performance fault and a local class performance fault;
if the fault type is a non-performance fault, determining a micro-service fault node based on a second fault troubleshooting path corresponding to the second sub-fault type; the second sub-fault type includes a traffic exception type fault and an application exception type fault.
Optionally, determining the fault type based on the fault signature includes:
If the fault characteristics show that the first parameter is abnormal, determining that the fault type is a performance type fault;
If the fault characteristics show that the second parameter is abnormal, determining that the fault type is a non-performance fault;
The first parameters comprise service response time and the number of fault alarm messages; the second parameters include traffic volume, traffic error rate.
Optionally, determining the first sub-fault type based on the number of fault alert information includes:
if the number of the fault alarm information is greater than or equal to the alarm number threshold, determining that the first sub-fault type is a global performance fault;
if the number of the fault alarm information is smaller than the alarm number threshold, determining that the first sub-fault type is a local performance fault.
Optionally, determining the micro service fault node based on the first fault troubleshooting path corresponding to the first sub fault type includes:
If the first sub-fault type is a global performance fault, determining a micro-service fault node according to a fault troubleshooting path of a data center network fault, a single-system host fault, a middleware fault, a database fault and a key application node fault;
If the first sub-fault type is a local performance fault, determining a micro-service fault node according to a fault troubleshooting path of a single server application flow fault, an application call fault of the structured query language SQL, an application code fault, a CPU or thread fault, a memory or Java virtual machine JVM fault, a disk or file fault and a network fault.
Optionally, determining the micro-service fault node according to a fault troubleshooting path of a data center network fault, a single system host fault, a middleware fault, a database fault, and a critical application node fault includes:
determining whether the data center has a network fault based on the flow information of the data center;
If the data center has no network fault, determining whether a host in the single system has a fault or not based on the performance information of the host in the single system;
if the host computer in the single system has no fault, determining whether the middleware has the fault or not based on the middleware information;
If the middleware does not have a fault, determining whether the database has the fault or not based on the performance information of the database;
If the database has no fault, determining whether the key application node has the fault or not based on the service information of the key application node.
Optionally, determining the micro-service fault node according to a fault troubleshooting path of a single server application flow fault, an application call fault to SQL, an application code fault, a CPU or thread fault, a memory or JVM fault, a disk or file fault, and a network fault, includes:
determining whether the application flow of the single server fails or not based on the application flow information of the single server;
If the single server application flow has no fault, determining whether the application has SQL calling fault or not based on the SQL calling information of the application;
If the application does not have SQL call faults, determining whether an application code has faults or not based on an application running log;
If the application code has no fault, determining whether the CPU or the thread has the fault or not based on the running information of the CPU or the thread of the server;
if the CPU or the thread has no fault, determining whether the memory or the JVM has the fault or not based on the running information of the memory or the JVM of the server;
If the memory or the JVM has no fault, determining whether the disk or the file has the fault or not based on the running information of the disk or the file of the server;
if the disk or the file has no fault, determining whether the network has the fault or not based on the running information of the server network.
Optionally, determining the micro service fault node based on the second fault troubleshooting path corresponding to the second sub fault type includes:
aiming at abnormal service faults, determining micro service fault nodes according to fault investigation paths of application interface faults, front-end page faults and client faults;
Aiming at the application abnormal faults, determining micro-service fault nodes according to fault troubleshooting paths of multiple data center flow distribution faults, single server flow faults, database faults, memory or JVM faults.
Optionally, determining the micro-service fault node according to the fault troubleshooting path of the application interface fault, the front-end page fault and the client fault includes:
Determining whether a service interface fails based on the service error code;
If the service interface has no fault, determining whether the front-end page has the fault or not based on the front-end page running information;
if the front-end page has no fault, determining whether the client has the fault or not based on the client operation data.
Optionally, determining the micro-service failure node according to a failure troubleshooting path of a multi-data center traffic distribution failure, a single server traffic failure, a database failure, a memory or a JVM failure, includes:
determining whether the flow distribution of the multiple data centers fails based on the flow information of the multiple data centers;
If the flow distribution of the multiple data centers does not have faults, determining whether the flow distribution of the single data center has faults or not based on the flow information of the single data center;
if the single-center flow distribution has no fault, determining whether the single server flow has the fault or not based on the operation information of the service system;
If the single server flow has no fault, determining whether the database has the fault or not based on the access information of the server to the database;
If the database has no fault, determining whether the memory or the JVM has the fault based on the running information of the memory or the JVM of the server.
According to a second aspect of the present application, an embodiment of the present application provides an electronic device, including:
at least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the micro-service fault localization method as in the first aspect or any implementation of the first aspect.
According to the micro-service fault positioning method and the electronic device, the fault characteristics are determined based on the fault alarm information by responding to the acquired micro-service fault alarm information; determining fault types based on the fault characteristics, wherein the fault types comprise performance type faults and non-performance type faults; if the fault type is a performance type fault, determining a first sub-fault type based on the number of fault alarm information, and determining a micro-service fault node based on a first fault troubleshooting path corresponding to the first sub-fault type; the first sub-fault type includes a global class performance fault and a local class performance fault; if the fault type is a non-performance fault, determining a micro-service fault node based on a second fault troubleshooting path corresponding to the second sub-fault type; the second sub-fault type comprises a business abnormal fault and an application abnormal fault; therefore, the method starts from two major faults, namely the performance fault and the non-performance fault, analyzes possible reasons of the faults, subdivides the faults of each major class into two sub-fault types, and sets corresponding fault troubleshooting paths, so that the faults can be rapidly and accurately positioned.
The foregoing description is only an overview of the present application, and is intended to be implemented in accordance with the teachings of the present application in order that the same may be more clearly understood and to make the same and other objects, features and advantages of the present application more readily apparent.
Drawings
FIG. 1 is a flow chart of a method for positioning a micro-service fault according to an embodiment of the present application;
FIG. 2 is a schematic diagram of a fault detection path corresponding to a global performance fault in an embodiment of the present application;
FIG. 3 is a schematic diagram of a fault detection path corresponding to a local class performance fault in an embodiment of the present application;
fig. 4 is a schematic diagram of a fault detection path corresponding to a service exception fault in an embodiment of the present application;
FIG. 5 is a schematic diagram of a fault detection path corresponding to an abnormal fault in an embodiment of the present application;
Fig. 6 is a schematic diagram of a hardware structure of an electronic device according to an embodiment of the present application.
Detailed Description
For the purpose of making the objects, technical solutions and advantages of the embodiments of the present application more apparent, the technical solutions of the embodiments of the present application will be clearly and completely described below with reference to the accompanying drawings in the embodiments of the present application, and it is apparent that the described embodiments are some embodiments of the present application, but not all embodiments of the present application. All other embodiments, which can be made by those skilled in the art based on the embodiments of the application without making any inventive effort, are intended to fall within the scope of the application.
The embodiment of the application provides a method for positioning a micro-service fault, which is shown in fig. 1 and comprises the following steps:
s101, responding to acquisition of micro-service fault alarm information, and determining fault characteristics based on the fault alarm information.
In this embodiment, the fault alarm information includes the appearance problems of faults, such as slow processing of a large amount of services, slow processing of a small amount of services, non-normal operation of services, interruption of services, and the like.
In this embodiment, by analyzing the fault alert information, the fault characteristics may be determined, where the fault characteristics characterize the deep problem of the fault, such as long service response time, a large amount of fault alert information, reduced traffic, and increased service error rate.
S102, determining fault types based on the fault characteristics, wherein the fault types comprise performance type faults and non-performance type faults.
In this embodiment, the performance fault mainly refers to the abnormal situation of response time, number of fault alarm information and the like of the system, and the node with the problem may be on one or more nodes of an application, a database, a cache and a network device.
In this embodiment, the non-performance fault mainly refers to situations such as a decrease in traffic volume, an increase in traffic error rate, an unavailability of a traffic interface, a decrease in application traffic caused by a flow interruption, a decrease in traffic success rate, and an increase in time consumption of the traffic interface.
In this embodiment, the corresponding fault feature tables may be set for the performance fault and the non-performance fault, so that the fault feature is determined, and the fault type to which the fault belongs may be queried.
S103, if the fault type is a performance fault, determining a first sub-fault type based on the number of fault alarm information, and determining a micro-service fault node based on a first fault troubleshooting path corresponding to the first sub-fault type; the first sub-fault type includes a global class performance fault and a local class performance fault.
In this embodiment, the performance class failures include global class performance failures and local class performance failures.
In this embodiment, when a global performance fault occurs, a serious problem occurs in the underlying infrastructure, or a serious problem occurs at a single point in the overall system architecture; the appearance of the problem can be reflected in customer service feeding back a large number of customer complaints, and large-area alarm appears in each service and application monitoring. The local class exception problems comprise dimensions such as increased time consumption, reduced response rate and the like of part of applications or part of interfaces; the appearance of the problem is that a small amount of work orders are fed back by customer service, and the system has small drop of various indexes and is accompanied by a small-area server alarm. Therefore, if the fault type is a performance class fault, the global performance class fault and the local performance class fault can be determined by the number of the fault alarm information.
In some embodiments, determining the first sub-fault type based on the number of fault alert information includes:
if the number of the fault alarm information is greater than or equal to the alarm number threshold, determining that the first sub-fault type is a global performance fault; if the number of the fault alarm information is smaller than the alarm number threshold, determining that the first sub-fault type is a local performance fault.
In this embodiment, by setting the alarm number threshold, it is possible to quickly determine whether the fault is a global class performance fault or a local class performance fault.
S104, if the fault type is a non-performance fault, determining a micro-service fault node based on a second fault troubleshooting path corresponding to the second sub-fault type; the second sub-fault type includes a traffic exception type fault and an application exception type fault.
In this embodiment, the non-performance class faults are classified into a business exception class fault and an application exception class fault. The business abnormal faults and the application abnormal faults are respectively corresponding to different second fault troubleshooting paths.
In this embodiment, the failure detection principle is consistent with the failure recovery principle for the abnormal service failures and the abnormal application failures, that is, normal recovery function is ensured first, and then the positioning and detection of reasons are performed. The troubleshooting idea can be divided into three points: ① Whether all functions of the module of the whole channel are affected (for example, all functions of the fund module cannot be used) and whether the switching of the multi-data center is performed; ② The fault evaluation only affects a certain service function of a single channel, and can adopt to close a service inlet of a problem channel to guide clients to shunt to other channels for transaction; ③ If the related change is carried out before the fault occurs, the emergency rollback can be carried out for recovery. Based on the three-point thought, the second fault troubleshooting paths corresponding to the business abnormal faults and the application abnormal faults can be respectively determined.
In this embodiment, when the fault type is a non-performance fault, the service abnormal fault and the application abnormal fault may be simultaneously detected, and the final micro-service fault node is determined according to the detection results of the service abnormal fault and the application abnormal fault.
According to the micro-service fault positioning method provided by the embodiment of the application, the fault characteristics are determined based on the fault alarm information by responding to the acquired micro-service fault alarm information; determining fault types based on the fault characteristics, wherein the fault types comprise performance type faults and non-performance type faults; if the fault type is a performance type fault, determining a first sub-fault type based on the number of fault alarm information, and determining a micro-service fault node based on a first fault troubleshooting path corresponding to the first sub-fault type; the first sub-fault type includes a global class performance fault and a local class performance fault; if the fault type is a non-performance fault, determining a micro-service fault node based on a second fault troubleshooting path corresponding to the second sub-fault type; the second sub-fault type comprises a business abnormal fault and an application abnormal fault; therefore, the method starts from two major faults, namely the performance fault and the non-performance fault, analyzes possible reasons of the faults, subdivides the faults of each major class into two sub-fault types, and sets corresponding fault troubleshooting paths, so that the faults can be rapidly and accurately positioned.
In an alternative embodiment, in step S102, determining the fault type based on the fault signature includes: if the fault characteristics show that the first parameter is abnormal, determining that the fault type is a performance type fault; if the fault characteristics show that the second parameter is abnormal, determining that the fault type is a non-performance fault; the first parameters comprise service response time and the number of fault alarm messages; the second parameters include traffic volume, traffic error rate.
In this embodiment, the first parameter anomaly may be an increase in response time and an increase in the number of fault alert messages. The second parameter anomaly may be a decrease in traffic and an increase in traffic error rate.
In this embodiment, by setting the abnormal parameters corresponding to the performance-class fault and the non-performance-class fault, it is possible to quickly determine whether the performance-class fault or the non-performance-class fault is based on the fault characteristics.
In an optional embodiment, in step S103, determining the micro service fault node based on the first fault troubleshooting path corresponding to the first sub-fault type includes:
If the first sub-fault type is global performance fault, determining a micro-service fault node according to a fault troubleshooting path of a data center network fault, a single system host fault, a middleware fault, a database fault and a key application node fault.
If the first sub-fault type is a local performance fault, determining a micro-service fault node according to a fault troubleshooting path of a single server application flow fault, an application to Structured Query Language (SQL) call fault, an application code fault, a CPU or thread fault, a memory or Java Virtual Machine (JVM) fault, a disk or file fault and a network fault.
In this embodiment, if the first sub-fault type is a global performance fault, the fault troubleshooting concept mainly includes the following steps:
1. And determining whether global problems in the aspects of network, storage and host exist or not through the infrastructure monitoring large disk and the resource monitoring large disk, and if so, transferring to corresponding solution confirmation.
2. And monitoring the large disc through each type of middleware, determining whether a large-area failure problem exists in a certain middleware, and if so, switching to related middleware emergency treatment.
3. And monitoring the large disk through the database, determining whether the problems of downtime of a large-area database cluster, CPU violent height and the like exist, and if so, switching to responsive database emergency treatment.
4. If no related problem is found in the global inspection, application problem analysis is performed to determine whether a large area failure of an application node exists on the service key link. And checking through the global service monitoring large disc configured in advance, determining whether all service indexes have problems, if so, highly suspected that a certain key application has problems, and if not, determining a specific affected service point, and continuing to check related service problems at the point.
5. The key application node investigation is carried out by the following modes: a) Determining whether a certain application has a problem or not through a key application index preset in a service monitoring large disc; b) Checking interrupt service or large-area error application by checking application topology or input tracking ID in the distributed link tracking system; c) And determining problematic applications through each application monitoring panel one by one.
Based on the above-mentioned troubleshooting thought, the first troubleshooting path for the global class performance fault can be determined as: failure troubleshooting paths for data center network failure, single system host failure, middleware failure, database failure, critical application node failure.
In some embodiments, determining a micro-service failed node according to a failure troubleshooting path for a data center network failure, a single system host failure, a middleware failure, a database failure, a critical application node failure, comprises: determining whether the data center has a network fault based on the flow information of the data center; if the data center has no network fault, determining whether a host in the single system has a fault or not based on the performance information of the host in the single system; if the host computer in the single system has no fault, determining whether the middleware has the fault or not based on the middleware information; if the middleware does not have a fault, determining whether the database has the fault or not based on the performance information of the database; if the database has no fault, determining whether the key application node has the fault or not based on the service information of the key application node.
In implementation, as shown in fig. 2, it may be determined whether a global network problem exists by first querying information such as traffic, session number, delay, etc. of the data center and comparing the information with a baseline. If the global network problem exists, the network is subjected to anomaly analysis, and the fault node and the fault reason are determined.
If the global network problem does not exist, whether the large-area host alarm exists in the current data center can be confirmed, then the performance information of the host of the single system, such as the performance information of a CPU, a memory, a disk and the like, is inquired, whether the host fault exists in the current system is confirmed, if the host fault exists, the detailed inquiry is carried out on the corresponding host performance information, and the fault node and the fault reason are confirmed.
If the host computer fault does not exist, middleware information such as the information of the utilization rate of a connection pool, the utilization rate of a JVM heap memory and the like can be queried, whether the whole middleware is unavailable is confirmed, if the middleware fault exists, the corresponding middleware is shifted to perform problem query, and the fault node and the fault reason are confirmed.
If no middleware fault exists, whether the current database has performance bottleneck, whether SQL operation has overlarge resource consumption and the like can be inquired; if the conditions exist, the database is subjected to abnormal analysis, and the fault node and the fault reason are determined.
If the database is confirmed to have no fault, the key application nodes with slow transaction time and even no response on the business key link can be checked, then the problem checking is carried out on the key application nodes based on the business information of the key application nodes, and the fault nodes and the fault reasons are determined.
According to the above troubleshooting process, the fault node in the global class performance fault can be basically located.
In some embodiments, determining a microservice failure node according to a single server application flow failure, an application to SQL call failure, an application code failure, a CPU or thread failure, a memory or JVM failure, a disk or file failure, a network failure troubleshooting path comprises: determining whether the application flow of the single server fails or not based on the application flow information of the single server; if the single server application flow has no fault, determining whether the application has SQL calling fault or not based on the SQL calling information of the application; if the application does not have SQL call faults, determining whether an application code has faults or not based on an application running log; if the application code has no fault, determining whether the CPU or the thread has the fault or not based on the running information of the CPU or the thread of the server; if the CPU or the thread has no fault, determining whether the memory or the JVM has the fault or not based on the running information of the memory or the JVM of the server; if the memory or the JVM has no fault, determining whether the disk or the file has the fault or not based on the running information of the disk or the file of the server; if the disk or the file has no fault, determining whether the network has the fault or not based on the running information of the server network.
In the implementation, as shown in fig. 3, whether the single server application flow has a fault may be first determined by the information such as the transaction amount, response time, response rate, and success rate of the single server. If the fault exists, carrying out problem inquiry on load balancing and current limiting setting, and confirming fault nodes and fault reasons.
If the single server application flow has no fault, based on the ordering information of SQL running time, whether SQL call abnormality exists can be confirmed, if SQL call abnormality exists, whether abnormal database processes can be deleted can be confirmed, if abnormal database processes can be deleted, if abnormal database processes can not be deleted, analysis can be performed on a database execution plan, and fault nodes and fault reasons can be confirmed.
If the SQL call abnormality does not exist, the application running log is queried, abnormal error analysis is carried out on the application running log, whether the application code fails or not is determined, and if the failure exists, the failure node and the failure cause are determined.
If the application code faults do not exist, the CPU of the single server can be analyzed through key information such as CPU utilization rate, CPU average load, context switching times and the like, and if the faults of the CPU or the threads are confirmed, the abnormal CPU is analyzed, and fault nodes and fault reasons are confirmed.
If no fault CPU or thread exists, the abnormal memory or JVM of the single server can be analyzed based on the running information of the memory or JVM, and if the memory or JVM fault exists, the abnormal memory or JVM is analyzed, and the fault node and the fault reason are confirmed. The running information of the memory or the JVM comprises residual memory, used memory, available memory, a buffer/buffer area, virtual memory of a process (comprising Java process), resident memory, shared memory, page-missing abnormal constants of the process, including main page-missing abnormal and sub page-missing abnormal, and the like.
If no memory or JVM fault exists, whether the disk or the file is faulty or not can be determined based on the running information of the disk or the file of the server, and if the disk or the file is faulty, detailed analysis is performed on the abnormal disk or file to determine a fault node and a fault reason. The running information of the disk or the file comprises disk utilization, disk throughput, disk response time, disk waiting queue size and the like.
If the disk or file faults do not exist, whether the server network is faulty or not can be determined through information such as network bandwidth, network delay and the like, and if the server network is faulty, abnormal network operation information is analyzed, and fault nodes and fault reasons are determined.
In this embodiment, in the analysis of the local performance fault, by applying the analysis of the traffic, the log, the host, the middleware, the database, the network, and the like, more than 99% of fault causes can be covered, and the fault cause can be basically located.
In this embodiment, according to the order of checking from the whole to the local and from the cluster to the single machine, each item of system information can be checked one by one, so that the fault node in the global performance fault and the local performance fault can be accurately located.
In an optional embodiment, in step S104, determining the micro service fault node based on the second fault localization path corresponding to the second sub fault type includes:
Aiming at abnormal service faults, determining micro service fault nodes according to fault checking paths of application interface faults, front-end page faults and client faults.
Aiming at the application abnormal faults, determining micro-service fault nodes according to fault troubleshooting paths of multiple data center flow distribution faults, single server flow faults, database faults, memory or JVM faults.
In some embodiments, determining a micro-service failed node according to a failure troubleshooting path of an application interface failure, a front-end page failure, a client failure, includes: determining whether a service interface fails based on the service error code; if the service interface has no fault, determining whether the front-end page has the fault or not based on the front-end page running information; if the front-end page has no fault, determining whether the client has the fault or not based on the client operation data.
In implementation, as shown in fig. 4, when a business problem occurs, a common business error code or a transaction tracking ID can be used for inquiring to see whether the business error code or the transaction tracking ID is due to the reporting error of the business rule layer.
If the service error code is not common, the error reporting is not performed on the service rule layer, and then the key words of the service error code are analyzed to determine whether the service interface fails. If the service interface fails, the service interface message can be analyzed to determine the failure node and the failure cause.
If the service interface has no fault, the fault type and the fault page can be checked based on the front-end page running information, whether the front-end page has the fault is determined, and if the front-end page has the fault, the fault node and the fault cause can be determined based on the front-end page code.
If the front page has no fault, determining whether the client has the fault based on the running log of the client, and if the client has the fault, determining a fault node and a fault reason for analysis.
In some embodiments, determining a micro-service failed node according to a failure troubleshooting path of a multi-data center traffic distribution failure, a single server traffic failure, a database failure, a memory or a JVM failure, comprises: determining whether the flow distribution of the multiple data centers fails based on the flow information of the multiple data centers; if the flow distribution of the multiple data centers does not have faults, determining whether the flow distribution of the single data center has faults or not based on the flow information of the single data center; if the single-center flow distribution has no fault, determining whether the single server flow has the fault or not based on the operation information of the service system; if the single server flow has no fault, determining whether the database has the fault or not based on the access information of the server to the database; if the database has no fault, determining whether the memory or the JVM has the fault based on the running information of the memory or the JVM of the server.
In the specific implementation, as shown in fig. 5, when an application problem occurs, the flow rates of the plurality of data centers may be compared with the flow base lines first to determine whether the flow distribution of the plurality of data centers deviates substantially, and if the flow distribution of the plurality of data centers deviates substantially, it is indicated that the flow distribution of the plurality of data centers has a problem. The traffic entry of the single data center, web application protection system (WAF) check rules may be checked to determine the failure node and the failure cause.
If the flow distribution of the multiple data centers does not have faults, the flow distribution of each server of the single data center is checked to determine whether the flow distribution of the single data center has faults. If the data center flow distribution has faults, the flow inlet of the single data center server and the Web application defense system (WAF) check rule can be checked to determine the fault node and the fault cause.
If the single data center flow distribution has no faults, determining whether the flow distribution of a plurality of applications in a single server has faults, and if the flow distribution of a plurality of applications in a single server has faults, checking the flow inlet of the applications and the Web application protection system (WAF) check rules to determine fault nodes and fault reasons.
If the flow distribution of a plurality of applications in a single server has no faults, checking whether the server has the problems of blocking, concurrency and the like when accessing the database, determining whether the database has faults, if the database has faults, analyzing the fault database, and determining fault nodes and fault reasons.
If the database is not faulty, it can be determined if the memory or JVM is faulty. If the memory or the JVM fails, the abnormal memory or the JVM is analyzed to confirm the failure node and the failure cause.
In this embodiment, according to the order of investigation from the whole to the local and from the cluster to the single machine, each item of system information can be subjected to investigation one by one, and the fault node in the service abnormal fault and the application abnormal fault can be accurately located.
According to an embodiment of the present application, the present application also provides an electronic device and a readable storage medium.
Fig. 6 shows a schematic block diagram of an example electronic device 800 that may be used to implement an embodiment of the application. Electronic devices are intended to represent various forms of digital computers, such as laptops, desktops, workstations, personal digital assistants, servers, blade servers, mainframes, and other appropriate computers. The electronic device may also represent various forms of mobile devices, such as personal digital processing, cellular telephones, smartphones, wearable devices, and other similar computing devices. The components shown herein, their connections and relationships, and their functions, are meant to be exemplary only, and are not meant to limit implementations of the applications described and/or claimed herein.
As shown in fig. 6, the electronic device 800 includes a computing unit 801 that can perform various appropriate actions and processes according to a computer program stored in a Read Only Memory (ROM) 802 or a computer program loaded from a storage unit 808 into a Random Access Memory (RAM) 803. In the RAM 803, various programs and data required for the operation of the electronic device 800 can also be stored. The computing unit 801, the ROM 802, and the RAM 803 are connected to each other by a bus 804. An input/output (I/O) interface 805 is also connected to the bus 804.
Various components in electronic device 800 are connected to I/O interface 805, including: an input unit 806 such as a keyboard, mouse, etc.; an output unit 807 such as various types of displays, speakers, and the like; a storage unit 808, such as a magnetic disk, optical disk, etc.; and a communication unit 809, such as a network card, modem, wireless communication transceiver, or the like. The communication unit 809 allows the electronic device 800 to exchange information/data with other devices through a computer network such as the internet and/or various telecommunication networks.
The computing unit 801 may be a variety of general and/or special purpose processing components having processing and computing capabilities. Some examples of computing unit 801 include, but are not limited to, a Central Processing Unit (CPU), a Graphics Processing Unit (GPU), various specialized Artificial Intelligence (AI) computing chips, various computing units running machine learning model algorithms, a Digital Signal Processor (DSP), and any suitable processor, controller, microcontroller, etc. The computing unit 801 performs the various methods and processes described above, such as the micro service fault location method. For example, in some embodiments, the micro-service fault localization method may be implemented as a computer software program tangibly embodied on a machine-readable medium, such as the storage unit 808. In some embodiments, part or all of the computer program may be loaded and/or installed onto the electronic device 800 via the ROM 802 and/or the communication unit 809. When the computer program is loaded into RAM 803 and executed by computing unit 801, one or more steps of the micro-service fault localization method described above may be performed. Alternatively, in other embodiments, the computing unit 801 may be configured to perform the micro-service fault localization method by any other suitable means (e.g., by means of firmware).
Various implementations of the systems and techniques described here above may be implemented in digital electronic circuitry, integrated circuitry, field Programmable Gate Arrays (FPGAs), application Specific Integrated Circuits (ASICs), application Specific Standard Products (ASSPs), systems-on-a-chip (SOCs), complex Programmable Logic Devices (CPLDs), computer hardware, firmware, software, and/or combinations thereof. These various embodiments may include: implemented in one or more computer programs, the one or more computer programs may be executed and/or interpreted on a programmable system including at least one programmable processor, which may be a special purpose or general-purpose programmable processor, that may receive data and instructions from, and transmit data and instructions to, a storage system, at least one input device, and at least one output device.
Program code for carrying out methods of the present application may be written in any combination of one or more programming languages. These program code may be provided to a processor or controller of a general purpose computer, special purpose computer, or other programmable data processing apparatus such that the program code, when executed by the processor or controller, causes the functions/operations specified in the flowchart and/or block diagram to be implemented. The program code may execute entirely on the machine, partly on the machine, as a stand-alone software package, partly on the machine and partly on a remote machine or entirely on the remote machine or server.
In the context of the present application, a machine-readable medium may be a tangible medium that can contain, or store a program for use by or in connection with an instruction execution system, apparatus, or device. The machine-readable medium may be a machine-readable signal medium or a machine-readable storage medium. The machine-readable medium may include, but is not limited to, an electronic, magnetic, optical, electromagnetic, infrared, or semiconductor system, apparatus, or device, or any suitable combination of the foregoing. More specific examples of a machine-readable storage medium would include an electrical connection based on one or more wires, a portable computer diskette, a hard disk, a Random Access Memory (RAM), a read-only memory (ROM), an erasable programmable read-only memory (EPROM or flash memory), an optical fiber, a portable compact disc read-only memory (CD-ROM), an optical storage device, a magnetic storage device, or any suitable combination of the foregoing.
To provide for interaction with a user, the systems and techniques described here can be implemented on a computer having: a display device (e.g., a CRT (cathode ray tube) or LCD (liquid crystal display) monitor) for displaying information to a user; and a keyboard and pointing device (e.g., a mouse or trackball) by which a user can provide input to the computer. Other kinds of devices may also be used to provide for interaction with a user; for example, feedback provided to the user may be any form of sensory feedback (e.g., visual feedback, auditory feedback, or tactile feedback); and input from the user may be received in any form, including acoustic input, speech input, or tactile input.
The systems and techniques described here can be implemented in a computing system that includes a background component (e.g., as a data server), or that includes a middleware component (e.g., an application server), or that includes a front-end component (e.g., a user computer having a graphical user interface or a web browser through which a user can interact with an implementation of the systems and techniques described here), or any combination of such background, middleware, or front-end components. The components of the system can be interconnected by any form or medium of digital data communication (e.g., a communication network). Examples of communication networks include: local Area Networks (LANs), wide Area Networks (WANs), and the internet.
The computer system may include a client and a server. The client and server are typically remote from each other and typically interact through a communication network. The relationship of client and server arises by virtue of computer programs running on the respective computers and having a client-server relationship to each other. The server may be a cloud server, a server of a distributed system, or a server incorporating a blockchain.
It should be appreciated that various forms of the flows shown above may be used to reorder, add, or delete steps. For example, the steps described in the present application may be performed in parallel, sequentially, or in a different order, provided that the desired results of the disclosed embodiments are achieved, and are not limited herein.
Furthermore, the terms "first," "second," and the like, are used for descriptive purposes only and are not to be construed as indicating or implying a relative importance or implicitly indicating the number of technical features indicated. Thus, a feature defining "a first" or "a second" may explicitly or implicitly include at least one such feature. In the description of the present application, the meaning of "a plurality" is two or more, unless explicitly defined otherwise.
The foregoing is merely illustrative of the present application, and the present application is not limited thereto, and any person skilled in the art will readily recognize that variations or substitutions are within the scope of the present application. Therefore, the protection scope of the present application shall be subject to the protection scope of the claims.
Claims (9)
1. A method for locating a micro-service fault, comprising:
determining fault characteristics based on the fault alert information in response to obtaining the micro-service fault alert information;
Determining fault types based on the fault characteristics, wherein the fault types comprise performance type faults and non-performance type faults; determining a fault type based on the fault signature, comprising: if the fault characteristics show that the first parameter is abnormal, determining that the fault type is a performance type fault; if the fault characteristics show that the second parameter is abnormal, determining that the fault type is a non-performance fault; the first parameters comprise service response time and the number of fault alarm messages; the second parameter includes traffic volume, traffic error rate;
If the fault type is a performance type fault, determining a first sub-fault type based on the number of fault alarm information, and determining a micro-service fault node based on a first fault troubleshooting path corresponding to the first sub-fault type; the first sub-fault type includes a global class performance fault and a local class performance fault;
if the fault type is a non-performance fault, determining a micro-service fault node based on a second fault troubleshooting path corresponding to a second sub-fault type; the second sub-fault type includes a traffic exception type fault and an application exception type fault.
2. The micro service fault location method of claim 1, wherein determining the first sub fault type based on the number of fault alert information comprises:
if the number of the fault alarm information is greater than or equal to an alarm number threshold, determining that the first sub-fault type is a global performance fault;
And if the number of the fault alarm information is smaller than the alarm number threshold value, determining that the first sub-fault type is a local performance fault.
3. The method of claim 1, wherein determining a micro service failure node based on a first failure troubleshooting path corresponding to a first sub-failure type comprises:
If the first sub-fault type is global performance fault, determining a micro-service fault node according to a fault troubleshooting path of a data center network fault, a single system host fault, a middleware fault, a database fault and a key application node fault;
If the first sub-fault type is a local performance fault, determining a micro-service fault node according to a fault troubleshooting path of a single server application flow fault, an application to structural query language SQL call fault, an application code fault, a CPU or thread fault, a memory or Java virtual machine JVM fault, a disk or file fault and a network fault.
4. The method for positioning a micro service failure according to claim 3, wherein determining a micro service failure node according to a failure troubleshooting path of a data center network failure, a single system host failure, a middleware failure, a database failure, a critical application node failure comprises:
determining whether the data center has a network fault based on the flow information of the data center;
If the data center has no network fault, determining whether a host in the single system has a fault or not based on the performance information of the host in the single system;
if the host computer in the single system has no fault, determining whether the middleware has the fault or not based on the middleware information;
If the middleware does not have a fault, determining whether the database has the fault or not based on the performance information of the database;
If the database has no fault, determining whether the key application node has the fault or not based on the service information of the key application node.
5. The method for positioning a micro service failure according to claim 3, wherein determining a micro service failure node according to a failure troubleshooting path of a single server application flow failure, an application to SQL call failure, an application code failure, a CPU or thread failure, a memory or JVM failure, a disk or file failure, and a network failure, comprises:
determining whether the application flow of the single server fails or not based on the application flow information of the single server;
If the single server application flow has no fault, determining whether the application has SQL calling fault or not based on the SQL calling information of the application;
If the application does not have SQL call faults, determining whether an application code has faults or not based on an application running log;
If the application code has no fault, determining whether the CPU or the thread has the fault or not based on the running information of the CPU or the thread of the server;
if the CPU or the thread has no fault, determining whether the memory or the JVM has the fault or not based on the running information of the memory or the JVM of the server;
If the memory or the JVM has no fault, determining whether the disk or the file has the fault or not based on the running information of the disk or the file of the server;
if the disk or the file has no fault, determining whether the network has the fault or not based on the running information of the server network.
6. The method of claim 1, wherein determining a micro service failure node based on a second failure troubleshooting path corresponding to a second sub-failure type comprises:
aiming at abnormal service faults, determining micro service fault nodes according to fault investigation paths of application interface faults, front-end page faults and client faults;
Aiming at the application abnormal faults, determining micro-service fault nodes according to fault troubleshooting paths of multiple data center flow distribution faults, single server flow faults, database faults, memory or JVM faults.
7. The method for positioning a micro service fault according to claim 6, wherein determining a micro service fault node according to a fault troubleshooting path of an application interface fault, a front-end page fault, and a client fault comprises:
Determining whether a service interface fails based on the service error code;
If the service interface has no fault, determining whether the front-end page has the fault or not based on the front-end page running information;
if the front-end page has no fault, determining whether the client has the fault or not based on the client operation data.
8. The method for positioning a micro service failure according to claim 6, wherein determining a micro service failure node according to a failure troubleshooting path of a multi-data center traffic distribution failure, a single server traffic failure, a database failure, a memory or a JVM failure, comprises:
determining whether the flow distribution of the multiple data centers fails based on the flow information of the multiple data centers;
If the flow distribution of the multiple data centers does not have faults, determining whether the flow distribution of the single data center has faults or not based on the flow information of the single data center;
if the single-center flow distribution has no fault, determining whether the single server flow has the fault or not based on the operation information of the service system;
If the single server flow has no fault, determining whether the database has the fault or not based on the access information of the server to the database;
If the database has no fault, determining whether the memory or the JVM has the fault based on the running information of the memory or the JVM of the server.
9. An electronic device, comprising:
At least one processor; and a memory communicatively coupled to the at least one processor; wherein the memory stores instructions executable by the at least one processor to cause the at least one processor to perform the micro-service fault localization method of any one of claims 1-8.
Priority Applications (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410935654.7A CN118467232B (en) | 2024-07-12 | 2024-07-12 | Micro-service fault positioning method and electronic equipment |
Applications Claiming Priority (1)
| Application Number | Priority Date | Filing Date | Title |
|---|---|---|---|
| CN202410935654.7A CN118467232B (en) | 2024-07-12 | 2024-07-12 | Micro-service fault positioning method and electronic equipment |
Publications (2)
| Publication Number | Publication Date |
|---|---|
| CN118467232A CN118467232A (en) | 2024-08-09 |
| CN118467232B true CN118467232B (en) | 2024-10-08 |
Family
ID=92152714
Family Applications (1)
| Application Number | Title | Priority Date | Filing Date |
|---|---|---|---|
| CN202410935654.7A Active CN118467232B (en) | 2024-07-12 | 2024-07-12 | Micro-service fault positioning method and electronic equipment |
Country Status (1)
| Country | Link |
|---|---|
| CN (1) | CN118467232B (en) |
Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114528175A (en) * | 2020-10-30 | 2022-05-24 | 亚信科技(中国)有限公司 | Micro-service application system root cause positioning method, device, medium and equipment |
| CN116260703A (en) * | 2023-02-13 | 2023-06-13 | 中国工商银行股份有限公司 | Distributed message service node CPU performance fault self-recovery method and device |
Family Cites Families (11)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| US9842045B2 (en) * | 2016-02-19 | 2017-12-12 | International Business Machines Corporation | Failure recovery testing framework for microservice-based applications |
| CN113760634B (en) * | 2020-09-04 | 2025-01-14 | 北京沃东天骏信息技术有限公司 | A data processing method and device |
| CN114138522A (en) * | 2020-09-04 | 2022-03-04 | 大唐移动通信设备有限公司 | A fault recovery method, device, electronic device and medium for microservices |
| CN113094284B (en) * | 2021-04-30 | 2024-11-15 | 中国工商银行股份有限公司 | Application fault detection method and device |
| CN113269648A (en) * | 2021-06-10 | 2021-08-17 | 中国建设银行股份有限公司 | Fault node positioning method and device, storage medium and electronic equipment |
| CN116382963A (en) * | 2023-03-29 | 2023-07-04 | 平安银行股份有限公司 | Fault classification management method and related equipment thereof |
| CN116338363B (en) * | 2023-05-24 | 2023-09-15 | 南方电网数字电网研究院有限公司 | Fault detection method, device, computer equipment and storage medium |
| CN116627706A (en) * | 2023-06-20 | 2023-08-22 | 中国电信股份有限公司 | Resource pool fault diagnosis method and device and electronic equipment |
| CN117201277A (en) * | 2023-08-14 | 2023-12-08 | 南方电网数字电网集团信息通信科技有限公司 | Automatic repair system for abnormal faults of micro-service of monitoring data |
| CN117827505A (en) * | 2023-11-15 | 2024-04-05 | 苏州元脑智能科技有限公司 | Equipment fault analysis method and device, electronic equipment and storage medium |
| CN118014558A (en) * | 2024-02-05 | 2024-05-10 | 中国电信股份有限公司 | Fault processing method and device, nonvolatile storage medium and electronic equipment |
-
2024
- 2024-07-12 CN CN202410935654.7A patent/CN118467232B/en active Active
Patent Citations (2)
| Publication number | Priority date | Publication date | Assignee | Title |
|---|---|---|---|---|
| CN114528175A (en) * | 2020-10-30 | 2022-05-24 | 亚信科技(中国)有限公司 | Micro-service application system root cause positioning method, device, medium and equipment |
| CN116260703A (en) * | 2023-02-13 | 2023-06-13 | 中国工商银行股份有限公司 | Distributed message service node CPU performance fault self-recovery method and device |
Also Published As
| Publication number | Publication date |
|---|---|
| CN118467232A (en) | 2024-08-09 |
Similar Documents
| Publication | Publication Date | Title |
|---|---|---|
| WO2022068645A1 (en) | Database fault discovery method, apparatus, electronic device, and storage medium | |
| CN110704231A (en) | A fault handling method and device | |
| CN115328733B (en) | Alarm method, device, electronic device and storage medium applied to business system | |
| CN114706893A (en) | Fault detection method, device, equipment and storage medium | |
| CN116010220A (en) | Alarm diagnosis method, device, equipment and storage medium | |
| CN112711487A (en) | Data source management and control method and device, management and control server and storage medium | |
| CN116645082A (en) | System inspection method, device, equipment and storage medium | |
| CN113760579A (en) | Troubleshooting method and device | |
| CN117608904A (en) | Fault positioning method and device, electronic equipment and storage medium | |
| CN112131077B (en) | Positioning method and positioning device for fault node and database cluster system | |
| CN117061334A (en) | Link alarm processing method, device, equipment and storage medium | |
| CN114095394B (en) | Network node fault detection method and device, electronic equipment and storage medium | |
| CN113656252B (en) | Fault positioning method, device, electronic equipment and storage medium | |
| CN118101537B (en) | Gateway monitoring method and device, system-level chip and electronic equipment | |
| CN118467232B (en) | Micro-service fault positioning method and electronic equipment | |
| CN115550363B (en) | Node hierarchical management method, device and electronic equipment | |
| CN117880053A (en) | Method, apparatus, device, storage medium and program product for providing alarm | |
| CN115309717A (en) | Database loading method, device, equipment, medium and computer program product | |
| CN117608877B (en) | Data transmission method, device, equipment and storage medium | |
| CN118170555A (en) | Distributed service abnormality positioning method, device, electronic equipment and storage medium | |
| CN116225714A (en) | Information processing method, device, equipment and storage medium | |
| CN119988106A (en) | Exception handling method and device in fusion system | |
| CN119493789A (en) | A database detection method, device, equipment and storage medium | |
| CN120378285A (en) | Alarm root cause analysis method, device, electronic equipment, storage medium and program | |
| CN117609079A (en) | Method, device, equipment and medium for testing based on call relation network |
Legal Events
| Date | Code | Title | Description |
|---|---|---|---|
| PB01 | Publication | ||
| PB01 | Publication | ||
| SE01 | Entry into force of request for substantive examination | ||
| SE01 | Entry into force of request for substantive examination | ||
| GR01 | Patent grant | ||
| GR01 | Patent grant |