Disclosure of Invention
Aiming at the problems of the prior database operation and maintenance method, the application provides a database intelligent operation and maintenance diagnosis tree method based on multidimensional indexes, which comprises the following steps:
(1) Collecting and processing multidimensional data of a database in operation, constructing an expert knowledge base, and collecting and analyzing historical data;
(2) Performing anomaly detection on query delay and IO operation waiting time in database operation;
(3) Identifying an anomaly source in the database system by constructing a root cause candidate set, alarm event association and root cause diagnosis;
(4) Storing and managing historical alarms and root cause analysis results, supporting rapid root cause speculation of alarms, and continuously optimizing a system through dynamic learning;
And (3) measuring the probability score of the database abnormal root cause candidate set by using the probability score PS, and using a change amount duty ratio VP model to position the root cause with higher change amount by using the influence measurement of time sequence data change and a misjudgment rate control mechanism.
Further, the step (3) specifically includes:
3.1 constructing a root cause candidate set;
3.2 associating alarm events;
3.3 diagnosing root cause, calculating likelihood score Wherein the PS value range is 0-1,Is vector quantityAndIs used for the distance of (a),V i and f i represent vectors, respectivelyAndComponent index of (2); For the current state of the database, For the normal behavior predicted by the system,Is the behavior that results from some combination of features in the candidate set.
3.4 Calculating the variation duty ratio of the combination x over the time period (T 0,T1)Wherein, value (x, T 0)、value(x,T1) is the value corresponding to the index of the combination x at the time points T 0 and T 1, and value (x, T 0)、value(*,T1) is the value of the overall index at the time points T 0 and T 1, respectively.
Based on the method, the application also provides a database intelligent operation and maintenance diagnosis tree system based on the multidimensional index, which comprises:
(1) The diagnosis tree basic module is responsible for multidimensional data acquisition and processing, expert knowledge base construction and historical data collection and analysis during database operation;
(2) The abnormality detection module is used for detecting abnormality of query delay and IO operation waiting time in the database operation;
(3) The intelligent operation and maintenance diagnosis tree construction module is used for identifying an abnormal source in the database system by constructing a root cause candidate set, alarm event association and root cause diagnosis;
(4) The knowledge base learning module stores and manages historical alarms and root cause analysis results, supports quick root cause speculation of alarms, and continuously optimizes the system through dynamic learning;
The intelligent operation and maintenance diagnosis tree construction module measures the probability score of the database abnormal root cause candidate set by using the probability score PS, and the change amount accounts for higher root cause by using the change amount account ratio VP model, the influence measurement of time sequence data change and the misjudgment rate control mechanism.
Further, the intelligent operation and maintenance diagnosis tree construction module comprises the following steps:
3.1 constructing a root cause candidate set;
3.2 associating alarm events;
3.3 diagnosing root cause, calculating likelihood score Wherein the PS value range is 0-1,Is vector quantityAndIs used for the distance of (a),V i and f i represent vectors, respectivelyAndComponent index of (2); For the current state of the database, For the normal behavior predicted by the system,Is the behavior that results from some combination of features in the candidate set.
3.4 Calculating the variation duty ratio of the combination x over the time period (T 0,T1)Wherein, value (x, T 0)、value(x,T1) is the value corresponding to the index of the combination x at the time points T 0 and T 1, and value (x, T 0)、value(*,T1) is the value of the overall index at the time points T 0 and T 1, respectively.
The system can be helped to quickly locate potential root cause combinations by comparing the current state of the database with the system prediction state and tracking the source of the abnormality by the characteristic combination of the root cause candidate set, so that the investigation time is shortened. Vector in algorithmAndIs used by the distance formula of (2)The method is helpful for capturing the tiny change in the running state of the database more accurately, and is very effective for detecting some complex and unobvious abnormal features by an operation and maintenance system.
The invention solves the problems of low efficiency, excessive dependence on manual experience and need to manually screen and analyze a large amount of operation data when the complex database is abnormal in the traditional operation and maintenance method, and compared with the traditional operation and maintenance method, the invention improves the automation and the accuracy of root cause analysis, reduces the manual workload and improves the efficiency and the reliability of the database operation and maintenance. After the invention is adopted, the range of the root cause candidate set can be gradually narrowed, the ranking results of the probability score and the variation ratio are respectively calculated, the ranking of the two ranking results is added to obtain a new ranking score, the results are reordered according to the ranking score, and the root cause with the highest probability is output. Through the intelligent diagnosis tree module, high-level performance abnormality can be gradually decomposed to a specific fine-grained operation level, and the hierarchical fault positioning greatly improves the accuracy and speed of abnormality diagnosis and reduces the workload of invalid data search. And meanwhile, the possibility score PS quantifies the possibility of the root cause candidate set, and the range of the root cause set is further reduced by comparing the current state of the database with the historical normal state. The system can automatically evaluate the possibility of abnormality caused by each root candidate set and provide objective basis.
Detailed Description
Aiming at the problems of the prior operation and maintenance method, the application provides a database intelligent operation and maintenance diagnosis tree method based on multidimensional indexes, which realizes multidimensional data acquisition and construction of expert experience knowledge base through a diagnosis tree base module and lays a foundation for root cause analysis. On the other hand, the core module intelligent diagnosis tree module constructs a root candidate set by means of abnormal layering analysis performance, analyzes the root combination by means of a Monte Carlo search algorithm, performs root sorting by combining the probability score and the variation duty ratio, and finally achieves operation and maintenance diagnosis of abnormal indexes of the database of the multidimensional indexes. The system flow diagram is shown in fig. 1.
1. The diagnosis tree basic module is a core basic part of the intelligent diagnosis tree and is responsible for multidimensional data acquisition and processing, expert knowledge base construction and historical data collection and analysis during database operation. The module monitors data in real time, and provides stable decision basis for the whole system based on abnormality judgment and trend analysis capability of historical data, and the core functions comprise:
and the multidimensional data acquisition comprises multidimensional data such as alarm data, performance indexes (such as query delay, IO waiting time and resource utilization rate), service operation logs, database transaction information and the like, and the comprehensive monitoring of the system is ensured.
And (3) establishing and continuously updating a knowledge base based on field expert experience, and supplementing and optimizing the automatic analysis capability of the system.
And (3) historical data analysis, namely collecting and analyzing historical operation data of the system, and establishing a baseline model. Comparing the historical data with the current state, and judging the severity of the abnormality and the evolution trend of the abnormality.
2. The abnormality detection module is used for intelligently detecting the abnormality of key indexes (such as query delay, IO operation waiting time and the like) in the operation of the database.
And in the pre-detection process, the to-be-detected point firstly enters the pre-detection process, and most normal points are intercepted by using a time sequence characteristic similarity model. If the pre-detection flow judges abnormality, the next feature extraction and classification flow is entered.
And (3) extracting and classifying the characteristics, namely extracting multidimensional characteristics of the alarm data when the abnormality is pre-detected, classifying by using a machine learning algorithm, recording the influence degree of each dimensional characteristic on the index abnormality, generating dimensional importance ranking, and feeding back the classification result to the sample set.
3. The intelligent diagnosis tree construction module:
when the system detects high-level performance abnormality, the diagnosis tree module can automatically decompose the abnormality downwards, analyze the fine granularity operation of the system layer by layer, and through abnormal layering analysis, the system can quickly reduce the fault range and avoid invalid searching in a large amount of operation data.
The design flow of the module is developed around multi-dimensional index cross analysis, the intelligent diagnosis tree is deep layer by layer, the comprehensive and systematic diagnosis of database abnormality is realized, and the core flow is as follows:
(1) And constructing a root candidate set, wherein the root candidate set is the basis of diagnosis tree analysis in a database environment. The set is constructed by a combination of dimensions, and potential causes are identified by cross-analyzing the combination of dimensions. The system may generate a set of potential root cause candidates that provide a basis for subsequent anomaly localization.
(2) And (3) the alarm event association, namely triggering the alarm event by the system when the performance index is abnormal, and serving as a starting signal of a diagnosis tree to guide operation and maintenance personnel or an automation tool to further track the root cause of the abnormality.
(3) Root cause diagnosis, namely comparing the current observed database state with the historical normal state, further reducing the root cause candidate set, and determining the most probable abnormal source through quantitative analysis, wherein the method comprises the following steps:
The likelihood score PS will be used to measure the likelihood score of a candidate set of root causes for a database anomaly, with higher scores indicating a greater likelihood that the root cause set will cause the anomaly (e.g., performance bottleneck or query failure). By comparing the observed state of the current database operation with the predicted value, and the behavioral characteristics of the candidate root causes, it is quantified which set of root causes is most likely the source of the problem. The calculation formula is as follows:
Wherein the value range of PS is 0-1,
Representative vectorAndDistance of (2), calculated asWherein v i and f i represent vectors, respectivelyAndIs used for the component index of (a).Representing the current state of the database,Representing the normal behaviour predicted by the system,Is the behavior resulting from some combination of features in the root cause candidate set.
On the other hand, considering that the database index data is usually time series data, the application uses the change amount ratio to describe the proportion of the change amount of a certain attribute combination x in the whole index change amount in a certain time period (T 0,T1), and pays attention to the reason that the change amount accounts for relatively high. In a practical scenario, the database may exhibit small fluctuations that are often not of significance, but may trigger unnecessary alarms. In order to prevent the slight fluctuation of the attribute combination with smaller initial value in the subsequent time period from being misjudged as heavy change and reduce the misjudgment rate, the change amount ratio VP of the combination x in the time period (T 0,T1) can be defined as:
Here, value (x, T 0)、value(x,T1) represents the value corresponding to the index of the combination x at the times T 0 and T 1, respectively, and value (x, T 0)、value(*,T1) represents the value of the overall index at the times T 0 and T 1, respectively.
When the database is processed abnormally, if the dimension combination space of the index is too large, the traditional searching method can be difficult to effectively find the root cause set due to too high complexity, and a Monte Carlo searching algorithm is used for searching the huge dimension space to find the most possible root cause combination. The process comprises the following four steps:
a. Selecting, namely calculating a probability score PS of each dimension combination in the multi-dimension combination of the database, wherein the probability score PS is used for measuring the probability of abnormality of a certain root caused by a candidate set, and determining an optimal exploration path through ranking. The formula is as follows:
Wherein S is the current state and represents the current database dimension combination or system state, a is the selected action, which means the operation of selecting a certain dimension combination or root candidate set to search under the current state S, S 'is the subsequent node after the state S passes through the action a, descendent (S'): the subsequent state set (offspring) of the state S 'represents all possible states which can be reached from the state S' through further expansion and searching, u is all possible states on the current search path, and S (u) is the root candidate set.
N(s) represents the total number of times of accessing the current state node s, N (s, a) represents the exploring number of accessing a certain root cause set currently, C is a constant, 2;Q (s, a) is a value which can be obtained by average potential fraction value under a certain node and is also called average dimension value, USB (s, a) represents UCB value of each node below, and the node is close to root cause effect value, therefore, the node with the largest USB value is obtained,
According to UCB formula, selecting the node with the maximum UCB value as the node to be explored in the next step, wherein A(s) is the action set in the current state s.
B. Expansion of
From the element set which has not been explored, an element e * with the largest potential score is selected, and the current root cause set S (S) is added to form a new expansion node S'. The specific formula is as follows:
S(s′)=S(s)∪{e*}
e * is the element with the largest potential score among the elements not currently added to the collection in the extension process.
C. Evaluation of
After a new node is extended, the node is initialized, the likelihood score PS for the node is recalculated, and the associated Q and N values are updated.
D. Backtracking
Backtracking begins when all nodes have been expanded, or the number of iterations exceeds a limit.
Through the algorithm process, the system gradually reduces the range of the root cause candidate set. And respectively calculating the descending order sequencing results of the tribute likelihood score and the variation ratio, adding the ranks of the two sequencing results to obtain a new ranking score, re-sequencing the results according to the ranking score, and outputting the root cause with the highest likelihood.
4. And the knowledge base learning module is responsible for storing and managing historical alarms and root cause analysis results, supporting quick root cause speculation of alarms and continuously optimizing the intelligence of the system through dynamic learning.
And the alarm knowledge base stores the root cause analysis path and the processing scheme of the historical alarms to form a case base capable of being queried, so that a solution method for similar alarm events can be conveniently and rapidly searched.
And (3) dynamic learning, namely after each alarm analysis is completed, updating new alarm data and analysis results to a knowledge base by the system, and continuously perfecting the mapping relation between the alarms and root causes so that the system is more and more intelligent.
Example 2
The database system of a certain enterprise processes a large number of user queries and supports real-time service requirements, in a certain time period, the database performance suddenly appears abnormal, the user query delay is obviously increased, and the IO operation waiting time exceeds the normal range. By the database intelligent operation and maintenance diagnosis tree method designed above, the system helps operation and maintenance personnel to quickly locate root causes and provide solutions. Specific examples are as follows:
The system pre-detects all queries running on the current database. Most of the timing characteristics of the query match well with the normal state, and are judged to be normal. But part of the queries are detected to have significantly increased delay and disk IO latency is longer, entering further feature extraction and classification flows.
Feature extraction and classification, wherein the system performs multidimensional feature extraction on the abnormal query, and the method comprises the following steps:
Table size of the query
Execution plan of query
Querying whether or not scanning or ordering of large amounts of data is involved
Latency-increased queries mostly involve extensive data scanning of tables Orders, and no index is used
Partial queries waiting for resource locks, which may result in delays in IO operations
The system generates a dimension importance ranking from these features, identifying "lack of index" and "lock wait" are the primary reasons for performance bottlenecks.
Currently observed database state index vector:
Normal behavior vector predicted by the system:
behavior vectors generated from certain feature combinations in the candidate set are rooted:
the system evaluates the likelihood of each root cause candidate by quantitative analysis, calculating a likelihood score (PS) as follows:
The distance formula is:
The calculation can be as follows:
the likelihood score for the lack of index was 0.85 (higher), indicating that this is the primary root cause.
The likelihood of lock waiting score is 0.60 (secondary reason).
In addition, the system analyzes index changes between time periods (T 0,T1) by using a change amount duty ratio method, and finds that the scanning amount of the table Orders is obviously increased, the duty ratio reaches 70%, and further confirms that the lack of index is a main cause of the problem.
The system records the root cause analysis path (lack of index, lock waiting) of the alarm and the final solution (adding index, optimizing transaction) into the knowledge base, thereby facilitating the rapid processing of similar alarms in the future.
The units, devices or modules etc. set forth in the above embodiments may be implemented in particular by a computer chip or entity or by a product having a certain function. For convenience of description, the above devices are described as being functionally divided into various modules, respectively. Of course, when implementing the present application, the functions of each module may be implemented in the same or multiple pieces of software and/or hardware, or a module implementing the same function may be implemented by multiple sub-modules or a combination of sub-units. The above-described apparatus embodiments are merely illustrative, for example, the division of the units is merely a logical function division, and there may be additional divisions when actually implemented, for example, multiple units or components may be combined or integrated into another system, or some features may be omitted or not performed. Alternatively, the coupling or direct coupling or communication connection shown or discussed with each other may be an indirect coupling or communication connection via some interfaces, devices or units, which may be in electrical, mechanical or other form.
Those skilled in the art will also appreciate that, in addition to implementing the controller in a pure computer readable program code, it is well possible to implement the same functionality by logically programming the method steps such that the controller is in the form of logic gates, switches, application specific integrated circuits, programmable logic controllers, embedded microcontrollers, etc. Such a controller can be regarded as a hardware component, and means for implementing various functions included therein can also be regarded as a structure within the hardware component. Or even means for achieving the various functions may be regarded as either software modules implementing the methods or structures within hardware components.
The application may be described in the general context of computer-executable instructions, such as program modules, being executed by a computer. Generally, program modules include routines, programs, objects, components, data structures, classes, etc. that perform particular tasks or implement particular abstract data types. The application may also be practiced in distributed computing environments where tasks are performed by remote processing devices that are linked through a communications network. In a distributed computing environment, program modules may be located in both local and remote computer storage media including memory storage devices.
From the above description of embodiments, it will be apparent to those skilled in the art that the present application may be implemented in software plus a necessary general hardware platform. Based on such understanding, the technical solution of the present application may be embodied essentially or in a part contributing to the prior art in the form of a software product, which may be stored in a storage medium, such as a ROM/RAM, a magnetic disk, an optical disk, etc., including several instructions for causing a computer device (which may be a personal computer, a mobile terminal, a server, or a network device, etc.) to execute the method according to the embodiments or some parts of the embodiments of the present application.
Various embodiments in this specification are described in a progressive manner, and identical or similar parts are all provided for each embodiment, each embodiment focusing on differences from other embodiments. The application is operational with numerous general purpose or special purpose computer system environments or configurations. Such as a personal computer, a server computer, a hand-held or portable device, a tablet device, a multiprocessor system, a microprocessor-based system, a set top box, a programmable electronic device, a network PC, a minicomputer, a mainframe computer, a distributed computing environment that includes any of the above systems or devices, and the like.
The foregoing description of the embodiments has been provided for the purpose of illustrating the general principles of the application, and is not meant to limit the scope of the application, but to limit the application to the particular embodiments, and any modifications, equivalents, improvements, etc. that fall within the spirit and principles of the application are intended to be included within the scope of the application.